MARGENTO & Inkpen Paper @ New Directions in the Humanities Conference 2015
… As we have already seen is the case of the previous publication in poetry computational analysis, collecting data and the features of the databases analyzed are more intimately related to the specifics and performance of the resulting classifiers and computational tools than one would suspect, and moreover, they also involve weighty even if implicit or unconscious cultural and literary choices. But the issue is even far more complex than that. Data in general and particularly the huge amount of data that is continuously made available and that grows exponentially in the digital age has attracted the attention of major scholars before and has actually meanwhile come to represent not only a self-sufficient subject and a challenge to a variety of disciplines, but even a new research paradigm. This fourth paradigm succeeds according to Gray and Szalay (2007) three older ones, the experimental, theoretical, and simulation paradigms, and in computer science “it means that the term e-science is not primarily concerned with faster computation, but with more advanced database technologies.” (Levallois, Steinmetz, and Wouters 2013, 152) For Jim Gray, a late computer scientist “celebrated as a visionary” (id.), we are witnessing the evolution of two branches in every discipline, “a computational branch and a data-processing branch” (ibid. 153), and the new field dedicated to studying such ramifications is called data-intensive research or data-intensive science. There is no consensus as to when data are large or complex enough to qualify as object of data-intensive research, especially since huge or massive may mean completely different things in different fields and disciplines, but Levallois, Steinmetz, and Wouters advance a very relevant and potentially very useful definition: “data-intensive research [is] research that requires radical changes in the discipline” involving “new, possibly more standardized and technology-intensive ways to store, annotate, and share data,” a concept that therefore “may point toward quite different research practices and computational tools.” (id.)
In the contributions quoted above the poem datasets are in the hundreds (the largest one, the Malay corpus containing 1,500 elements, while the other handful of papers ever published on computational poetry analysis employ significantly smaller sets or corpora), whereas our first paper—focusing on multilabel subject-based classifications of poems—analyzed over 11,000 poems in Poetry Foundation’s database, and since we have meanwhile consistently expanded our corpora by including material from more and more print and online sources, we can assert that the size of our databases and corpora can count as the basis for data-intensive research.
On the other hand, we do use different research practices in that we put together a model that analyzes poems comprehensively and not limiting the approach (as the precedent computational analysis approaches have) to only (one or several aspects of) one poetic feature—diction, subject, form, etc. Moreover, using graph theory applications in analyzing both particular poems and poetry corpora is a completely novel poetry criticism and analysis practice, and it involves in its turn different computational tools than what has been used so far in the field. These tools range from meter parsers to locating enjambments to assembling weighted graphs of poems and analyzing features such connectivity and spotting cut vertices.
from the Conclusions:
The Graph Poem Project is: