CATCH meeting: patterns in narrative texts
Theme: Patterns in Narrative Texts
Many a collection of Cultural Heritage Institutions consists mainly of historical and contemporary texts. To extract information from such large corpora, various text processing techniques are available. A special challenge is formed by the large subset of textual data that take on a narrative form. What distinguishes such narrative texts from factual reports is that they are typically multi-layered, and studying these layers can tell us much about the author's mentality and beliefs, as well as other important cultural and historical information. To explore narrative corpora and disclose the deeper information they contain, new text mining methods must be developed.
The afternoon will revolve around big data of language, narratives, and folklore, with a focus on finding significant patterns, themes and motifs within these data. The data that will be discussed range from narrative journalistic texts to orally transmitted folktales. In the study of history, diachronic corpora can be mined to discover how historical events are reflected in language use. In folk narrative research, patterns of interest include the stability and variability of 'narrative building blocks' (motifs, memes) in oral transmission, and geographical dispersion of folk beliefs in the supernatural. Establishing links between narrative texts is a common factor in all this research.
12.00 – 13.30 Lunch. There is an opportunity to take a guided tour through the Meertens Instituut as well.
13.30 – 13.35 Word of welcome by Hans Bennis
13.35 – 13.50 Word of welcome by Jaap van den Herik
13.50 – 14.00 Introduction on Patterns in Narrative Texts and on the Dutch Folktale Database by Theo Meder
14.00- 14.15 Dolf Trieschnigg: Learning to Extract Folktale Keywords
14.15-14.30 Dong Nguyen: Folktale Classification using Learning to Rank
14.30 – 15.15 Mike Kestemont & Folgert Karsdorp: Mining the Twentieth Century's History from the TIME Magazine
15.15 – 16.00 Tea break with poster presentations and demonstrations.
1. Dutch Folktale Database/FACT (Dolf Trieschnigg, Iwe Muiser)
2. Tunes & Tales (Peter van Kranenburg)
3. TINPOT (Dong Nguyen)
4. Nederlab (Rob Zeeman)
5. CLARIAH (Patricia Alkhoven)
6. e-Humanities (Andrea Scharnhorst)
7. Riddle of Literary Quality (Corina Koolen, Andreas van Cranenburgh)
16.00 – 17.00: Tim Tangherlini: Tools of the WitchHunter: hGIS and Network Classifiers for the Study of Folklore.
17.00 – 18.00: Drinks
Mike Kestemont & Folgert Karsdorp: Mining the Twentieth Century's History from the TIME Magazine Corpus.
In this presentation we report on quantitative research conducted on the complete archive of "TIME Magazine", containing over 260.000 articles. This well-known American weekly news magazine has had a continuous publication history since 1923, making this collection an exceptionally rich and balanced textual resource for the study of the history of twentieth century. Because of the sheer size of this so-called "Big Data", we must resort to automated, computational analyses. We apply state-of-the-art techniques from language technology and text mining, for instance from the recent "deep learning" movement. Among researchers, there is widespread acceptance that cultural evolution is somehow reflected in language use; yet, there exists no standard methodology to study such phenomena beyond the naive plotting of individual word frequencies through time. In this paper we attempt to move to more advanced analysis techniques for the computational study of history based on a large, diachronic textual corpus. Although TIME's archive naturally offers a strongly America-centric view on history, we will demonstrate how large-scale events such as World Wars I and II, the Moon Landing, or the rise of the Internet, have found an interesting and complex reflection in the evolution of TIME's vocabulary. Of particular interest to us is the notorious "TIME100", a highly mediatized list of the most influential people in the world which the magazine brings out yearly. Moreover, in 2003, TIME published such a list for the entire twentieth century, singling out the well-known theoretical physicist Albert Einstein as the single most influential “Person of the Century”. In our research we have paid special attention to the intriguing interplay between this list of influential personalities and the manner in which they are discussed in the magazine's own archive.
Tim Tangherlini: Tools of the WitchHunter: hGIS and Network Classifiers for the Study of Folklore.
With the advent of well structured databases housing very large collections of traditional cultural expressive materials, folklore researchers find themselves poised on the cusp of new era in research. Yet with these opportunities come certain challenges. How does one work with thousands or tens of thousands of records when one was trained to work with dozens or maybe hundreds of records? What type of analytical and computational tools does a folklorist need to be able to work with these massive digital collections?
Using a 35,000 story subset of the Evald Tang Kristensen collection of Danish folklore as a starting point, I explore some of the approaches that we have developed for classification and pattern discovery in this collection. We develop a representation of latent semantic connections between stories and project these into a map-based navigation and discovery environment. Our preliminary work is based on the pre-existing corpus indices and a shared-keyword index, coupled to an index of geo-referenced places mentioned in the stories. Combining these allows us to produce heat maps of the relationship between places and a first level approximation of story topics. A researcher can use these topic concentrations as a method for building and refining research questions. For example, do certain areas have a higher than normal concentration of stories about witchcraft? We also allow for spatial querying, an approach that allows a researcher to discover topics that are particularly related to a specific place. Do certain topics — such as vengeful haunts — tend to cluster around certain landscape features such as manor farms? We also have begun developing methods for discovering directionality in topic assignments — do some classes of narrators situate stories about house elves at a greater distance than other classes of narrators? Our corpus representation can be extended to include multimodal network representations of the corpus and LDA topic models to allow for additional visualizations of latent corpus topics.
About the keynote speakers
Folgert Karsdorp is employed as a Ph.D student at the Meertens Institute (Amsterdam, NL) in the KNAW funded project Tunes & Tales. In this project he investigates the variation of folk tales through oral transmission based on motifs. His research interests lie in computational linguistics and natural language processing. He attempts to brid
ge insights from humanities, notably narratology, and computational approaches.
Currently, Mike Kestemont is a postdoctoral research fellow of the Research Foundation of Flanders (FWO) at the University of Antwerp. He works in the group of Frank Willaert at the Institute for the Study of Literature on the Low Countries (ISLN), as well as the lab lead by Walter Daelemans at the CliPS Computational Linguistics and Psycholinguistics Group. In the fall of 2012 Kestemont has also been a visiting research fellow at the Radboud University Nijmegen (The Netherlands). Kestemont likes to call himself a ‘computational philologist’: his research focuses on computational text analysis and computational stylistics or stylometry. He applies these methods predominantly (though not solely) to medieval literature, in particular that of the Low Countries (Middle Dutch literature). Much of his research is closely linked to the international initiative of Digital Humanities or eHumanities.
Timothy R. Tangherlini teaches folklore, literature and cultural studies at the University of California, where he is a professor in the Scandinavian Section, and the Department of Asian Languages and Cultures. He is also an affiliate of the Center for Medieval and Renaissance Studies, the Religious Studies Program, and a faculty member in the Center for Korean Studies and the Center for European and Eurasian Studies.
He has published widely on folklore, literature, film and critical geography. His main theoretical areas of interest are folk narrative, legend, popular culture, and critical geography. His main geographic areas of interest are the Nordic region (particularly Denmark and Iceland), the United States, and Korea.
He is the author of Interpreting Legend: Danish Storytellers and their Repertoires (1994), Talking Trauma. Paramedics and Their Stories (1998), and the co-editor of Nationalism and the Construction of Korean Identity (1999), and Sitings. Critical Approaches to Korean Geography (2008). He has also produced or co-produced two documentary films, Talking Trauma: Storytelling Among Paramedics (1994) and Our Nation. A Korean Punk Rock Community (2002).
His current work focuses on computation and the humanities. In 2012, along with James Abello and Peter Broadwell, Tim Tangherlini published a paper called ‘Computational Folkloristics’ in: Communications of the ACM vol 55, no. 7, pp. 60-70. In 2013 he published his lecture ‘The Folklore Macroscope. Challenges for a Computational Folkloristics’ in Western Folklore vol. 72, nr. 1, pp. 7-27. His new hybrid publication, Danish Folktales, Legends, and Other Stories (2013) weds a print book to a rich digital resource,