We (Tim Causer, Kris Grint, Anna-Maria Sichani, and me!) have recently published an article in Digital Scholarship in the Humanities on the economics of crowdsourcing, reporting on the Transcribe Bentham project, which is formally published here:
Alack, due to our own economic situation, its behind a paywall there. Its also embargoed for two years in our institutional repository (!). But I’ve just been alerted to the fact that the license of this journal allows the author to put the “post-print on the authors personal website immediately”. Others publishing in DSH may also not be aware of this clause in the license!
So here it is, for free download, for you to grab and enjoy in PDF.
I’ll stick the abstract here. It will help people find it!
In recent years, important research on crowdsourcing in the cultural heritage sector has been published, dealing with topics such as the quantity of contributions made by volunteers, the motivations of those who participate in such projects, the design and establishment of crowdsourcing initiatives, and their public engagement value. This article addresses a gap in the literature, and seeks to answer two key questions in relation to crowdsourced transcription: (1) whether volunteers’ contributions are of a high enough standard for creating a publicly accessible database, and for use in scholarly research; and (2) if crowdsourced transcription makes economic sense, and if the investment in launching and running such a project can ever pay off. In doing so, this article takes the award-winning crowdsourced transcription initiative, Transcribe Bentham, which began in 2010, as its case study. It examines a large data set, namely, 4,364 checked and approved transcripts submitted by volunteers between 1 October 2012 and 27 June 2014. These data include metrics such as the time taken to check and approve each transcript, and the number of alterations made to the transcript by Transcribe Bentham staff. These data are then used to evaluate the long-term cost-effectiveness of the initiative, and its potential impact upon the ongoing production of The Collected Works of Jeremy Bentham at UCL. Finally, the article proposes more general points about successfully planning humanities crowdsourcing projects, and provides a framework in which both the quality of their outputs and the efficiencies of their cost structures can be evaluated.
Read the full piece here.
Since its establishment in 2001, the English version of Wikipedia has grown to host more than 5.6 million articles that reflect content ranging from culture and the arts to technology and the applied sciences. Consistently ranked as one of the top visited sites on the Internet, Wikipedia provides an open and freely accessible resource of interconnected information that anyone can edit. Unfortunately, not everyone actually does. Nine out of ten editors are male. The average Wikipedian is an educated, English-speaking citizen of a majority-Christian nation in the global north. They are technically proficient and likely hold, or are skilled enough to hold, white-collar employment. Not surprisingly, these commonalities have introduced systemic bias to the manner in which content is generated, updated, and, most critically, omitted from the site.
Pages about trans and cis women, gender non-conforming people, cultural communities in the global south, those living in poverty, and people without internet access are chronically underrepresented on Wikipedia. This includes groups in developing nations, as well as racialized and systemically marginalized groups in economically wealthy countries, such as the Black and Latinx communities in the United States. Equally absent are pages about Indigenous peoples, communities, and cultures. As of August 2018 there were 3,468 articles within the scope of the Indigenous Peoples of the Americas WikiProject. This number represents only 0.06% of the articles on English-language Wikipedia, with an even smaller percentage relating to First Nations, Inuit, and Métis peoples in what is currently known as Canada. Overall, representation of Indigenous-focused content is sorely lacking.
As settlers living and working as archivists on the traditional territories of the Neutral, Anishnaabeg, Métis, and Haudenosaunee peoples — Danielle on the Haldimand Tract, land extending six miles from each side of the Grand River that was promised to the Six Nations, and Krista on Robinson-Huron Treaty territory — we have personally and professionally considered the Truth and Reconciliation Commission of Canada Calls to Action (TRC) that outline the responsibilities of cultural heritage workers to educate both themselves and the general public about the Canadian Indian Residential School System (Residential Schools). In working to do so, however, we recognize that Residential Schools were but one of the many horrific consequences of settler colonialism. Meaningful engagement with the reconciliation process and Indigenous communities in Canada means raising awareness about more than Residential Schools. It means understanding the need for cultural organizations to build relationships with Indigenous communities rooted in solidarity and allyship; centering an ethic that moves beyond rote territorial acknowledgements; and setting aside defensive dismissals of wrongs that happened before we were born in order to prioritize what Senator Murray Sinclair calls “a sense of responsibility for the future.” It also means acknowledging that colonialism continues to impact Indigenous communities and working to break down colonial systems that exist within cultural organizations. We believe that editing Wikipedia through a lens of reconciliation is one way to do so.
Read the full post here.
Last week (July 31, 2018), I had the honor of speaking at CLIR’s (Council on Library and Information Resources) summer seminar for new Postdoctoral Fellows. I was very excited to get the opportunity to meet a new cohort of fellows just as they are beginning their new positions at various institutions. (For more information on CLIR Postdoctoral Fellowships, visit their website! And keep an eye out for the next round of applications this fall/winter.)
My talk centered on the work we do at Recovering the US Hispanic Literary Heritage (aka “Recovery”), the importance of minority archives, and working toward inclusivity. For 27 years, Recovery has dedicated itself to recovering, preserving, and disseminating the lost written legacy of Latinas and Latinos in the United States. US Latina/o collections, like other minority collections, do not traditionally form part of a larger national historical narrative. Herein lies the importance of minority collections: the stories they tell give us a more nuanced understanding of US history and culture.
Let’s take a step back to think about the structure of archives, the inherent issues, and the questions that we—as archivists, scholars, students, and educators—should ask ourselves when engaging with historical collections. Archives help structure knowledge and history. Michel Foucault argues that history “now organizes the document” [with “document” being the archival] “divides it up, distributes it, orders it, arranges it in levels, establishes series, distinguishes between what is relevant and what is not, discovers elements, defines unities, describes relations” (146). Thus history, or perhaps more aptly, what we understand to be or call history, cannot be distinguished from the production and organization of the archive. Furthermore, national archives help to create an authoritative national narrative. The International Council on Archives, for example, describes archives on their webpage as follows:
Archives constitute the memory of nations and societies, shape their identity, and are a cornerstone of the information society. By proving evidence of human actions and transactions, archives support administration and underlie the rights of individuals, organisations and states. By guaranteeing citizens’ rights of access to official information and to knowledge of their history, archives are fundamental to identity, democracy, accountability and good governance.
Given this defined mission of archives, we can think about what archives do or are meant to do; they define:
- “the nation,”
- what is—and what isn’t—considered “important,”
Read the full post here.
In the last 25 years we have seen the web enable new digital means for historians to reach broader publics and audiences. Over that same period of time, archives and archivists have been exploring and engaging with related strands of digital transformation. In one strand, similar focus on community work through digital means has emerged in both areas. While historians have been developing a community of practice around public history, archivists and archives have similarly been reframing their work as more user-centered and more closely engaged with communities and their records. A body of archival work and scholarship has emerged around the function of community archives that presents significant possibilities for further connections with the practices of history and historians. In a second strand, strategies for understanding and preserving digital cultural heritage have also taken shape. While historians have begun exploring using tools to produce new forms of digital scholarship, archivists and archives have been working to both develop methods to care for and make available digital material. Archivists have established tools, workflows, vocabulary and infrastructure for digital archives, and they have also managed the digitization of collections to expand access.
At the intersection of these two developments, we see a significant convergence between the needs and practices of public historians and archivists. Historians’ new forms of scholarship increasingly function as forms of knowledge infrastructure. Archivists work on systems for enabling access to collections are themselves anchored in longstanding commitments to infrastructure for enabling the use of records. At this convergence, there is a significant opportunity for historians to begin to connect more with archivists as peers, as experts in questions of the structure and order of sources and records.
In this essay we explore the ways that archives, archivists, and archival practice are evolving around both analog and digital activities that are highly relevant for those interested in working in digital public history.
Read the full piece here.
Recently, historians have been trying to understand cultural change by measuring the “distances” that separate texts, songs, or other cultural artifacts. Where distances are large, they infer that change has been rapid. There are many ways to define distance, but one common strategy begins by topic modeling the evidence. Each novel (or song, or political speech) can be represented as a distribution across topics in the model. Then researchers estimate the pace of change by measuring distances between topic distributions.
In 2015, Mauch et al. used this strategy to measure the pace of change in popular music—arguing, for instance, that changes linked to hip-hop were more dramatic than the British invasion. Last year, Barron et al. used a similar strategy to measure the influence of speakers in French Revolutionary debate.
I don’t think topic modeling causes problems in either of the papers I just mentioned. But these methods are so useful that they’re likely to be widely imitated, and I do want to warn interested people about a couple of pitfalls I’ve encountered along the road.
One reason for skepticism will immediately occur to humanists: are human perceptions about difference even roughly proportional to the “distances” between topic distributions? In one case study I examined, the answer turned out to be “yes,” but there are caveats attached. Read the paper if you’re curious.
In this blog post, I’ll explore a simpler and weirder problem. Unless we’re careful about the way we measure “distance,” topic models can warp time. Time may seem to pass more slowly toward the edges of a long topic model, and more rapidly toward its center.
Read the full post here.
This is a quick introduction on how to get and visualize Google search data with both time and geographical components using the R packages gtrendsR, maps and ggplot2. In this example, we will look at search interest for named hurricanes that hit the U.S. mainland and then plot how often different states search for “guns.”
Source: Mapping search data from Google Trends in R
One useful library for viewing a topic model is LDAvis, an R package for creating interactive web visualizations of topic models, and its Python port, PyLDAvis. This library is focused on visualizing a topic model, using PCA to chart the relationship between topics and between topics and words in the topic model. It is also agnostic about library you use to create the topic model, so long as you extract the necessary data in the correct formats.
While the python version of the library works very smoothly with Gensim, which I have discussed before, there is little documentation for how to move from a topic model created using MALLET to data that can be processed by the LDAvis library. For reasons that require their own blog post, I have shifted from using Gensim for my topic model to using MALLET (spoilers: better documentation of output formats, more widespread use in the humanities so better documentation and code examples generally). But I still wanted to use this library to visualize the full model as a way of generating an overall view of the relationship between the 250 topics it contains.
The documentation for both LDAvis and PyLDAvis relies primarily on code examples to demonstrate how to use the libraries. My primary sources were a python exampleand two R examples, one focused on manipulating the model data and one on the full model to visualization process. The “details” documentation for the R library also proved key for trouble-shooting when the outputs did not match my expectations. (Pro tip: word order matters.)
Looking at the examples, the data required for the visualization library are:
- topic-term distributions (matrix,
- document-topic distributions (matrix,
- document lengths (number vector)
- vocab (character vector)
- term frequencies (number vector)
One challenge is that the order of the data needs to be managed, so that the terms columns in
phi, the topic-term matrix, are in the same order as the vocab vector, which is in the same order as the frequencies vector, and the documents index of
theta, the document-topic matrix, is in the same order of the document lengths vector.
Read the full post here.
How is humanities and social science knowledge impacted by the introduction of three-dimensional visualization technologies? While 3D visualization may seem far removed from the everyday work of scholars in the social sciences and humanities, it has great potential to change how we conduct and communicate our work.
Three-dimensional visualizations can be used for creating models, supplementing maps, developing games, printing objects, developing virtual environments, enhancing telecommunications, and housing simulations. They can be used to support retrospective and prospective analysis, exploration of counterfactuals, and representation of hybrid or alternate realities, particularly when they combine objects in 3D contexts. An art historian might want to understand how an artifact was perceived in context, or how a built structure looked in earlier eras, or to document an installation or exhibition. An archeologist might use 3D models or prints to complete a broken artifact or to reassemble a ruin. A sociologist might develop agent-based modeling in a 3D space to understand the social dynamics in a given location. A historian might explore 3D viewsheds to determine lines of sight and power. A linguist might construct a virtual environment for language learning. A literary scholar might build out a navigable imagined space as a form of nonlinear literary criticism. A statistician might display data in 3D infographics to aid in interpretation. And of course, artists, architects, and designers of all stripes might use 3D to create new objects and environments as well as use such techniques as a way to study those that already exists. All of these researchers in turn might communicate their work through multimodal, immersive, affective visualizations for public outreach, policy impact, or funding solicitations.
Although the technologies used to create them are daunting at first, these visualizations are becoming increasingly accessible to nonspecialist users, and the underlying conceptual approaches that they highlight are not new to the disciplines where they’re now being used. Designing and representing 3D space and objects in 2D images, text, and other forms comes naturally to us in many fields. Maps, plans, and networks fill the pages of social science research. Where and how people think, live, work, and interact are contextualized in historical and contemporary places, spaces, environments, and geographies. Artists and architects build their maquettes and design their structures and installations. The dimensional space of the stage, performance hall, or theater is a key component of the production. Lighting, acoustics, and movement are all part of the process. Museums and cultural heritage institutions have taken advantage of the rhetorical power of 3D for their interpretative exhibits for years.
Read the full post here.
[Delivered as part of the “Mid-Range Reading: Manifesto Edition” panel, organized by Alison Booth, of the DH2018 Conference in Mexico City]
A great deal of digital humanities work over the past decade or so has employed scale as the concept that distinguishes it from other methods of literary and cultural study. Quantitative scholars in particular have quite naturally chosen scale as the specific difference of their method. They speak of the computer as a “macroscope” that permits “macroanalysis.” Critics counted words and documents before computers, but computers let them count and compute lots of them. Contrasting themselves with close readers, “distant readers” propose, with the help of machines, to step back from the individual pages and books to see more and see bigger. When the popular press sees fit to feature DH, it is scale that gets touted and scale that gets maligned.
Claims of scalar difference are often apparently precise. Instead of offering a reading of a single novel, distant readers study the titles of 7,000 British novels from 1740-1850, or ask how not to read a million books, or search through (as of last count) the 60,237 full texts in EEBO TCP I and II. For nearly all quantitative analyses of texts, the authors tell (or could tell) the reader exactly how many words they are counting in exactly how many documents over how many years, since these numbers are the basis of more sophisticated metrics and models.
The concept of scale is not wrong or misguided in any simple sense, and I plan to issue no prohibitions on its use. Nor do I plan to offer a brief for the micro in opposition to the macro (As Roopika Risam and Susan Edwards did at DH2017). I want instead to argue that we should displace scale from its marquee role in differentiating data and corpus based digital inquiry from other approaches. That displacement has perhaps already begun. Surveying recent work by a range of scholars in an attempt to forestall attacks on the use of data in literary study, Ted Underwood observes that “None of them, as far as I can tell, have stopped doing close reading.” “We also do close reading” is a totally sensible line of defense, albeit one that fortifies distant reading at the expense of its distinctiveness. This is all to the good.
Read the full post here.
One thing we digital methods people like to harp on about is the fact that quantitative methods are brilliant for dealing with huge amounts of text that are quite frankly incomprehensible at the level of literary-linguistic detail we would like to be able to study them at. The ability to observe frequency at a level of detail that would be impossible at scale is this is indeed their radical possibility. But the question of ‘scale’ is not a fixed measure. To the literary scholar used to focusing on one text at a time, ‘scale’ could mean that one text, whereas to someone who is used to dealing with corpora, ‘scale’ could represent the entirety of Eighteenth Century Books Online (ECCO). So when I was approached to write this post, I wanted to model what resources like CLiC offer to the scholar used to closely interacting with one text at a time.
I was intrigued to see that the CLiC project had included a copy of The Awakening and Selected Short Stories by Kate Chopin, in large part because The Awakening can be classified as a very long short story or a very short novella. It is very much the kind of writing which is perceived as wholly human-addressable, insofar that it is short enough to read in one sitting. The Awakening is also widely understood as an early example of the development of an autonomous female identity (Toth 1976, 242; Gray 2004, 53; Chopin 57). Literary critics such as Perkins-Gilman (1898, 79); Skaggs (1974, 348) Seyersted (1969, 134) and Grey (2004) have focused on the performative aspects of women’s social enclosure within the confines of 19th century home life. And I am intrigued by its history as part of women writing their own literature and history about their own experiences in a society where this is widely frowned upon. In this post, I shall illustrate how simple frequencies can be used to guide a literary analysis
Read the full post here.