Data First Manifesto

We propose a strategy for conducting digital humanities teaching and research that prioritizes publishing data above all other project activities. Drawing on our experience working with faculty, librarians, and graduate students on a critical edition in TEI of Charles Baudelaire’s Les Fleurs du Mal, we have created a manifesto to support a data-first strategy in the digital humanities.

The Corpus Baudelaire Project began at Vanderbilt University in 2013, when a hybrid group of approximately ten scholars, who had recently learned how to encode literary texts in the TEI, aspired to do something practical with their new skills. The group developed a connection to Vanderbilt University Library’s W. T. Bandy Center for Baudelaire and Modern French Studies; who exhaustively collects Baudelaire’s works, including Les Fleurs du Mal. The work itself was published in four editions: 1857, 1861 (containing 35 additional poems, the Tableaux parisiens, and lacking six poems censored by the Second Empire), 1866 (including Les Epauves or The Scraps, and the six poems missing from the 1861 edition), and the posthumous 1886 edition. Participants in the Corpus Baudelaire Project are encoding all the editions using the critical edition apparatus in the TEI.

Our data-first approach to Corpus Baudelaire Project minimizes otherwise common tasks such as developing databases or coding interfaces, and offers advantage over alternative approaches in fostering collaboration, pedagogy, and new forms of publishing. We also suggest that our data-first approach may also productively be generalized to any digital humanities projects developing significant quantities of data.

A data-first approach differs from other forms of digital humanities scholarship by minimizing startup costs and reducing complexity. Whereas digital humanities projects aim above all to produce some form of online digital edition or interactive website, a data-first approach invests primarily in producing and sharing data with others. “It’s the data, stupid!” is our informal slogan.

A data-first approach to DH involves at least four preferences, dealing with licensing, curating, and publishing datasets online. The second two steps are likely to be iterative and emergent.

Licensing

A data-first approach begins with the presupposition of making data openly available and reusable by other scholars. This not only implies attaching an open source license to the data, but also making certain that participants can download and reuse the dataset without restriction.

Computability

A data-first approach makes data available for computational analysis. While graphic user interfaces are helpful to individual users, providing an open API fosters maximal utility and diversity of uses.

Curating

A data-first approach implies that discussions about data curation start at the beginning of the project, not its end. How shall information be encoded? How to decide between alternative options? Are there emerging best practices and converging forms of representation? Documenting data and making available any accompanying schemas is also critical when taking a data-first approach.

Publishing

A data-first approach requires that data be published for comment, criticism and reuse from the onset of the project. What are the best platforms for publishing digital humanities data? How can digital humanists provide access and get credit for their data?

By prioritizing these activities above other forms of digital humanities, we simultaneously lower the barriers for participants to join our project while offering them the opportunity to publish and begin receive credit for their work almost immediately. Crucially, credit is allocated with respect to contributions, not by seniority or other hierarchical designations; the data bear witness directly to their creators.

About the Data First Manifesto

Licensing

Computability

Curating

Publishing