Sunday, October 02, 2005

CONTEXTUAL TESTIMONY: THE TEXT ENCODING INITIATIVE

CONTEXTUAL TESTIMONY: THE TEXT ENCODING INITIATIVE
by Marcel H. Faulkner

Cataloguing, classification, subject heading, key wording, and indexing are all important aspects of the library and information sciences, and all have been greatly affected by the ongoing increase in informational digitization. It is therefore essential for library professionals to have at least a rudimentary understanding of metadata formats. Metadata is, literally, “data about other data and objects, used to describe digitized and non-digitized resources located in a distributed system in a networked environment (Nair & Jeevan 2004).”

At the forefront of metadata formatting is the Text Encoding Initiative (TEI), an ongoing international project created to standardize the encoding and interchange of texts. Scholars from across the globe participated in the project, which since its inception has focused primarily on the humanities, social sciences, and linguistics. The project began in the early days of large-scale digitization, 1987, and was spearheaded and initially funded by three high-profile scholarly organizations, the Association for Computers and Humanities (ACH), the Association of Computational Linguistics (ACL), and the Association of Literary and Linguistic Computing (ALLC). Subsequent funding has come from the U.S. National Endowment for the Humanities (NEH), Directorate XIII of the Commission of the European Communities (CEC/DG-XIII), the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada, and other sources. The TEI Consortium, a member-funded non-profit corporation headquartered in Bergen, Norway, was created in 2000 to continue the maintenance and development of the TEI standard. The Consortium also has four hosting universities for the project: Brown, Oxford, the University of Bergen, and the University of Virginia (Nair & Jeevan 2004).

In the 1980’s, researchers and research organizations that digitized texts pretty much devised their own systems of encoding. This resulted in ‘proprietary’ encoding schemes, which were institution-specific and difficult if not impossible to share with other researchers. There was a clear need for an encoding standard that made texts electronically reusable, interchangeable, collaborative, and system-independent (Vanhoutte 2004). In this broad milieu, the primary accomplishment of TEI was the creation of its Guidelines, which are updated every few years and comprised of about 400 separate concepts and components, expressed by means of a markup language, defined by a DMD or XML schema (Wikipedia 2005). The most recent version, TEI P5, was released in July 2005. The modular scheme of the Guidelines enables their adaptation or customization to a wide variety of research or production environments (TEI website, P5 Release page: http://www.tei-c.org/P5/). As of September 28, 2005, there are 123 projects currently using the Guidelines, including African American Women Writers of the 19th Century, the American Verse Project, the European Corpus Initiative, and The Scholarly Electronic Text and Imaging Service (TEI website, Projects page: http://www.tei-c.org/Applications/).

In its Charter, the TEI calls itself “the twentieth century’s most important standardization effort for humanities-related data” (TEI website, Charter page: http://www.tei-c.org/Consortium/charter.xml). While this sounds like an adman’s pitch, the claim has more than a little merit. TEI projects create access to texts that were previously unavailable, as well ones of extremely limited availability, such as rare, hundreds-of-years-old works that could deteriorate if removed from their specialized storage facilities. TEI projects also make available more recent out-of-print books, as well as works created electronically in obsolete systems that were therefore unreadable to most scholars (Computers in Libraries [CIL], 1996). All in all, TEI projects create access to lost works and give permanence to historical documents. One example is the Smithsonian Institute’s encoding of the documents of the United States Exploring Expedition, aka the Wilkes Expedition, an enormously ambitious 19th century American exploration and surveying endeavour that circumnavigated the globe from 1838 to 1942 and produced thousands of documents. The original documents comprised 30 bound volumes and cost over $100,000 to print, which made them prohibitively expensive to reproduce and unavailable to scholars who could not visit the Smithsonian in person, but are now available to anyone online via the Smithsonian's "Galaxy of Knowledge" Web site (McKellar 2004).

But there is no magic bullet here. While the TEI schema has clear advantages, it also has its shortcomings. Few organizations have the resources or capability required to master the entire TEI encoding scheme, which is complex and involved. TEI’s creators implicitly acknowledged this drawback by producing TEI Lite, a stripped-down version of the Guidelines. Initially intended as an introduction to the full TEI schema, TEI Lite has proven popular and meets the needs of most TEI users (Vanhoutte 2005). Nor is the TEI schema adequate for all digitization efforts: the METAe project, a research and development group co-funded by the European Commission, found that the TEI scheme was too inexplicit for METAe’s goal of automated recognition (Stehno, Egger & Retti 2003).

Ideal as it may be for digitizing the works of humanities scholars, the TEI schema may be problematic for library catalogue bibliographic descriptions. Granularity refers to the size of metadata units that are being encoded. Low granularity means the units are larger and the encoding is easier, but there is less flexibility to alter the encoded texts later (Tennant, 2002). Difficulties arise if libraries use a low-granularity TEI schema to digitize their catalogues and then later opt to enhance them with pictures, book reviews, and longer bibliographic descriptions. To input these enhancements, the catalogue may need to be redigitized from scratch. This scenario can be avoided if the catalogue is initially digitized with a high-granularity TEI schema, but high granularity is labour-intensive and therefore more expensive (Tennant). In other words, the famously flexible TEI guidelines enable institutions to do great things with digitization, but in some circumstances enable wrong choices that have to be paid for in the long run.

It appears that despite several implementation caveats, the TEI has created the universal encoding standard that is so desperately needed in this digital age. The Guidelines are invaluable to the process of digitizing humanities texts, and have emerged, through some trial and error, as the de facto standard for professional organizations (Cantara 2005). But the belief that TEI schema will never become obsolete (CIL 1996) seems like a monstrous presumption: who among us can predict the future of information technology with any semblance of certainty? At any time, some genius could devise an all-encompassing digitization process that will make everything we’re using today look like phooey on a stick. In the meantime, however, the phooey is looking pretty good.


REFERENCE LIST

Articles:
Cantara, L. (2005). Digital libraries in the humanities: the text-encoding initiative, part 1. OCLC International Digital Library Perspectives, 21 (1), pp. 36-9.

McKellar, H. (2004). Fueling the digital renaissance. KMWorld, 13 (2). Online: http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&article_id=1677&publication_id=105.

Nair, S. S., & Jeevan, V.K.J. (2004). A brief overview of metadata formats. DESIDOC Bulletin of Information Technology, 24 (4), pp. 3-11.

[No author listed] (1996). SGML in education: the TEI and ICADD initiatives. Computers in Libraries, 16 (3), pp. 26-30.

Stehno, B., Egger, A., & Retti, G. (2003). METAe: automated encoding of digitized texts. Literary and Linguistic Computing, 18 (1), pp. 77-88.

Tennant, R. (2002). The importance of being granular. Library Journal, May 15, 2002, p. 32-6.

Vanhoutte, E. (2004). An introduction to the TEI and the TEI Consortium. Literary and Linguistic Computing, 19 (1), pp. 9-16.

Websites:
Smithsonian Institute Galaxy of Knowledge: http://www.sil.si.edu/

Text Encoding Initiative (TEI) online: http://www.tei-c.org/.

Wikipedia, online encyclopedia, Text Encoding Initiative page: http://en.wikipedia.org/wiki/Text_Encoding_Initiative.