Marcel's Truckload of Trepidation: CONTEXTUAL TESTIMONY: THE TEXT ENCODING INITIATIVE

CONTEXTUAL TESTIMONY: THE TEXT ENCODING INITIATIVE
by Marcel H. Faulkner

Cataloguing, classification, subject heading, key wording, and indexing are all important aspects of the library and information sciences, and all have been greatly affected by the ongoing increase in informational digitization. It is therefore essential for library professionals to have at least a rudimentary understanding of metadata formats. Metadata is, literally, “data about other data and objects, used to describe digitized and non-digitized resources located in a distributed system in a networked environment (Nair & Jeevan 2004).”

At the forefront of metadata formatting is the Text Encoding Initiative (TEI), an ongoing international project created to standardize the encoding and interchange of texts. Scholars from across the globe participated in the project, which since its inception has focused primarily on the humanities, social sciences, and linguistics. The project began in the early days of large-scale digitization, 1987, and was spearheaded and initially funded by three high-profile scholarly organizations, the Association for Computers and Humanities (ACH), the Association of Computational Linguistics (ACL), and the Association of Literary and Linguistic Computing (ALLC). Subsequent funding has come from the U.S. National Endowment for the Humanities (NEH), Directorate XIII of the Commission of the European Communities (CEC/DG-XIII), the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada, and other sources. The TEI Consortium, a member-funded non-profit corporation headquartered in Bergen, Norway, was created in 2000 to continue the maintenance and development of the TEI standard. The Consortium also has four hosting universities for the project: Brown, Oxford, the University of Bergen, and the University of Virginia (Nair & Jeevan 2004).

In the 1980’s, researchers and research organizations that digitized texts pretty much devised their own systems of encoding. This resulted in ‘proprietary’ encoding schemes, which were institution-specific and difficult if not impossible to share with other researchers. There was a clear need for an encoding standard that made texts electronically reusable, interchangeable, collaborative, and system-independent (Vanhoutte 2004). In this broad milieu, the primary accomplishment of TEI was the creation of its Guidelines, which are updated every few years and comprised of about 400 separate concepts and components, expressed by means of a markup language, defined by a DMD or XML schema (Wikipedia 2005). The most recent version, TEI P5, was released in July 2005. The modular scheme of the Guidelines enables their adaptation or customization to a wide variety of research or production environments (TEI website, P5 Release page: http://www.tei-c.org/P5/). As of September 28, 2005, there are 123 projects currently using the Guidelines, including African American Women Writers of the 19th Century, the American Verse Project, the European Corpus Initiative, and The Scholarly Electronic Text and Imaging Service (TEI website, Projects page: http://www.tei-c.org/Applications/).

In its Charter, the TEI calls itself “the twentieth century’s most important standardization effort for humanities-related data” (TEI website, Charter page: http://www.tei-c.org/Consortium/charter.xml). While this sounds like an adman’s pitch, the claim has more than a little merit. TEI projects create access to texts that were previously unavailable, as well ones of extremely limited availability, such as rare, hundreds-of-years-old works that could deteriorate if removed from their specialized storage facilities. TEI projects also make available more recent out-of-print books, as well as works created electronically in obsolete systems that were therefore unreadable to most scholars (Computers in Libraries [CIL], 1996). All in all, TEI projects create access to lost works and give permanence to historical documents. One example is the Smithsonian Institute’s encoding of the documents of the United States Exploring Expedition, aka the Wilkes Expedition, an enormously ambitious 19th century American exploration and surveying endeavour that circumnavigated the globe from 1838 to 1942 and produced thousands of documents. The original documents comprised 30 bound volumes and cost over $100,000 to print, which made them prohibitively expensive to reproduce and unavailable to scholars who could not visit the Smithsonian in person, but are now available to anyone online via the Smithsonian's "Galaxy of Knowledge" Web site (McKellar 2004).

But there is no magic bullet here. While the TEI schema has clear advantages, it also has its shortcomings. Few organizations have the resources or capability required to master the entire TEI encoding scheme, which is complex and involved. TEI’s creators implicitly acknowledged this drawback by producing TEI Lite, a stripped-down version of the Guidelines. Initially intended as an introduction to the full TEI schema, TEI Lite has proven popular and meets the needs of most TEI users (Vanhoutte 2005). Nor is the TEI schema adequate for all digitization efforts: the METAe project, a research and development group co-funded by the European Commission, found that the TEI scheme was too inexplicit for METAe’s goal of automated recognition (Stehno, Egger & Retti 2003).

Ideal as it may be for digitizing the works of humanities scholars, the TEI schema may be problematic for library catalogue bibliographic descriptions. Granularity refers to the size of metadata units that are being encoded. Low granularity means the units are larger and the encoding is easier, but there is less flexibility to alter the encoded texts later (Tennant, 2002). Difficulties arise if libraries use a low-granularity TEI schema to digitize their catalogues and then later opt to enhance them with pictures, book reviews, and longer bibliographic descriptions. To input these enhancements, the catalogue may need to be redigitized from scratch. This scenario can be avoided if the catalogue is initially digitized with a high-granularity TEI schema, but high granularity is labour-intensive and therefore more expensive (Tennant). In other words, the famously flexible TEI guidelines enable institutions to do great things with digitization, but in some circumstances enable wrong choices that have to be paid for in the long run.

It appears that despite several implementation caveats, the TEI has created the universal encoding standard that is so desperately needed in this digital age. The Guidelines are invaluable to the process of digitizing humanities texts, and have emerged, through some trial and error, as the de facto standard for professional organizations (Cantara 2005). But the belief that TEI schema will never become obsolete (CIL 1996) seems like a monstrous presumption: who among us can predict the future of information technology with any semblance of certainty? At any time, some genius could devise an all-encompassing digitization process that will make everything we’re using today look like phooey on a stick. In the meantime, however, the phooey is looking pretty good.

REFERENCE LIST

Articles:
Cantara, L. (2005). Digital libraries in the humanities: the text-encoding initiative, part 1. OCLC International Digital Library Perspectives, 21 (1), pp. 36-9.

McKellar, H. (2004). Fueling the digital renaissance. KMWorld, 13 (2). Online: http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&article_id=1677&publication_id=105.

Nair, S. S., & Jeevan, V.K.J. (2004). A brief overview of metadata formats. DESIDOC Bulletin of Information Technology, 24 (4), pp. 3-11.

[No author listed] (1996). SGML in education: the TEI and ICADD initiatives. Computers in Libraries, 16 (3), pp. 26-30.

Stehno, B., Egger, A., & Retti, G. (2003). METAe: automated encoding of digitized texts. Literary and Linguistic Computing, 18 (1), pp. 77-88.

Tennant, R. (2002). The importance of being granular. Library Journal, May 15, 2002, p. 32-6.

Vanhoutte, E. (2004). An introduction to the TEI and the TEI Consortium. Literary and Linguistic Computing, 19 (1), pp. 9-16.

Websites:
Smithsonian Institute Galaxy of Knowledge: http://www.sil.si.edu/

Text Encoding Initiative (TEI) online: http://www.tei-c.org/.

Wikipedia, online encyclopedia, Text Encoding Initiative page: http://en.wikipedia.org/wiki/Text_Encoding_Initiative.

4 Comments:

At 10:25 AM, Robin said...: Hey Marcel!

Great project profile on TEI. I was just wondering what happens to the data after it is encoded? We've been talking a lot in class lately about copyright laws and legal issues surrounding information. Do you happen to know if say the Smithsonian encodes a specific work, they have the rights for that set of information? Can companies encode whatever they want, or do they need permission from the author (well, if he’s still around…)? Could the companies then in turn charge for the use of their encoded texts?

-Robin
At 2:35 PM, Marcel Faulkner said...: This comment has been removed by a blog administrator.
At 2:59 PM, Marcel Faulkner said...: Hi Robin:

These are excellent questions. Unlicensed publication of copyrighted text is, plain and simple, a violation of copyright, and there is no reason to think that international copyright laws do not apply to the TEI, its projects, the Smithsonian, or any other organization. Publication generates a copyright issue greater than mere encoding, if only because the copyright holder may not be aware of the infringement until the encoded text is published or posted somewhere, online or otherwise. None of this applies, of course, to works already in the public domain. But this is just inexpert speculation on my part, so I've emailed TEI and asked them for a clarification on this. I'll answer your questions in more detail in a few days, whether or not TEI responds to my email.
Cheers,
Marcel.
At 4:44 PM, Marcel Faulkner said...: Actually, Robin, Julia Flanders from TEI answered my email about an hour after I'd sent it. Here is the response I received:

Dear Marcel,

Thank you for writing. I can clarify somewhat by noting that the TEI
itself is not a text encoding project--that is, it does not itself
conduct any encoding of texts (except for the fact that its own web
site is encoded in TEI). Thus the issue you raise is in a sense moot.

However, relocating your question to those groups who do text
encoding: yes, if the source materials being encoded are covered by
copyright, then one needs to get clearance/permission before
republishing them in any form, including as TEI-encoded texts.
Clearance would not be needed to simply encode the texts (you or I
might choose to encode anything just for our private amusement, just
as we might copy them into a notebook) but if the goal is to publish
the encoded results, this would violate copyright unless permission
is received. I should emphasize that it is not the *encoding* that
constitutes the violation, but the *publication* or circulation of
the results. Encoding is no different from any other activity
(transcription, editing, photocopying, etc.) in this respect.

This is of course only an issue with materials that are covered by
copyright. The project I work for encodes pre-Victorian women's
writing, which is all out of copyright, so we are free to transcribe
and encode these materials as we please, and publish the results as a
new edition of the text.

I hope this answers your question--if not, please feel free to write back.

best wishes, Julia

Julia Flanders

Marcel's P.S. In all likelihood, settling copyright questions was probably the first thing the Smithsonian (or other encoding organizations) would have to do.

<< Home

Marcel's Truckload of Trepidation

Sunday, October 02, 2005

CONTEXTUAL TESTIMONY: THE TEXT ENCODING INITIATIVE

4 Comments:

About Me

Previous Posts