As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.

The same holds true of text corpora, in the sense that the original text usually has an external source, and is considered to be an immutable artifact.

Any transformations of that artifact which involve human judgment — even something as simple as tokenization — are subject to later revision, thus it is important to retain the source material in a form that is as close to the original as possible.

Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.

Therefore, many of the computational methods described in this book are applicable.

These are organized into a tree structure, shown schematically in 1.2.