The filename contains the date, chatroom, and number of posts; e.g., The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University.
This corpus contains text from 500 sources, and the sources have been categorized by genre, such as Next, we need to obtain counts for each genre of interest.
An interesting property of this collection is its time dimension: Many text corpora contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth.
This particular corpus actually contains dozens of individual texts — one per address — but for convenience we glued them end-to-end and treated them as a single text. also used various pre-defined texts that we accessed by typing This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score).For convenience, the corpus methods accept a single fileid or a list of fileids.Similarly, we can specify the words or sentences we want in terms of files or categories.This chapter continues to present programming concepts by example, in the context of a linguistic processing task.We will wait until later before exploring each Python construct systematically.