abydos.corpus package¶
abydos.corpus.
The corpus package includes basic and n-gram corpus classes:
As a quick example of Corpus
:
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n\n'
>>> tqbf += 'And then it slept.\n\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.docs()
[[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']],
[['And', 'then', 'it', 'slept.']], [['And', 'the', 'dog', 'ran', 'off.']]]
>>> round(corp.idf('dog'), 10)
1.0986122887
>>> round(corp.idf('the'), 10)
0.4054651081
Here, each sentence is a separate "document". We can retrieve IDF values from
the Corpus
. The same Corpus
can be used to initialize
an NGramCorpus
and calculate TF values:
>>> ngcorp = NGramCorpus(corp)
>>> ngcorp.get_count('the')
2
>>> ngcorp.get_count('fox')
1
-
class
abydos.corpus.
Corpus
(corpus_text='', doc_split='nn', sent_split='n', filter_chars='', stop_words=None, word_tokenizer=None)[source]¶ Bases:
object
Corpus class.
Internally, this is a list of lists or lists. The corpus itself is a list of documents. Each document is an ordered list of sentences in those documents. And each sentence is an ordered list of words that make up that sentence.
New in version 0.1.0.
Initialize Corpus.
- By default, when importing a corpus:
two consecutive newlines divide documents
single newlines divide sentences
other whitespace divides words
- Parameters
corpus_text (str) -- The corpus text as a single string
doc_split (str) -- A character or string used to split corpus_text into documents
sent_split (str) -- A character or string used to split documents into sentences
filter_chars (list) -- A list of characters (as a string, tuple, set, or list) to filter out of the corpus text
stop_words (list) -- A list of words (as a tuple, set, or list) to filter out of the corpus text
word_tokenizer (_Tokenizer) -- A tokenizer to apply to each sentence in order to retrieve the individual "word" tokens. If set to none, str.split() will be used.
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf)
New in version 0.1.0.
-
docs
()[source]¶ Return the docs in the corpus.
Each list within a doc represents the sentences in that doc, each of which is in turn a list of words within that sentence.
- Returns
The docs in the corpus as a list of lists of lists of strs
- Return type
[[[str]]]
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.docs() [[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog', 'ran', 'off.']]] >>> len(corp.docs()) 1
New in version 0.1.0.
-
docs_of_words
()[source]¶ Return the docs in the corpus, with sentences flattened.
Each list within the corpus represents all the words of that document. Thus the sentence level of lists has been flattened.
- Returns
The docs in the corpus as a list of list of strs
- Return type
[[str]]
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.docs_of_words() [['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', 'And', 'then', 'it', 'slept.', 'And', 'the', 'dog', 'ran', 'off.']] >>> len(corp.docs_of_words()) 1
New in version 0.1.0.
-
idf
(term, transform=None)[source]¶ Calculate the Inverse Document Frequency of a term in the corpus.
- Parameters
term (str) -- The term to calculate the IDF of
transform (function) -- A function to apply to each document term before checking for the presence of term
- Returns
The IDF
- Return type
float
Examples
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n\n' >>> tqbf += 'And then it slept.\n\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> print(corp.docs()) [[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']], [['And', 'then', 'it', 'slept.']], [['And', 'the', 'dog', 'ran', 'off.']]] >>> round(corp.idf('dog'), 10) 1.0986122887 >>> round(corp.idf('the'), 10) 0.4054651081
New in version 0.1.0.
-
paras
()[source]¶ Return the paragraphs in the corpus.
Each list within a paragraph represents the sentences in that doc, each of which is in turn a list of words within that sentence. This is identical to the docs() member function and exists only to mirror part of NLTK's API for corpora.
- Returns
The paragraphs in the corpus as a list of lists of lists of strs
- Return type
[[[str]]]
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.paras() [[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog', 'ran', 'off.']]] >>> len(corp.paras()) 1
New in version 0.1.0.
-
raw
()[source]¶ Return the raw corpus.
This is reconstructed by joining sub-components with the corpus' split characters
- Returns
The raw corpus
- Return type
str
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> print(corp.raw()) The quick brown fox jumped over the lazy dog. And then it slept. And the dog ran off. >>> len(corp.raw()) 85
New in version 0.1.0.
-
sents
()[source]¶ Return the sentences in the corpus.
Each list within a sentence represents the words within that sentence.
- Returns
The sentences in the corpus as a list of lists of strs
- Return type
[[str]]
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.sents() [['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog', 'ran', 'off.']] >>> len(corp.sents()) 3
-
words
()[source]¶ Return the words in the corpus as a single list.
- Returns
The words in the corpus as a list of strs
- Return type
[str]
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.words() ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', 'And', 'then', 'it', 'slept.', 'And', 'the', 'dog', 'ran', 'off.'] >>> len(corp.words()) 18
New in version 0.1.0.
-
class
abydos.corpus.
NGramCorpus
(corpus=None)[source]¶ Bases:
object
The NGramCorpus class.
Internally, this is a set of recursively embedded dicts, with n layers for a corpus of n-grams. E.g. for a trigram corpus, this will be a dict of dicts of dicts. More precisely,
collections.Counter
is used in place of dict, making multiset operations valid and allowing unattested n-grams to be queried.The key at each level is a word. The value at the most deeply embedded level is a numeric value representing the frequency of the trigram. E.g. the trigram frequency of 'colorless green ideas' would be the value stored in
self.ngcorpus['colorless']['green']['ideas'][None]
.New in version 0.3.0.
Initialize Corpus.
- Parameters
corpus (Corpus) -- The
Corpus
from which to initialize the n-gram corpus. By default, this is None, which initializes an empty NGramCorpus. This can then be populated using NGramCorpus methods.- Raises
TypeError -- Corpus argument must be None or of type abydos.Corpus
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus(Corpus(tqbf))
New in version 0.3.0.
-
corpus_importer
(corpus, n_val=1, bos='_START_', eos='_END_')[source]¶ Fill in self.ngcorpus from a Corpus argument.
- Parameters
corpus (Corpus) -- The Corpus from which to initialize the n-gram corpus
n_val (int) -- Maximum n value for n-grams
bos (str) -- String to insert as an indicator of beginning of sentence
eos (str) -- String to insert as an indicator of end of sentence
- Raises
TypeError -- Corpus argument of the Corpus class required.
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus() >>> ngcorp.corpus_importer(Corpus(tqbf))
New in version 0.3.0.
-
get_count
(ngram, corpus=None)[source]¶ Get the count of an n-gram in the corpus.
- Parameters
ngram (str) -- The n-gram to retrieve the count of from the n-gram corpus
corpus (Corpus) -- The corpus
- Returns
The n-gram count
- Return type
int
Examples
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus(Corpus(tqbf)) >>> ngcorp.get_count('the') 2 >>> ngcorp.get_count('fox') 1
New in version 0.3.0.
-
class
abydos.corpus.
UnigramCorpus
(corpus_text='', documents=0, word_transform=None, word_tokenizer=None)[source]¶ Bases:
object
Unigram corpus class.
Largely intended for calculating inverse document frequence (IDF) from a large corpus of unigram (or smaller) tokens, this class encapsulates a dict object. Each key is a unigram token whose value is a tuple consisting of the number of times a term appeared and the number of distinct documents in which it appeared.
New in version 0.4.0.
Initialize UnigramCorpus.
- Parameters
corpus_text (str) -- The corpus text as a single string
documents (int) -- The number of documents in the corpus. If equal to 0 (the default) then the maximum from the internal dictionary's distinct documents count.
word_transform (function) -- A function to apply to each term before term tokenization and addition to the corpus. One might use this, for example, to apply Soundex encoding to each term.
word_tokenizer (_Tokenizer) -- A tokenizer to apply to each sentence in order to retrieve the individual "word" tokens. If set to none, str.split() will be used.
Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = UnigramCorpus(tqbf)
New in version 0.4.0.
-
add_document
(doc)[source]¶ Add a new document to the corpus.
- Parameters
doc (str) -- A string, representing the document to be added.
New in version 0.4.0.
-
gng_importer
(corpus_file)[source]¶ Fill in self.corpus from a Google NGram corpus file.
- Parameters
corpus_file (file) -- The Google NGram file from which to initialize the n-gram corpus
New in version 0.4.0.
-
idf
(term)[source]¶ Calculate the Inverse Document Frequency of a term in the corpus.
- Parameters
term (str) -- The term to calculate the IDF of
- Returns
The IDF
- Return type
float
Examples
>>> tqbf = 'the quick brown fox jumped over the lazy dog\n\n' >>> tqbf += 'and then it slept\n\n and the dog ran off' >>> corp = UnigramCorpus(tqbf) >>> round(corp.idf('dog'), 10) 0.6931471806 >>> round(corp.idf('the'), 10) 0.6931471806
New in version 0.4.0.
-
load_corpus
(filename)[source]¶ Load the corpus from a file.
This employs pickle to load the corpus (a defaultdict). Other parameters of the corpus, such as its word_tokenizer, will not be affected and should be set during initialization.
- Parameters
filename (str) -- The filename to load the corpus from.
New in version 0.4.0.
-
save_corpus
(filename)[source]¶ Save the corpus to a file.
This employs pickle to save the corpus (a defaultdict). Other parameters of the corpus, such as its word_tokenizer, will not be affected and should be set during initialization.
- Parameters
filename (str) -- The filename to save the corpus to.
New in version 0.4.0.