abydos.corpus package¶
abydos.corpus.
The corpus package includes basic and n-gram corpus classes:
As a quick example of Corpus
:
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n\n'
>>> tqbf += 'And then it slept.\n\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.docs()
[[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']],
[['And', 'then', 'it', 'slept.']], [['And', 'the', 'dog', 'ran', 'off.']]]
>>> round(corp.idf('dog'), 10)
0.4771212547
>>> round(corp.idf('the'), 10)
0.1760912591
Here, each sentence is a separate "document". We can retrieve IDF values from
the Corpus
. The same Corpus
can be used to initialize
an NGramCorpus
and calculate TF values:
>>> ngcorp = NGramCorpus(corp)
>>> ngcorp.get_count('the')
2
>>> ngcorp.get_count('fox')
1
>>> ngcorp.tf('the')
1.3010299956639813
>>> ngcorp.tf('fox')
1.0
-
class
abydos.corpus.
Corpus
(corpus_text='', doc_split='nn', sent_split='n', filter_chars='', stop_words=None)[source]¶ Bases:
object
Corpus class.
Internally, this is a list of lists or lists. The corpus itself is a list of documents. Each document is an ordered list of sentences in those documents. And each sentence is an ordered list of words that make up that sentence.
-
docs
()[source]¶ Return the docs in the corpus.
Each list within a doc represents the sentences in that doc, each of which is in turn a list of words within that sentence.
Returns: The docs in the corpus as a list of lists of lists of strs Return type: [[[str]]] Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.docs() [[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog', 'ran', 'off.']]] >>> len(corp.docs()) 1
-
docs_of_words
()[source]¶ Return the docs in the corpus, with sentences flattened.
Each list within the corpus represents all the words of that document. Thus the sentence level of lists has been flattened.
Returns: The docs in the corpus as a list of list of strs Return type: [[str]] Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.docs_of_words() [['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', 'And', 'then', 'it', 'slept.', 'And', 'the', 'dog', 'ran', 'off.']] >>> len(corp.docs_of_words()) 1
-
idf
(term, transform=None)[source]¶ Calculate the Inverse Document Frequency of a term in the corpus.
Parameters: - term (str) -- The term to calculate the IDF of
- transform (function) -- A function to apply to each document term before checking for the presence of term
Returns: The IDF
Return type: float
Examples
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n\n' >>> tqbf += 'And then it slept.\n\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> print(corp.docs()) [[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']], [['And', 'then', 'it', 'slept.']], [['And', 'the', 'dog', 'ran', 'off.']]] >>> round(corp.idf('dog'), 10) 0.4771212547 >>> round(corp.idf('the'), 10) 0.1760912591
-
paras
()[source]¶ Return the paragraphs in the corpus.
Each list within a paragraph represents the sentences in that doc, each of which is in turn a list of words within that sentence. This is identical to the docs() member function and exists only to mirror part of NLTK's API for corpora.
Returns: The paragraphs in the corpus as a list of lists of lists of strs Return type: [[[str]]] Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.paras() [[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog', 'ran', 'off.']]] >>> len(corp.paras()) 1
-
raw
()[source]¶ Return the raw corpus.
This is reconstructed by joining sub-components with the corpus' split characters
Returns: The raw corpus Return type: str Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> print(corp.raw()) The quick brown fox jumped over the lazy dog. And then it slept. And the dog ran off. >>> len(corp.raw()) 85
-
sents
()[source]¶ Return the sentences in the corpus.
Each list within a sentence represents the words within that sentence.
Returns: The sentences in the corpus as a list of lists of strs Return type: [[str]] Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.sents() [['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog', 'ran', 'off.']] >>> len(corp.sents()) 3
-
words
()[source]¶ Return the words in the corpus as a single list.
Returns: The words in the corpus as a list of strs Return type: [str] Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> corp = Corpus(tqbf) >>> corp.words() ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.', 'And', 'then', 'it', 'slept.', 'And', 'the', 'dog', 'ran', 'off.'] >>> len(corp.words()) 18
-
-
class
abydos.corpus.
NGramCorpus
(corpus=None)[source]¶ Bases:
object
The NGramCorpus class.
Internally, this is a set of recursively embedded dicts, with n layers for a corpus of n-grams. E.g. for a trigram corpus, this will be a dict of dicts of dicts. More precisely,
collections.Counter
is used in place of dict, making multiset operations valid and allowing unattested n-grams to be queried.The key at each level is a word. The value at the most deeply embedded level is a numeric value representing the frequency of the trigram. E.g. the trigram frequency of 'colorless green ideas' would be the value stored in
self.ngcorpus['colorless']['green']['ideas'][None]
.-
corpus_importer
(corpus, n_val=1, bos='_START_', eos='_END_')[source]¶ Fill in self.ngcorpus from a Corpus argument.
Parameters: - corpus (Corpus) -- The Corpus from which to initialize the n-gram corpus
- n_val (int) -- Maximum n value for n-grams
- bos (str) -- String to insert as an indicator of beginning of sentence
- eos (str) -- String to insert as an indicator of end of sentence
Raises: TypeError
-- Corpus argument of the Corpus class required.Example
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus() >>> ngcorp.corpus_importer(Corpus(tqbf))
-
get_count
(ngram, corpus=None)[source]¶ Get the count of an n-gram in the corpus.
Parameters: - ngram (str) -- The n-gram to retrieve the count of from the n-gram corpus
- corpus (Corpus) -- The corpus
Returns: The n-gram count
Return type: int
Examples
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus(Corpus(tqbf)) >>> NGramCorpus(Corpus(tqbf)).get_count('the') 2 >>> NGramCorpus(Corpus(tqbf)).get_count('fox') 1
-
gng_importer
(corpus_file)[source]¶ Fill in self.ngcorpus from a Google NGram corpus file.
Parameters: corpus_file (file) -- The Google NGram file from which to initialize the n-gram corpus
-
tf
(term)[source]¶ Return term frequency.
Parameters: term (str) -- The term for which to calculate tf Returns: The term frequency (tf) Return type: float Raises: ValueError
-- tf can only calculate the frequency of individual wordsExamples
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus(Corpus(tqbf)) >>> NGramCorpus(Corpus(tqbf)).tf('the') 1.3010299956639813 >>> NGramCorpus(Corpus(tqbf)).tf('fox') 1.0
-