abydos.corpus package

abydos.corpus.

The corpus package includes basic and n-gram corpus classes:

As a quick example of Corpus:

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n\n'
>>> tqbf += 'And then it slept.\n\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.docs()
[[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.']],
[['And', 'then', 'it', 'slept.']], [['And', 'the', 'dog', 'ran', 'off.']]]
>>> round(corp.idf('dog'), 10)
1.0986122887
>>> round(corp.idf('the'), 10)
0.4054651081

Here, each sentence is a separate "document". We can retrieve IDF values from the Corpus. The same Corpus can be used to initialize an NGramCorpus and calculate TF values:

>>> ngcorp = NGramCorpus(corp)
>>> ngcorp.get_count('the')
2
>>> ngcorp.get_count('fox')
1

class abydos.corpus.Corpus(corpus_text='', doc_split='nn', sent_split='n', filter_chars='', stop_words=None, word_tokenizer=None)[source]

Bases: object

Corpus class.

Internally, this is a list of lists or lists. The corpus itself is a list of documents. Each document is an ordered list of sentences in those documents. And each sentence is an ordered list of words that make up that sentence.

New in version 0.1.0.

Initialize Corpus.

By default, when importing a corpus:
  • two consecutive newlines divide documents

  • single newlines divide sentences

  • other whitespace divides words

Parameters
  • corpus_text (str) -- The corpus text as a single string

  • doc_split (str) -- A character or string used to split corpus_text into documents

  • sent_split (str) -- A character or string used to split documents into sentences

  • filter_chars (list) -- A list of characters (as a string, tuple, set, or list) to filter out of the corpus text

  • stop_words (list) -- A list of words (as a tuple, set, or list) to filter out of the corpus text

  • word_tokenizer (_Tokenizer) -- A tokenizer to apply to each sentence in order to retrieve the individual "word" tokens. If set to none, str.split() will be used.

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)

New in version 0.1.0.

docs()[source]

Return the docs in the corpus.

Each list within a doc represents the sentences in that doc, each of which is in turn a list of words within that sentence.

Returns

The docs in the corpus as a list of lists of lists of strs

Return type

[[[str]]]

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.docs()
[[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog',
'ran', 'off.']]]
>>> len(corp.docs())
1

New in version 0.1.0.

docs_of_words()[source]

Return the docs in the corpus, with sentences flattened.

Each list within the corpus represents all the words of that document. Thus the sentence level of lists has been flattened.

Returns

The docs in the corpus as a list of list of strs

Return type

[[str]]

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.docs_of_words()
[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.', 'And', 'then', 'it', 'slept.', 'And', 'the', 'dog', 'ran',
'off.']]
>>> len(corp.docs_of_words())
1

New in version 0.1.0.

idf(term, transform=None)[source]

Calculate the Inverse Document Frequency of a term in the corpus.

Parameters
  • term (str) -- The term to calculate the IDF of

  • transform (function) -- A function to apply to each document term before checking for the presence of term

Returns

The IDF

Return type

float

Examples

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n\n'
>>> tqbf += 'And then it slept.\n\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> print(corp.docs())
[[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.']],
[['And', 'then', 'it', 'slept.']],
[['And', 'the', 'dog', 'ran', 'off.']]]
>>> round(corp.idf('dog'), 10)
1.0986122887
>>> round(corp.idf('the'), 10)
0.4054651081

New in version 0.1.0.

paras()[source]

Return the paragraphs in the corpus.

Each list within a paragraph represents the sentences in that doc, each of which is in turn a list of words within that sentence. This is identical to the docs() member function and exists only to mirror part of NLTK's API for corpora.

Returns

The paragraphs in the corpus as a list of lists of lists of strs

Return type

[[[str]]]

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.paras()
[[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog',
'ran', 'off.']]]
>>> len(corp.paras())
1

New in version 0.1.0.

raw()[source]

Return the raw corpus.

This is reconstructed by joining sub-components with the corpus' split characters

Returns

The raw corpus

Return type

str

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> print(corp.raw())
The quick brown fox jumped over the lazy dog.
And then it slept.
And the dog ran off.
>>> len(corp.raw())
85

New in version 0.1.0.

sents()[source]

Return the sentences in the corpus.

Each list within a sentence represents the words within that sentence.

Returns

The sentences in the corpus as a list of lists of strs

Return type

[[str]]

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.sents()
[['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.'], ['And', 'then', 'it', 'slept.'], ['And', 'the', 'dog',
'ran', 'off.']]
>>> len(corp.sents())
3
words()[source]

Return the words in the corpus as a single list.

Returns

The words in the corpus as a list of strs

Return type

[str]

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = Corpus(tqbf)
>>> corp.words()
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy',
'dog.', 'And', 'then', 'it', 'slept.', 'And', 'the', 'dog', 'ran',
'off.']
>>> len(corp.words())
18

New in version 0.1.0.

class abydos.corpus.NGramCorpus(corpus=None)[source]

Bases: object

The NGramCorpus class.

Internally, this is a set of recursively embedded dicts, with n layers for a corpus of n-grams. E.g. for a trigram corpus, this will be a dict of dicts of dicts. More precisely, collections.Counter is used in place of dict, making multiset operations valid and allowing unattested n-grams to be queried.

The key at each level is a word. The value at the most deeply embedded level is a numeric value representing the frequency of the trigram. E.g. the trigram frequency of 'colorless green ideas' would be the value stored in self.ngcorpus['colorless']['green']['ideas'][None].

New in version 0.3.0.

Initialize Corpus.

Parameters

corpus (Corpus) -- The Corpus from which to initialize the n-gram corpus. By default, this is None, which initializes an empty NGramCorpus. This can then be populated using NGramCorpus methods.

Raises

TypeError -- Corpus argument must be None or of type abydos.Corpus

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> ngcorp = NGramCorpus(Corpus(tqbf))

New in version 0.3.0.

corpus_importer(corpus, n_val=1, bos='_START_', eos='_END_')[source]

Fill in self.ngcorpus from a Corpus argument.

Parameters
  • corpus (Corpus) -- The Corpus from which to initialize the n-gram corpus

  • n_val (int) -- Maximum n value for n-grams

  • bos (str) -- String to insert as an indicator of beginning of sentence

  • eos (str) -- String to insert as an indicator of end of sentence

Raises

TypeError -- Corpus argument of the Corpus class required.

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> ngcorp = NGramCorpus()
>>> ngcorp.corpus_importer(Corpus(tqbf))

New in version 0.3.0.

get_count(ngram, corpus=None)[source]

Get the count of an n-gram in the corpus.

Parameters
  • ngram (str) -- The n-gram to retrieve the count of from the n-gram corpus

  • corpus (Corpus) -- The corpus

Returns

The n-gram count

Return type

int

Examples

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> ngcorp = NGramCorpus(Corpus(tqbf))
>>> ngcorp.get_count('the')
2
>>> ngcorp.get_count('fox')
1

New in version 0.3.0.

gng_importer(corpus_file)[source]

Fill in self.ngcorpus from a Google NGram corpus file.

Parameters

corpus_file (file) -- The Google NGram file from which to initialize the n-gram corpus

New in version 0.3.0.

class abydos.corpus.UnigramCorpus(corpus_text='', documents=0, word_transform=None, word_tokenizer=None)[source]

Bases: object

Unigram corpus class.

Largely intended for calculating inverse document frequence (IDF) from a large corpus of unigram (or smaller) tokens, this class encapsulates a dict object. Each key is a unigram token whose value is a tuple consisting of the number of times a term appeared and the number of distinct documents in which it appeared.

New in version 0.4.0.

Initialize UnigramCorpus.

Parameters
  • corpus_text (str) -- The corpus text as a single string

  • documents (int) -- The number of documents in the corpus. If equal to 0 (the default) then the maximum from the internal dictionary's distinct documents count.

  • word_transform (function) -- A function to apply to each term before term tokenization and addition to the corpus. One might use this, for example, to apply Soundex encoding to each term.

  • word_tokenizer (_Tokenizer) -- A tokenizer to apply to each sentence in order to retrieve the individual "word" tokens. If set to none, str.split() will be used.

Example

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> corp = UnigramCorpus(tqbf)

New in version 0.4.0.

add_document(doc)[source]

Add a new document to the corpus.

Parameters

doc (str) -- A string, representing the document to be added.

New in version 0.4.0.

gng_importer(corpus_file)[source]

Fill in self.corpus from a Google NGram corpus file.

Parameters

corpus_file (file) -- The Google NGram file from which to initialize the n-gram corpus

New in version 0.4.0.

idf(term)[source]

Calculate the Inverse Document Frequency of a term in the corpus.

Parameters

term (str) -- The term to calculate the IDF of

Returns

The IDF

Return type

float

Examples

>>> tqbf = 'the quick brown fox jumped over the lazy dog\n\n'
>>> tqbf += 'and then it slept\n\n and the dog ran off'
>>> corp = UnigramCorpus(tqbf)
>>> round(corp.idf('dog'), 10)
0.6931471806
>>> round(corp.idf('the'), 10)
0.6931471806

New in version 0.4.0.

load_corpus(filename)[source]

Load the corpus from a file.

This employs pickle to load the corpus (a defaultdict). Other parameters of the corpus, such as its word_tokenizer, will not be affected and should be set during initialization.

Parameters

filename (str) -- The filename to load the corpus from.

New in version 0.4.0.

save_corpus(filename)[source]

Save the corpus to a file.

This employs pickle to save the corpus (a defaultdict). Other parameters of the corpus, such as its word_tokenizer, will not be affected and should be set during initialization.

Parameters

filename (str) -- The filename to save the corpus to.

New in version 0.4.0.