abydos.ngram module

abydos.ngram.

The NGram class is a container for an n-gram corpus

class abydos.ngram.NGramCorpus(corpus=None)[source]

Bases: object

The NGramCorpus class.

Internally, this is a set of recursively embedded dicts, with n layers for a corpus of n-grams. E.g. for a trigram corpus, this will be a dict of dicts of dicts. More precisely, collections.Counter is used in place of dict, making multiset operations valid and allowing unattested n-grams to be queried.

The key at each level is a word. The value at the most deeply embedded level is a numeric value representing the frequency of the trigram. E.g. the trigram frequency of ‘colorless green ideas’ would be the value stored in self.ngcorpus[‘colorless’][‘green’][‘ideas’][None].

corpus_importer(corpus, n_val=1, bos='_START_', eos='_END_')[source]

Fill in self.ngcorpus from a Corpus argument.

Parameters:
  • corpus (Corpus) – The Corpus from which to initialize the n-gram corpus
  • n_val (int) – maximum n value for n-grams
  • bos (str) – string to insert as an indicator of beginning of sentence
  • eos (str) – string to insert as an indicator of end of sentence
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> ngcorp = NGramCorpus()
>>> ngcorp.corpus_importer(Corpus(tqbf))
get_count(ngram, corpus=None)[source]

Get the count of an n-gram in the corpus.

Parameters:
  • ngram (list, tuple, or string) – The n-gram to retrieve the count of from the n-gram corpus
  • corpus (Corpus) – The corpus
Returns:

The n-gram count

Return type:

int

>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> ngcorp = NGramCorpus(Corpus(tqbf))
>>> NGramCorpus(Corpus(tqbf)).get_count('the')
2
>>> NGramCorpus(Corpus(tqbf)).get_count('fox')
1
gng_importer(corpus_file)[source]

Fill in self.ngcorpus from a Google NGram corpus file.

Parameters:corpus_file (file) – The Google NGram file from which to initialize the n-gram corpus
tf(term)[source]

Return term frequency.

Parameters:term (str) – The term for which to calculate tf
Returns:The term frequency (tf)
Return type:float
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n'
>>> tqbf += 'And then it slept.\n And the dog ran off.'
>>> ngcorp = NGramCorpus(Corpus(tqbf))
>>> NGramCorpus(Corpus(tqbf)).tf('the')
1.3010299956639813
>>> NGramCorpus(Corpus(tqbf)).tf('fox')
1.0