abydos.ngram module¶
abydos.ngram.
The NGram class is a container for an n-gram corpus
-
class
abydos.ngram.
NGramCorpus
(corpus=None)[source]¶ Bases:
object
The NGramCorpus class.
Internally, this is a set of recursively embedded dicts, with n layers for a corpus of n-grams. E.g. for a trigram corpus, this will be a dict of dicts of dicts. More precisely, collections.Counter is used in place of dict, making multiset operations valid and allowing unattested n-grams to be queried.
The key at each level is a word. The value at the most deeply embedded level is a numeric value representing the frequency of the trigram. E.g. the trigram frequency of ‘colorless green ideas’ would be the value stored in self.ngcorpus[‘colorless’][‘green’][‘ideas’][None].
-
corpus_importer
(corpus, n_val=1, bos='_START_', eos='_END_')[source]¶ Fill in self.ngcorpus from a Corpus argument.
Parameters: - corpus (Corpus) – The Corpus from which to initialize the n-gram corpus
- n_val (int) – maximum n value for n-grams
- bos (str) – string to insert as an indicator of beginning of sentence
- eos (str) – string to insert as an indicator of end of sentence
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus() >>> ngcorp.corpus_importer(Corpus(tqbf))
-
get_count
(ngram, corpus=None)[source]¶ Get the count of an n-gram in the corpus.
Parameters: - ngram (list, tuple, or string) – The n-gram to retrieve the count of from the n-gram corpus
- corpus (Corpus) – The corpus
Returns: The n-gram count
Return type: int
>>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus(Corpus(tqbf)) >>> NGramCorpus(Corpus(tqbf)).get_count('the') 2 >>> NGramCorpus(Corpus(tqbf)).get_count('fox') 1
-
gng_importer
(corpus_file)[source]¶ Fill in self.ngcorpus from a Google NGram corpus file.
Parameters: corpus_file (file) – The Google NGram file from which to initialize the n-gram corpus
-
tf
(term)[source]¶ Return term frequency.
Parameters: term (str) – The term for which to calculate tf Returns: The term frequency (tf) Return type: float >>> tqbf = 'The quick brown fox jumped over the lazy dog.\n' >>> tqbf += 'And then it slept.\n And the dog ran off.' >>> ngcorp = NGramCorpus(Corpus(tqbf)) >>> NGramCorpus(Corpus(tqbf)).tf('the') 1.3010299956639813 >>> NGramCorpus(Corpus(tqbf)).tf('fox') 1.0
-