abydos.tokenizer package

abydos.tokenizer.

The tokenizer package collects classes whose purpose is to tokenize text or individual words. Each tokenizer also supports a scaler attribute when constructed, which adjusts count scaling. The scaler defaults to None, which performs no scaling. Setting scaler to 'set' is used to convert token counters from multi-sets to sets, so even if multiple instances of a token are present, they will be counted as one. Additionally, a callable function (of one argument, such as log, exp, or lambda x: x + 1) may be passed to scaler and this function will be applied to each count value.

The following general tokenizers are provided:

  • QGrams tokenizes a string into q-grams, substrings of length q. The class supports different values of q, the addition of start and stop symbols, and skip values. It even supports multiple values for q and skip, using lists or ranges.

  • QSkipgrams tokenizes a string into skipgrams of length q. A skipgram is a sequence of letters from a string with q, often discontinuous, characters. For example, the string 'ABCD' has the following 2-skipgrams: 'AB', 'AC', 'AD', 'BC', 'BD', 'CD'.

  • CharacterTokenizer tokenizes a string into individual characters.

  • RegexpTokenizer tokenizes a string according to a supplied regular expression.

  • WhitespaceTokenizer tokenizes a string by dividing it at instances of whitespace.

  • WordpunctTokenizer tokenizes a string by dividing it into strings of letters and strings of punctuation.

Six syllable-oriented tokenizers are provided:

  • COrVClusterTokenizer tokenizes a string by dividing it into strings of consonants, vowels, or other characters:

  • COrVClusterTokenizer tokenizes a string by dividing it into strings of consonants (C* clusters), vowels (V* clusters, or non-letter characters:

  • CVClusterTokenizer tokenizes a string by dividing it into strings of consonants then vowels (C*V* clusters) or non-letter characters:

  • VCClusterTokenizer tokenizes a string by dividing it into strings of vowels then characters (V*C* clusters) or non-letter characters:

  • SAPSTokenizer tokenizes a string according to the rules specified by the SAPS syllabification algorithm [RY05]:

  • SonoriPyTokenizer does syllabification according to the sonority sequencing principle, using SyllabiPy. It requires that SyllabiPy be installed.

  • LegaliPyTokenizer does syllabification according to the onset maximization principle (principle of legality), using SyllabiPy. It requires that SyllabiPy be installed, and works best if it has been trained on a corpus of text.

Finally, an NLTK tokenizer is provided:

  • NLTKTokenizer does tokenization using an instantiated NLTK tokenizer. Accordingly, NLTK needs to be installed.


class abydos.tokenizer.QGrams(qval=2, start_stop='$#', skip=0, scaler=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A q-gram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

New in version 0.1.0.

Initialize QGrams.

Parameters
  • qval (int or Iterable) -- The q-gram length (defaults to 2), can be an integer, range object, or list

  • start_stop (str) -- A string of length >= 0 indicating start & stop symbols. If the string is '', q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)

  • skip (int or Iterable) -- The number of characters to skip, can be an integer, range object, or list

  • scaler (None, str, or function) --

    A scaling function for the Counter:

    • None : no scaling

    • 'set' : All non-zero values are set to 1.

    • 'length' : Each token has weight equal to its length.

    • 'length-log'Each token has weight equal to the log of its

      length + 1.

    • 'length-exp'Each token has weight equal to e raised to its

      length.

    • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

Raises

ValueError -- Use WhitespaceTokenizer instead of qval=0.

Examples

>>> qg = QGrams().tokenize('AATTATAT')
>>> qg
QGrams({'AT': 3, 'TA': 2, '$A': 1, 'AA': 1, 'TT': 1, 'T#': 1})
>>> qg = QGrams(qval=1, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'A': 4, 'T': 4})
>>> qg = QGrams(qval=3, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'TAT': 2, 'AAT': 1, 'ATT': 1, 'TTA': 1, 'ATA': 1})
>>> QGrams(qval=2, start_stop='$#').tokenize('interning')
QGrams({'in': 2, '$i': 1, 'nt': 1, 'te': 1, 'er': 1, 'rn': 1,
'ni': 1, 'ng': 1, 'g#': 1})
>>> QGrams(start_stop='', skip=1).tokenize('AACTAGAAC')
QGrams({'AC': 2, 'AT': 1, 'CA': 1, 'TG': 1, 'AA': 1, 'GA': 1, 'A': 1})
>>> QGrams(start_stop='', skip=[0, 1]).tokenize('AACTAGAAC')
QGrams({'AC': 4, 'AA': 3, 'GA': 2, 'CT': 1, 'TA': 1, 'AG': 1,
'AT': 1, 'CA': 1, 'TG': 1, 'A': 1})
>>> QGrams(qval=range(3), skip=[0, 1]).tokenize('interdisciplinarian')
QGrams({'i': 10, 'n': 7, 'r': 4, 'a': 4, 'in': 3, 't': 2, 'e': 2,
'd': 2, 's': 2, 'c': 2, 'p': 2, 'l': 2, 'ri': 2, 'ia': 2, '$i': 1,
'nt': 1, 'te': 1, 'er': 1, 'rd': 1, 'di': 1, 'is': 1, 'sc': 1, 'ci': 1,
'ip': 1, 'pl': 1, 'li': 1, 'na': 1, 'ar': 1, 'an': 1, 'n#': 1, '$n': 1,
'it': 1, 'ne': 1, 'tr': 1, 'ed': 1, 'ds': 1, 'ic': 1, 'si': 1, 'cp': 1,
'il': 1, 'pi': 1, 'ln': 1, 'nr': 1, 'ai': 1, 'ra': 1, 'a#': 1})

New in version 0.1.0.

Changed in version 0.4.0: Broke tokenization functions out into tokenize method

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

New in version 0.4.0.

class abydos.tokenizer.QSkipgrams(qval=2, start_stop='$#', scaler=None, ssk_lambda=0.9)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A q-skipgram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

New in version 0.4.0.

Initialize QSkipgrams.

Parameters
  • qval (int or Iterable) -- The q-gram length (defaults to 2), can be an integer, range object, or list

  • start_stop (str) -- A string of length >= 0 indicating start & stop symbols. If the string is '', q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)

  • scaler (None, str, or function) --

    A scaling function for the Counter:

    • None : no scaling

    • 'set' : All non-zero values are set to 1.

    • 'length' : Each token has weight equal to its length.

    • 'length-log'Each token has weight equal to the log of its

      length + 1.

    • 'length-exp'Each token has weight equal to e raised to its

      length.

    • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

    • 'SSK' : Applies weighting according to the substring kernel rules of [LSShaweTaylor+02].

  • ssk_lambda (float or Iterable) -- A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in [LSShaweTaylor+02]. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)

Raises

ValueError -- Use WhitespaceTokenizer instead of qval=0.

Examples

>>> QSkipgrams().tokenize('AATTAT')
QSkipgrams({'AT': 7, '$A': 3, '$T': 3, 'AA': 3, 'A#': 3, 'TT': 3,
'T#': 3, 'TA': 2, '$#': 1})
>>> QSkipgrams(qval=1, start_stop='').tokenize('AATTAT')
QSkipgrams({'A': 3, 'T': 3})
>>> QSkipgrams(qval=3, start_stop='').tokenize('AATTAT')
QSkipgrams({'ATT': 6, 'AAT': 5, 'ATA': 4, 'TAT': 2, 'AAA': 1,
'TTA': 1, 'TTT': 1})
>>> QSkipgrams(start_stop='').tokenize('ABCD')
QSkipgrams({'AB': 1, 'AC': 1, 'AD': 1, 'BC': 1, 'BD': 1, 'CD': 1})
>>> QSkipgrams().tokenize('Colin')
QSkipgrams({'$C': 1, '$o': 1, '$l': 1, '$i': 1, '$n': 1, '$#': 1,
'Co': 1, 'Cl': 1, 'Ci': 1, 'Cn': 1, 'C#': 1, 'ol': 1, 'oi': 1, 'on': 1,
'o#': 1, 'li': 1, 'ln': 1, 'l#': 1, 'in': 1, 'i#': 1, 'n#': 1})
>>> QSkipgrams(qval=3).tokenize('AACTAGAAC')
QSkipgrams({'$AA': 20, '$A#': 20, 'AA#': 20, '$AC': 14, 'AC#': 14,
'AAC': 11, 'AAA': 10, '$C#': 8, '$AG': 6, '$CA': 6, '$TA': 6, 'ACA': 6,
'ATA': 6, 'AGA': 6, 'AG#': 6, 'CA#': 6, 'TA#': 6, '$$A': 5, 'A##': 5,
'$AT': 4, '$T#': 4, '$GA': 4, '$G#': 4, 'AT#': 4, 'GA#': 4, 'AAG': 3,
'AGC': 3, 'CTA': 3, 'CAA': 3, 'CAC': 3, 'TAA': 3, 'TAC': 3, '$$C': 2,
'$$#': 2, '$CT': 2, '$CG': 2, '$CC': 2, '$TG': 2, '$TC': 2, '$GC': 2,
'$##': 2, 'ACT': 2, 'ACG': 2, 'ACC': 2, 'ATG': 2, 'ATC': 2, 'CT#': 2,
'CGA': 2, 'CG#': 2, 'CC#': 2, 'C##': 2, 'TGA': 2, 'TG#': 2, 'TC#': 2,
'GAC': 2, 'GC#': 2, '$$T': 1, '$$G': 1, 'AAT': 1, 'CTG': 1, 'CTC': 1,
'CAG': 1, 'CGC': 1, 'TAG': 1, 'TGC': 1, 'T##': 1, 'GAA': 1, 'G##': 1})

QSkipgrams may also be used to produce weights in accordance with the substring kernel rules of [LSShaweTaylor+02] by passing the scaler value 'SSK':

>>> QSkipgrams(scaler='SSK').tokenize('AACTAGAAC')
QSkipgrams({'AA': 6.170192010000001, 'AC': 4.486377699,
'$A': 2.8883286990000006, 'A#': 2.6526399291000002, 'TA': 2.05659,
'AG': 1.931931, 'CA': 1.850931, 'GA': 1.5390000000000001, 'AT': 1.3851,
'C#': 1.2404672100000003, '$C': 1.0047784401000002, 'CT': 0.81,
'TG': 0.7290000000000001, 'CG': 0.6561, 'GC': 0.6561,
'$T': 0.5904900000000001, 'G#': 0.5904900000000001, 'TC': 0.531441,
'$G': 0.4782969000000001, 'CC': 0.4782969000000001,
'T#': 0.4782969000000001, '$#': 0.31381059609000006})

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

New in version 0.4.0.

class abydos.tokenizer.CharacterTokenizer(scaler=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A character tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> CharacterTokenizer().tokenize('AACTAGAAC')
CharacterTokenizer({'A': 5, 'C': 2, 'T': 1, 'G': 1})

New in version 0.4.0.

class abydos.tokenizer.RegexpTokenizer(scaler=None, regexp='\w+', flags=0)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A regexp tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> RegexpTokenizer(regexp=r'[^-]+').tokenize('AA-CT-AG-AA-CD')
RegexpTokenizer({'AA': 2, 'CT': 1, 'AG': 1, 'CD': 1})

New in version 0.4.0.

class abydos.tokenizer.WhitespaceTokenizer(scaler=None, flags=0)[source]

Bases: abydos.tokenizer._regexp.RegexpTokenizer

A whitespace tokenizer.

Examples

>>> WhitespaceTokenizer().tokenize('a b c f a c g e a b')
WhitespaceTokenizer({'a': 3, 'b': 2, 'c': 2, 'f': 1, 'g': 1, 'e': 1})

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

class abydos.tokenizer.WordpunctTokenizer(scaler=None, flags=0)[source]

Bases: abydos.tokenizer._regexp.RegexpTokenizer

A wordpunct tokenizer.

Examples

>>> WordpunctTokenizer().tokenize("Can't stop the feelin'!")
WordpunctTokenizer({'Can': 1, "'": 1, 't': 1, 'stop': 1, 'the': 1,
'feelin': 1, "'!": 1})

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

class abydos.tokenizer.COrVClusterTokenizer(scaler=None, consonants=None, vowels=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A C- or V-cluster tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> COrVClusterTokenizer().tokenize('seven-twelfths')
COrVClusterTokenizer({'e': 3, 's': 1, 'v': 1, 'n': 1, '-': 1,
'tw': 1, 'lfths': 1})
>>> COrVClusterTokenizer().tokenize('character')
COrVClusterTokenizer({'a': 2, 'r': 2, 'ch': 1, 'ct': 1, 'e': 1})

New in version 0.4.0.

class abydos.tokenizer.CVClusterTokenizer(scaler=None, consonants=None, vowels=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A C*V*-cluster tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> CVClusterTokenizer().tokenize('seven-twelfths')
CVClusterTokenizer({'se': 1, 've': 1, 'n': 1, '-': 1, 'twe': 1,
'lfths': 1})
>>> CVClusterTokenizer().tokenize('character')
CVClusterTokenizer({'cha': 1, 'ra': 1, 'cte': 1, 'r': 1})

New in version 0.4.0.

class abydos.tokenizer.VCClusterTokenizer(scaler=None, consonants=None, vowels=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

A V*C*-cluster tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> VCClusterTokenizer().tokenize('seven-twelfths')
VCClusterTokenizer({'s': 1, 'ev': 1, 'en': 1, '-': 1, 'tw': 1,
'elfths': 1})
>>> VCClusterTokenizer().tokenize('character')
VCClusterTokenizer({'ch': 1, 'ar': 1, 'act': 1, 'er': 1})

New in version 0.4.0.

class abydos.tokenizer.SAPSTokenizer(scaler=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

Syllable Alignment Pattern Searching tokenizer.

This is the syllabifier described on p. 917 of [RY05].

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> SAPSTokenizer().tokenize('seven-twelfths')
SAPSTokenizer({'t': 2, 'se': 1, 'ven': 1, '-': 1, 'wel': 1, 'f': 1,
'h': 1, 's': 1})
>>> SAPSTokenizer().tokenize('character')
SAPSTokenizer({'c': 1, 'ha': 1, 'rac': 1, 'ter': 1})

New in version 0.4.0.

class abydos.tokenizer.SonoriPyTokenizer(scaler=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

SonoriPy tokenizer.

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> SonoriPyTokenizer().tokenize('seven-twelfths')
SonoriPyTokenizer({'se': 1, 'ven-': 1, 'twelfths': 1})
>>> SonoriPyTokenizer().tokenize('character')
SonoriPyTokenizer({'cha': 1, 'rac': 1, 'ter': 1})

New in version 0.4.0.

class abydos.tokenizer.LegaliPyTokenizer(scaler=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

LegaliPy tokenizer.

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

  • None : no scaling

  • 'set' : All non-zero values are set to 1.

  • 'length' : Each token has weight equal to its length.

  • 'length-log'Each token has weight equal to the log of its

    length + 1.

  • 'length-exp'Each token has weight equal to e raised to its

    length.

  • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

New in version 0.4.0.

tokenize(string, ipa=False)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters
  • string (str) -- The string to tokenize

  • ipa (bool) -- If True, indicates that the string is in IPA

Examples

>>> LegaliPyTokenizer().tokenize('seven-twelfths')
LegaliPyTokenizer({'s': 1, 'ev': 1, 'en-tw': 1, 'elfths': 1})
>>> LegaliPyTokenizer().tokenize('character')
LegaliPyTokenizer({'ch': 1, 'ar': 1, 'act': 1, 'er': 1})

New in version 0.4.0.

train_onsets(text, threshold=0.0002, clean=True, append=False)[source]

Train the onsets on a text.

Parameters
  • text (str) -- The text on which to train

  • threshold (float) -- Threshold proportion above which to include onset into onset list

  • clean (bool) -- If True, the text is stripped of numerals and punctuation

  • append (bool) -- If True, the current onset list is extended

New in version 0.4.0.

class abydos.tokenizer.NLTKTokenizer(nltk_tokenizer=None, scaler=None)[source]

Bases: abydos.tokenizer._tokenizer._Tokenizer

NLTK tokenizer wrapper class.

New in version 0.4.0.

Initialize Tokenizer.

Parameters
  • scaler (None, str, or function) --

    A scaling function for the Counter:

    • None : no scaling

    • 'set' : All non-zero values are set to 1.

    • 'length' : Each token has weight equal to its length.

    • 'length-log'Each token has weight equal to the log of its

      length + 1.

    • 'length-exp'Each token has weight equal to e raised to its

      length.

    • a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

  • nltk_tokenizer (Object) -- An instantiated tokenizer from NLTK.

New in version 0.4.0.

tokenize(string)[source]

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize

Examples

>>> from nltk.tokenize.casual import TweetTokenizer
>>> nltk_tok = TweetTokenizer()
>>> NLTKTokenizer(nltk_tokenizer=nltk_tok).tokenize(
... '.@Twitter Today is #lit!')
NLTKTokenizer({'.': 1, '@Twitter': 1, 'Today': 1, 'is': 1, '#lit': 1,
'!': 1})

New in version 0.4.0.