abydos.tokenizer package¶

abydos.tokenizer.

The tokenizer package collects classes whose purpose is to tokenize text or individual words. Each tokenizer also supports a scaler attribute when constructed, which adjusts count scaling. The scaler defaults to None, which performs no scaling. Setting scaler to 'set' is used to convert token counters from multi-sets to sets, so even if multiple instances of a token are present, they will be counted as one. Additionally, a callable function (of one argument, such as log, exp, or lambda x: x + 1) may be passed to scaler and this function will be applied to each count value.

The following general tokenizers are provided:

Six syllable-oriented tokenizers are provided:

Finally, an NLTK tokenizer is provided:

class abydos.tokenizer.QGrams(qval=2, start_stop='$#', skip=0, scaler=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A q-gram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

New in version 0.1.0.

Initialize QGrams.

Parameters

qval (int or Iterable) -- The q-gram length (defaults to 2), can be an integer, range object, or list
start_stop (str) -- A string of length >= 0 indicating start & stop symbols. If the string is '', q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)
skip (int or Iterable) -- The number of characters to skip, can be an integer, range object, or list
scaler (None, str, or function) --
A scaling function for the Counter:
- None : no scaling
- 'set' : All non-zero values are set to 1.
- 'length' : Each token has weight equal to its length.
- 'length-log'Each token has weight equal to the log of its
  length + 1.
- 'length-exp'Each token has weight equal to e raised to its
  length.
- a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.

Raises

ValueError -- Use WhitespaceTokenizer instead of qval=0.

Examples

>>> qg = QGrams().tokenize('AATTATAT')
>>> qg
QGrams({'AT': 3, 'TA': 2, '$A': 1, 'AA': 1, 'TT': 1, 'T#': 1})

>>> qg = QGrams(qval=1, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'A': 4, 'T': 4})

>>> qg = QGrams(qval=3, start_stop='').tokenize('AATTATAT')
>>> qg
QGrams({'TAT': 2, 'AAT': 1, 'ATT': 1, 'TTA': 1, 'ATA': 1})

>>> QGrams(qval=2, start_stop='$#').tokenize('interning')
QGrams({'in': 2, '$i': 1, 'nt': 1, 'te': 1, 'er': 1, 'rn': 1,
'ni': 1, 'ng': 1, 'g#': 1})

>>> QGrams(start_stop='', skip=1).tokenize('AACTAGAAC')
QGrams({'AC': 2, 'AT': 1, 'CA': 1, 'TG': 1, 'AA': 1, 'GA': 1, 'A': 1})

>>> QGrams(start_stop='', skip=[0, 1]).tokenize('AACTAGAAC')
QGrams({'AC': 4, 'AA': 3, 'GA': 2, 'CT': 1, 'TA': 1, 'AG': 1,
'AT': 1, 'CA': 1, 'TG': 1, 'A': 1})

>>> QGrams(qval=range(3), skip=[0, 1]).tokenize('interdisciplinarian')
QGrams({'i': 10, 'n': 7, 'r': 4, 'a': 4, 'in': 3, 't': 2, 'e': 2,
'd': 2, 's': 2, 'c': 2, 'p': 2, 'l': 2, 'ri': 2, 'ia': 2, '$i': 1,
'nt': 1, 'te': 1, 'er': 1, 'rd': 1, 'di': 1, 'is': 1, 'sc': 1, 'ci': 1,
'ip': 1, 'pl': 1, 'li': 1, 'na': 1, 'ar': 1, 'an': 1, 'n#': 1, '$n': 1,
'it': 1, 'ne': 1, 'tr': 1, 'ed': 1, 'ds': 1, 'ic': 1, 'si': 1, 'cp': 1,
'il': 1, 'pi': 1, 'ln': 1, 'nr': 1, 'ai': 1, 'ra': 1, 'a#': 1})

New in version 0.1.0.

Changed in version 0.4.0: Broke tokenization functions out into tokenize method

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

New in version 0.4.0.

class abydos.tokenizer.QSkipgrams(qval=2, start_stop='$#', scaler=None, ssk_lambda=0.9)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A q-skipgram class, which functions like a bag/multiset.

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

New in version 0.4.0.

Initialize QSkipgrams.

Parameters

qval (int or Iterable) -- The q-gram length (defaults to 2), can be an integer, range object, or list
start_stop (str) -- A string of length >= 0 indicating start & stop symbols. If the string is '', q-grams will be calculated without start & stop symbols appended to each end. Otherwise, the first character of start_stop will pad the beginning of the string and the last character of start_stop will pad the end of the string before q-grams are calculated. (In the case that start_stop is only 1 character long, the same symbol will be used for both.)
scaler (None, str, or function) --
A scaling function for the Counter:
- None : no scaling
- 'set' : All non-zero values are set to 1.
- 'length' : Each token has weight equal to its length.
- 'length-log'Each token has weight equal to the log of its
  length + 1.
- 'length-exp'Each token has weight equal to e raised to its
  length.
- a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
- 'SSK' : Applies weighting according to the substring kernel rules of [LSShaweTaylor+02].
ssk_lambda (float or Iterable) -- A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in [LSShaweTaylor+02]. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)

Raises

ValueError -- Use WhitespaceTokenizer instead of qval=0.

Examples

>>> QSkipgrams().tokenize('AATTAT')
QSkipgrams({'AT': 7, '$A': 3, '$T': 3, 'AA': 3, 'A#': 3, 'TT': 3,
'T#': 3, 'TA': 2, '$#': 1})

>>> QSkipgrams(qval=1, start_stop='').tokenize('AATTAT')
QSkipgrams({'A': 3, 'T': 3})

>>> QSkipgrams(qval=3, start_stop='').tokenize('AATTAT')
QSkipgrams({'ATT': 6, 'AAT': 5, 'ATA': 4, 'TAT': 2, 'AAA': 1,
'TTA': 1, 'TTT': 1})

>>> QSkipgrams(start_stop='').tokenize('ABCD')
QSkipgrams({'AB': 1, 'AC': 1, 'AD': 1, 'BC': 1, 'BD': 1, 'CD': 1})

>>> QSkipgrams().tokenize('Colin')
QSkipgrams({'$C': 1, '$o': 1, '$l': 1, '$i': 1, '$n': 1, '$#': 1,
'Co': 1, 'Cl': 1, 'Ci': 1, 'Cn': 1, 'C#': 1, 'ol': 1, 'oi': 1, 'on': 1,
'o#': 1, 'li': 1, 'ln': 1, 'l#': 1, 'in': 1, 'i#': 1, 'n#': 1})

>>> QSkipgrams(qval=3).tokenize('AACTAGAAC')
QSkipgrams({'$AA': 20, '$A#': 20, 'AA#': 20, '$AC': 14, 'AC#': 14,
'AAC': 11, 'AAA': 10, '$C#': 8, '$AG': 6, '$CA': 6, '$TA': 6, 'ACA': 6,
'ATA': 6, 'AGA': 6, 'AG#': 6, 'CA#': 6, 'TA#': 6, '$$A': 5, 'A##': 5,
'$AT': 4, '$T#': 4, '$GA': 4, '$G#': 4, 'AT#': 4, 'GA#': 4, 'AAG': 3,
'AGC': 3, 'CTA': 3, 'CAA': 3, 'CAC': 3, 'TAA': 3, 'TAC': 3, '$$C': 2,
'$$#': 2, '$CT': 2, '$CG': 2, '$CC': 2, '$TG': 2, '$TC': 2, '$GC': 2,
'$##': 2, 'ACT': 2, 'ACG': 2, 'ACC': 2, 'ATG': 2, 'ATC': 2, 'CT#': 2,
'CGA': 2, 'CG#': 2, 'CC#': 2, 'C##': 2, 'TGA': 2, 'TG#': 2, 'TC#': 2,
'GAC': 2, 'GC#': 2, '$$T': 1, '$$G': 1, 'AAT': 1, 'CTG': 1, 'CTC': 1,
'CAG': 1, 'CGC': 1, 'TAG': 1, 'TGC': 1, 'T##': 1, 'GAA': 1, 'G##': 1})

QSkipgrams may also be used to produce weights in accordance with the substring kernel rules of [LSShaweTaylor+02] by passing the scaler value 'SSK':

>>> QSkipgrams(scaler='SSK').tokenize('AACTAGAAC')
QSkipgrams({'AA': 6.170192010000001, 'AC': 4.486377699,
'$A': 2.8883286990000006, 'A#': 2.6526399291000002, 'TA': 2.05659,
'AG': 1.931931, 'CA': 1.850931, 'GA': 1.5390000000000001, 'AT': 1.3851,
'C#': 1.2404672100000003, '$C': 1.0047784401000002, 'CT': 0.81,
'TG': 0.7290000000000001, 'CG': 0.6561, 'GC': 0.6561,
'$T': 0.5904900000000001, 'G#': 0.5904900000000001, 'TC': 0.531441,
'$G': 0.4782969000000001, 'CC': 0.4782969000000001,
'T#': 0.4782969000000001, '$#': 0.31381059609000006})

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

New in version 0.4.0.

class abydos.tokenizer.CharacterTokenizer(scaler=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A character tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> CharacterTokenizer().tokenize('AACTAGAAC')
CharacterTokenizer({'A': 5, 'C': 2, 'T': 1, 'G': 1})

New in version 0.4.0.

class abydos.tokenizer.RegexpTokenizer(scaler=None, regexp='\w+', flags=0)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A regexp tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> RegexpTokenizer(regexp=r'[^-]+').tokenize('AA-CT-AG-AA-CD')
RegexpTokenizer({'AA': 2, 'CT': 1, 'AG': 1, 'CD': 1})

New in version 0.4.0.

class abydos.tokenizer.WhitespaceTokenizer(scaler=None, flags=0)[source]¶

Bases: abydos.tokenizer._regexp.RegexpTokenizer

A whitespace tokenizer.

Examples

>>> WhitespaceTokenizer().tokenize('a b c f a c g e a b')
WhitespaceTokenizer({'a': 3, 'b': 2, 'c': 2, 'f': 1, 'g': 1, 'e': 1})

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

class abydos.tokenizer.WordpunctTokenizer(scaler=None, flags=0)[source]¶

Bases: abydos.tokenizer._regexp.RegexpTokenizer

A wordpunct tokenizer.

Examples

>>> WordpunctTokenizer().tokenize("Can't stop the feelin'!")
WordpunctTokenizer({'Can': 1, "'": 1, 't': 1, 'stop': 1, 'the': 1,
'feelin': 1, "'!": 1})

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

class abydos.tokenizer.COrVClusterTokenizer(scaler=None, consonants=None, vowels=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A C- or V-cluster tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> COrVClusterTokenizer().tokenize('seven-twelfths')
COrVClusterTokenizer({'e': 3, 's': 1, 'v': 1, 'n': 1, '-': 1,
'tw': 1, 'lfths': 1})

>>> COrVClusterTokenizer().tokenize('character')
COrVClusterTokenizer({'a': 2, 'r': 2, 'ch': 1, 'ct': 1, 'e': 1})

New in version 0.4.0.

class abydos.tokenizer.CVClusterTokenizer(scaler=None, consonants=None, vowels=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A C*V*-cluster tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> CVClusterTokenizer().tokenize('seven-twelfths')
CVClusterTokenizer({'se': 1, 've': 1, 'n': 1, '-': 1, 'twe': 1,
'lfths': 1})

>>> CVClusterTokenizer().tokenize('character')
CVClusterTokenizer({'cha': 1, 'ra': 1, 'cte': 1, 'r': 1})

New in version 0.4.0.

class abydos.tokenizer.VCClusterTokenizer(scaler=None, consonants=None, vowels=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

A V*C*-cluster tokenizer.

New in version 0.4.0.

Initialize tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> VCClusterTokenizer().tokenize('seven-twelfths')
VCClusterTokenizer({'s': 1, 'ev': 1, 'en': 1, '-': 1, 'tw': 1,
'elfths': 1})

>>> VCClusterTokenizer().tokenize('character')
VCClusterTokenizer({'ch': 1, 'ar': 1, 'act': 1, 'er': 1})

New in version 0.4.0.

class abydos.tokenizer.SAPSTokenizer(scaler=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

Syllable Alignment Pattern Searching tokenizer.

This is the syllabifier described on p. 917 of [RY05].

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> SAPSTokenizer().tokenize('seven-twelfths')
SAPSTokenizer({'t': 2, 'se': 1, 'ven': 1, '-': 1, 'wel': 1, 'f': 1,
'h': 1, 's': 1})

>>> SAPSTokenizer().tokenize('character')
SAPSTokenizer({'c': 1, 'ha': 1, 'rac': 1, 'ter': 1})

New in version 0.4.0.

class abydos.tokenizer.SonoriPyTokenizer(scaler=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

SonoriPy tokenizer.

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> SonoriPyTokenizer().tokenize('seven-twelfths')
SonoriPyTokenizer({'se': 1, 'ven-': 1, 'twelfths': 1})

>>> SonoriPyTokenizer().tokenize('character')
SonoriPyTokenizer({'cha': 1, 'rac': 1, 'ter': 1})

New in version 0.4.0.

class abydos.tokenizer.LegaliPyTokenizer(scaler=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

LegaliPy tokenizer.

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --

A scaling function for the Counter:

New in version 0.4.0.

tokenize(string, ipa=False)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters

string (str) -- The string to tokenize
ipa (bool) -- If True, indicates that the string is in IPA

Examples

>>> LegaliPyTokenizer().tokenize('seven-twelfths')
LegaliPyTokenizer({'s': 1, 'ev': 1, 'en-tw': 1, 'elfths': 1})

>>> LegaliPyTokenizer().tokenize('character')
LegaliPyTokenizer({'ch': 1, 'ar': 1, 'act': 1, 'er': 1})

New in version 0.4.0.

train_onsets(text, threshold=0.0002, clean=True, append=False)[source]¶

Train the onsets on a text.

Parameters

text (str) -- The text on which to train
threshold (float) -- Threshold proportion above which to include onset into onset list
clean (bool) -- If True, the text is stripped of numerals and punctuation
append (bool) -- If True, the current onset list is extended

New in version 0.4.0.

class abydos.tokenizer.NLTKTokenizer(nltk_tokenizer=None, scaler=None)[source]¶

Bases: abydos.tokenizer._tokenizer._Tokenizer

NLTK tokenizer wrapper class.

New in version 0.4.0.

Initialize Tokenizer.

Parameters

scaler (None, str, or function) --
A scaling function for the Counter:
- None : no scaling
- 'set' : All non-zero values are set to 1.
- 'length' : Each token has weight equal to its length.
- 'length-log'Each token has weight equal to the log of its
  length + 1.
- 'length-exp'Each token has weight equal to e raised to its
  length.
- a callable function : The function is applied to each value in the Counter. Some useful functions include math.exp, math.log1p, math.sqrt, and indexes into interesting integer sequences such as the Fibonacci sequence.
nltk_tokenizer (Object) -- An instantiated tokenizer from NLTK.

New in version 0.4.0.

tokenize(string)[source]¶

Tokenize the term and store it.

The tokenized term is stored as an ordered list and as a Counter object.

Parameters: string (str) -- The string to tokenize

Examples

>>> from nltk.tokenize.casual import TweetTokenizer
>>> nltk_tok = TweetTokenizer()
>>> NLTKTokenizer(nltk_tokenizer=nltk_tok).tokenize(
... '.@Twitter Today is #lit!')
NLTKTokenizer({'.': 1, '@Twitter': 1, 'Today': 1, 'is': 1, '#lit': 1,
'!': 1})

New in version 0.4.0.