abydos package

Submodules

abydos.clustering module

abydos.clustering

The clustering module implements clustering algorithms such as:
  • string fingerprinting
abydos.clustering.fingerprint(phrase)

String fingerprint

The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening spaces

Parameters:phrase (str) – the string from which to calculate the fingerprint
Returns:the fingerprint of the phrase
Return type:str
abydos.clustering.mean_pairwise_similarity(collection, metric=<function sim>, meanfunc=<function hmean>, symmetric=False)

Mean pairwise similarity of a collection of strings

Takes the mean of the pairwise similarity between each member of a collection, optionally in both directions (for asymmetric similarity metrics.

Parameters:
  • collection (list) – a collection of terms or a string that can be split
  • metric (function) – a similarity metric function
  • mean (function) – a mean function that takes a list of values and returns a float
  • symmetric (bool) – set to True if all pairwise similarities should be calculated in both directions
Returns:

the mean pairwise similarity of a collection of strings

Return type:

str

abydos.clustering.omission_key(word)

Omission key

The omission key of a word is defined in: Pollock, Joseph J. and Antonio Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4). 358–368. <http://dl.acm.org/citation.cfm?id=358048>

Parameters:word (str) – the word to transform into its omission key
Returns:the omission key
Return type:str
abydos.clustering.phonetic_fingerprint(phrase, phonetic_algorithm=<function double_metaphone>, *args)

Return the phonetic fingerprint of a phrase

A phonetic fingerprint is identical to a standard string fingerprint, as implemented in abydos.clustering.fingerprint(), but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm.

Parameters:
  • phrase (str) – the string from which to calculate the phonetic fingerprint
  • phonetic_algorithm (function) – a phonetic algorithm that takes a string and returns a string (presumably a phonetic representation of the original string) By default, this function uses abydos.phonetic.double_metaphone()
  • args – additional arguments to pass to the phonetic algorithm, along with the phrase itself
Returns:

the phonetic fingerprint of the phrase

Return type:

str

abydos.clustering.qgram_fingerprint(phrase, qval=2, start_stop=u'')

Q-Gram fingerprint

A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated.

Parameters:
  • phrase (str) – the string from which to calculate the q-gram fingerprint
  • qval (int) – the length of each q-gram (by default 2)
  • start_stop (str) – the start & stop symbol(s) to concatenate on either end of the phrase, as defined in abydos.util.qgram()
Returns:

the q-gram fingerprint of the phrase

Return type:

str

abydos.clustering.skeleton_key(word)

Skeleton key

The skeleton key of a word is defined in: Pollock, Joseph J. and Antonio Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4). 358–368. <http://dl.acm.org/citation.cfm?id=358048>

Parameters:word (str) – the word to transform into its skeleton key
Returns:the skeleton key
Return type:str

abydos.compression module

abydos.compression

The compression module defines compression and compression-related functions for use within Abydos, including implementations of the following:

  • arithmetic coding functions (ac_train, ac_encode, & ac_decode)
  • Burrows-Wheeler transform encoder/decoder (bwt_encode & bwt_decode)
  • Run-Length Encoding encoder/decoder (rle_encode & rle_decode)
abydos.compression.ac_decode(longval, nbits, probs)

Decode the number to a string using the given statistics

This is based on Andrew Dalke’s public domain implementation: http://code.activestate.com/recipes/306626/ It has been ported to use the abydos.util.Rational class.

Parameters:
  • longval – The first part of an encoded tuple from ac_encode
  • nbits – The second part of an encoded tuple from ac_encode
  • probs – A probability statistics dictionary generated by ac_train
Returns:

The arithmetically decoded text

Return type:

str

abydos.compression.ac_encode(text, probs)

Encode a text using arithmetic coding with the provided probabilities

Text and the 0-order probability statistics -> longval, nbits

The encoded number is rational(longval, 2**nbits)

This is based on Andrew Dalke’s public domain implementation: http://code.activestate.com/recipes/306626/ It has been ported to use the abydos.util.Rational class.

Parameters:
  • text – A string to encode
  • probs – A probability statistics dictionary generated by ac_train
Returns:

The arithmetically coded text

Return type:

tuple

abydos.compression.ac_train(text)

Generate a probability dict from the provided text

Text -> 0-order probability statistics as a dict

This is based on Andrew Dalke’s public domain implementation: http://code.activestate.com/recipes/306626/ It has been ported to use the abydos.util.Rational class.

Parameters:text – The text data over which to calculate probability statistics. This must not contain the NUL (0x00) character because that’s used to indicate the end of data.
Returns:a probability dict
Return type:dict
abydos.compression.bwt_decode(code, terminator=u'\x00')

Return a word decoded from BWT form

The Burrows-Wheeler transform is an attempt at placing similar characters together to improve compression. This function reverses the transform. Cf. https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

Parameters:
  • code – the word to transform from BWT form
  • terminator – a character added to word to signal the end of the string
Returns:

word decoded by BWT

Return type:

str

abydos.compression.bwt_encode(word, terminator=u'\x00')

Return the Burrows-Wheeler transformed form of a word

The Burrows-Wheeler transform is an attempt at placing similar characters together to improve compression. Cf. https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform

Parameters:
  • word – the word to transform using BWT
  • terminator – a character to add to word to signal the end of the string
Returns:

word encoded by BWT

Return type:

str

abydos.compression.rle_decode(text, use_bwt=True)

Simple, crude run-length-encoding (RLE) decoder

Based on http://rosettacode.org/wiki/Run-length_encoding#Python

Digits 0-9 cannot have been in the original text.

Parameters:
  • text – a text string to decode
  • use_bwt – boolean indicating whether to perform BWT decoding after RLE decoding
Returns:

word decoded by BWT

Return type:

str

abydos.compression.rle_encode(text, use_bwt=True)

Simple, crude run-length-encoding (RLE) encoder

Based on http://rosettacode.org/wiki/Run-length_encoding#Python

Digits 0-9 cannot be in text.

Parameters:
  • text – a text string to encode
  • use_bwt – boolean indicating whether to perform BWT encoding before RLE encoding
Returns:

word decoded by BWT

Return type:

str

abydos.corpus module

abydos.corpus

The corpus class is a container for linguistic corpora and includes various functions for corpus statistics, language modeling, etc.

class abydos.corpus.Corpus(corpus_text=u'', doc_split=u'nn', sent_split=u'n', filter_chars=u'', stop_words=None)

Bases: object

The Corpus class

Internally, this is a list of lists or lists. The corpus itself is a list of documents. Each document is an ordered list of sentences in those documents. And each sentence is an ordered list of words that make up that sentence.

docs()

Get the docs in the corpus

Each list within a doc represents the sentences in that doc, each of which is in turn a list of words within that sentence.

Returns:the paragraphs in the corpus as a list of lists of lists of strs
Return type:list(list(list(str)))
docs_of_words()

Get the docs in the corpus, with sentences flattened

Each list within the corpus represents all the words of that document. Thus the sentence level of lists has been flattened.

Returns:the docs in the corpus as a list of list of strs
Return type:list(list(str))
idf(term, transform=None)

Calculates the Inverse Document Frequency (IDF) of a term in the corpus.

Parameters:
  • term – the term to calculate the IDF of
  • transform – a function to apply to each document term before checking for the presence of term
Returns:

the IDF

Return type:

float

paras()

Get the paragraphs in the corpus

Each list within a paragraph represents the sentences in that doc, each of which is in turn a list of words within that sentence. This is identical to the docs() member function and exists only to mirror part of NLTK’s API for corpora.

Returns:the paragraphs in the corpus as a list of lists of lists of strs
Return type:list(list(list(str)))
raw()

Get the raw corpus

This is reconstructed by joining sub-components with the corpus’ split characters

Returns:the raw corpus
Return type:str
sents()

Get the sentences in the corpus

Each list within a sentence represents the words within that sentence.

Returns:the sentences in the corpus as a list of lists of strs
Return type:list(list(str))
words()

Get the words in the corpus as a single list

Returns:the words in the corpus as a list of strs
Return type:list(str)

abydos.distance module

abydos.distance

The distance module implements string edit distance functions including:

  • Levenshtein distance (incl. a [0, 1] normalized variant)
  • Optimal String Alignment distance (incl. a [0, 1] normalized variant)
  • Levenshtein-Damerau distance (incl. a [0, 1] normalized variant)
  • Hamming distance (incl. a [0, 1] normalized variant)
  • Tversky index
  • Sørensen–Dice coefficient & distance
  • Jaccard similarity coefficient & distance
  • overlap similarity & distance
  • Tanimoto coefficient & distance
  • cosine similarity & distance
  • Jaro distance
  • Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
  • Longest common substring
  • Ratcliff-Obershelp similarity & distance
  • Match Rating Algorithm similarity
  • Normalized Compression Distance (NCD) & similarity
  • Monge-Elkan similarity & distance
  • Matrix similarity
  • Needleman-Wunsch score
  • Smither-Waterman score
  • Gotoh score
  • Length similarity
  • Prefix, Suffix, and Identity similarity & distance
  • Modified Language-Independent Product Name Search (MLIPNS) similarity & distance
  • Bag distance (incl. a [0, 1] normalized variant)
  • Editex distance (incl. a [0, 1] normalized variant)
  • TF-IDF similarity

Functions beginning with the prefixes ‘sim’ and ‘dist’ are guaranteed to be in the range [0, 1], and sim_X = 1 - dist_X since the two are complements. If a sim_X function is supplied identical src & tar arguments, it is guaranteed to return 1; the corresponding dist_X function is guaranteed to return 0.

abydos.distance.bag(src, tar)

bag distance

Bag distance is: \(max( |multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)| )\)

Parameters:src, tar (str) – two strings to be compared
Returns:bag distance
Return type:int
abydos.distance.damerau_levenshtein(src, tar, cost=(1, 1, 1, 1))

Damerau-Levenshtein distance

This computes the Damerau-Levenshtein distance. Cf. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Damerau-Levenshtein code based on Java code by Kevin L. Stern, under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

Parameters:
  • src, tar (str) – two strings to be compared
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

the Damerau-Levenshtein distance between src & tar

Return type:

int (may return a float if cost has float values)

abydos.distance.dist(src, tar, method=<function sim_levenshtein>)

generalized distance

This is a generalized function for calling other distance functions.

Parameters:
  • src, tar (str) – two strings to be compared
  • method (function) – specifies the similarity metric (Levenshtein by default) – Note that this takes a similarity metric function, not a distance metric function.
Returns:

distance according to the specified function

Return type:

float

abydos.distance.dist_bag(src, tar)

Return the normalized bag distance of two strings.

Bag distance is normalized by dividing by \(max( |src|, |tar| )\).

Parameters:src, tar (str) – two strings to be compared
Returns:normalized bag distance
Return type:float
abydos.distance.dist_compression(src, tar, compressor=u'bz2', probs=None)

normalized compression distance (NCD)

Cf. https://en.wikipedia.org/wiki/Normalized_compression_distance#Normalized_compression_distance

Parameters:
  • src, tar (str) – two strings to be compared
  • compressor (str) –

    a compression scheme to use for the similarity calculation, from the following:

    • zlib – standard zlib/gzip
    • bz2 – bzip2
    • lzma – Lempel–Ziv–Markov chain algorithm
    • arith – arithmetic coding
    • rle – run-length encoding
    • bwtrle – Burrows-Wheeler transform followed by run-length encoding
  • probs (doct) – a dictionary trained with ac_train (for the arith compressor only)
Returns:

compression distance

Return type:

float

abydos.distance.dist_cosine(src, tar, qval=2)

cosine distance

Cosine distance is the complement of cosine similarity: \(dist_{cosine} = 1 - sim_{cosine}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

cosine distance

Return type:

float

abydos.distance.dist_damerau(src, tar, cost=(1, 1, 1, 1))

Damerau-Levenshtein distance normalized to the interval [0, 1]

The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

The arguments are identical to those of the levenshtein() function.

Parameters:
  • src, tar (str) – two strings to be compared
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Damerau-Levenshtein distance

Return type:

float

abydos.distance.dist_dice(src, tar, qval=2)

Sørensen–Dice distance

Sørensen–Dice distance is the complemenjt of the Sørensen–Dice coefficient: \(dist_{dice} = 1 - sim_{dice}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

Sørensen–Dice distance

Return type:

float

abydos.distance.dist_editex(src, tar, cost=(0, 1, 2), local=False)

Editex distance normalized to the interval [0, 1]

The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
  • src, tar (str) – two strings to be compared
  • cost (tuple) – a 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
  • local (bool) – if True, the local variant of Editex is used
Returns:

normalized Editex distance

Return type:

float

abydos.distance.dist_hamming(src, tar, difflens=True)

Hamming distance normalized to the interval [0, 1]

The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless difflens is set to False, in which case an exception is raised).

The arguments are identical to those of the hamming() function.

Parameters:
  • src, tar (str) – two strings to be compared
  • allow_different_lengths (bool) – If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:

normalized Hamming distance

Return type:

float

abydos.distance.dist_ident(src, tar)

identity distance

This is 0 if the two strings are identical, otherwise 1, i.e. \(dist_{identity} = 1 - sim_{identity}\)

Parameters:src, tar (str) – two strings to be compared
Returns:indentity distance
Return type:int
abydos.distance.dist_jaccard(src, tar, qval=2)

Jaccard distance

Jaccard distance is the complement of the Jaccard similarity coefficient: \(dist_{Jaccard} = 1 - sim_{Jaccard}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

Jaccard distance

Return type:

float

abydos.distance.dist_jaro_winkler(src, tar, qval=1, mode=u'winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)

Jaro(-Winkler) distance

Jaro-Winkler distance is 1 - the Jaro-Winkler similarity

Parameters:
  • src, tar (str) – two strings to be compared
  • qval (int) – the length of each q-gram (defaults to 1: character-wise matching)
  • mode (str) –

    indicates which variant of this distance metric to compute:

    • ‘winkler’ – computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word
    • ‘jaro’ – computes the Jaro distance

The following arguments apply only when mode is ‘winkler’:

Parameters:
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
  • boost_threshold (float) – a value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7)
  • scaling_factor (float) – a value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1)
Returns:

Jaro or Jaro-Winkler distance

Return type:

float

abydos.distance.dist_lcsseq(src, tar)

longest common subsequence distance (\(dist_{LCSseq}\))

This employs the LCSseq function to derive a similarity metric: \(dist_{LCSseq}(s,t) = 1 - sim_{LCSseq}(s,t)\)

Parameters:src, tar (str) – two strings to be compared
Returns:LCSseq distance
Return type:float
abydos.distance.dist_lcsstr(src, tar)

longest common substring distance (\(dist_{LCSstr}\))

This employs the LCS function to derive a similarity metric: \(dist_{LCSstr}(s,t) = 1 - sim_{LCSstr}(s,t)\)

Parameters:src, tar (str) – two strings to be compared
Returns:LCSstr distance
Return type:float
abydos.distance.dist_length(src, tar)

length distance

Length distance is the complement of length similarity: \(dist_{length} = 1 - sim_{length}\)

Parameters:src, tar (str) – two strings to be compared
Returns:length distance
Return type:float
abydos.distance.dist_levenshtein(src, tar, mode=u'lev', cost=(1, 1, 1, 1))

Levenshtein distance normalized to the interval [0, 1]

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
  • src, tar (str) – two strings to be compared
  • mode (str) –

    specifies a mode for computing the Levenshtein distance:

    • ‘lev’ (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions
    • ‘osa’ computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
    • ‘dam’ computes the Damerau-Levenshtein distance, in which edits may include inserts, deletes, substitutions, and transpositions and substrings may undergo repeated edits
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Levenshtein distance

Return type:

float

abydos.distance.dist_mlipns(src, tar, threshold=0.25, maxmismatches=2)

Modified Language-Independent Product Name Search (MLIPNS)

MLIPNS distance is the complement of MLIPNS similarity: \(dist_{MLIPNS} = 1 - sim_{MLIPNS}\)

This function returns only 0.0 (distant) or 1.0 (not distant)

Parameters:
  • src, tar (str) – two strings to be compared
  • threshold (float) – a number [0, 1] indicating the maximum similarity score, below which the strings are considered ‘similar’ (0.25 by default)
  • maxmismatches (int) – a number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:

MLIPNS distance

Return type:

float

abydos.distance.dist_monge_elkan(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)

Monge-Elkan distance

Monge-Elkan is defined in: Monge, Alvaro E. and Charles P. Elkan. 1996. “The field matching problem: Algorithms and applications.” KDD-9 Proceedings. http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf

Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the distance between src and tar is not necessarily equal to the distance between tar and src. If the sym argument is True, a symmetric value is calculated, at the cost of doubling the computation time (since the \(sim_{Monge-Elkan}(src, tar)\) and \(sim_{Monge-Elkan}(tar, src)\) are both calculated and then averaged).

Parameters:
  • src, tar (str) – two strings to be compared
  • sim_func (function) – the internal similarity metric to emply
  • symmetric (bool) – return a symmetric similarity measure
Returns:

Monge-Elkan distance

Return type:

float

abydos.distance.dist_mra(src, tar)

normalized Match Rating Algorithm distance

MRA distance is the complement of MRA similarity: \(dist_{MRA} = 1 - sim_{MRA}\)

Parameters:src, tar (str) – two strings to be compared
Returns:normalized MRA distance
Return type:float
abydos.distance.dist_overlap(src, tar, qval=2)

overlap distance

Overlap distance is the complement of the overlap coefficient: \(sim_{overlap} = 1 - dist_{overlap}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

overlap distance

Return type:

float

abydos.distance.dist_prefix(src, tar)

prefix distance

Prefix distance is the complement of prefix similarity: \(dist_{prefix} = 1 - sim_{prefix}\)

Parameters:src, tar (str) – two strings to be compared
Returns:prefix distance
Return type:float
abydos.distance.dist_ratcliff_obershelp(src, tar)

Ratcliff-Obershelp distance

Ratcliff-Obsershelp distance the complement of Ratcliff-Obershelp similarity: \(dist_{Ratcliff-Obershelp} = 1 - sim_{Ratcliff-Obershelp}\)

Parameters:src, tar (str) – two strings to be compared
Returns:Ratcliffe-Obershelp distance
Return type:float
abydos.distance.dist_strcmp95(src, tar, long_strings=False)

strcmp95 distance

strcmp95 distance is 1 - strcmp95 similarity

Parameters:
  • src, tar (str) – two strings to be compared
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
Returns:

strcmp95 distance

Return type:

float

abydos.distance.dist_suffix(src, tar)

suffix distance

Suffix distance is the complement of suffix similarity: \(dist_{suffix} = 1 - sim_{suffix}\)

Parameters:src, tar (str) – two strings to be compared
Returns:suffix distance
Return type:float
abydos.distance.dist_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None)

Tversky distance

Tversky distance is the complement of the Tvesrsky index (similarity): \(dist_{Tversky} = 1-sim_{Tversky}\)

The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity. This is activated by specifying a bias parameter. Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
  • alpha, beta (float) – two Tversky index parameters as indicated in the description below
Returns:

Tversky distance

Return type:

float

abydos.distance.editex(src, tar, cost=(0, 1, 2), local=False)

Editex distance

As described on pages 3 & 4 of Zobel, Justin and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In: Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland. 166–173. http://goanna.cs.rmit.edu.au/~jz/fulltext/sigir96.pdf

The local variant is based on Ring, Nicholas and Alexandra L. Uitdenbogerd. 2009. Finding ‘Lucy in Disguise’: The Misheard Lyric Matching Problem. In: Proceedings of the 5th Asia Information Retrieval Symposium, Sapporo, Japan. 157-167. http://www.seg.rmit.edu.au/research/download.php?manuscript=404

Parameters:
  • src, tar (str) – two strings to be compared
  • cost (tuple) – a 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
  • local (bool) – if True, the local variant of Editex is used
Returns:

Editex distance

Return type:

int

abydos.distance.gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=<function sim_ident>)

Gotoh score

Gotoh’s algorithm is essentially Needleman-Wunsch with affine gap penalties: https://www.cs.umd.edu/class/spring2003/cmsc838t/papers/gotoh1982.pdf

Parameters:
  • src, tar (str) – two strings to be compared
  • gap_open (float) – the cost of an open alignment gap (1 by default)
  • gap_ext (float) – the cost of an alignment gap extension (0.4 by default)
  • sim_func (function) – a function that returns the similarity of two characters (identity similarity by default)
Returns:

Gotoh score

Return type:

float (in fact dependent on the gap_cost & return value of sim_func)

abydos.distance.hamming(src, tar, difflens=True)

Hamming distance

Hamming distance equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings’ lengths and adds to this the difference in string lengths.

Parameters:
  • src, tar (str) – two strings to be compared
  • allow_different_lengths (bool) – If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:

the Hamming distance between src & tar

Return type:

int

abydos.distance.lcsseq(src, tar)

longest common subsequence (LCSseq)

Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence#Dynamic_Programming_6 This is licensed GFDL 1.2

Modifications include:
conversion to a numpy array in place of a list of lists
Parameters:src, tar (str) – two strings to be compared
Returns:the longes common subsequence
Return type:str
abydos.distance.lcsstr(src, tar)

longest common substring (LCSstr)

Based on the code from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python This is licensed Creative Commons: Attribution-ShareAlike 3.0

Modifications include:

  • conversion to a numpy array in place of a list of lists
  • conversion to Python 2/3-safe _range from xrange
Parameters:src, tar (str) – two strings to be compared
Returns:the longes common substring
Return type:float
abydos.distance.levenshtein(src, tar, mode=u'lev', cost=(1, 1, 1, 1))

Levenshtein distance

This is the standard edit distance measure. Cf. https://en.wikipedia.org/wiki/Levenshtein_distance

Two additional variants: optimal string alignment (aka restricted Damerau-Levenshtein distance) and the Damerau-Levenshtein distance are also supported. Cf. https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm. Cf. https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm

Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.

Parameters:
  • src, tar (str) – two strings to be compared
  • mode (str) –

    specifies a mode for computing the Levenshtein distance:

    • ‘lev’ (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions
    • ‘osa’ computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
    • ‘dam’ computes the Damerau-Levenshtein distance, in which edits may include inserts, deletes, substitutions, and transpositions and substrings may undergo repeated edits
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

the Levenshtein distance between src & tar

Return type:

int (may return a float if cost has float values)

abydos.distance.mra_compare(src, tar)

Western Airlines Surname Match Rating Algorithm comparison rating

A description of the algorithm can be found on page 18 of https://archive.org/details/accessingindivid00moor

Parameters:src, tar (str) – two strings to be compared
Returns:MRA comparison rating
Return type:int
abydos.distance.needleman_wunsch(src, tar, gap_cost=1, sim_func=<function sim_ident>)

Needleman-Wunsch score

This is the standard edit distance measure.

Cf. https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm

Cf. http://csb.stanford.edu/class/public/readings/Bioinformatics_I_Lecture6/Needleman_Wunsch_JMB_70_Global_alignment.pdf

Parameters:
  • src, tar (str) – two strings to be compared
  • gap_cost (float) – the cost of an alignment gap (1 by default)
  • sim_func (function) – a function that returns the similarity of two characters (identity similarity by default)
Returns:

Needleman-Wunsch score

Return type:

int (in fact dependent on the gap_cost & return value of sim_func)

abydos.distance.sim(src, tar, method=<function sim_levenshtein>)

generalized similarity

This is a generalized function for calling other similarity functions.

Parameters:
  • src, tar (str) – two strings to be compared
  • method (function) – specifies the similarity metric (Levenshtein by default)
Returns:

similarity according to the specified function

Return type:

float

abydos.distance.sim_bag(src, tar)

normalized bag similarity

Normalized bag similarity is the complement of normalized bag distance: \(sim_{bag} = 1 - dist_{bag}\)

Parameters:src, tar (str) – two strings to be compared
Returns:normalized bag similarity
Return type:float
abydos.distance.sim_compression(src, tar, compressor=u'bz2', probs=None)

normalized compression similarity (NCS)

Normalized compression similarity is the complement of normalized compression distance: \(sim_{NCS} = 1 - dist_{NCD}\)

Parameters:
  • src, tar (str) – two strings to be compared
  • compressor (str) –

    a compression scheme to use for the similarity calculation:

    • zlib – standard zlib/gzip
    • bz2 – bzip2
    • lzma – Lempel–Ziv–Markov chain algorithm
    • arith – arithmetic coding
    • rle – run-length encoding
    • bwtrle – Burrows-Wheeler transform followed by run-length encoding
  • probs (dict) – a dictionary trained with ac_train (for the arith compressor only)
Returns:

compression similarity

Return type:

float

abydos.distance.sim_cosine(src, tar, qval=2)

cosine similarity (Ochiai coefficient)

For two sets X and Y, the cosine similarity (Ochiai coefficient) is: \(sim_{cosine}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

cosine similarity

Return type:

float

abydos.distance.sim_damerau(src, tar, cost=(1, 1, 1, 1))

Levenshtein similarity normalized to the interval [0, 1]

Damerau-Levenshtein similarity the complement of Damerau-Levenshtein distance: \(sim_{Damerau} = 1 - dist_{Damerau}\)

The arguments are identical to those of the levenshtein() function.

Parameters:
  • src, tar (str) – two strings to be compared
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Damerau-Levenshtein similarity

Return type:

float

abydos.distance.sim_dice(src, tar, qval=2)

Sørensen–Dice coefficient

For two sets X and Y, the Sørensen–Dice coefficient is \(sim_{dice}(X, Y) = \frac{2 \cdot |X \cap Y|}{|X| + |Y|}\)

This is identical to the Tanimoto similarity coefficient and the Tversky index for \(\alpha = \beta = 0.5\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

Sørensen–Dice similarity

Return type:

float

abydos.distance.sim_editex(src, tar, cost=(0, 1, 2), local=False)

Editex similarity normalized to the interval [0, 1]

The Editex similarity is the complement of Editex distance \(sim_{Editex} = 1 - dist_{Editex}\)

The arguments are identical to those of the editex() function.

Parameters:
  • src, tar (str) – two strings to be compared
  • cost (tuple) – a 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
  • local (bool) – if True, the local variant of Editex is used
Returns:

normalized Editex similarity

Return type:

float

abydos.distance.sim_hamming(src, tar, difflens=True)

Hamming similarity normalized to the interval [0, 1]

Hamming similarity is the complement of normalized Hamming distance: \(sim_{Hamming} = 1 - dist{Hamming}\)

Provided that difflens==True, the Hamming similarity is identical to the Language-Independent Product Name Search (LIPNS) similarity score. For further information, see the sim_mlipns documentation.

The arguments are identical to those of the hamming() function.

Parameters:
  • src, tar (str) – two strings to be compared
  • allow_different_lengths (bool) – If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:

normalized Hamming similarity

Return type:

float

abydos.distance.sim_ident(src, tar)

identity similarity

This is 1 if the two strings are identical, otherwise 0.

Parameters:src, tar (str) – two strings to be compared
Returns:identity similarity
Return type:int
abydos.distance.sim_jaccard(src, tar, qval=2)

Jaccard similarity coefficient

For two sets X and Y, the Jaccard similarity coefficient is \(sim_{jaccard}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\)

This is identical to the Tanimoto similarity coefficient and the Tversky index for \(\alpha = \beta = 1\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

Jaccard similarity

Return type:

float

abydos.distance.sim_jaro_winkler(src, tar, qval=1, mode=u'winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)

Jaro(-Winkler) distance

This is Python based on the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c The above file is a US Government publication and, accordingly, in the public domain.

Parameters:
  • src, tar (str) – two strings to be compared
  • qval (int) – the length of each q-gram (defaults to 1: character-wise matching)
  • mode (str) –

    indicates which variant of this distance metric to compute:

    • ‘winkler’ – computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word
    • ‘jaro’ – computes the Jaro distance

The following arguments apply only when mode is ‘winkler’:

Parameters:
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
  • boost_threshold (float) – a value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7)
  • scaling_factor (float) – a value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1)
Returns:

Jaro or Jaro-Winkler similarity

Return type:

float

abydos.distance.sim_lcsseq(src, tar)

longest common subsequence similarity (\(sim_{LCSseq}\))

This employs the LCSseq function to derive a similarity metric: \(sim_{LCSseq}(s,t) = \frac{|LCSseq(s,t)|}{max(|s|, |t|)}\)

Parameters:src, tar (str) – two strings to be compared
Returns:LCSseq similarity
Return type:float
abydos.distance.sim_lcsstr(src, tar)

longest common substring similarity (\(sim_{LCSstr}\))

This employs the LCS function to derive a similarity metric: \(sim_{LCSstr}(s,t) = \frac{|LCSstr(s,t)|}{max(|s|, |t|)}\)

Parameters:src, tar (str) – two strings to be compared
Returns:LCSstr similarity
Return type:float
abydos.distance.sim_length(src, tar)

length similarity

This is the ratio of the length of the shorter string to the longer.

Parameters:src, tar (str) – two strings to be compared
Returns:length similarity
Return type:float
abydos.distance.sim_levenshtein(src, tar, mode=u'lev', cost=(1, 1, 1, 1))

Levenshtein similarity normalized to the interval [0, 1]

Levenshtein similarity the complement of Levenshtein distance: \(sim_{Levenshtein} = 1 - dist_{Levenshtein}\)

The arguments are identical to those of the levenshtein() function.

Parameters:
  • src, tar (str) – two strings to be compared
  • mode (str) –

    specifies a mode for computing the Levenshtein distance:

    • ‘lev’ (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions
    • ‘osa’ computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
    • ‘dam’ computes the Damerau-Levenshtein distance, in which edits may include inserts, deletes, substitutions, and transpositions and substrings may undergo repeated edits
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Levenshtein similarity

Return type:

float

abydos.distance.sim_matrix(src, tar, mat=None, mismatch_cost=0, match_cost=1, symmetric=True, alphabet=None)

matrix similarity

With the default parameters, this is identical to sim_ident. It is possible for sim_matrix to return values outside of the range \([0, 1]\), if values outside that range are present in mat, mismatch_cost, or match_cost.

Parameters:
  • src, tar (str) – two strings to be compared
  • mat (dict) – a dict mapping tuples to costs; the tuples are (src, tar) pairs of symbols from the alphabet parameter
  • mismatch_cost (float) – the value returned if (src, tar) is absent from mat when src does not equal tar
  • match_cost (float) – the value returned if (src, tar) is absent from mat when src equals tar
  • symmetric (bool) – True if the cost of src not matching tar is identical to the cost of tar not matching src; in this case, the values in mat need only contain (src, tar) or (tar, src), not both
  • alphabet (str) – a collection of tokens from which src and tar are drawn; if this is defined a ValueError is raised if either tar or src is not found in alphabet
Returns:

matrix similarity

Return type:

float

abydos.distance.sim_mlipns(src, tar, threshold=0.25, maxmismatches=2)

Modified Language-Independent Product Name Search (MLIPNS)

The MLIPNS algorithm is described in Shannaq, Boumedyen A. N. and Victor V. Alexandrov. 2010. “Using Product Similarity for Adding Business.” Global Journal of Computer Science and Technology. 10(12). 2-8. http://www.sial.iias.spb.su/files/386-386-1-PB.pdf

This function returns only 1.0 (similar) or 0.0 (not similar).

LIPNS similarity is identical to normalized Hamming similarity.

Parameters:
  • src, tar (str) – two strings to be compared
  • threshold (float) – a number [0, 1] indicating the maximum similarity score, below which the strings are considered ‘similar’ (0.25 by default)
  • maxmismatches (int) – a number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:

MLIPNS similarity

Return type:

float

abydos.distance.sim_monge_elkan(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)

Monge-Elkan similarity

Monge-Elkan is defined in: Monge, Alvaro E. and Charles P. Elkan. 1996. “The field matching problem: Algorithms and applications.” KDD-9 Proceedings. http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf

Note: Monge-Elkan is NOT a symmetric similarity algoritm. Thus, the similarity of src to tar is not necessarily equal to the similarity of tar to src. If the sym argument is True, a symmetric value is calculated, at the cost of doubling the computation time (since the \(sim_{Monge-Elkan}(src, tar)\) and \(sim_{Monge-Elkan}(tar, src)\) are both calculated and then averaged).

Parameters:
  • src, tar (str) – two strings to be compared
  • sim_func (function) – the internal similarity metric to emply
  • symmetric (bool) – return a symmetric similarity measure
Returns:

Monge-Elkan similarity

Return type:

float

abydos.distance.sim_mra(src, tar)

normalized Match Rating Algorithm similarity

This is the MRA normalized to \([0, 1]\), given that MRA itself is constrained to the range \([0, 6]\).

Parameters:src, tar (str) – two strings to be compared
Returns:normalized MRA similarity
Return type:float
abydos.distance.sim_overlap(src, tar, qval=2)

overlap coefficient

For two sets X and Y, the overlap coefficient is \(sim_{overlap}(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

overlap similarity

Return type:

float

abydos.distance.sim_prefix(src, tar)

prefix similarity

Prefix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the start of both terms.

Parameters:src, tar (str) – two strings to be compared
Returns:prefix similarity
Return type:float
abydos.distance.sim_ratcliff_obershelp(src, tar)

Ratcliff-Obershelp similarity

This follows the Ratcliff-Obershelp algorithm to derive a similarity measure:

  1. Find the length of the longest common substring in src & tar.
  2. Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.
  3. Multiply this length by 2 and divide by the sum of the lengths of src & tar.

Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

Parameters:src, tar (str) – two strings to be compared
Returns:Ratcliff-Obserhelp similarity
Return type:float
abydos.distance.sim_strcmp95(src, tar, long_strings=False)

strcmp95 similarity

This is a Python translation of the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c The above file is a US Government publication and, accordingly, in the public domain.

This is based on the Jaro-Winkler distance, but also attempts to correct for some common typos and frequently confused characters. It is also limited to uppercase ASCII characters, so it is appropriate to American names, but not much else.

Parameters:
  • src, tar (str) – two strings to be compared
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
Returns:

strcmp95 similarity

Return type:

float

abydos.distance.sim_suffix(src, tar)

suffix similarity

Suffix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the end of both terms.

Parameters:src, tar (str) – two strings to be compared
Returns:suffix similarity
Return type:float
abydos.distance.sim_tanimoto(src, tar, qval=2)

Tanimoto similarity

For two sets X and Y, the Tanimoto similarity coefficient is \(sim_{Tanimoto}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\) This is identical to the Jaccard similarity coefficient and the Tversky index for \(\alpha = \beta = 1\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

Tanimoto similarity

Return type:

float

abydos.distance.sim_tfidf(src, tar, qval=2, docs_src=None, docs_tar=None)

TF-IDF similarity

This is chiefly based on the “Formal Definition of TF/IDF Distance” at: http://alias-i.com/lingpipe/docs/api/com/aliasi/spell/TfIdfDistance.html

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
  • docs_src (Counter) – a Counter object or string representing the document corpus for the src string
  • docs_tar (Counter) – a Counter object or string representing the document corpus for the tar string (or set to None to use the docs_src for both)
Returns:

TF-IDF similarity

Return type:

float

abydos.distance.sim_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None)

Tversky index

The Tversky index is defined as: For two sets X and Y: \(sim_{Tversky}(X, Y) = \frac{|X \cap Y|} {|X \cap Y| + \alpha|X - Y| + \beta|Y - X|}\)

Cf. https://en.wikipedia.org/wiki/Tversky_index

\(\alpha = \beta = 1\) is equivalent to the Jaccard & Tanimoto similarity coefficients.

\(\alpha = \beta = 0.5\) is equivalent to the Sørensen-Dice similarity coefficient.

Unequal α and β will tend to emphasize one or the other set’s contributions:

  • \(\alpha > \beta\) emphasizes the contributions of X over Y
  • \(\alpha < \beta\) emphasizes the contributions of Y over X)

Parameter values’ relation to 1 emphasizes different types of contributions:

  • \(\alpha and \beta > 1\) emphsize unique contributions over the intersection
  • \(\alpha and \beta < 1\) emphsize the intersection over unique contributions

The symmetric variant is defined in Jiminez, Sergio, Claudio Becerra, and Alexander Gelbukh. 2013. SOFTCARDINALITY-CORE: Improving Text Overlap with Distributional Measures for Semantic Textual Similarity. This is activated by specifying a bias parameter. Cf. http://aclweb.org/anthology/S/S13/S13-1028.pdf

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
  • alpha, beta (float) – two Tversky index parameters as indicated in the description below
Returns:

Tversky similarity

Return type:

float

abydos.distance.smith_waterman(src, tar, gap_cost=1, sim_func=<function sim_ident>)

Smith-Waterman score

This is the standard edit distance measure.

Cf. https://en.wikipedia.org/wiki/Smith–Waterman_algorithm

Parameters:
  • src, tar (str) – two strings to be compared
  • gap_cost (float) – the cost of an alignment gap (1 by default)
  • sim_func (function) – a function that returns the similarity of two characters (identity similarity by default)
Returns:

Smith-Waterman score

Return type:

int (in fact dependent on the gap_cost & return value of sim_func)

abydos.distance.tanimoto(src, tar, qval=2)

Tanimoto distance

Tanimoto distance is \(-log_{2}sim_{Tanimoto}\)

Parameters:
  • src, tar (str) – two strings to be compared (or QGrams/Counter objects)
  • qval (int) – the length of each q-gram; 0 or None for non-q-gram version
Returns:

Tanimoto distance

Return type:

float

abydos.ngram module

abydos.ngram

The NGram class is a container for an n-gram corpus

class abydos.ngram.NGramCorpus(corpus=None)

Bases: object

The NGramCorpus class

Internally, this is a set of recursively embedded dicts, with n layers for a corpus of n-grams. E.g. for a trigram corpus, this will be a dict of dicts of dicts. More precisely, collections.Counter is used in place of dict, making multiset operations valid and allowing unattested n-grams to be queried.

The key at each level is a word. The value at the most deeply embedded level is a numeric value representing the frequency of the trigram. E.g. the trigram frequency of ‘colorless green ideas’ would be the value stored in self.ngcorpus[‘colorless’][‘green’][‘ideas’].

corpus_importer(corpus)

Fill in self.ngcorpus from a Corpus argument

Parameters:corpus – The Corpus from which to initialize the n-gram corpus
gng_importer(corpus_file)

Fill in self.ngcorpus from a Google NGram corpus file

Parameters:corpus_file (file) – The Google NGram file from which to initialize the n-gram corpus
ngcorpus = Counter()

abydos.phones module

abydos.phones

The phones module implements ...

abydos.phones.cmp_features(feat1, feat2)

Compare features

This returns a number in the range [0, 1] representing a comparison of two feature bundles.

If one of the bundles is negative, -1 is returned (for unknown values)

If the bundles are identical, 1 is returned.

If they are inverses of one another, 0 is returned.

Otherwise, a float representing their similarity is returned.

Parameters:feat1, feat2 (int) – Two feature bundles to compare
abydos.phones.get_feature(vector, feature)

Get feature

This returns a list of ints, equal in length to the vector input,
representing presence/absence/neutrality with respect to a particular phonetic feature.
Parameters:
  • vector (list) – a tuple or list of ints representing the phonetic features of a phone or series of phones (such as is returned by the ipa_to_features function)
  • feature (str) – a feature name from the set: ‘consonantal’, ‘sonorant’, ‘syllabic’, ‘labial’, ‘round’, ‘coronal’, ‘anterior’, ‘distributed’, ‘dorsal’, ‘high’, ‘low’, ‘back’, ‘tense’, ‘pharyngeal’, ‘ATR’, ‘voice’, ‘spread_glottis’, ‘constricted_glottis’, ‘continuant’, ‘strident’, ‘lateral’, ‘delayed_release’, ‘nasal’
Returns:

a list indicating presence/absence/neutrality with respect to the feature

Return type:

list(int)

abydos.phones.ipa_to_features(ipa)

IPA to features

This translates an IPA string of one or more phones to a list of ints representing the features of the string.

Parameters:ipa (str) – the IPA representation of a phone or series of phones
Returns:a representation of the features of the input string
Return type:list(int)

abydos.phonetic module

abydos.phonetic

The phonetic module implements phonetic algorithms including:

  • Robert C. Russell’s Index
  • American Soundex
  • Daitch-Mokotoff Soundex
  • Kölner Phonetik
  • NYSIIS
  • Match Rating Algorithm
  • Metaphone
  • Double Metaphone
  • Caverphone
  • Alpha Search Inquiry System
  • Fuzzy Soundex
  • Phonex
  • Phonem
  • Phonix
  • SfinxBis
  • phonet
  • Standardized Phonetic Frequency Code
  • Beider-Morse Phonetic Matching
abydos.phonetic.alpha_sis(word, maxlength=14)

IBM Alpha Search Inquiry System

Based on the algorithm described in “Accessing individual records from personal data files using non-unique identifiers” / Gwendolyn B. Moore, et al.; prepared for the Institute for Computer Sciences and Technology, National Bureau of Standards, Washington, D.C (1977): https://archive.org/stream/accessingindivid00moor#page/15/mode/1up

A collection is necessary since there can be multiple values for a single word. But the collection must be ordered since the first value is the primary coding.

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the length of the code returned (defaults to 14)
Returns:

the Alpha SIS value

Return type:

tuple

abydos.phonetic.bmpm(word, language_arg=0, name_mode=u'gen', match_mode=u'approx', concat=False, filter_langs=False)

Beider-Morse Phonetic Matching algorithm

The Beider-Morse Phonetic Matching algorithm is described at: http://stevemorse.org/phonetics/bmpm.htm The reference implementation is licensed under GPLv3 and available at: http://stevemorse.org/phoneticinfo.htm

Parameters:
  • word (str) – the word to transform
  • language_arg (str) –

    the language of the term; supported values include:

    • ‘any’
    • ‘arabic’
    • ‘cyrillic’
    • ‘czech’
    • ‘dutch’
    • ‘english’
    • ‘french’
    • ‘german’
    • ‘greek’
    • ‘greeklatin’
    • ‘hebrew’
    • ‘hungarian’
    • ‘italian’
    • ‘polish’
    • ‘portuguese’
    • ‘romanian’
    • ‘russian’
    • ‘spanish’
    • ‘turkish’
    • ‘germandjsg’
    • ‘polishdjskp’
    • ‘russiandjsre’
  • name_mode (str) –

    the name mode of the algorithm:

    • ‘gen’ – general (default)
    • ‘ash’ – Ashkenazi
    • ‘sep’ – Sephardic
  • match_mode (str) – matching mode: ‘approx’ or ‘exact’
  • concat (bool) – concatenation mode
  • filter_langs (bool) – filter out incompatible languages
Returns:

the BMPM value(s)

Return type:

tuple

abydos.phonetic.caverphone(word, version=2)

Caverphone

A description of version 1 of the algorithm can be found at: http://caversham.otago.ac.nz/files/working/ctp060902.pdf

A description of version 2 of the algorithm can be found at: http://caversham.otago.ac.nz/files/working/ctp150804.pdf

Parameters:
  • word (str) – the word to transform
  • version (int) – the version of Caverphone to employ for encoding (defaults to 2)
Returns:

the Caverphone value

Return type:

str

abydos.phonetic.dm_soundex(word, maxlength=6, reverse=False, zero_pad=True)

Daitch-Mokotoff Soundex

Returns values of a word as a set. A collection is necessary since there can be multiple values for a single word.

Parameters:
  • word – the word to transform
  • maxlength – the length of the code returned (defaults to 6)
  • reverse – reverse the word before computing the selected Soundex (defaults to False); This results in “Reverse Soundex”
  • zero_pad – pad the end of the return value with 0s to achieve a maxlength string
Returns:

the Daitch-Mokotoff Soundex value

Return type:

str

abydos.phonetic.double_metaphone(word, maxlength=inf)

Double Metaphone

Based on Lawrence Philips’ (Visual) C++ code from 1999: http://aspell.net/metaphone/dmetaph.cpp

Parameters:
  • word – the word to transform
  • maxlength – the maximum length of the returned Double Metaphone codes (defaults to unlimited, but in Philips’ original implementation this was 4)
Returns:

the Double Metaphone value(s)

Return type:

tuple

abydos.phonetic.fuzzy_soundex(word, maxlength=5, zero_pad=True)

Fuzzy Soundex

Fuzzy Soundex is an algorithm derived from Soundex, defined in: Holmes, David and M. Catherine McCabe. “Improving Precision and Recall for Soundex Retrieval.” http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a maxlength string
Returns:

the Fuzzy Soundex value

Return type:

str

abydos.phonetic.koelner_phonetik(word)

Kölner Phonetik (numeric output)

Based on the algorithm described at https://de.wikipedia.org/wiki/Kölner_Phonetik

While the output code is numeric, it is still a str because 0s can lead the code.

Parameters:word (str) – the word to transform
Returns:the Kölner Phonetik value as a numeric string
Return type:str
abydos.phonetic.koelner_phonetik_alpha(word)

Kölner Phonetik (alphabetic output)

Parameters:word (str) – the word to transform
Returns:the Kölner Phonetik value as an alphabetic string
Return type:str
abydos.phonetic.koelner_phonetik_num_to_alpha(num)

Convert Kölner Phonetik from numeric to alphabetic

Parameters:num (str) – a numeric Kölner Phonetik representation
Returns:an alphabetic representation of the same word
Return type:str
abydos.phonetic.metaphone(word, maxlength=inf)

Metaphone

Based on Lawrence Philips’ Pick BASIC code from 1990: http://aspell.net/metaphone/metaphone.basic This incorporates some corrections to the above code, particularly some of those suggested by Michael Kuhn in: http://aspell.net/metaphone/metaphone-kuhn.txt

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the maximum length of the returned Metaphone code (defaults to unlimited, but in Philips’ original implementation this was 4)
Returns:

the Metaphone value

Return type:

str

abydos.phonetic.mra(word)

Western Airlines Surname Match Rating Algorithm personal numeric identifier (PNI)

A description of the algorithm can be found on page 18 of https://archive.org/details/accessingindivid00moor

Parameters:word (str) – the word to transform
Returns:the MRA PNI
Return type:str
abydos.phonetic.nysiis(word, maxlength=6)

New York State Identification and Intelligence System (NYSIIS)

A description of the algorithm can be found at https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the maximum length (default 6) of the code to return
Returns:

the NYSIIS value

Return type:

str

abydos.phonetic.phonem(word)

Phonem

Phonem is defined in Wilde, Georg and Carsten Meyer. 1999. “Doppelgaenger gesucht - Ein Programm fuer kontextsensitive phonetische Textumwandlung.” ct Magazin fuer Computer & Technik 25/1999.

This version is based on the Perl implementation documented at: http://phonetik.phil-fak.uni-koeln.de/fileadmin/home/ritters/Allgemeine_Dateien/Martin_Wilz.pdf It includes some enhancements presented in the Java port at: https://github.com/dcm4che/dcm4che/blob/master/dcm4che-soundex/src/main/java/org/dcm4che3/soundex/Phonem.java

Phonem is intended chiefly for German names/words.

Parameters:word (str) – the word to transform
Returns:the Phonem value
Return type:str
abydos.phonetic.phonet(word, mode=1, lang=u'de', trace=False)

phonet was developed by Jörg Michael and documented in c’t magazine vol. 25/1999, p. 252. It is a phonetic algorithm designed primarily for German. Cf. http://www.heise.de/ct/ftp/99/25/252/

This is a port of Jesper Zedlitz’s code, which is licensed LGPL: https://code.google.com/p/phonet4java/source/browse/trunk/src/main/java/com/googlecode/phonet4java/Phonet.java

That is, in turn, based on Michael’s C code, which is also licensed LGPL: ftp://ftp.heise.de/pub/ct/listings/phonet.zip

Parameters:
  • word (str) – the word to transform
  • mode (int) – the ponet variant to employ (1 or 2)
  • lang (str) – ‘de’ (default) for German ‘none’ for no language
  • trace (bool) – prints debugging info if True
Returns:

the phonet value

Return type:

str

abydos.phonetic.phonex(word, maxlength=4, zero_pad=True)

Phonex

Phonex is an algorithm derived from Soundex, defined in: Lait, A. J. and B. Randell. “An Assessment of Name Matching Algorithms”. http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a maxlength string
Returns:

the Phonex value

Return type:

str

abydos.phonetic.phonix(word, maxlength=4, zero_pad=True)

Phonix

Phonix is a Soundex-like algorithm defined in: T.N. Gadd: PHONIX — The Algorithm, Program 24/4, 1990, p.363-366.

This implementation is based on http://cpansearch.perl.org/src/ULPFR/WAIT-1.800/soundex.c http://cs.anu.edu.au/people/Peter.Christen/Febrl/febrl-0.4.01/encode.py and https://metacpan.org/pod/Text::Phonetic::Phonix

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a maxlength string
Returns:

the Phonix value

Return type:

str

abydos.phonetic.russell_index(word)

Russell Index (integer output)

This follows Robert C. Russell’s Index algorithm, as described in US Patent 1,261,167 (1917)

Parameters:word (str) – the word to transform
Returns:the Russell Index value
Return type:int
abydos.phonetic.russell_index_alpha(word)

Russell Index (alphabetic output)

This follows Robert C. Russell’s Index algorithm, as described in US Patent 1,261,167 (1917)

Parameters:word (str) – the word to transform
Returns:the Russell Index value as an alphabetic string
Return type:str
abydos.phonetic.russell_index_num_to_alpha(num)

Russell Index integer to alphabetic string

This follows Robert C. Russell’s Index algorithm, as described in US Patent 1,261,167 (1917)

Parameters:num (int) – a Russell Index integer value
Returns:the Russell Index as an alphabetic string
Return type:str
abydos.phonetic.sfinxbis(word, maxlength=None)

SfinxBis

SfinxBis is a Soundex-like algorithm defined in: http://www.swami.se/download/18.248ad5af12aa8136533800091/SfinxBis.pdf

This implementation follows the reference implementation: http://www.swami.se/download/18.248ad5af12aa8136533800093/swamiSfinxBis.java.txt

SfinxBis is intended chiefly for Swedish names.

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the length of the code returned (defaults to unlimited)
Returns:

the SfinxBis value

Return type:

tuple

abydos.phonetic.soundex(word, maxlength=4, var=u'American', reverse=False, zero_pad=True)

Soundex

Parameters:
  • word (str) – the word to transform
  • maxlength (int) – the length of the code returned (defaults to 4)
  • var (str) –

    the variant of the algorithm to employ (defaults to ‘American’):

    • ‘American’ follows the American Soundex algorithm, as described at http://www.archives.gov/publications/general-info-leaflets/55-census.html and in Knuth(1998:394); this is also called Miracode
    • ‘special’ follows the rules from the 1880-1910 US Census, in which h & w are not treated as blocking consonants but as vowels
    • ‘dm’ computes the Daitch-Mokotoff Soundex
  • reverse (bool) – reverse the word before computing the selected Soundex (defaults to False); This results in “Reverse Soundex”
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a maxlength string
Returns:

the Soundex value

Return type:

str

abydos.phonetic.spfc(word)

Standardized Phonetic Frequency Code (SPFC)

Standardized Phonetic Frequency Code is roughly Soundex-like. This implementation is based on page 19-21 of https://archive.org/stream/accessingindivid00moor#page/19/mode/1up

Parameters:word (str) – the word to transform
Returns:the SPFC value
Return type:str

abydos.qgram module

abydos.qgram

The qgram module defines the QGrams multi-set class

class abydos.qgram.QGrams(term, qval=2, start_stop=u'$#')

Bases: collections.Counter

A q-gram class, which functions like a bag/multiset

A q-gram is here defined as all sequences of q characters. Q-grams are also known as k-grams and n-grams, but the term n-gram more typically refers to sequences of whitespace-delimited words in a string, where q-gram refers to sequences of characters in a word or string.

count()

q-grams count

Returns:the total count of q-grams in a QGrams object
Return type:int
ordered_list = []
term = u''
term_ss = u''

abydos.stats module

abydos.stats

The stats module defines functions for calculating various statistical data about linguistic objects.

This includes the ConfusionTable object, which includes members cable of calculating the following data based on a confusion table:

  • population counts
  • precision, recall, specificity, negative predictive value, fall-out, false discovery rate, accuracy, balanced accuracy, informedness, and markedness
  • various means of the precision & recall, including: arithmetic, geometric, harmonic, quadratic, logarithmic, contraharmonic, identric (exponential), & Hölder (power/generalized) means
  • \(F_{\beta}\)-scores, \(E\)-scores, \(G\)-measures, along with special functions for \(F_{1}\), \(F_{0.5}\), & \(F_{2}\) scores
  • significance & Matthews correlation coefficient calculation

Functions are provided for calculating the following means:

  • arithmetic
  • geometric
  • harmonic
  • quadratic
  • contraharmonic
  • logarithmic,
  • identric (exponential)
  • Seiffert’s
  • Lehmer
  • Heronian
  • Hölder (power/generalized)
  • arithmetic-geometric
  • geometric-harmonic
  • arithmetic-geometric-harmonic

And for calculating:

  • midrange
  • median
  • mode
class abydos.stats.ConfusionTable(tp=0, tn=0, fp=0, fn=0)

Bases: object

ConfusionTable object

This object is initialized by passing either four integers (or a tuple of four integers) representing the squares of a confusion table: true positives, true negatives, false positives, and false negatives

The object possesses methods for the calculation of various statistics based on the confusion table.

accuracy()

Accuracy is defined as \(\frac{tp + tn}{population}\)

Cf. https://en.wikipedia.org/wiki/Accuracy

Returns:The accuracy of the confusion table
Return type:float
accuracy_gain()

gain in accuracy

The gain in accuracy is defined as: \(G(accuracy) = \frac{accuracy}{random~ accuracy}\)

Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)

Returns:The gain in accuracy of the confusion table
Return type:float
balanced_accuracy()

balanced accuracy

Balanced accuracy is defined as \(\frac{sensitivity + specificity}{2}\)

Cf. https://en.wikipedia.org/wiki/Accuracy

Returns:The balanced accuracy of the confusion table
Return type:float
cond_neg_pop()

condition negative population

Returns:The condition negative population of the confusion table
Return type:int
cond_pos_pop()

condition positive population

Returns:The condition positive population of the confusion table
Return type:int
correct_pop()

correct population

Returns:the correct population of the confusion table
Return type:int
dict()

cast to dict

Returns:the confusion table as a dict
Return type:dict
e_score(beta=1)

\(E\)-score

This is Van Rijsbergen’s effectiveness measure

Cf. https://en.wikipedia.org/wiki/Information_retrieval#F-measure

Returns:The \(E\)-score of the confusion table
Return type:float
error_pop()

error population

Returns:The error population of the confusion table
Return type:int
f1_score()

\(F_{1}\) score

\(F_{1}\) score is the harmonic mean of precision and recall: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:The \(F_{1}\) of the confusion table
Return type:float
f2_score()

\(F_{2}\)

The \(F_{2}\) score emphasizes recall over precision in comparison to the \(F_{1}\) score

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:The \(F_{2}\) of the confusion table
Return type:float
f_measure()

\(F\)-measure

\(F\)-measure is the harmonic mean of precision and recall: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:The math:F-measure of the confusion table
Return type:float
fallout()

fall-out

Fall-out is defined as \(\frac{fp}{fp + tn}\)

AKA false positive rate (FPR)

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Fall-out

Returns:The fall-out of the confusion table
Return type:float
false_neg()

false negatives

Returns:the false negatives of the confusion table
Return type:int
false_pos()

false positives

Returns:the false positives of the confusion table
Return type:int
fbeta_score(beta=1)

\(F_{\beta}\) score

\(F_{\beta}\) for a positive real value \(\beta\) “measures the effectiveness of retrieval with respect to a user who attaches \(\beta\) times as much importance to recall as precision” (van Rijsbergen 1979)

\(F_{\beta}\) score is defined as: \((1 + \beta^2) \cdot \frac{precision \cdot recall} {((\beta^2 \cdot precision) + recall)}\)

Cf. https://en.wikipedia.org/wiki/F1_score

Params numeric beta:
 The \(\beta\) parameter in the above formula
Returns:The \(F_{\beta}\) of the confusion table
Return type:float
fdr()

false discovery rate (FDR)

False discovery rate is defined as \(\frac{fp}{fp + tp}\)

Cf. https://en.wikipedia.org/wiki/False_discovery_rate

Returns:The false discovery rate of the confusion table
Return type:float
fhalf_score()

\(F_{0.5}\) score

The \(F_{0.5}\) score emphasizes precision over recall in comparison to the \(F_{1}\) score

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:The \(F_{0.5}\) score of the confusion table
Return type:float
g_measure()

G-measure

\(G\)-measure is the geometric mean of precision and recall: \(\sqrt{precision \cdot recall}\)

This is identical to the Fowlkes–Mallows (FM) index for two clusters.

Cf. https://en.wikipedia.org/wiki/F1_score#G-measure

Cf. https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index

Returns:The \(G\)-measure of the confusion table
Return type:float
informedness()

Informedness is defined as \(sensitivity + specificity - 1\).

AKA Youden’s J statistic

AKA DeltaP’

Cf. https://en.wikipedia.org/wiki/Youden%27s_J_statistic

Cf. http://dspace.flinders.edu.au/xmlui/bitstream/handle/2328/27165/Powers%20Evaluation.pdf

Returns:The informedness of the confusion table
Return type:float
kappa_statistic()

κ statistic

The κ statistic is defined as: \(\kappa = \frac{accuracy - random~ accuracy} {1 - random~ accuracy}\)

The κ statistic compares the performance of the classifier relative to the performance of a random classifier. κ = 0 indicates performance identical to random. κ = 1 indicates perfect predictive success. κ = -1 indicates perfect predictive failure.

Returns:The κ statistic of the confusion table
Return type:float
markedness()

Markedness is defined as \(precision + npv - 1\)

AKA DeltaP

Cf. https://en.wikipedia.org/wiki/Youden%27s_J_statistic

Cf. http://dspace.flinders.edu.au/xmlui/bitstream/handle/2328/27165/Powers%20Evaluation.pdf

Returns:The markedness of the confusion table
Return type:float
mcc()

Matthews correlation coefficient (MCC)

The Matthews correlation coefficient is defined as: \(\frac{(tp \cdot tn) - (fp \cdot fn)} {\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}\)

This is equivalent to the geometric mean of informedness and markedness, defined above.

Cf. https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

Returns:The Matthews correlation coefficient of the confusion table
Return type:float
npv()

negative predictive value (NPV)

NPV is defined as \(\frac{tn}{tn + fn}\)

Cf. https://en.wikipedia.org/wiki/Negative_predictive_value

Returns:The negative predictive value of the confusion table
Return type:float
population(N)
Returns:The population (N) of the confusion table
Return type:int
pr_aghmean()

arithmetic-geometric-harmonic mean of precision & recall

Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described by Raïssouli, Leazizi, & Chergui: http://www.emis.de/journals/JIPAM/images/014_08_JIPAM/014_08.pdf

Returns:The arithmetic-geometric-harmonic mean of the confusion table’s precision & recall
Return type:float
pr_agmean()

arithmetic-geometric mean of precision & recall

Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits)

Cf. https://en.wikipedia.org/wiki/Arithmetic–geometric_mean

Returns:The arithmetic-geometric mean of the confusion table’s precision & recall
Return type:float
pr_amean()

arithmetic mean of precision & recall

The arithmetic mean of precision and recall is defined as: \(\frac{precision \cdot recall}{2}\)

Cf. https://en.wikipedia.org/wiki/Arithmetic_mean

Returns:The arithmetic mean of the confusion table’s precision & recall
Return type:float
pr_cmean()

contraharmonic mean of precision & recall

The contraharmonic mean is: \(\frac{precision^{2} + recall^{2}}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean

Returns:The contraharmonic mean of the confusion table’s precision & recall
Return type:float
pr_ghmean()

geometric-harmonic mean of precision & recall

Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits)

Cf. https://en.wikipedia.org/wiki/Geometric–harmonic_mean

Returns:The geometric-harmonic mean of the confusion table’s precision & recall
Return type:float
pr_gmean()

geometric mean of precision & recall

The geometric mean of precision and recall is defined as: \(\sqrt{precision \cdot recall}\)

Cf. https://en.wikipedia.org/wiki/Geometric_mean

Returns:The geometric mean of the confusion table’s precision & recall
Return type:float
pr_heronian_mean()

Heronian mean of precision & recall

The Heronian mean of precision and recall is defined as: \(\frac{precision + \sqrt{precision \cdot recall} + recall}{3}\)

Cf. https://en.wikipedia.org/wiki/Heronian_mean

Returns:The Heronian mean of the confusion table’s precision & recall
Return type:float
pr_hmean()

harmonic mean of precision & recall

The harmonic mean of precision and recall is defined as: \(\frac{2 \cdot precision \cdot recall}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/Harmonic_mean

Returns:The harmonic mean of the confusion table’s precision & recall
Return type:float
pr_hoelder_mean(exp=2)

Hölder (power/generalized) mean of precision & recall

The power mean of precision and recall is defined as: \(\frac{1}{2} \cdot \sqrt[exp]{precision^{exp} + recall^{exp}}\) for \(exp \ne 0\), and the geometric mean for \(exp = 0\)

Cf. https://en.wikipedia.org/wiki/Generalized_mean

Parameters:exp (numeric) – The exponent of the Hölder mean
Returns:The Hölder mean for the given exponent of the confusion table’s precision & recall
Return type:float
pr_imean()

identric (exponential) mean of precision & recall

The identric mean is: precision if precision = recall, otherwise \(\frac{1}{e} \cdot \sqrt[precision - recall]{\frac{precision^{precision}} {recall^{recall}}}\)

Cf. https://en.wikipedia.org/wiki/Identric_mean

Returns:The identric mean of the confusion table’s precision & recall
Return type:float
pr_lehmer_mean(exp=2)

Lehmer mean of precision & recall

The Lehmer mean is: \(\frac{precision^{exp} + recall^{exp}} {precision^{exp-1} + recall^{exp-1}}\)

Cf. https://en.wikipedia.org/wiki/Lehmer_mean

Parameters:exp (numeric) – The exponent of the Lehmer mean
Returns:The Lehmer mean for the given exponent of the confusion table’s precision & recall
Return type:float
pr_lmean()

logarithmic mean of precision & recall

The logarithmic mean is: 0 if either precision or recall is 0, the precision if they are equal, otherwise \(\frac{precision - recall} {ln(precision) - ln(recall)}\)

Cf. https://en.wikipedia.org/wiki/Logarithmic_mean

Returns:The logarithmic mean of the confusion table’s precision & recall
Return type:float
pr_qmean()

quadratic mean of precision & recall

The quadratic mean of precision and recall is defined as: \(\sqrt{\frac{precision^{2} + recall^{2}}{2}}\)

Cf. https://en.wikipedia.org/wiki/Quadratic_mean

Returns:The quadratic mean of the confusion table’s precision & recall
Return type:float
pr_seiffert_mean()

Seiffert’s mean of precision & recall

Seiffert’s mean of precision and recall is: \(\frac{precision - recall}{4 \cdot arctan \sqrt{\frac{precision}{recall}} - \pi}\)

Cf. http://www.helsinki.fi/~hasto/pp/miaPreprint.pdf

Returns:Seiffer’s mean of the confusion table’s precision & recall
Return type:float
precision()

Precision is defined as \(\frac{tp}{tp + fp}\)

AKA positive predictive value (PPV)

Cf. https://en.wikipedia.org/wiki/Precision_and_recall

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Precision

Returns:The precision of the confusion table
Return type:float
precision_gain()

gain in precision

The gain in precision is defined as: \(G(precision) = \frac{precision}{random~ precision}\)

Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)

Returns:The gain in precision of the confusion table
Return type:float
recall()

Recall is defined as \(\frac{tp}{tp + fn}\)

AKA sensitivity

AKA true positive rate (TPR)

Cf. https://en.wikipedia.org/wiki/Precision_and_recall Cf. https://en.wikipedia.org/wiki/Sensitivity_(test) Cf. https://en.wikipedia.org/wiki/Information_retrieval#Recall

Returns:The recall of the confusion table
Return type:float
significance()

the significance (\(\chi^{2}\))

Significance is defined as: \(\chi^{2} = \frac{(tp \cdot tn - fp \cdot fn)^{2} (tp + tn + fp + fn)} {((tp + fp)(tp + fn)(tn + fp)(tn + fn)}\)

Also: \(\chi^{2} = MCC^{2} \cdot n\)

Cf. https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test

Returns:The significance of the confusion table
Return type:float
specificity()

Specificity is defined as \(\frac{tn}{tn + fp}\)

AKA true negative rate (TNR)

Cf. https://en.wikipedia.org/wiki/Specificity_(tests)

Returns:The specificity of the confusion table
Return type:float
test_neg_pop()

test negative population

Returns:The test negative population of the confusion table
Return type:int
test_pos_pop()

test positive population

Returns:The test positive population of the confusion table
Return type:int
true_neg()

true negatives

Returns:the true negatives of the confusion table
Return type:int
true_pos()

true positives

Returns:the true positives of the confusion table
Return type:int
tuple()

cast to tuple

Returns:the confusion table as a 4-tuple (tp, tn, fp, fn)
Return type:tuple
abydos.stats.aghmean(nums)

arithmetic-geometric-harmonic mean

Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described by Raïssouli, Leazizi, & Chergui: http://www.emis.de/journals/JIPAM/images/014_08_JIPAM/014_08.pdf

Parameters:nums (list) – A series of numbers
Returns:The arithmetic-geometric-harmonic mean of nums
Return type:float
abydos.stats.agmean(nums)

arithmetic-geometric mean

Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits) Cf. https://en.wikipedia.org/wiki/Arithmetic–geometric_mean

Parameters:nums (list) – A series of numbers
Returns:The arithmetic-geometric mean of nums
Return type:float
abydos.stats.amean(nums)

arithmetic mean

The arithmetic mean is defined as: \(\frac{\sum{nums}}{|nums|}\)

Cf. https://en.wikipedia.org/wiki/Arithmetic_mean

Parameters:nums (list) – A series of numbers
Returns:The arithmetric mean of nums
Return type:float
abydos.stats.cmean(nums)

contraharmonic mean

The contraharmonic mean is: \(\frac{\sum\limits_i x_i^2}{\sum\limits_i x_i}\)

Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean

Parameters:nums (list) – A series of numbers
Returns:The contraharmonic mean of nums
Return type:float
abydos.stats.ghmean(nums)

geometric-harmonic mean

Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits) Cf. https://en.wikipedia.org/wiki/Geometric–harmonic_mean

Parameters:nums (list) – A series of numbers
Returns:The geometric-harmonic mean of nums
Return type:float
abydos.stats.gmean(nums)

geometric mean

The geometric mean is defined as: \(\sqrt[|nums|]{\prod\limits_{i} nums_{i}}\)

Cf. https://en.wikipedia.org/wiki/Geometric_mean

Parameters:nums (list) – A series of numbers
Returns:The geometric mean of nums
Return type:float
abydos.stats.heronian_mean(nums)

Heronian mean

The Heronian mean is: \(\frac{\sum\limits_{i, j}\sqrt{{x_i \cdot x_j}}} {|nums| \cdot \frac{|nums| + 1}{2}}\) for \(j \ge i\)

Cf. https://en.wikipedia.org/wiki/Heronian_mean

Parameters:nums (list) – A series of numbers
Returns:The Heronian mean of nums
Return type:float
abydos.stats.hmean(nums)

harmonic mean

The harmonic mean is defined as: \(\frac{|nums|}{\sum\limits_{i}\frac{1}{nums_i}}\)

Following the behavior of Wolfram|Alpha: If one of the values in nums is 0, return 0. If more than one value in nums is 0, return NaN.

Cf. https://en.wikipedia.org/wiki/Harmonic_mean

Parameters:nums (list) – A series of numbers
Returns:The harmonic mean of nums
Return type:float
abydos.stats.hoelder_mean(nums, exp=2)

Hölder (power/generalized) mean

The Hölder mean is defined as: \(\sqrt[p]{\frac{1}{|nums|} \cdot \sum\limits_i{x_i^p}}\) for \(p \ne 0\), and the geometric mean for \(p = 0\)

Cf. https://en.wikipedia.org/wiki/Generalized_mean

Parameters:
  • nums (list) – A series of numbers
  • exp (numeric) – The exponent of the Hölder mean
Returns:

The Hölder mean of nums for the given exponent

Return type:

float

abydos.stats.imean(nums)

identric (exponential) mean

The identric mean of two numbers x and y is: x if x = y otherwise \(\frac{1}{e} \sqrt[x-y]{\frac{x^x}{y^y}}\)

Cf. https://en.wikipedia.org/wiki/Identric_mean

Parameters:nums (list) – A series of numbers
Returns:The identric mean of nums
Return type:float
abydos.stats.lehmer_mean(nums, exp=2)

Lehmer mean

The Lehmer mean is: \(\frac{\sum\limits_i{x_i^p}}{\sum\limits_i{x_i^(p-1)}}\)

Cf. https://en.wikipedia.org/wiki/Lehmer_mean

Parameters:
  • nums (list) – A series of numbers
  • exp (numeric) – The exponent of the Lehmer mean
Returns:

The Lehmer mean of nums for the given exponent

Return type:

float

abydos.stats.lmean(nums)

logarithmic mean

The logarithmic mean of an arbitrarily long series is defined by http://www.survo.fi/papers/logmean.pdf as: \(L(x_1, x_2, ..., x_n) = (n-1)! \sum\limits_{i=1}^n \frac{x_i} {\prod\limits_{\substack{j = 1\\j \ne i}}^n ln \frac{x_i}{x_j}}\)

Cf. https://en.wikipedia.org/wiki/Logarithmic_mean

Parameters:nums (list) – A series of numbers
Returns:The logarithmic mean of nums
Return type:float
abydos.stats.median(nums)

With numbers sorted by value, the median is the middle value (if there is an odd number of values) or the arithmetic mean of the two middle values (if there is an even number of values).

Cf. https://en.wikipedia.org/wiki/Median

Parameters:nums (list) – A series of numbers
Returns:The median of nums
Return type:float
abydos.stats.midrange(nums)

The midrange is the arithmetic mean of the maximum & minimum of a series.

Cf. https://en.wikipedia.org/wiki/Midrange

Parameters:nums (list) – A series of numbers
Returns:The midrange of nums
Return type:float
abydos.stats.mode(nums)

The mode of a series is the most common element of that series

Cf. https://en.wikipedia.org/wiki/Mode_(statistics)

Parameters:nums (list) – A series of numbers
Returns:The mode of nums
Return type:float
abydos.stats.qmean(nums)

quadratic mean

The quadratic mean of precision and recall is defined as: \(\sqrt{\sum\limits_{i} \frac{num_i^2}{|nums|}}\)

Cf. https://en.wikipedia.org/wiki/Quadratic_mean

Parameters:nums (list) – A series of numbers
Returns:The quadratic mean of nums
Return type:float
abydos.stats.seiffert_mean(nums)

Seiffert’s mean

Seiffert’s mean of two numbers x and y is: \(\frac{x - y}{4 \cdot arctan \sqrt{\frac{x}{y}} - \pi}\)

Cf. http://www.helsinki.fi/~hasto/pp/miaPreprint.pdf

Parameters:nums (list) – A series of numbers
Returns:Sieffert’s mean of nums
Return type:float

abydos.stemmer module

abydos.stemmer

The stemmer module defines word stemmers including:

  • the Lovins stemmer
  • the Porter and Porter2 (Snowball English) stemmers
  • Snowball stemmers for German, Dutch, Norwegian, Swedish, and Danish
  • CLEF German, German plus, and Swedish stemmers
  • Caumann’s German stemmer
abydos.stemmer.caumanns(word)

Caumanns German stemmer

Jörg Caumanns’ stemmer is described in his article at: http://edocs.fu-berlin.de/docs/servlets/MCRFileNodeServlet/FUDOCS_derivate_000000000350/tr-b-99-16.pdf

This implementation is based on the GermanStemFilter described at: http://www.evelix.ch/unternehmen/Blog/evelix/2013/11/11/inner-workings-of-the-german-analyzer-in-lucene

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.clef_german(word)

CLEF German stemmer

The CLEF German stemmer is defined at: http://members.unine.ch/jacques.savoy/clef/germanStemmer.txt

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.clef_german_plus(word)

CLEF German stemmer plus

The CLEF German stemmer plus is defined at: http://members.unine.ch/jacques.savoy/clef/germanStemmerPlus.txt

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.clef_swedish(word)

CLEF Swedish stemmer

The CLEF Swedish stemmer is defined at: http://members.unine.ch/jacques.savoy/clef/swedishStemmer.txt

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.lovins(word)

Lovins stemmer

The Lovins stemmer is described in Julie Beth Lovins’s article at: http://www.mt-archive.info/MT-1968-Lovins.pdf

Parameters:word – the word to stem
Returns:word stem
Return type:string
abydos.stemmer.porter(word, early_english=False)

Porter stemmer

The Porter stemmer is defined at: http://snowball.tartarus.org/algorithms/porter/stemmer.html

Parameters:
  • word – the word to calculate the stem of
  • early_english – set to True in order to remove -eth & -est (2nd & 3rd person singular verbal agreement suffixes)
Returns:

word stem

Return type:

str

abydos.stemmer.porter2(word, early_english=False)

Porter2 (Snowball English) stemmer:

The Porter2 (Snowball English) stemmer is defined at: http://snowball.tartarus.org/algorithms/english/stemmer.html

Parameters:
  • word – the word to calculate the stem of
  • early_english – set to True in order to remove -eth & -est (2nd & 3rd person singular verbal agreement suffixes)
Returns:

word stem

Return type:

str

abydos.stemmer.sb_danish(word)

Snowball Danish stemmer

The Snowball Danish stemmer is defined at: http://snowball.tartarus.org/algorithms/danish/stemmer.html

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.sb_dutch(word)

Snowball Dutch stemmer

The Snowball Dutch stemmer is defined at: http://snowball.tartarus.org/algorithms/dutch/stemmer.html

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.sb_german(word, alternate_vowels=False)

Snowball German stemmer

The Snowball German stemmer is defined at: http://snowball.tartarus.org/algorithms/german/stemmer.html

Parameters:
  • word – the word to calculate the stem of
  • alternate_vowels – composes ae as ä, oe as ö, and ue as ü before running the algorithm
Returns:

word stem

Return type:

str

abydos.stemmer.sb_norwegian(word)

Snowball Norwegian stemmer

The Snowball Norwegian stemmer is defined at: http://snowball.tartarus.org/algorithms/norwegian/stemmer.html

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str
abydos.stemmer.sb_swedish(word)

Snowball Swedish stemmer

The Snowball Swedish stemmer is defined at: http://snowball.tartarus.org/algorithms/swedish/stemmer.html

Parameters:word – the word to calculate the stem of
Returns:word stem
Return type:str

abydos.util module

abydos.util

The util module defines various utility functions for other modules within Abydos, including:

  • prod – computes the product of a collection of numbers (akin to sum)

  • jitter – adds small random noise to each member of a list of numbers

    (ported from R’s jitter function)

  • Rational – a rational number class

class abydos.util.Rational(p=0, q=1)

Bases: object

Rational number object supporting arithmetic and comparison operations

denominator()

Get the denominator (q)

Returns:the denominator q
Return type:int
numerator()

Get the numerator (p)

Returns:the numerator p
Return type:int
p = 0L
q = 1L
abydos.util.jitter(nums, factor=1, amount=None, min_val=None, max_val=None, rfunc=u'normal')

Jitter

Adapted from R documentation as this is ported directly from the R code:

The result, say r, is r = x + numpy.random.uniform(-a, a) where n = len(x) and a is the amount argument (if specified).

Let z = max(x) - min(x) (assuming the usual case). The amount a to be added is either provided as positive argument amount or otherwise computed from z, as follows:

If amount == 0, we set a = factor * z/50 (same as S).

If amount is None (default), we set a = factor * d/5 where d is the smallest difference between adjacent unique x values.

Based on: http://svn.r-project.org/R/trunk/src/library/base/R/jitter.R

Parameters:
  • x – numeric collection to which jitter should be added
  • factor – numeric
  • amount – numeric; if positive, used as amount (see below), otherwise, if = 0 the default is factor * z/50. Default (NULL): factor * d/5 where d is about the smallest difference between x values.
  • min_val – the minimum permitted value in the returned list
  • max_val – the maximum permitted value in the returned list
  • rand – a string or function to indicate the random distribution used: ‘normal’ (default), ‘uniform’ (standard in the R version), or ‘laplace’ (requires Numpy) If a function is supplied, it should take one argument (the value passed as amount).
Returns:

a list of numbers with random noise added, according to the R jitter function

abydos.util.prod(nums)

Product

The product is Π(nums).

Cf. https://en.wikipedia.org/wiki/Product_(mathematics)

Parameters:nums – a collection (list, tuple, set, etc.) of numbers
Returns:the product of a nums

Module contents

abydos

Abydos NLP/IR library by Christopher C. Little

This library contains code I’m using for research, in particular dissertation research & experimentation.

Further documentation to come...