abydos.distance package

abydos.distance.

The distance module implements string edit distance functions including:

  • Levenshtein distance
  • Optimal String Alignment distance
  • Levenshtein-Damerau distance
  • Hamming distance
  • Tversky index
  • Sørensen–Dice coefficient & distance
  • Jaccard similarity coefficient & distance
  • overlap similarity & distance
  • Tanimoto coefficient & distance
  • Minkowski distance & similarity
  • Manhattan distance & similarity
  • Euclidean distance & similarity
  • Chebyshev distance
  • cosine similarity & distance
  • Jaro distance
  • Jaro-Winkler distance (incl. the strcmp95 algorithm variant)
  • Longest common substring
  • Ratcliff-Obershelp similarity & distance
  • Match Rating Algorithm similarity
  • Normalized Compression Distance (NCD) & similarity
  • Monge-Elkan similarity & distance
  • Matrix similarity
  • Needleman-Wunsch score
  • Smith-Waterman score
  • Gotoh score
  • Length similarity
  • Prefix, Suffix, and Identity similarity & distance
  • Modified Language-Independent Product Name Search (MLIPNS) similarity & distance
  • Bag similarity & distance
  • Editex distance
  • Eudex distances
  • Sift4 distance
  • Baystat distance & similarity
  • Typo distance
  • Indel distance
  • Synoname

Functions beginning with the prefixes ‘sim’ and ‘dist’ are guaranteed to be in the range [0, 1], and sim_X = 1 - dist_X since the two are complements. If a sim_X function is supplied identical src & tar arguments, it is guaranteed to return 1; the corresponding dist_X function is guaranteed to return 0.

abydos.distance.sim(src, tar, method=<function sim_levenshtein>)[source]

Return a similarity of two strings.

This is a generalized function for calling other similarity functions.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • method (function) – specifies the similarity metric (sim_levenshtein by default)
Returns:

similarity according to the specified function

Return type:

float

>>> round(sim('cat', 'hat'), 12)
0.666666666667
>>> round(sim('Niall', 'Neil'), 12)
0.4
>>> sim('aluminum', 'Catalan')
0.125
>>> sim('ATCG', 'TAGC')
0.25
abydos.distance.dist(src, tar, method=<function sim_levenshtein>)[source]

Return a distance between two strings.

This is a generalized function for calling other distance functions.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • method (function) – specifies the similarity metric (sim_levenshtein by default) – Note that this takes a similarity metric function, not a distance metric function.
Returns:

distance according to the specified function

Return type:

float

>>> round(dist('cat', 'hat'), 12)
0.333333333333
>>> round(dist('Niall', 'Neil'), 12)
0.6
>>> dist('aluminum', 'Catalan')
0.875
>>> dist('ATCG', 'TAGC')
0.75
abydos.distance.levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]

Return the Levenshtein distance between two strings.

This is the standard edit distance measure. Cf. [65][Lev66].

Two additional variants: optimal string alignment (aka restricted Damerau-Levenshtein distance) [Boy11] and the Damerau-Levenshtein [Dam64] distance are also supported.

The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm [WF74].

Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • mode (str) –

    specifies a mode for computing the Levenshtein distance:

    • ’lev’ (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions
    • ’osa’ computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
    • ’dam’ computes the Damerau-Levenshtein distance, in which edits may include inserts, deletes, substitutions, and transpositions and substrings may undergo repeated edits
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

the Levenshtein distance between src & tar

Return type:

int (may return a float if cost has float values)

>>> levenshtein('cat', 'hat')
1
>>> levenshtein('Niall', 'Neil')
3
>>> levenshtein('aluminum', 'Catalan')
7
>>> levenshtein('ATCG', 'TAGC')
3
>>> levenshtein('ATCG', 'TAGC', mode='osa')
2
>>> levenshtein('ACTG', 'TAGC', mode='osa')
4
>>> levenshtein('ATCG', 'TAGC', mode='dam')
2
>>> levenshtein('ACTG', 'TAGC', mode='dam')
3
abydos.distance.dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]

Return the normalized Levenshtein distance between two strings.

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • mode (str) –

    specifies a mode for computing the Levenshtein distance:

    • ’lev’ (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions
    • ’osa’ computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
    • ’dam’ computes the Damerau-Levenshtein distance, in which edits may include inserts, deletes, substitutions, and transpositions and substrings may undergo repeated edits
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Levenshtein distance

Return type:

float

>>> round(dist_levenshtein('cat', 'hat'), 12)
0.333333333333
>>> round(dist_levenshtein('Niall', 'Neil'), 12)
0.6
>>> dist_levenshtein('aluminum', 'Catalan')
0.875
>>> dist_levenshtein('ATCG', 'TAGC')
0.75
abydos.distance.sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]

Return the Levenshtein similarity of two strings.

Normalized Levenshtein similarity is the complement of normalized Levenshtein distance: \(sim_{Levenshtein} = 1 - dist_{Levenshtein}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • mode (str) –

    specifies a mode for computing the Levenshtein distance:

    • ’lev’ (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions
    • ’osa’ computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
    • ’dam’ computes the Damerau-Levenshtein distance, in which edits may include inserts, deletes, substitutions, and transpositions and substrings may undergo repeated edits
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Levenshtein similarity

Return type:

float

>>> round(sim_levenshtein('cat', 'hat'), 12)
0.666666666667
>>> round(sim_levenshtein('Niall', 'Neil'), 12)
0.4
>>> sim_levenshtein('aluminum', 'Catalan')
0.125
>>> sim_levenshtein('ATCG', 'TAGC')
0.25
abydos.distance.damerau_levenshtein(src, tar, cost=(1, 1, 1, 1))[source]

Return the Damerau-Levenshtein distance between two strings.

This computes the Damerau-Levenshtein distance [Dam64]. Damerau-Levenshtein code is based on Java code by Kevin L. Stern [Ste14], under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

the Damerau-Levenshtein distance between src & tar

Return type:

int (may return a float if cost has float values)

>>> damerau_levenshtein('cat', 'hat')
1
>>> damerau_levenshtein('Niall', 'Neil')
3
>>> damerau_levenshtein('aluminum', 'Catalan')
7
>>> damerau_levenshtein('ATCG', 'TAGC')
2
abydos.distance.dist_damerau(src, tar, cost=(1, 1, 1, 1))[source]

Return the Damerau-Levenshtein similarity of two strings.

Damerau-Levenshtein distance normalized to the interval [0, 1].

The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

The arguments are identical to those of the levenshtein() function.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Damerau-Levenshtein distance

Return type:

float

>>> round(dist_damerau('cat', 'hat'), 12)
0.333333333333
>>> round(dist_damerau('Niall', 'Neil'), 12)
0.6
>>> dist_damerau('aluminum', 'Catalan')
0.875
>>> dist_damerau('ATCG', 'TAGC')
0.5
abydos.distance.sim_damerau(src, tar, cost=(1, 1, 1, 1))[source]

Return the Damerau-Levenshtein similarity of two strings.

Normalized Damerau-Levenshtein similarity the complement of normalized Damerau-Levenshtein distance: \(sim_{Damerau} = 1 - dist_{Damerau}\).

The arguments are identical to those of the levenshtein() function.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:

normalized Damerau-Levenshtein similarity

Return type:

float

>>> round(sim_damerau('cat', 'hat'), 12)
0.666666666667
>>> round(sim_damerau('Niall', 'Neil'), 12)
0.4
>>> sim_damerau('aluminum', 'Catalan')
0.125
>>> sim_damerau('ATCG', 'TAGC')
0.5
abydos.distance.dist_indel(src, tar)[source]

Return the normalized indel distance between two strings.

This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

indel distance

Return type:

float

>>> round(dist_indel('cat', 'hat'), 12)
0.333333333333
>>> round(dist_indel('Niall', 'Neil'), 12)
0.333333333333
>>> round(dist_indel('Colin', 'Cuilen'), 12)
0.454545454545
>>> dist_indel('ATCG', 'TAGC')
0.5
abydos.distance.sim_indel(src, tar)[source]

Return the normalized indel similarity of two strings.

This is equivalent to normalized Levenshtein similarity, when only inserts and deletes are possible.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

indel similarity

Return type:

float

>>> round(sim_indel('cat', 'hat'), 12)
0.666666666667
>>> round(sim_indel('Niall', 'Neil'), 12)
0.666666666667
>>> round(sim_indel('Colin', 'Cuilen'), 12)
0.545454545455
>>> sim_indel('ATCG', 'TAGC')
0.5
abydos.distance.hamming(src, tar, diff_lens=True)[source]

Return the Hamming distance between two strings.

Hamming distance [Ham50] equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings’ lengths and adds to this the difference in string lengths.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • diff_lens (bool) – If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:

the Hamming distance between src & tar

Return type:

int

>>> hamming('cat', 'hat')
1
>>> hamming('Niall', 'Neil')
3
>>> hamming('aluminum', 'Catalan')
8
>>> hamming('ATCG', 'TAGC')
4
abydos.distance.dist_hamming(src, tar, diff_lens=True)[source]

Return the normalized Hamming distance between two strings.

Hamming distance normalized to the interval [0, 1].

The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).

The arguments are identical to those of the hamming() function.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • diff_lens (bool) – If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:

normalized Hamming distance

Return type:

float

>>> round(dist_hamming('cat', 'hat'), 12)
0.333333333333
>>> dist_hamming('Niall', 'Neil')
0.6
>>> dist_hamming('aluminum', 'Catalan')
1.0
>>> dist_hamming('ATCG', 'TAGC')
1.0
abydos.distance.sim_hamming(src, tar, diff_lens=True)[source]

Return the normalized Hamming similarity of two strings.

Hamming similarity normalized to the interval [0, 1].

Hamming similarity is the complement of normalized Hamming distance: \(sim_{Hamming} = 1 - dist{Hamming}\).

Provided that diff_lens==True, the Hamming similarity is identical to the Language-Independent Product Name Search (LIPNS) similarity score. For further information, see the sim_mlipns documentation.

The arguments are identical to those of the hamming() function.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • diff_lens (bool) – If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings’ lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:

normalized Hamming similarity

Return type:

float

>>> round(sim_hamming('cat', 'hat'), 12)
0.666666666667
>>> sim_hamming('Niall', 'Neil')
0.4
>>> sim_hamming('aluminum', 'Catalan')
0.0
>>> sim_hamming('ATCG', 'TAGC')
0.0
abydos.distance.dist_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]

Return the Jaro or Jaro-Winkler distance between two strings.

Jaro(-Winkler) similarity is the complement of Jaro(-Winkler) distance: \(sim_{Jaro(-Winkler)} = 1 - dist_{Jaro(-Winkler)}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • qval (int) – the length of each q-gram (defaults to 1: character-wise matching)
  • mode (str) –

    indicates which variant of this distance metric to compute:

    • ’winkler’ – computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word
    • ’jaro’ – computes the Jaro distance

The following arguments apply only when mode is ‘winkler’:

Parameters:
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
  • boost_threshold (float) – a value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7)
  • scaling_factor (float) – a value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1)
Returns:

Jaro or Jaro-Winkler distance

Return type:

float

>>> round(dist_jaro_winkler('cat', 'hat'), 12)
0.222222222222
>>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
0.195
>>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
0.39880952381
>>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
0.166666666667
>>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
0.222222222222
>>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
0.216666666667
>>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
0.39880952381
>>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
0.166666666667
abydos.distance.sim_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]

Return the Jaro or Jaro-Winkler similarity of two strings.

Jaro(-Winkler) distance is a string edit distance initially proposed by Jaro and extended by Winkler [Jar89][Win90].

This is Python based on the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • qval (int) – the length of each q-gram (defaults to 1: character-wise matching)
  • mode (str) –

    indicates which variant of this distance metric to compute:

    • ’winkler’ – computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word
    • ’jaro’ – computes the Jaro distance

The following arguments apply only when mode is ‘winkler’:

Parameters:
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
  • boost_threshold (float) – a value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7)
  • scaling_factor (float) – a value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1)
Returns:

Jaro or Jaro-Winkler similarity

Return type:

float

>>> round(sim_jaro_winkler('cat', 'hat'), 12)
0.777777777778
>>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
0.805
>>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
0.60119047619
>>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
0.833333333333
>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
0.777777777778
>>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
0.783333333333
>>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
0.60119047619
>>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
0.833333333333
abydos.distance.dist_strcmp95(src, tar, long_strings=False)[source]

Return the strcmp95 distance between two strings.

strcmp95 distance is the complement of strcmp95 similarity: \(dist_{strcmp95} = 1 - sim_{strcmp95}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
Returns:

strcmp95 distance

Return type:

float

>>> round(dist_strcmp95('cat', 'hat'), 12)
0.222222222222
>>> round(dist_strcmp95('Niall', 'Neil'), 12)
0.1545
>>> round(dist_strcmp95('aluminum', 'Catalan'), 12)
0.345238095238
>>> round(dist_strcmp95('ATCG', 'TAGC'), 12)
0.166666666667
abydos.distance.sim_strcmp95(src, tar, long_strings=False)[source]

Return the strcmp95 similarity of two strings.

This is a Python translation of the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.

This is based on the Jaro-Winkler distance, but also attempts to correct for some common typos and frequently confused characters. It is also limited to uppercase ASCII characters, so it is appropriate to American names, but not much else.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • long_strings (bool) – set to True to “Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.”
Returns:

strcmp95 similarity

Return type:

float

>>> sim_strcmp95('cat', 'hat')
0.7777777777777777
>>> sim_strcmp95('Niall', 'Neil')
0.8454999999999999
>>> sim_strcmp95('aluminum', 'Catalan')
0.6547619047619048
>>> sim_strcmp95('ATCG', 'TAGC')
0.8333333333333334
abydos.distance.minkowski(src, tar, qval=2, pval=1, normalized=False, alphabet=None)[source]

Return the Minkowski distance (\(L^p-norm\)) of two strings.

The Minkowski distance [Min10] is a distance metric in \(L^p-space\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or float pval (int) – the \(p\)-value of the \(L^p\)-space.
  • normalized (bool) – normalizes to [0, 1] if True
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the Minkowski distance

Return type:

float

>>> minkowski('cat', 'hat')
4.0
>>> minkowski('Niall', 'Neil')
7.0
>>> minkowski('Colin', 'Cuilen')
9.0
>>> minkowski('ATCG', 'TAGC')
10.0
abydos.distance.dist_minkowski(src, tar, qval=2, pval=1, alphabet=None)[source]

Return normalized Minkowski distance of two strings.

The normalized Minkowski distance [Min10] is a distance metric in \(L^p-space\), normalized to [0, 1].

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or float pval (int) – the \(p\)-value of the \(L^p\)-space.
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the normalized Minkowski distance

Return type:

float

>>> dist_minkowski('cat', 'hat')
0.5
>>> round(dist_minkowski('Niall', 'Neil'), 12)
0.636363636364
>>> round(dist_minkowski('Colin', 'Cuilen'), 12)
0.692307692308
>>> dist_minkowski('ATCG', 'TAGC')
1.0
abydos.distance.sim_minkowski(src, tar, qval=2, pval=1, alphabet=None)[source]

Return normalized Minkowski similarity of two strings.

Minkowski similarity is the complement of Minkowski distance: \(sim_{Minkowski} = 1 - dist_{Minkowski}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or float pval (int) – the \(p\)-value of the \(L^p\)-space.
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the normalized Minkowski similarity

Return type:

float

>>> sim_minkowski('cat', 'hat')
0.5
>>> round(sim_minkowski('Niall', 'Neil'), 12)
0.363636363636
>>> round(sim_minkowski('Colin', 'Cuilen'), 12)
0.307692307692
>>> sim_minkowski('ATCG', 'TAGC')
0.0
abydos.distance.manhattan(src, tar, qval=2, normalized=False, alphabet=None)[source]

Return the Manhattan distance between two strings.

Manhattan distance is the city-block or taxi-cab distance, equivalent to Minkowski distance in \(L^1\)-space.

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • normalized – normalizes to [0, 1] if True
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the Manhattan distance

Return type:

float

>>> manhattan('cat', 'hat')
4.0
>>> manhattan('Niall', 'Neil')
7.0
>>> manhattan('Colin', 'Cuilen')
9.0
>>> manhattan('ATCG', 'TAGC')
10.0
abydos.distance.dist_manhattan(src, tar, qval=2, alphabet=None)[source]

Return the normalized Manhattan distance between two strings.

The normalized Manhattan distance is a distance metric in \(L^1-space\), normalized to [0, 1].

This is identical to Canberra distance.

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the normalized Manhattan distance

Return type:

float

>>> dist_manhattan('cat', 'hat')
0.5
>>> round(dist_manhattan('Niall', 'Neil'), 12)
0.636363636364
>>> round(dist_manhattan('Colin', 'Cuilen'), 12)
0.692307692308
>>> dist_manhattan('ATCG', 'TAGC')
1.0
abydos.distance.sim_manhattan(src, tar, qval=2, alphabet=None)[source]

Return the normalized Manhattan similarity of two strings.

Manhattan similarity is the complement of Manhattan distance: \(sim_{Manhattan} = 1 - dist_{Manhattan}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the normalized Manhattan similarity

Return type:

float

>>> sim_manhattan('cat', 'hat')
0.5
>>> round(sim_manhattan('Niall', 'Neil'), 12)
0.363636363636
>>> round(sim_manhattan('Colin', 'Cuilen'), 12)
0.307692307692
>>> sim_manhattan('ATCG', 'TAGC')
0.0
abydos.distance.euclidean(src, tar, qval=2, normalized=False, alphabet=None)[source]

Return the Euclidean distance between two strings.

Euclidean distance is the straigh-line or as-the-crow-flies distance, equivalent to Minkowski distance in \(L^2\)-space.

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • normalized – normalizes to [0, 1] if True
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the Euclidean distance

Return type:

float

>>> euclidean('cat', 'hat')
2.0
>>> round(euclidean('Niall', 'Neil'), 12)
2.645751311065
>>> euclidean('Colin', 'Cuilen')
3.0
>>> round(euclidean('ATCG', 'TAGC'), 12)
3.162277660168
abydos.distance.dist_euclidean(src, tar, qval=2, alphabet=None)[source]

Return the normalized Euclidean distance between two strings.

The normalized Euclidean distance is a distance metric in \(L^2-space\), normalized to [0, 1].

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the normalized Euclidean distance

Return type:

float

>>> round(dist_euclidean('cat', 'hat'), 12)
0.57735026919
>>> round(dist_euclidean('Niall', 'Neil'), 12)
0.683130051064
>>> round(dist_euclidean('Colin', 'Cuilen'), 12)
0.727606875109
>>> dist_euclidean('ATCG', 'TAGC')
1.0
abydos.distance.sim_euclidean(src, tar, qval=2, alphabet=None)[source]

Return the normalized Euclidean similarity of two strings.

Euclidean similarity is the complement of Euclidean distance: \(sim_{Euclidean} = 1 - dist_{Euclidean}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the normalized Euclidean similarity

Return type:

float

>>> round(sim_euclidean('cat', 'hat'), 12)
0.42264973081
>>> round(sim_euclidean('Niall', 'Neil'), 12)
0.316869948936
>>> round(sim_euclidean('Colin', 'Cuilen'), 12)
0.272393124891
>>> sim_euclidean('ATCG', 'TAGC')
0.0
abydos.distance.chebyshev(src, tar, qval=2, normalized=False, alphabet=None)[source]

Return the Chebyshev distance between two strings.

Euclidean distance is the chessboard distance, equivalent to Minkowski distance in \(L^\infty-space\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • normalized – normalizes to [0, 1] if True
  • or int alphabet (collection) – the values or size of the alphabet
Returns:

the Chebyshev distance

Return type:

float

>>> chebyshev('cat', 'hat')
1.0
>>> chebyshev('Niall', 'Neil')
1.0
>>> chebyshev('Colin', 'Cuilen')
1.0
>>> chebyshev('ATCG', 'TAGC')
1.0
>>> chebyshev('ATCG', 'TAGC', qval=1)
0.0
>>> chebyshev('ATCGATTCGGAATTTC', 'TAGCATAATCGCCG', qval=1)
3.0
abydos.distance.dist_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None)[source]

Return the Tversky distance between two strings.

Tversky distance is the complement of the Tvesrsky index (similarity): \(dist_{Tversky} = 1-sim_{Tversky}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • alpha (float) – the Tversky index’s alpha parameter
  • beta (float) – the Tversky index’s beta parameter
  • bias (float) – The symmetric Tversky index bias parameter
Returns:

Tversky distance

Return type:

float

>>> dist_tversky('cat', 'hat')
0.6666666666666667
>>> dist_tversky('Niall', 'Neil')
0.7777777777777778
>>> dist_tversky('aluminum', 'Catalan')
0.9375
>>> dist_tversky('ATCG', 'TAGC')
1.0
abydos.distance.sim_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None)[source]

Return the Tversky index of two strings.

The Tversky index [Tve77] is defined as: For two sets X and Y: \(sim_{Tversky}(X, Y) = \\frac{|X \\cap Y|} {|X \\cap Y| + \\alpha|X - Y| + \\beta|Y - X|}\).

\(\\alpha = \\beta = 1\) is equivalent to the Jaccard & Tanimoto similarity coefficients.

\(\\alpha = \\beta = 0.5\) is equivalent to the Sørensen-Dice similarity coefficient [Dic45][Sorensen48].

Unequal α and β will tend to emphasize one or the other set’s contributions:

  • \(\\alpha > \\beta\) emphasizes the contributions of X over Y
  • \(\\alpha < \\beta\) emphasizes the contributions of Y over X)

Parameter values’ relation to 1 emphasizes different types of contributions:

  • \(\\alpha and \\beta > 1\) emphsize unique contributions over the intersection
  • \(\\alpha and \\beta < 1\) emphsize the intersection over unique contributions

The symmetric variant is defined in [JBG13]. This is activated by specifying a bias parameter.

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
  • alpha (float) – Tversky index parameter as described above
  • beta (float) – Tversky index parameter as described above
  • bias (float) – The symmetric Tversky index bias parameter
Returns:

Tversky similarity

Return type:

float

>>> sim_tversky('cat', 'hat')
0.3333333333333333
>>> sim_tversky('Niall', 'Neil')
0.2222222222222222
>>> sim_tversky('aluminum', 'Catalan')
0.0625
>>> sim_tversky('ATCG', 'TAGC')
0.0
abydos.distance.dist_dice(src, tar, qval=2)[source]

Return the Sørensen–Dice distance between two strings.

Sørensen–Dice distance is the complemenjt of the Sørensen–Dice coefficient: \(dist_{dice} = 1 - sim_{dice}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

Sørensen–Dice distance

Return type:

float

>>> dist_dice('cat', 'hat')
0.5
>>> dist_dice('Niall', 'Neil')
0.6363636363636364
>>> dist_dice('aluminum', 'Catalan')
0.8823529411764706
>>> dist_dice('ATCG', 'TAGC')
1.0
abydos.distance.sim_dice(src, tar, qval=2)[source]

Return the Sørensen–Dice coefficient of two strings.

For two sets X and Y, the Sørensen–Dice coefficient [Dic45][Sorensen48] is \(sim_{dice}(X, Y) = \\frac{2 \\cdot |X \\cap Y|}{|X| + |Y|}\).

This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\\alpha = \\beta = 0.5\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

Sørensen–Dice similarity

Return type:

float

>>> sim_dice('cat', 'hat')
0.5
>>> sim_dice('Niall', 'Neil')
0.36363636363636365
>>> sim_dice('aluminum', 'Catalan')
0.11764705882352941
>>> sim_dice('ATCG', 'TAGC')
0.0
abydos.distance.dist_jaccard(src, tar, qval=2)[source]

Return the Jaccard distance between two strings.

Jaccard distance is the complement of the Jaccard similarity coefficient: \(dist_{Jaccard} = 1 - sim_{Jaccard}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

Jaccard distance

Return type:

float

>>> dist_jaccard('cat', 'hat')
0.6666666666666667
>>> dist_jaccard('Niall', 'Neil')
0.7777777777777778
>>> dist_jaccard('aluminum', 'Catalan')
0.9375
>>> dist_jaccard('ATCG', 'TAGC')
1.0
abydos.distance.sim_jaccard(src, tar, qval=2)[source]

Return the Jaccard similarity of two strings.

For two sets X and Y, the Jaccard similarity coefficient [Jac01] is \(sim_{jaccard}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}\).

This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\\alpha = \\beta = 1\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

Jaccard similarity

Return type:

float

>>> sim_jaccard('cat', 'hat')
0.3333333333333333
>>> sim_jaccard('Niall', 'Neil')
0.2222222222222222
>>> sim_jaccard('aluminum', 'Catalan')
0.0625
>>> sim_jaccard('ATCG', 'TAGC')
0.0
abydos.distance.dist_overlap(src, tar, qval=2)[source]

Return the overlap distance between two strings.

Overlap distance is the complement of the overlap coefficient: \(sim_{overlap} = 1 - dist_{overlap}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

overlap distance

Return type:

float

>>> dist_overlap('cat', 'hat')
0.5
>>> dist_overlap('Niall', 'Neil')
0.6
>>> dist_overlap('aluminum', 'Catalan')
0.875
>>> dist_overlap('ATCG', 'TAGC')
1.0
abydos.distance.sim_overlap(src, tar, qval=2)[source]

Return the overlap coefficient of two strings.

For two sets X and Y, the overlap coefficient [Szy34][Sim49], also called the Szymkiewicz-Simpson coefficient, is \(sim_{overlap}(X, Y) = \\frac{|X \\cap Y|}{min(|X|, |Y|)}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

overlap similarity

Return type:

float

>>> sim_overlap('cat', 'hat')
0.5
>>> sim_overlap('Niall', 'Neil')
0.4
>>> sim_overlap('aluminum', 'Catalan')
0.125
>>> sim_overlap('ATCG', 'TAGC')
0.0
abydos.distance.tanimoto(src, tar, qval=2)[source]

Return the Tanimoto distance between two strings.

Tanimoto distance is \(-log_{2}sim_{Tanimoto}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

Tanimoto distance

Return type:

float

>>> tanimoto('cat', 'hat')
-1.5849625007211563
>>> tanimoto('Niall', 'Neil')
-2.1699250014423126
>>> tanimoto('aluminum', 'Catalan')
-4.0
>>> tanimoto('ATCG', 'TAGC')
-inf
abydos.distance.sim_tanimoto(src, tar, qval=2)[source]

Return the Tanimoto similarity of two strings.

For two sets X and Y, the Tanimoto similarity coefficient [Tan58] is \(sim_{Tanimoto}(X, Y) = \\frac{|X \\cap Y|}{|X \\cup Y|}\).

This is identical to the Jaccard similarity coefficient [Jac01] and the Tversky index [Tve77] for \(\\alpha = \\beta = 1\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

Tanimoto similarity

Return type:

float

>>> sim_tanimoto('cat', 'hat')
0.3333333333333333
>>> sim_tanimoto('Niall', 'Neil')
0.2222222222222222
>>> sim_tanimoto('aluminum', 'Catalan')
0.0625
>>> sim_tanimoto('ATCG', 'TAGC')
0.0
abydos.distance.dist_cosine(src, tar, qval=2)[source]

Return the cosine distance between two strings.

Cosine distance is the complement of cosine similarity: \(dist_{cosine} = 1 - sim_{cosine}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

cosine distance

Return type:

float

>>> dist_cosine('cat', 'hat')
0.5
>>> dist_cosine('Niall', 'Neil')
0.6348516283298893
>>> dist_cosine('aluminum', 'Catalan')
0.882148869802242
>>> dist_cosine('ATCG', 'TAGC')
1.0
abydos.distance.sim_cosine(src, tar, qval=2)[source]

Return the cosine similarity of two strings.

For two sets X and Y, the cosine similarity, Otsuka-Ochiai coefficient, or Ochiai coefficient [Ots36][Och57] is: \(sim_{cosine}(X, Y) = \\frac{|X \\cap Y|}{\\sqrt{|X| \\cdot |Y|}}\).

Parameters:
  • src (str) – source string (or QGrams/Counter objects) for comparison
  • tar (str) – target string (or QGrams/Counter objects) for comparison
  • qval (int) – the length of each q-gram; 0 for non-q-gram version
Returns:

cosine similarity

Return type:

float

>>> sim_cosine('cat', 'hat')
0.5
>>> sim_cosine('Niall', 'Neil')
0.3651483716701107
>>> sim_cosine('aluminum', 'Catalan')
0.11785113019775793
>>> sim_cosine('ATCG', 'TAGC')
0.0
abydos.distance.bag(src, tar)[source]

Return the bag distance between two strings.

Bag distance is proposed in [BCP02]. It is defined as: \(max(|multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)|)\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

bag distance

Return type:

int

>>> bag('cat', 'hat')
1
>>> bag('Niall', 'Neil')
2
>>> bag('aluminum', 'Catalan')
5
>>> bag('ATCG', 'TAGC')
0
>>> bag('abcdefg', 'hijklm')
7
>>> bag('abcdefg', 'hijklmno')
8
abydos.distance.dist_bag(src, tar)[source]

Return the normalized bag distance between two strings.

Bag distance is normalized by dividing by \(max( |src|, |tar| )\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

normalized bag distance

Return type:

float

>>> dist_bag('cat', 'hat')
0.3333333333333333
>>> dist_bag('Niall', 'Neil')
0.4
>>> dist_bag('aluminum', 'Catalan')
0.625
>>> dist_bag('ATCG', 'TAGC')
0.0
abydos.distance.sim_bag(src, tar)[source]

Return the normalized bag similarity of two strings.

Normalized bag similarity is the complement of normalized bag distance: \(sim_{bag} = 1 - dist_{bag}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

normalized bag similarity

Return type:

float

>>> round(sim_bag('cat', 'hat'), 12)
0.666666666667
>>> sim_bag('Niall', 'Neil')
0.6
>>> sim_bag('aluminum', 'Catalan')
0.375
>>> sim_bag('ATCG', 'TAGC')
1.0
abydos.distance.dist_monge_elkan(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)[source]

Return the Monge-Elkan distance between two strings.

Monge-Elkan distance is the complement of Monge-Elkan similarity: \(dist_{Monge-Elkan} = 1 - sim_{Monge-Elkan}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • sim_func (function) – the internal similarity metric to employ
  • symmetric (bool) – return a symmetric similarity measure
Returns:

Monge-Elkan distance

Return type:

float

>>> dist_monge_elkan('cat', 'hat')
0.25
>>> round(dist_monge_elkan('Niall', 'Neil'), 12)
0.333333333333
>>> round(dist_monge_elkan('aluminum', 'Catalan'), 12)
0.611111111111
>>> dist_monge_elkan('ATCG', 'TAGC')
0.5
abydos.distance.sim_monge_elkan(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)[source]

Return the Monge-Elkan similarity of two strings.

Monge-Elkan is defined in [ME96].

Note: Monge-Elkan is NOT a symmetric similarity algorithm. Thus, the similarity of src to tar is not necessarily equal to the similarity of tar to src. If the symmetric argument is True, a symmetric value is calculated, at the cost of doubling the computation time (since \(sim_{Monge-Elkan}(src, tar)\) and \(sim_{Monge-Elkan}(tar, src)\) are both calculated and then averaged).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • sim_func (function) – the internal similarity metric to employ
  • symmetric (bool) – return a symmetric similarity measure
Returns:

Monge-Elkan similarity

Return type:

float

>>> sim_monge_elkan('cat', 'hat')
0.75
>>> round(sim_monge_elkan('Niall', 'Neil'), 12)
0.666666666667
>>> round(sim_monge_elkan('aluminum', 'Catalan'), 12)
0.388888888889
>>> sim_monge_elkan('ATCG', 'TAGC')
0.5
abydos.distance.needleman_wunsch(src, tar, gap_cost=1, sim_func=<function sim_ident>)[source]

Return the Needleman-Wunsch score of two strings.

The Needleman-Wunsch score [NW70] is a standard edit distance measure.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • gap_cost (float) – the cost of an alignment gap (1 by default)
  • sim_func (function) – a function that returns the similarity of two characters (identity similarity by default)
Returns:

Needleman-Wunsch score

Return type:

float

>>> needleman_wunsch('cat', 'hat')
2.0
>>> needleman_wunsch('Niall', 'Neil')
1.0
>>> needleman_wunsch('aluminum', 'Catalan')
-1.0
>>> needleman_wunsch('ATCG', 'TAGC')
0.0
abydos.distance.smith_waterman(src, tar, gap_cost=1, sim_func=<function sim_ident>)[source]

Return the Smith-Waterman score of two strings.

The Smith-Waterman score [SW81] is a standard edit distance measure, differing from Needleman-Wunsch in that it focuses on local alignment and disallows negative scores.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • gap_cost (float) – the cost of an alignment gap (1 by default)
  • sim_func (function) – a function that returns the similarity of two characters (identity similarity by default)
Returns:

Smith-Waterman score

Return type:

float

>>> smith_waterman('cat', 'hat')
2.0
>>> smith_waterman('Niall', 'Neil')
1.0
>>> smith_waterman('aluminum', 'Catalan')
0.0
>>> smith_waterman('ATCG', 'TAGC')
1.0
abydos.distance.gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=<function sim_ident>)[source]

Return the Gotoh score of two strings.

The Gotoh score [Got82] is essentially Needleman-Wunsch with affine gap penalties.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • gap_open (float) – the cost of an open alignment gap (1 by default)
  • gap_ext (float) – the cost of an alignment gap extension (0.4 by default)
  • sim_func (function) – a function that returns the similarity of two characters (identity similarity by default)
Returns:

Gotoh score

Return type:

float

>>> gotoh('cat', 'hat')
2.0
>>> gotoh('Niall', 'Neil')
1.0
>>> round(gotoh('aluminum', 'Catalan'), 12)
-0.4
>>> gotoh('cat', 'hat')
2.0
abydos.distance.sim_matrix(src, tar, mat=None, mismatch_cost=0, match_cost=1, symmetric=True, alphabet=None)[source]

Return the matrix similarity of two strings.

With the default parameters, this is identical to sim_ident. It is possible for sim_matrix to return values outside of the range \([0, 1]\), if values outside that range are present in mat, mismatch_cost, or match_cost.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • mat (dict) – a dict mapping tuples to costs; the tuples are (src, tar) pairs of symbols from the alphabet parameter
  • mismatch_cost (float) – the value returned if (src, tar) is absent from mat when src does not equal tar
  • match_cost (float) – the value returned if (src, tar) is absent from mat when src equals tar
  • symmetric (bool) – True if the cost of src not matching tar is identical to the cost of tar not matching src; in this case, the values in mat need only contain (src, tar) or (tar, src), not both
  • alphabet (str) – a collection of tokens from which src and tar are drawn; if this is defined a ValueError is raised if either tar or src is not found in alphabet
Returns:

matrix similarity

Return type:

float

>>> sim_matrix('cat', 'hat')
0
>>> sim_matrix('hat', 'hat')
1
abydos.distance.lcsseq(src, tar)[source]

Return the longest common subsequence of two strings.

Longest common subsequence (LCSseq) is the longest subsequence of characters that two strings have in common.

Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence#Dynamic_Programming_6 [Cod18a]. This is licensed GFDL 1.2.

Modifications include:
conversion to a numpy array in place of a list of lists
Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

the longest common subsequence

Return type:

str

>>> lcsseq('cat', 'hat')
'at'
>>> lcsseq('Niall', 'Neil')
'Nil'
>>> lcsseq('aluminum', 'Catalan')
'aln'
>>> lcsseq('ATCG', 'TAGC')
'AC'
abydos.distance.dist_lcsseq(src, tar)[source]

Return the longest common subsequence distance between two strings.

Longest common subsequence distance (\(dist_{LCSseq}\)).

This employs the LCSseq function to derive a similarity metric: \(dist_{LCSseq}(s,t) = 1 - sim_{LCSseq}(s,t)\)

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

LCSseq distance

Return type:

float

>>> dist_lcsseq('cat', 'hat')
0.33333333333333337
>>> dist_lcsseq('Niall', 'Neil')
0.4
>>> dist_lcsseq('aluminum', 'Catalan')
0.625
>>> dist_lcsseq('ATCG', 'TAGC')
0.5
abydos.distance.sim_lcsseq(src, tar)[source]

Return the longest common subsequence similarity of two strings.

Longest common subsequence similarity (\(sim_{LCSseq}\)).

This employs the LCSseq function to derive a similarity metric: \(sim_{LCSseq}(s,t) = \\frac{|LCSseq(s,t)|}{max(|s|, |t|)}\)

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

LCSseq similarity

Return type:

float

>>> sim_lcsseq('cat', 'hat')
0.6666666666666666
>>> sim_lcsseq('Niall', 'Neil')
0.6
>>> sim_lcsseq('aluminum', 'Catalan')
0.375
>>> sim_lcsseq('ATCG', 'TAGC')
0.5
abydos.distance.lcsstr(src, tar)[source]

Return the longest common substring of two strings.

Longest common substring (LCSstr).

Based on the code from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#Python [Wik18]. This is licensed Creative Commons: Attribution-ShareAlike 3.0.

Modifications include:

  • conversion to a numpy array in place of a list of lists
  • conversion to Python 2/3-safe range from xrange via six
Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

the longest common substring

Return type:

str

>>> lcsstr('cat', 'hat')
'at'
>>> lcsstr('Niall', 'Neil')
'N'
>>> lcsstr('aluminum', 'Catalan')
'al'
>>> lcsstr('ATCG', 'TAGC')
'A'
abydos.distance.dist_lcsstr(src, tar)[source]

Return the longest common substring distance between two strings.

Longest common substring distance (\(dist_{LCSstr}\)).

This employs the LCS function to derive a similarity metric: \(dist_{LCSstr}(s,t) = 1 - sim_{LCSstr}(s,t)\)

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

LCSstr distance

Return type:

float

>>> dist_lcsstr('cat', 'hat')
0.33333333333333337
>>> dist_lcsstr('Niall', 'Neil')
0.8
>>> dist_lcsstr('aluminum', 'Catalan')
0.75
>>> dist_lcsstr('ATCG', 'TAGC')
0.75
abydos.distance.sim_lcsstr(src, tar)[source]

Return the longest common substring similarity of two strings.

Longest common substring similarity (\(sim_{LCSstr}\)).

This employs the LCS function to derive a similarity metric: \(sim_{LCSstr}(s,t) = \\frac{|LCSstr(s,t)|}{max(|s|, |t|)}\)

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

LCSstr similarity

Return type:

float

>>> sim_lcsstr('cat', 'hat')
0.6666666666666666
>>> sim_lcsstr('Niall', 'Neil')
0.2
>>> sim_lcsstr('aluminum', 'Catalan')
0.25
>>> sim_lcsstr('ATCG', 'TAGC')
0.25
abydos.distance.dist_ratcliff_obershelp(src, tar)[source]

Return the Ratcliff-Obershelp distance between two strings.

Ratcliff-Obsershelp distance the complement of Ratcliff-Obershelp similarity: \(dist_{Ratcliff-Obershelp} = 1 - sim_{Ratcliff-Obershelp}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

Ratcliff-Obershelp distance

Return type:

float

>>> round(dist_ratcliff_obershelp('cat', 'hat'), 12)
0.333333333333
>>> round(dist_ratcliff_obershelp('Niall', 'Neil'), 12)
0.333333333333
>>> round(dist_ratcliff_obershelp('aluminum', 'Catalan'), 12)
0.6
>>> dist_ratcliff_obershelp('ATCG', 'TAGC')
0.5
abydos.distance.sim_ratcliff_obershelp(src, tar)[source]

Return the Ratcliff-Obershelp similarity of two strings.

This follows the Ratcliff-Obershelp algorithm [RM88] to derive a similarity measure:

  1. Find the length of the longest common substring in src & tar.
  2. Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.
  3. Multiply this length by 2 and divide by the sum of the lengths of src & tar.

Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

Ratcliff-Obershelp similarity

Return type:

float

>>> round(sim_ratcliff_obershelp('cat', 'hat'), 12)
0.666666666667
>>> round(sim_ratcliff_obershelp('Niall', 'Neil'), 12)
0.666666666667
>>> round(sim_ratcliff_obershelp('aluminum', 'Catalan'), 12)
0.4
>>> sim_ratcliff_obershelp('ATCG', 'TAGC')
0.5
abydos.distance.dist_ident(src, tar)[source]

Return the identity distance between two strings.

This is 0 if the two strings are identical, otherwise 1, i.e. \(dist_{identity} = 1 - sim_{identity}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

identity distance

Return type:

int

>>> dist_ident('cat', 'hat')
1
>>> dist_ident('cat', 'cat')
0
abydos.distance.sim_ident(src, tar)[source]

Return the identity similarity of two strings.

Identity similarity is 1 if the two strings are identical, otherwise 0.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

identity similarity

Return type:

int

>>> sim_ident('cat', 'hat')
0
>>> sim_ident('cat', 'cat')
1
abydos.distance.dist_length(src, tar)[source]

Return the length distance between two strings.

Length distance is the complement of length similarity: \(dist_{length} = 1 - sim_{length}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

length distance

Return type:

float

>>> dist_length('cat', 'hat')
0.0
>>> dist_length('Niall', 'Neil')
0.19999999999999996
>>> dist_length('aluminum', 'Catalan')
0.125
>>> dist_length('ATCG', 'TAGC')
0.0
abydos.distance.sim_length(src, tar)[source]

Return the length similarity of two strings.

Length similarity is the ratio of the length of the shorter string to the longer.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

length similarity

Return type:

float

>>> sim_length('cat', 'hat')
1.0
>>> sim_length('Niall', 'Neil')
0.8
>>> sim_length('aluminum', 'Catalan')
0.875
>>> sim_length('ATCG', 'TAGC')
1.0
abydos.distance.dist_prefix(src, tar)[source]

Return the prefix distance between two strings.

Prefix distance is the complement of prefix similarity: \(dist_{prefix} = 1 - sim_{prefix}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

prefix distance

Return type:

float

>>> dist_prefix('cat', 'hat')
1.0
>>> dist_prefix('Niall', 'Neil')
0.75
>>> dist_prefix('aluminum', 'Catalan')
1.0
>>> dist_prefix('ATCG', 'TAGC')
1.0
abydos.distance.sim_prefix(src, tar)[source]

Return the prefix similarity of two strings.

Prefix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the start of both terms.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

prefix similarity

Return type:

float

>>> sim_prefix('cat', 'hat')
0.0
>>> sim_prefix('Niall', 'Neil')
0.25
>>> sim_prefix('aluminum', 'Catalan')
0.0
>>> sim_prefix('ATCG', 'TAGC')
0.0
abydos.distance.dist_suffix(src, tar)[source]

Return the suffix distance between two strings.

Suffix distance is the complement of suffix similarity: \(dist_{suffix} = 1 - sim_{suffix}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

suffix distance

Return type:

float

>>> dist_suffix('cat', 'hat')
0.33333333333333337
>>> dist_suffix('Niall', 'Neil')
0.75
>>> dist_suffix('aluminum', 'Catalan')
1.0
>>> dist_suffix('ATCG', 'TAGC')
1.0
abydos.distance.sim_suffix(src, tar)[source]

Return the suffix similarity of two strings.

Suffix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the end of both terms.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

suffix similarity

Return type:

float

>>> sim_suffix('cat', 'hat')
0.6666666666666666
>>> sim_suffix('Niall', 'Neil')
0.25
>>> sim_suffix('aluminum', 'Catalan')
0.0
>>> sim_suffix('ATCG', 'TAGC')
0.0
abydos.distance.dist_ncd_zlib(src, tar)[source]

Return the NCD between two strings using zlib compression.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression distance

Return type:

float

>>> dist_ncd_zlib('cat', 'hat')
0.3333333333333333
>>> dist_ncd_zlib('Niall', 'Neil')
0.45454545454545453
>>> dist_ncd_zlib('aluminum', 'Catalan')
0.5714285714285714
>>> dist_ncd_zlib('ATCG', 'TAGC')
0.4
abydos.distance.sim_ncd_zlib(src, tar)[source]

Return the NCD similarity between two strings using zlib compression.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression similarity

Return type:

float

>>> sim_ncd_zlib('cat', 'hat')
0.6666666666666667
>>> sim_ncd_zlib('Niall', 'Neil')
0.5454545454545454
>>> sim_ncd_zlib('aluminum', 'Catalan')
0.4285714285714286
>>> sim_ncd_zlib('ATCG', 'TAGC')
0.6
abydos.distance.dist_ncd_bz2(src, tar)[source]

Return the NCD between two strings using bz2 compression.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression distance

Return type:

float

>>> dist_ncd_bz2('cat', 'hat')
0.08
>>> dist_ncd_bz2('Niall', 'Neil')
0.037037037037037035
>>> dist_ncd_bz2('aluminum', 'Catalan')
0.20689655172413793
>>> dist_ncd_bz2('ATCG', 'TAGC')
0.037037037037037035
abydos.distance.sim_ncd_bz2(src, tar)[source]

Return the NCD similarity between two strings using bz2 compression.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression similarity

Return type:

float

>>> sim_ncd_bz2('cat', 'hat')
0.92
>>> sim_ncd_bz2('Niall', 'Neil')
0.962962962962963
>>> sim_ncd_bz2('aluminum', 'Catalan')
0.7931034482758621
>>> sim_ncd_bz2('ATCG', 'TAGC')
0.962962962962963
abydos.distance.dist_ncd_lzma(src, tar)[source]

Return the NCD between two strings using lzma compression.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression distance

Return type:

float

>>> dist_ncd_lzma('cat', 'hat')
0.08695652173913043
>>> dist_ncd_lzma('Niall', 'Neil')
0.16
>>> dist_ncd_lzma('aluminum', 'Catalan')
0.16
>>> dist_ncd_lzma('ATCG', 'TAGC')
0.08695652173913043
abydos.distance.sim_ncd_lzma(src, tar)[source]

Return the NCD similarity between two strings using lzma compression.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression similarity

Return type:

float

>>> sim_ncd_lzma('cat', 'hat')
0.9130434782608696
>>> sim_ncd_lzma('Niall', 'Neil')
0.84
>>> sim_ncd_lzma('aluminum', 'Catalan')
0.84
>>> sim_ncd_lzma('ATCG', 'TAGC')
0.9130434782608696
abydos.distance.dist_ncd_bwtrle(src, tar)[source]

Return the NCD between two strings using BWT plus RLE.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression distance

Return type:

float

>>> dist_ncd_bwtrle('cat', 'hat')
0.75
>>> dist_ncd_bwtrle('Niall', 'Neil')
0.8333333333333334
>>> dist_ncd_bwtrle('aluminum', 'Catalan')
1.0
>>> dist_ncd_bwtrle('ATCG', 'TAGC')
0.8
abydos.distance.sim_ncd_bwtrle(src, tar)[source]

Return the NCD similarity between two strings using BWT plus RLE.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

compression similarity

Return type:

float

>>> sim_ncd_bwtrle('cat', 'hat')
0.25
>>> sim_ncd_bwtrle('Niall', 'Neil')
0.16666666666666663
>>> sim_ncd_bwtrle('aluminum', 'Catalan')
0.0
>>> sim_ncd_bwtrle('ATCG', 'TAGC')
0.19999999999999996
abydos.distance.dist_ncd_rle(src, tar, use_bwt=False)[source]

Return the NCD between two strings using RLE.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • use_bwt (bool) – boolean indicating whether to perform BWT encoding before RLE encoding
Returns:

compression distance

Return type:

float

>>> dist_ncd_rle('cat', 'hat')
1.0
>>> dist_ncd_rle('Niall', 'Neil')
1.0
>>> dist_ncd_rle('aluminum', 'Catalan')
1.0
>>> dist_ncd_rle('ATCG', 'TAGC')
1.0
abydos.distance.sim_ncd_rle(src, tar, use_bwt=False)[source]

Return the NCD similarity between two strings using RLE.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • use_bwt (bool) – boolean indicating whether to perform BWT encoding before RLE encoding
Returns:

compression similarity

Return type:

float

>>> sim_ncd_rle('cat', 'hat')
0.0
>>> sim_ncd_rle('Niall', 'Neil')
0.0
>>> sim_ncd_rle('aluminum', 'Catalan')
0.0
>>> sim_ncd_rle('ATCG', 'TAGC')
0.0
abydos.distance.dist_ncd_arith(src, tar, probs=None)[source]

Return the NCD between two strings using arithmetic coding.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • probs (dict) – a dictionary trained with arithmetic.train (for the arith compressor only)
Returns:

compression distance

Return type:

float

>>> dist_ncd_arith('cat', 'hat')
0.5454545454545454
>>> dist_ncd_arith('Niall', 'Neil')
0.6875
>>> dist_ncd_arith('aluminum', 'Catalan')
0.8275862068965517
>>> dist_ncd_arith('ATCG', 'TAGC')
0.6923076923076923
abydos.distance.sim_ncd_arith(src, tar, probs=None)[source]

Return the NCD similarity between two strings using arithmetic coding.

Normalized compression distance (NCD) [CV05].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • probs (dict) – a dictionary trained with ac_train (for the arith compressor only)
Returns:

compression similarity

Return type:

float

>>> sim_ncd_arith('cat', 'hat')
0.4545454545454546
>>> sim_ncd_arith('Niall', 'Neil')
0.3125
>>> sim_ncd_arith('aluminum', 'Catalan')
0.1724137931034483
>>> sim_ncd_arith('ATCG', 'TAGC')
0.3076923076923077
abydos.distance.mra_compare(src, tar)[source]

Return the MRA comparison rating of two strings.

The Western Airlines Surname Match Rating Algorithm comparison rating, as presented on page 18 of [MKTM77].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

MRA comparison rating

Return type:

int

>>> mra_compare('cat', 'hat')
5
>>> mra_compare('Niall', 'Neil')
6
>>> mra_compare('aluminum', 'Catalan')
0
>>> mra_compare('ATCG', 'TAGC')
5
abydos.distance.dist_mra(src, tar)[source]

Return the normalized MRA distance between two strings.

MRA distance is the complement of MRA similarity: \(dist_{MRA} = 1 - sim_{MRA}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

normalized MRA distance

Return type:

float

>>> dist_mra('cat', 'hat')
0.16666666666666663
>>> dist_mra('Niall', 'Neil')
0.0
>>> dist_mra('aluminum', 'Catalan')
1.0
>>> dist_mra('ATCG', 'TAGC')
0.16666666666666663
abydos.distance.sim_mra(src, tar)[source]

Return the normalized MRA similarity of two strings.

This is the MRA normalized to \([0, 1]\), given that MRA itself is constrained to the range \([0, 6]\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
Returns:

normalized MRA similarity

Return type:

float

>>> sim_mra('cat', 'hat')
0.8333333333333334
>>> sim_mra('Niall', 'Neil')
1.0
>>> sim_mra('aluminum', 'Catalan')
0.0
>>> sim_mra('ATCG', 'TAGC')
0.8333333333333334
abydos.distance.editex(src, tar, cost=(0, 1, 2), local=False)[source]

Return the Editex distance between two strings.

As described on pages 3 & 4 of [ZD96].

The local variant is based on [RU09].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • cost (tuple) – a 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
  • local (bool) – if True, the local variant of Editex is used
Returns:

Editex distance

Return type:

int

>>> editex('cat', 'hat')
2
>>> editex('Niall', 'Neil')
2
>>> editex('aluminum', 'Catalan')
12
>>> editex('ATCG', 'TAGC')
6
abydos.distance.dist_editex(src, tar, cost=(0, 1, 2), local=False)[source]

Return the normalized Editex distance between two strings.

The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • cost (tuple) – a 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
  • local (bool) – if True, the local variant of Editex is used
Returns:

normalized Editex distance

Return type:

float

>>> round(dist_editex('cat', 'hat'), 12)
0.333333333333
>>> round(dist_editex('Niall', 'Neil'), 12)
0.2
>>> dist_editex('aluminum', 'Catalan')
0.75
>>> dist_editex('ATCG', 'TAGC')
0.75
abydos.distance.sim_editex(src, tar, cost=(0, 1, 2), local=False)[source]

Return the normalized Editex similarity of two strings.

The Editex similarity is the complement of Editex distance: \(sim_{Editex} = 1 - dist_{Editex}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • cost (tuple) – a 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
  • local (bool) – if True, the local variant of Editex is used
Returns:

normalized Editex similarity

Return type:

float

>>> round(sim_editex('cat', 'hat'), 12)
0.666666666667
>>> round(sim_editex('Niall', 'Neil'), 12)
0.8
>>> sim_editex('aluminum', 'Catalan')
0.25
>>> sim_editex('ATCG', 'TAGC')
0.25
abydos.distance.dist_mlipns(src, tar, threshold=0.25, max_mismatches=2)[source]

Return the MLIPNS distance between two strings.

MLIPNS distance is the complement of MLIPNS similarity: \(dist_{MLIPNS} = 1 - sim_{MLIPNS}\). This function returns only 0.0 (distant) or 1.0 (not distant).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • threshold (float) – a number [0, 1] indicating the maximum similarity score, below which the strings are considered ‘similar’ (0.25 by default)
  • max_mismatches (int) – a number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:

MLIPNS distance

Return type:

float

>>> dist_mlipns('cat', 'hat')
0.0
>>> dist_mlipns('Niall', 'Neil')
1.0
>>> dist_mlipns('aluminum', 'Catalan')
1.0
>>> dist_mlipns('ATCG', 'TAGC')
1.0
abydos.distance.sim_mlipns(src, tar, threshold=0.25, max_mismatches=2)[source]

Return the MLIPNS similarity of two strings.

Modified Language-Independent Product Name Search (MLIPNS) is described in [SA10]. This function returns only 1.0 (similar) or 0.0 (not similar). LIPNS similarity is identical to normalized Hamming similarity.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • threshold (float) – a number [0, 1] indicating the maximum similarity score, below which the strings are considered ‘similar’ (0.25 by default)
  • max_mismatches (int) – a number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:

MLIPNS similarity

Return type:

float

>>> sim_mlipns('cat', 'hat')
1.0
>>> sim_mlipns('Niall', 'Neil')
0.0
>>> sim_mlipns('aluminum', 'Catalan')
0.0
>>> sim_mlipns('ATCG', 'TAGC')
0.0
abydos.distance.dist_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None)[source]

Return the Baystat distance.

Normalized Baystat similarity is the complement of normalized Baystat distance: \(sim_{Baystat} = 1 - dist_{Baystat}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • min_ss_len (int) – minimum substring length to be considered
  • left_ext (int) – left-side extension length
  • right_ext (int) – right-side extension length
Returns:

the Baystat distance

Return type:

float

>>> round(dist_baystat('cat', 'hat'), 12)
0.333333333333
>>> dist_baystat('Niall', 'Neil')
0.6
>>> round(dist_baystat('Colin', 'Cuilen'), 12)
0.833333333333
>>> dist_baystat('ATCG', 'TAGC')
1.0
abydos.distance.sim_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None)[source]

Return the Baystat similarity.

Good results for shorter words are reported when setting min_ss_len to 1 and either left_ext OR right_ext to 1.

The Baystat similarity is defined in [FurnrohrRvR02].

This is ostensibly a port of the R module PPRL’s implementation: https://github.com/cran/PPRL/blob/master/src/MTB_Baystat.cpp [Ruk18]. As such, this could be made more pythonic.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • min_ss_len (int) – minimum substring length to be considered
  • left_ext (int) – left-side extension length
  • right_ext (int) – right-side extension length
Returns:

the Baystat similarity

Return type:

float

>>> round(sim_baystat('cat', 'hat'), 12)
0.666666666667
>>> sim_baystat('Niall', 'Neil')
0.4
>>> round(sim_baystat('Colin', 'Cuilen'), 12)
0.166666666667
>>> sim_baystat('ATCG', 'TAGC')
0.0
abydos.distance.eudex_hamming(src, tar, weights='exponential', max_length=8, normalized=False)[source]

Calculate the Hamming distance between the Eudex hashes of two terms.

Cf. [Tic].

  • If weights is set to None, a simple Hamming distance is calculated.
  • If weights is set to ‘exponential’, weight decays by powers of 2, as proposed in the eudex specification: https://github.com/ticki/eudex.
  • If weights is set to ‘fibonacci’, weight decays through the Fibonacci series, as in the eudex reference implementation.
  • If weights is set to a callable function, this assumes it creates a generator and the generator is used to populate a series of weights.
  • If weights is set to an iterable, the iterable’s values should be integers and will be used as the weights.
Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • iterable, or generator function weights (str,) – the weights or weights generator function
  • max_length – the number of characters to encode as a eudex hash
  • normalized (bool) – normalizes to [0, 1] if True
Returns:

the Eudex Hamming distance

Return type:

int

>>> eudex_hamming('cat', 'hat')
128
>>> eudex_hamming('Niall', 'Neil')
2
>>> eudex_hamming('Colin', 'Cuilen')
10
>>> eudex_hamming('ATCG', 'TAGC')
403
>>> eudex_hamming('cat', 'hat', weights='fibonacci')
34
>>> eudex_hamming('Niall', 'Neil', weights='fibonacci')
2
>>> eudex_hamming('Colin', 'Cuilen', weights='fibonacci')
7
>>> eudex_hamming('ATCG', 'TAGC', weights='fibonacci')
117
>>> eudex_hamming('cat', 'hat', weights=None)
1
>>> eudex_hamming('Niall', 'Neil', weights=None)
1
>>> eudex_hamming('Colin', 'Cuilen', weights=None)
2
>>> eudex_hamming('ATCG', 'TAGC', weights=None)
9
>>> # Using the OEIS A000142:
>>> eudex_hamming('cat', 'hat', [1, 1, 2, 6, 24, 120, 720, 5040])
1
>>> eudex_hamming('Niall', 'Neil', [1, 1, 2, 6, 24, 120, 720, 5040])
720
>>> eudex_hamming('Colin', 'Cuilen', [1, 1, 2, 6, 24, 120, 720, 5040])
744
>>> eudex_hamming('ATCG', 'TAGC', [1, 1, 2, 6, 24, 120, 720, 5040])
6243
abydos.distance.dist_eudex(src, tar, weights='exponential', max_length=8)[source]

Return normalized Hamming distance between Eudex hashes of two terms.

This is Eudex distance normalized to [0, 1].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • iterable, or generator function weights (str,) – the weights or weights generator function
  • max_length – the number of characters to encode as a eudex hash
Returns:

the normalized Eudex distance

Return type:

float

>>> round(dist_eudex('cat', 'hat'), 12)
0.062745098039
>>> round(dist_eudex('Niall', 'Neil'), 12)
0.000980392157
>>> round(dist_eudex('Colin', 'Cuilen'), 12)
0.004901960784
>>> round(dist_eudex('ATCG', 'TAGC'), 12)
0.197549019608
abydos.distance.sim_eudex(src, tar, weights='exponential', max_length=8)[source]

Return normalized Hamming similarity between Eudex hashes of two terms.

Normalized Eudex similarity is the complement of normalized Eudex distance: \(sim_{Eudex} = 1 - dist_{Eudex}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • iterable, or generator function weights (str,) – the weights or weights generator function
  • max_length – the number of characters to encode as a eudex hash
Returns:

the normalized Eudex similarity

Return type:

float

>>> round(sim_eudex('cat', 'hat'), 12)
0.937254901961
>>> round(sim_eudex('Niall', 'Neil'), 12)
0.999019607843
>>> round(sim_eudex('Colin', 'Cuilen'), 12)
0.995098039216
>>> round(sim_eudex('ATCG', 'TAGC'), 12)
0.802450980392
abydos.distance.sift4_common(src, tar, max_offset=5, max_distance=0)[source]

Return the “common” Sift4 distance between two terms.

This is an approximation of edit distance, described in [Zac14].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • max_offset – the number of characters to search for matching letters
  • max_distance – the distance at which to stop and exit
Returns:

the Sift4 distance according to the common formula

Return type:

int

>>> sift4_common('cat', 'hat')
1
>>> sift4_common('Niall', 'Neil')
2
>>> sift4_common('Colin', 'Cuilen')
3
>>> sift4_common('ATCG', 'TAGC')
2
abydos.distance.sift4_simplest(src, tar, max_offset=5)[source]

Return the “simplest” Sift4 distance between two terms.

This is an approximation of edit distance, described in [Zac14].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • max_offset – the number of characters to search for matching letters
Returns:

the Sift4 distance according to the simplest formula

Return type:

int

>>> sift4_simplest('cat', 'hat')
1
>>> sift4_simplest('Niall', 'Neil')
2
>>> sift4_simplest('Colin', 'Cuilen')
3
>>> sift4_simplest('ATCG', 'TAGC')
2
abydos.distance.dist_sift4(src, tar, max_offset=5, max_distance=0)[source]

Return the normalized “common” Sift4 distance between two terms.

This is Sift4 distance, normalized to [0, 1].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • max_offset – the number of characters to search for matching letters
  • max_distance – the distance at which to stop and exit
Returns:

the normalized Sift4 distance

Return type:

float

>>> round(dist_sift4('cat', 'hat'), 12)
0.333333333333
>>> dist_sift4('Niall', 'Neil')
0.4
>>> dist_sift4('Colin', 'Cuilen')
0.5
>>> dist_sift4('ATCG', 'TAGC')
0.5
abydos.distance.sim_sift4(src, tar, max_offset=5, max_distance=0)[source]

Return the normalized “common” Sift4 similarity of two terms.

Normalized Sift4 similarity is the complement of normalized Sift4 distance: \(sim_{Sift4} = 1 - dist_{Sift4}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • max_offset – the number of characters to search for matching letters
  • max_distance – the distance at which to stop and exit
Returns:

the normalized Sift4 similarity

Return type:

float

>>> round(sim_sift4('cat', 'hat'), 12)
0.666666666667
>>> sim_sift4('Niall', 'Neil')
0.6
>>> sim_sift4('Colin', 'Cuilen')
0.5
>>> sim_sift4('ATCG', 'TAGC')
0.5
abydos.distance.typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5), layout='QWERTY')[source]

Return the typo distance between two strings.

This is inspired by Typo-Distance [Son11], and a fair bit of this was copied from that module. Compared to the original, this supports different metrics for substitution.

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • metric (str) – supported values include: ‘euclidean’, ‘manhattan’, ‘log-euclidean’, and ‘log-manhattan’
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.
  • layout (str) – name of the keyboard layout to use (Currently supported: QWERTY, Dvorak, AZERTY, QWERTZ)
Returns:

typo distance

Return type:

float

>>> typo('cat', 'hat')
1.5811388
>>> typo('Niall', 'Neil')
2.8251407
>>> typo('Colin', 'Cuilen')
3.4142137
>>> typo('ATCG', 'TAGC')
2.5
>>> typo('cat', 'hat', metric='manhattan')
2.0
>>> typo('Niall', 'Neil', metric='manhattan')
3.0
>>> typo('Colin', 'Cuilen', metric='manhattan')
3.5
>>> typo('ATCG', 'TAGC', metric='manhattan')
2.5
>>> typo('cat', 'hat', metric='log-manhattan')
0.804719
>>> typo('Niall', 'Neil', metric='log-manhattan')
2.2424533
>>> typo('Colin', 'Cuilen', metric='log-manhattan')
2.2424533
>>> typo('ATCG', 'TAGC', metric='log-manhattan')
2.3465736
abydos.distance.dist_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5))[source]

Return the normalized typo distance between two strings.

This is typo distance, normalized to [0, 1].

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • metric (str) – supported values include: ‘euclidean’, ‘manhattan’, ‘log-euclidean’, and ‘log-manhattan’
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.
Returns:

normalized typo distance

Return type:

float

>>> round(dist_typo('cat', 'hat'), 12)
0.527046283086
>>> round(dist_typo('Niall', 'Neil'), 12)
0.565028142929
>>> round(dist_typo('Colin', 'Cuilen'), 12)
0.569035609563
>>> dist_typo('ATCG', 'TAGC')
0.625
abydos.distance.sim_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5))[source]

Return the normalized typo similarity between two strings.

Normalized typo similarity is the complement of normalized typo distance: \(sim_{typo} = 1 - dist_{typo}\).

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • metric (str) – supported values include: ‘euclidean’, ‘manhattan’, ‘log-euclidean’, and ‘log-manhattan’
  • cost (tuple) – a 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.
Returns:

normalized typo similarity

Return type:

float

>>> round(sim_typo('cat', 'hat'), 12)
0.472953716914
>>> round(sim_typo('Niall', 'Neil'), 12)
0.434971857071
>>> round(sim_typo('Colin', 'Cuilen'), 12)
0.430964390437
>>> sim_typo('ATCG', 'TAGC')
0.375
abydos.distance.synoname(src, tar, word_approx_min=0.3, char_approx_min=0.73, tests=4095, ret_name=False)[source]

Return the Synoname similarity type of two words.

Cf. [JPGTrust91][Gro91]

Parameters:
  • src (str) – source string for comparison
  • tar (str) – target string for comparison
  • ret_name (bool) – return the name of the match type rather than the int value
  • word_approx_min (float) – the minimum word approximation value to signal a ‘word_approx’ match
  • char_approx_min (float) – the minimum character approximation value to signal a ‘char_approx’ match
  • or Iterable tests (int) – either an integer indicating tests to perform or a list of test names to perform (defaults to performing all tests)
  • ret_name – if True, returns the match name rather than its integer equivalent
Returns:

Synoname value

Return type:

int (or str if ret_name is True)

>>> synoname(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', ''))
2
>>> synoname(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', ''),
... ret_name=True)
'omission'
>>> synoname(('Dore', 'Gustave', ''),
... ('Dore', 'Paul Gustave Louis Christophe', ''),
... ret_name=True)
'inclusion'
>>> synoname(('Pereira', 'I. R.', ''), ('Pereira', 'I. Smith', ''),
... ret_name=True)
'word_approx'