abydos.distance package¶

abydos.distance.

The distance package implements string distance measure and metric classes:

These include traditional Levenshtein edit distance and related algorithms:

Levenshtein distance (Levenshtein)

Optimal String Alignment distance (Levenshtein with mode='osa')

Damerau-Levenshtein distance (DamerauLevenshtein)

Indel distance (Indel)

Hamming distance (Hamming) and the closely related Modified Language-Independent Product Name Search distance (MLIPNS) are provided.

Distance metrics developed for the US Census are included:

Jaro distance (JaroWinkler with mode='Jaro')

Jaro-Winkler distance (JaroWinkler)

Strcmp95 distance (Strcmp95)

A large set of multi-set token-based distance metrics are provided, including:

Generalized Minkowski distance (Minkowski)

Manhattan distance (Manhattan)

Euclidean distance (Euclidean)

Chebyshev distance (Chebyshev)

Generalized Tversky distance (Tversky)

Sørensen–Dice coefficient (Dice)

Jaccard similarity (Jaccard)

Tanimoto coefficient (Jaccard.tanimoto_coeff())

Overlap distance (Overlap)

Cosine similarity (Cosine)

Bag distance (Bag)

Monge-Elkan distance (MongeElkan)

Three popular sequence alignment algorithms are provided:

Needleman-Wunsch score (NeedlemanWunsch)

Smith-Waterman score (SmithWaterman)

Gotoh score (Gotoh)

Classes relating to substring and subsequence distances include:

Longest common subsequence (LCSseq)

Longest common substring (LCSstr)

Ratcliff-Obserhelp distance (RatcliffObershelp)

A number of simple distance classes provided in the package include:

Identity distance (Ident)

Length distance (Length)

Prefix distance (Prefix)

Suffix distance (Suffix)

Normalized compression distance classes for a variety of compression algorithms are provided:

zlib (NCDzlib)

bzip2 (NCDbz2)

lzma (NCDlzma)

arithmetic coding (NCDarith)

BWT plus RLE (NCDbwtrle)

RLE (NCDrle)

The remaining distance measures & metrics include:

Western Airlines' Match Rating Algorithm comparison (distance.MRA)

Editex (Editex)

Bavarian Landesamt für Statistik distance (Baystat)

Eudex distance (distance.Eudex)

Sift4 distance (Sift4 and Sift4Simplest)

Typo distance (Typo)

Synoname (Synoname)

Most of the distance and similarity measures have sim and dist methods, which return a measure that is normalized to the range \([0, 1]\). The normalized distance and similarity are always complements, so the normalized distance will always equal 1 - the similarity for a particular measure supplied with the same input. Some measures have an absolute distance method dist_abs that is not limited to any range.

All three methods can be demonstrated using the DamerauLevenshtein class:

>>> dl = DamerauLevenshtein()
>>> dl.dist_abs('orange', 'strange')
2
>>> dl.dist('orange', 'strange')
0.2857142857142857
>>> dl.sim('orange', 'strange')
0.7142857142857143

abydos.distance.sim(src, tar, method=<function sim_levenshtein>)[source]¶

Return a similarity of two strings.

This is a generalized function for calling other similarity functions.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison method (function) -- Specifies the similarity metric (`sim_levenshtein()` by default)
Returns:	Similarity according to the specified function
Return type:	float
Raises:	`AttributeError` -- Unknown distance function

Examples

>>> round(sim('cat', 'hat'), 12)
0.666666666667
>>> round(sim('Niall', 'Neil'), 12)
0.4
>>> sim('aluminum', 'Catalan')
0.125
>>> sim('ATCG', 'TAGC')
0.25

abydos.distance.dist(src, tar, method=<function sim_levenshtein>)[source]¶

Return a distance between two strings.

This is a generalized function for calling other distance functions.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison method (function) -- Specifies the similarity metric (`sim_levenshtein()` by default) -- Note that this takes a similarity metric function, not a distance metric function.
Returns:	Distance according to the specified function
Return type:	float
Raises:	`AttributeError` -- Unknown distance function

Examples

>>> round(dist('cat', 'hat'), 12)
0.333333333333
>>> round(dist('Niall', 'Neil'), 12)
0.6
>>> dist('aluminum', 'Catalan')
0.875
>>> dist('ATCG', 'TAGC')
0.75

class abydos.distance.Levenshtein[source]¶

Bases: abydos.distance._distance._Distance

Levenshtein distance.

This is the standard edit distance measure. Cf. [Lev65][Lev66].

Optimal string alignment (aka restricted Damerau-Levenshtein distance) [Boy11] is also supported.

The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm [WF74].

Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.

dist(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶

Return the normalized Levenshtein distance between two strings.

The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison mode (str) -- Specifies a mode for computing the Levenshtein distance: `lev` (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions `osa` computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The normalized Levenshtein distance between src & tar
Return type:	float

Examples

>>> cmp = Levenshtein()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.6
>>> cmp.dist('aluminum', 'Catalan')
0.875
>>> cmp.dist('ATCG', 'TAGC')
0.75

dist_abs(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶

Return the Levenshtein distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison mode (str) -- Specifies a mode for computing the Levenshtein distance: `lev` (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions `osa` computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The Levenshtein distance between src & tar
Return type:	int (may return a float if cost has float values)

Examples

>>> cmp = Levenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
7
>>> cmp.dist_abs('ATCG', 'TAGC')
3

>>> cmp.dist_abs('ATCG', 'TAGC', mode='osa')
2
>>> cmp.dist_abs('ACTG', 'TAGC', mode='osa')
4

abydos.distance.levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶

Return the Levenshtein distance between two strings.

This is a wrapper of Levenshtein.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison mode (str) -- Specifies a mode for computing the Levenshtein distance: `lev` (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions `osa` computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The Levenshtein distance between src & tar
Return type:	int (may return a float if cost has float values)

Examples

>>> levenshtein('cat', 'hat')
1
>>> levenshtein('Niall', 'Neil')
3
>>> levenshtein('aluminum', 'Catalan')
7
>>> levenshtein('ATCG', 'TAGC')
3

>>> levenshtein('ATCG', 'TAGC', mode='osa')
2
>>> levenshtein('ACTG', 'TAGC', mode='osa')
4

abydos.distance.dist_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶

Return the normalized Levenshtein distance between two strings.

This is a wrapper of Levenshtein.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison mode (str) -- Specifies a mode for computing the Levenshtein distance: `lev` (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions `osa` computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The Levenshtein distance between src & tar
Return type:	float

Examples

>>> round(dist_levenshtein('cat', 'hat'), 12)
0.333333333333
>>> round(dist_levenshtein('Niall', 'Neil'), 12)
0.6
>>> dist_levenshtein('aluminum', 'Catalan')
0.875
>>> dist_levenshtein('ATCG', 'TAGC')
0.75

abydos.distance.sim_levenshtein(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶

Return the Levenshtein similarity of two strings.

This is a wrapper of Levenshtein.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison mode (str) -- Specifies a mode for computing the Levenshtein distance: `lev` (default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutions `osa` computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The Levenshtein similarity between src & tar
Return type:	float

Examples

>>> round(sim_levenshtein('cat', 'hat'), 12)
0.666666666667
>>> round(sim_levenshtein('Niall', 'Neil'), 12)
0.4
>>> sim_levenshtein('aluminum', 'Catalan')
0.125
>>> sim_levenshtein('ATCG', 'TAGC')
0.25

class abydos.distance.DamerauLevenshtein[source]¶

Bases: abydos.distance._distance._Distance

Damerau-Levenshtein distance.

This computes the Damerau-Levenshtein distance [Dam64]. Damerau-Levenshtein code is based on Java code by Kevin L. Stern [Ste14], under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java

dist(src, tar, cost=(1, 1, 1, 1))[source]¶

Return the Damerau-Levenshtein similarity of two strings.

Damerau-Levenshtein distance normalized to the interval [0, 1].

The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The normalized Damerau-Levenshtein distance
Return type:	float

Examples

>>> cmp = DamerauLevenshtein()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.6
>>> cmp.dist('aluminum', 'Catalan')
0.875
>>> cmp.dist('ATCG', 'TAGC')
0.5

dist_abs(src, tar, cost=(1, 1, 1, 1))[source]¶

Return the Damerau-Levenshtein distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The Damerau-Levenshtein distance between src & tar
Return type:	int (may return a float if cost has float values)
Raises:	`ValueError` -- Unsupported cost assignment; the cost of two transpositions must not be less than the cost of an insert plus a delete.

Examples

>>> cmp = DamerauLevenshtein()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
7
>>> cmp.dist_abs('ATCG', 'TAGC')
2

abydos.distance.damerau_levenshtein(src, tar, cost=(1, 1, 1, 1))[source]¶

Return the Damerau-Levenshtein distance between two strings.

This is a wrapper of DamerauLevenshtein.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The Damerau-Levenshtein distance between src & tar
Return type:	int (may return a float if cost has float values)

Examples

>>> damerau_levenshtein('cat', 'hat')
1
>>> damerau_levenshtein('Niall', 'Neil')
3
>>> damerau_levenshtein('aluminum', 'Catalan')
7
>>> damerau_levenshtein('ATCG', 'TAGC')
2

abydos.distance.dist_damerau(src, tar, cost=(1, 1, 1, 1))[source]¶

Return the Damerau-Levenshtein similarity of two strings.

This is a wrapper of DamerauLevenshtein.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The normalized Damerau-Levenshtein distance
Return type:	float

Examples

>>> round(dist_damerau('cat', 'hat'), 12)
0.333333333333
>>> round(dist_damerau('Niall', 'Neil'), 12)
0.6
>>> dist_damerau('aluminum', 'Catalan')
0.875
>>> dist_damerau('ATCG', 'TAGC')
0.5

abydos.distance.sim_damerau(src, tar, cost=(1, 1, 1, 1))[source]¶

Return the Damerau-Levenshtein similarity of two strings.

This is a wrapper of DamerauLevenshtein.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
Returns:	The normalized Damerau-Levenshtein similarity
Return type:	float

Examples

>>> round(sim_damerau('cat', 'hat'), 12)
0.666666666667
>>> round(sim_damerau('Niall', 'Neil'), 12)
0.4
>>> sim_damerau('aluminum', 'Catalan')
0.125
>>> sim_damerau('ATCG', 'TAGC')
0.5

class abydos.distance.Indel[source]¶

Bases: abydos.distance._distance._Distance

Indel distance.

This is equivalent to Levenshtein distance, when only inserts and deletes are possible.

dist(src, tar)[source]¶

Return the normalized indel distance between two strings.

This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized indel distance
Return type:	float

Examples

>>> cmp = Indel()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.333333333333
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.454545454545
>>> cmp.dist('ATCG', 'TAGC')
0.5

dist_abs(src, tar)[source]¶

Return the indel distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Indel distance
Return type:	int

Examples

>>> cmp = Indel()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('Colin', 'Cuilen')
5
>>> cmp.dist_abs('ATCG', 'TAGC')
4

abydos.distance.indel(src, tar)[source]¶

Return the indel distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Indel distance
Return type:	int

Examples

>>> indel('cat', 'hat')
2
>>> indel('Niall', 'Neil')
3
>>> indel('Colin', 'Cuilen')
5
>>> indel('ATCG', 'TAGC')
4

abydos.distance.dist_indel(src, tar)[source]¶

Return the normalized indel distance between two strings.

This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized indel distance
Return type:	float

Examples

>>> round(dist_indel('cat', 'hat'), 12)
0.333333333333
>>> round(dist_indel('Niall', 'Neil'), 12)
0.333333333333
>>> round(dist_indel('Colin', 'Cuilen'), 12)
0.454545454545
>>> dist_indel('ATCG', 'TAGC')
0.5

abydos.distance.sim_indel(src, tar)[source]¶

Return the normalized indel similarity of two strings.

This is equivalent to normalized Levenshtein similarity, when only inserts and deletes are possible.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized indel similarity
Return type:	float

Examples

>>> round(sim_indel('cat', 'hat'), 12)
0.666666666667
>>> round(sim_indel('Niall', 'Neil'), 12)
0.666666666667
>>> round(sim_indel('Colin', 'Cuilen'), 12)
0.545454545455
>>> sim_indel('ATCG', 'TAGC')
0.5

class abydos.distance.Hamming[source]¶

Bases: abydos.distance._distance._Distance

Hamming distance.

Hamming distance [Ham50] equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings' lengths and adds to this the difference in string lengths.

dist(src, tar, diff_lens=True)[source]¶

Return the normalized Hamming distance between two strings.

Hamming distance normalized to the interval [0, 1].

The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).

The arguments are identical to those of the hamming() function.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:	Normalized Hamming distance
Return type:	float

Examples

>>> cmp = Hamming()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> cmp.dist('Niall', 'Neil')
0.6
>>> cmp.dist('aluminum', 'Catalan')
1.0
>>> cmp.dist('ATCG', 'TAGC')
1.0

dist_abs(src, tar, diff_lens=True)[source]¶

Return the Hamming distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:	The Hamming distance between src & tar
Return type:	int
Raises:	`ValueError` -- Undefined for sequences of unequal length; set diff_lens to True for Hamming distance between strings of unequal lengths.

Examples

>>> cmp = Hamming()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
3
>>> cmp.dist_abs('aluminum', 'Catalan')
8
>>> cmp.dist_abs('ATCG', 'TAGC')
4

abydos.distance.hamming(src, tar, diff_lens=True)[source]¶

Return the Hamming distance between two strings.

This is a wrapper for Hamming.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:	The Hamming distance between src & tar
Return type:	int

Examples

>>> hamming('cat', 'hat')
1
>>> hamming('Niall', 'Neil')
3
>>> hamming('aluminum', 'Catalan')
8
>>> hamming('ATCG', 'TAGC')
4

abydos.distance.dist_hamming(src, tar, diff_lens=True)[source]¶

Return the normalized Hamming distance between two strings.

This is a wrapper for Hamming.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:	The normalized Hamming distance
Return type:	float

Examples

>>> round(dist_hamming('cat', 'hat'), 12)
0.333333333333
>>> dist_hamming('Niall', 'Neil')
0.6
>>> dist_hamming('aluminum', 'Catalan')
1.0
>>> dist_hamming('ATCG', 'TAGC')
1.0

abydos.distance.sim_hamming(src, tar, diff_lens=True)[source]¶

Return the normalized Hamming similarity of two strings.

This is a wrapper for Hamming.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
Returns:	The normalized Hamming similarity
Return type:	float

Examples

>>> round(sim_hamming('cat', 'hat'), 12)
0.666666666667
>>> sim_hamming('Niall', 'Neil')
0.4
>>> sim_hamming('aluminum', 'Catalan')
0.0
>>> sim_hamming('ATCG', 'TAGC')
0.0

class abydos.distance.JaroWinkler[source]¶

Bases: abydos.distance._distance._Distance

Jaro-Winkler distance.

Jaro(-Winkler) distance is a string edit distance initially proposed by Jaro and extended by Winkler [Jar89][Win90].

This is Python based on the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.

sim(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]¶

Return the Jaro or Jaro-Winkler similarity of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison qval (int) -- The length of each q-gram (defaults to 1: character-wise matching) mode (str) -- Indicates which variant of this distance metric to compute: `winkler` -- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word `jaro` -- computes the Jaro distance long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers." (Used in 'winkler' mode only.) boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.) scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
Returns:	Jaro or Jaro-Winkler similarity
Return type:	float
Raises:	`ValueError` -- Unsupported boost_threshold assignment; boost_threshold must be between 0 and 1. `ValueError` -- Unsupported scaling_factor assignment; scaling_factor must be between 0 and 0.25.'

Examples

>>> round(sim_jaro_winkler('cat', 'hat'), 12)
0.777777777778
>>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
0.805
>>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
0.60119047619
>>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
0.833333333333

>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
0.777777777778
>>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
0.783333333333
>>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
0.60119047619
>>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
0.833333333333

abydos.distance.dist_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]¶

Return the Jaro or Jaro-Winkler distance between two strings.

This is a wrapper for JaroWinkler.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison qval (int) -- The length of each q-gram (defaults to 1: character-wise matching) mode (str) -- Indicates which variant of this distance metric to compute: `winkler` -- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word `jaro` -- computes the Jaro distance long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixedlength fields such as phone and social security numbers." (Used in 'winkler' mode only.) boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.) scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
Returns:	Jaro or Jaro-Winkler distance
Return type:	float

Examples

>>> round(dist_jaro_winkler('cat', 'hat'), 12)
0.222222222222
>>> round(dist_jaro_winkler('Niall', 'Neil'), 12)
0.195
>>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12)
0.39880952381
>>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12)
0.166666666667

>>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12)
0.222222222222
>>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
0.216666666667
>>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
0.39880952381
>>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
0.166666666667

abydos.distance.sim_jaro_winkler(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]¶

Return the Jaro or Jaro-Winkler similarity of two strings.

This is a wrapper for JaroWinkler.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison qval (int) -- The length of each q-gram (defaults to 1: character-wise matching) mode (str) -- Indicates which variant of this distance metric to compute: `winkler` -- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the word `jaro` -- computes the Jaro distance long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixedlength fields such as phone and social security numbers." (Used in 'winkler' mode only.) boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.) scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
Returns:	Jaro or Jaro-Winkler similarity
Return type:	float

Examples

>>> round(sim_jaro_winkler('cat', 'hat'), 12)
0.777777777778
>>> round(sim_jaro_winkler('Niall', 'Neil'), 12)
0.805
>>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12)
0.60119047619
>>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12)
0.833333333333

>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12)
0.777777777778
>>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12)
0.783333333333
>>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12)
0.60119047619
>>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12)
0.833333333333

class abydos.distance.Strcmp95[source]¶

Bases: abydos.distance._distance._Distance

Strcmp95.

This is a Python translation of the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.

This is based on the Jaro-Winkler distance, but also attempts to correct for some common typos and frequently confused characters. It is also limited to uppercase ASCII characters, so it is appropriate to American names, but not much else.

sim(src, tar, long_strings=False)[source]¶

Return the strcmp95 similarity of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
Returns:	Strcmp95 similarity
Return type:	float

Examples

>>> cmp = Strcmp95()
>>> cmp.sim('cat', 'hat')
0.7777777777777777
>>> cmp.sim('Niall', 'Neil')
0.8454999999999999
>>> cmp.sim('aluminum', 'Catalan')
0.6547619047619048
>>> cmp.sim('ATCG', 'TAGC')
0.8333333333333334

abydos.distance.dist_strcmp95(src, tar, long_strings=False)[source]¶

Return the strcmp95 distance between two strings.

This is a wrapper for Strcmp95.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
Returns:	Strcmp95 distance
Return type:	float

Examples

>>> round(dist_strcmp95('cat', 'hat'), 12)
0.222222222222
>>> round(dist_strcmp95('Niall', 'Neil'), 12)
0.1545
>>> round(dist_strcmp95('aluminum', 'Catalan'), 12)
0.345238095238
>>> round(dist_strcmp95('ATCG', 'TAGC'), 12)
0.166666666667

abydos.distance.sim_strcmp95(src, tar, long_strings=False)[source]¶

Return the strcmp95 similarity of two strings.

This is a wrapper for Strcmp95.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
Returns:	Strcmp95 similarity
Return type:	float

Examples

>>> sim_strcmp95('cat', 'hat')
0.7777777777777777
>>> sim_strcmp95('Niall', 'Neil')
0.8454999999999999
>>> sim_strcmp95('aluminum', 'Catalan')
0.6547619047619048
>>> sim_strcmp95('ATCG', 'TAGC')
0.8333333333333334

class abydos.distance.Minkowski[source]¶

Bases: abydos.distance._token_distance._TokenDistance

Minkowski distance.

The Minkowski distance [Min10] is a distance metric in \(L^p-space\).

dist(src, tar, qval=2, pval=1, alphabet=None)[source]¶

Return normalized Minkowski distance of two strings.

The normalized Minkowski distance [Min10] is a distance metric in \(L^p\)-space, normalized to [0, 1].

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version pval (int or float) -- The \(p\)-value of the \(L^p\)-space alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Minkowski distance
Return type:	float

Examples

>>> cmp = Minkowski()
>>> cmp.dist('cat', 'hat')
0.5
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.636363636364
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.692307692308
>>> cmp.dist('ATCG', 'TAGC')
1.0

dist_abs(src, tar, qval=2, pval=1, normalized=False, alphabet=None)[source]¶

Return the Minkowski distance (\(L^p\)-norm) of two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version pval (int or float) -- The \(p\)-value of the \(L^p\)-space normalized (bool) -- Normalizes to [0, 1] if True alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Minkowski distance
Return type:	float

Examples

>>> cmp = Minkowski()
>>> cmp.dist_abs('cat', 'hat')
4.0
>>> cmp.dist_abs('Niall', 'Neil')
7.0
>>> cmp.dist_abs('Colin', 'Cuilen')
9.0
>>> cmp.dist_abs('ATCG', 'TAGC')
10.0

abydos.distance.minkowski(src, tar, qval=2, pval=1, normalized=False, alphabet=None)[source]¶

Return the Minkowski distance (\(L^p\)-norm) of two strings.

This is a wrapper for Minkowski.dist_abs().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version pval (int or float) -- The \(p\)-value of the \(L^p\)-space normalized (bool) -- Normalizes to [0, 1] if True alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Minkowski distance
Return type:	float

Examples

>>> minkowski('cat', 'hat')
4.0
>>> minkowski('Niall', 'Neil')
7.0
>>> minkowski('Colin', 'Cuilen')
9.0
>>> minkowski('ATCG', 'TAGC')
10.0

abydos.distance.dist_minkowski(src, tar, qval=2, pval=1, alphabet=None)[source]¶

Return normalized Minkowski distance of two strings.

This is a wrapper for Minkowski.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version pval (int or float) -- The \(p\)-value of the \(L^p\)-space alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Minkowski distance
Return type:	float

Examples

>>> dist_minkowski('cat', 'hat')
0.5
>>> round(dist_minkowski('Niall', 'Neil'), 12)
0.636363636364
>>> round(dist_minkowski('Colin', 'Cuilen'), 12)
0.692307692308
>>> dist_minkowski('ATCG', 'TAGC')
1.0

abydos.distance.sim_minkowski(src, tar, qval=2, pval=1, alphabet=None)[source]¶

Return normalized Minkowski similarity of two strings.

This is a wrapper for Minkowski.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version pval (int or float) -- The \(p\)-value of the \(L^p\)-space alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Minkowski similarity
Return type:	float

Examples

>>> sim_minkowski('cat', 'hat')
0.5
>>> round(sim_minkowski('Niall', 'Neil'), 12)
0.363636363636
>>> round(sim_minkowski('Colin', 'Cuilen'), 12)
0.307692307692
>>> sim_minkowski('ATCG', 'TAGC')
0.0

class abydos.distance.Manhattan[source]¶

Bases: abydos.distance._minkowski.Minkowski

Manhattan distance.

Manhattan distance is the city-block or taxi-cab distance, equivalent to Minkowski distance in \(L^1\)-space.

dist(src, tar, qval=2, alphabet=None)[source]¶

Return the normalized Manhattan distance between two strings.

The normalized Manhattan distance is a distance metric in \(L^1\)-space, normalized to [0, 1].

This is identical to Canberra distance.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Manhattan distance
Return type:	float

Examples

>>> cmp = Manhattan()
>>> cmp.dist('cat', 'hat')
0.5
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.636363636364
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.692307692308
>>> cmp.dist('ATCG', 'TAGC')
1.0

dist_abs(src, tar, qval=2, normalized=False, alphabet=None)[source]¶

Return the Manhattan distance between two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version normalized (bool) -- Normalizes to [0, 1] if True alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Manhattan distance
Return type:	float

Examples

>>> cmp = Manhattan()
>>> cmp.dist_abs('cat', 'hat')
4.0
>>> cmp.dist_abs('Niall', 'Neil')
7.0
>>> cmp.dist_abs('Colin', 'Cuilen')
9.0
>>> cmp.dist_abs('ATCG', 'TAGC')
10.0

abydos.distance.manhattan(src, tar, qval=2, normalized=False, alphabet=None)[source]¶

Return the Manhattan distance between two strings.

This is a wrapper for Manhattan.dist_abs().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version normalized (bool) -- Normalizes to [0, 1] if True alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Manhattan distance
Return type:	float

Examples

>>> manhattan('cat', 'hat')
4.0
>>> manhattan('Niall', 'Neil')
7.0
>>> manhattan('Colin', 'Cuilen')
9.0
>>> manhattan('ATCG', 'TAGC')
10.0

abydos.distance.dist_manhattan(src, tar, qval=2, alphabet=None)[source]¶

Return the normalized Manhattan distance between two strings.

This is a wrapper for Manhattan.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Manhattan distance
Return type:	float

Examples

>>> dist_manhattan('cat', 'hat')
0.5
>>> round(dist_manhattan('Niall', 'Neil'), 12)
0.636363636364
>>> round(dist_manhattan('Colin', 'Cuilen'), 12)
0.692307692308
>>> dist_manhattan('ATCG', 'TAGC')
1.0

abydos.distance.sim_manhattan(src, tar, qval=2, alphabet=None)[source]¶

Return the normalized Manhattan similarity of two strings.

This is a wrapper for Manhattan.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Manhattan similarity
Return type:	float

Examples

>>> sim_manhattan('cat', 'hat')
0.5
>>> round(sim_manhattan('Niall', 'Neil'), 12)
0.363636363636
>>> round(sim_manhattan('Colin', 'Cuilen'), 12)
0.307692307692
>>> sim_manhattan('ATCG', 'TAGC')
0.0

class abydos.distance.Euclidean[source]¶

Bases: abydos.distance._minkowski.Minkowski

Euclidean distance.

Euclidean distance is the straigh-line or as-the-crow-flies distance, equivalent to Minkowski distance in \(L^2\)-space.

dist(src, tar, qval=2, alphabet=None)[source]¶

Return the normalized Euclidean distance between two strings.

The normalized Euclidean distance is a distance metric in \(L^2\)-space, normalized to [0, 1].

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Euclidean distance
Return type:	float

Examples

>>> cmp = Euclidean()
>>> round(cmp.dist('cat', 'hat'), 12)
0.57735026919
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.683130051064
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.727606875109
>>> cmp.dist('ATCG', 'TAGC')
1.0

dist_abs(src, tar, qval=2, normalized=False, alphabet=None)[source]¶

Return the Euclidean distance between two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version normalized (bool) -- Normalizes to [0, 1] if True alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Euclidean distance
Return type:	float

Examples

>>> cmp = Euclidean()
>>> cmp.dist_abs('cat', 'hat')
2.0
>>> round(cmp.dist_abs('Niall', 'Neil'), 12)
2.645751311065
>>> cmp.dist_abs('Colin', 'Cuilen')
3.0
>>> round(cmp.dist_abs('ATCG', 'TAGC'), 12)
3.162277660168

abydos.distance.euclidean(src, tar, qval=2, normalized=False, alphabet=None)[source]¶

Return the Euclidean distance between two strings.

This is a wrapper for Euclidean.dist_abs().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version normalized (bool) -- Normalizes to [0, 1] if True alphabet (collection or int) -- The values or size of the alphabet
Returns:	float
Return type:	The Euclidean distance

Examples

>>> euclidean('cat', 'hat')
2.0
>>> round(euclidean('Niall', 'Neil'), 12)
2.645751311065
>>> euclidean('Colin', 'Cuilen')
3.0
>>> round(euclidean('ATCG', 'TAGC'), 12)
3.162277660168

abydos.distance.dist_euclidean(src, tar, qval=2, alphabet=None)[source]¶

Return the normalized Euclidean distance between two strings.

This is a wrapper for Euclidean.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Euclidean distance
Return type:	float

Examples

>>> round(dist_euclidean('cat', 'hat'), 12)
0.57735026919
>>> round(dist_euclidean('Niall', 'Neil'), 12)
0.683130051064
>>> round(dist_euclidean('Colin', 'Cuilen'), 12)
0.727606875109
>>> dist_euclidean('ATCG', 'TAGC')
1.0

abydos.distance.sim_euclidean(src, tar, qval=2, alphabet=None)[source]¶

Return the normalized Euclidean similarity of two strings.

This is a wrapper for Euclidean.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet (collection or int) -- The values or size of the alphabet
Returns:	The normalized Euclidean similarity
Return type:	float

Examples

>>> round(sim_euclidean('cat', 'hat'), 12)
0.42264973081
>>> round(sim_euclidean('Niall', 'Neil'), 12)
0.316869948936
>>> round(sim_euclidean('Colin', 'Cuilen'), 12)
0.272393124891
>>> sim_euclidean('ATCG', 'TAGC')
0.0

class abydos.distance.Chebyshev[source]¶

Bases: abydos.distance._minkowski.Minkowski

Chebyshev distance.

Euclidean distance is the chessboard distance, equivalent to Minkowski distance in \(L^\infty\)-space.

dist(*args, **kwargs)[source]¶

Raise exception when called.

Parameters:	args -- Variable length argument list *kwargs -- Arbitrary keyword arguments
Raises:	`NotImplementedError` -- Method disabled for Chebyshev distance

dist_abs(src, tar, qval=2, alphabet=None)[source]¶

Return the Chebyshev distance between two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Chebyshev distance
Return type:	float

Examples

>>> cmp = Chebyshev()
>>> cmp.dist_abs('cat', 'hat')
1.0
>>> cmp.dist_abs('Niall', 'Neil')
1.0
>>> cmp.dist_abs('Colin', 'Cuilen')
1.0
>>> cmp.dist_abs('ATCG', 'TAGC')
1.0
>>> cmp.dist_abs('ATCG', 'TAGC', qval=1)
0.0
>>> cmp.dist_abs('ATCGATTCGGAATTTC', 'TAGCATAATCGCCG', qval=1)
3.0

sim(*args, **kwargs)[source]¶

Raise exception when called.

Parameters:	args -- Variable length argument list *kwargs -- Arbitrary keyword arguments
Raises:	`NotImplementedError` -- Method disabled for Chebyshev distance

abydos.distance.chebyshev(src, tar, qval=2, alphabet=None)[source]¶

Return the Chebyshev distance between two strings.

This is a wrapper for the Chebyshev.dist_abs().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alphabet alphabet (collection or int) -- The values or size of the alphabet
Returns:	The Chebyshev distance
Return type:	float

Examples

>>> chebyshev('cat', 'hat')
1.0
>>> chebyshev('Niall', 'Neil')
1.0
>>> chebyshev('Colin', 'Cuilen')
1.0
>>> chebyshev('ATCG', 'TAGC')
1.0
>>> chebyshev('ATCG', 'TAGC', qval=1)
0.0
>>> chebyshev('ATCGATTCGGAATTTC', 'TAGCATAATCGCCG', qval=1)
3.0

class abydos.distance.Tversky[source]¶

Bases: abydos.distance._token_distance._TokenDistance

Tversky index.

The Tversky index [Tve77] is defined as: For two sets X and Y: \(sim_{Tversky}(X, Y) = \frac{|X \cap Y|} {|X \cap Y| + \alpha|X - Y| + \beta|Y - X|}\).

\(\alpha = \beta = 1\) is equivalent to the Jaccard & Tanimoto similarity coefficients.

\(\alpha = \beta = 0.5\) is equivalent to the Sørensen-Dice similarity coefficient [Dic45][Sorensen48].

Unequal α and β will tend to emphasize one or the other set's contributions:

\(\alpha > \beta\) emphasizes the contributions of X over Y

\(\alpha < \beta\) emphasizes the contributions of Y over X)

Parameter values' relation to 1 emphasizes different types of contributions:

\(\alpha and \beta > 1\) emphsize unique contributions over the intersection

\(\alpha and \beta < 1\) emphsize the intersection over unique contributions

The symmetric variant is defined in [JBG13]. This is activated by specifying a bias parameter.

sim(src, tar, qval=2, alpha=1, beta=1, bias=None)[source]¶

Return the Tversky index of two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alpha (float) -- Tversky index parameter as described above beta (float) -- Tversky index parameter as described above bias (float) -- The symmetric Tversky index bias parameter
Returns:	Tversky similarity
Return type:	float
Raises:	`ValueError` -- Unsupported weight assignment; alpha and beta must be greater than or equal to 0.

Examples

>>> cmp = Tversky()
>>> cmp.sim('cat', 'hat')
0.3333333333333333
>>> cmp.sim('Niall', 'Neil')
0.2222222222222222
>>> cmp.sim('aluminum', 'Catalan')
0.0625
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None)[source]¶

Return the Tversky distance between two strings.

This is a wrapper for Tversky.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alpha (float) -- Tversky index parameter as described above beta (float) -- Tversky index parameter as described above bias (float) -- The symmetric Tversky index bias parameter
Returns:	Tversky distance
Return type:	float

Examples

>>> dist_tversky('cat', 'hat')
0.6666666666666667
>>> dist_tversky('Niall', 'Neil')
0.7777777777777778
>>> dist_tversky('aluminum', 'Catalan')
0.9375
>>> dist_tversky('ATCG', 'TAGC')
1.0

abydos.distance.sim_tversky(src, tar, qval=2, alpha=1, beta=1, bias=None)[source]¶

Return the Tversky index of two strings.

This is a wrapper for Tversky.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version alpha (float) -- Tversky index parameter as described above beta (float) -- Tversky index parameter as described above bias (float) -- The symmetric Tversky index bias parameter
Returns:	Tversky similarity
Return type:	float

Examples

>>> sim_tversky('cat', 'hat')
0.3333333333333333
>>> sim_tversky('Niall', 'Neil')
0.2222222222222222
>>> sim_tversky('aluminum', 'Catalan')
0.0625
>>> sim_tversky('ATCG', 'TAGC')
0.0

class abydos.distance.Dice[source]¶

Bases: abydos.distance._tversky.Tversky

Sørensen–Dice coefficient.

For two sets X and Y, the Sørensen–Dice coefficient [Dic45][Sorensen48] is \(sim_{dice}(X, Y) = \frac{2 \cdot |X \cap Y|}{|X| + |Y|}\).

This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\alpha = \beta = 0.5\).

sim(src, tar, qval=2)[source]¶

Return the Sørensen–Dice coefficient of two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Sørensen–Dice similarity
Return type:	float

Examples

>>> cmp = Dice()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.36363636363636365
>>> cmp.sim('aluminum', 'Catalan')
0.11764705882352941
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_dice(src, tar, qval=2)[source]¶

Return the Sørensen–Dice distance between two strings.

This is a wrapper for Dice.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Sørensen–Dice distance
Return type:	float

Examples

>>> dist_dice('cat', 'hat')
0.5
>>> dist_dice('Niall', 'Neil')
0.6363636363636364
>>> dist_dice('aluminum', 'Catalan')
0.8823529411764706
>>> dist_dice('ATCG', 'TAGC')
1.0

abydos.distance.sim_dice(src, tar, qval=2)[source]¶

Return the Sørensen–Dice coefficient of two strings.

This is a wrapper for Dice.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Sørensen–Dice similarity
Return type:	float

Examples

>>> sim_dice('cat', 'hat')
0.5
>>> sim_dice('Niall', 'Neil')
0.36363636363636365
>>> sim_dice('aluminum', 'Catalan')
0.11764705882352941
>>> sim_dice('ATCG', 'TAGC')
0.0

class abydos.distance.Jaccard[source]¶

Bases: abydos.distance._tversky.Tversky

Jaccard similarity.

For two sets X and Y, the Jaccard similarity coefficient [Jac01] is \(sim_{Jaccard}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\).

This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\alpha = \beta = 1\).

sim(src, tar, qval=2)[source]¶

Return the Jaccard similarity of two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Jaccard similarity
Return type:	float

Examples

>>> cmp = Jaccard()
>>> cmp.sim('cat', 'hat')
0.3333333333333333
>>> cmp.sim('Niall', 'Neil')
0.2222222222222222
>>> cmp.sim('aluminum', 'Catalan')
0.0625
>>> cmp.sim('ATCG', 'TAGC')
0.0

tanimoto_coeff(src, tar, qval=2)[source]¶

Return the Tanimoto distance between two strings.

Tanimoto distance [Tan58] is \(-log_{2} sim_{Tanimoto}(X, Y)\).

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Tanimoto distance
Return type:	float

Examples

>>> cmp = Jaccard()
>>> cmp.tanimoto_coeff('cat', 'hat')
-1.5849625007211563
>>> cmp.tanimoto_coeff('Niall', 'Neil')
-2.1699250014423126
>>> cmp.tanimoto_coeff('aluminum', 'Catalan')
-4.0
>>> cmp.tanimoto_coeff('ATCG', 'TAGC')
-inf

abydos.distance.dist_jaccard(src, tar, qval=2)[source]¶

Return the Jaccard distance between two strings.

This is a wrapper for Jaccard.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Jaccard distance
Return type:	float

Examples

>>> dist_jaccard('cat', 'hat')
0.6666666666666667
>>> dist_jaccard('Niall', 'Neil')
0.7777777777777778
>>> dist_jaccard('aluminum', 'Catalan')
0.9375
>>> dist_jaccard('ATCG', 'TAGC')
1.0

abydos.distance.sim_jaccard(src, tar, qval=2)[source]¶

Return the Jaccard similarity of two strings.

This is a wrapper for Jaccard.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Jaccard similarity
Return type:	float

Examples

>>> sim_jaccard('cat', 'hat')
0.3333333333333333
>>> sim_jaccard('Niall', 'Neil')
0.2222222222222222
>>> sim_jaccard('aluminum', 'Catalan')
0.0625
>>> sim_jaccard('ATCG', 'TAGC')
0.0

abydos.distance.tanimoto(src, tar, qval=2)[source]¶

Return the Tanimoto coefficient of two strings.

This is a wrapper for Jaccard.tanimoto_coeff().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Tanimoto distance
Return type:	float

Examples

>>> tanimoto('cat', 'hat')
-1.5849625007211563
>>> tanimoto('Niall', 'Neil')
-2.1699250014423126
>>> tanimoto('aluminum', 'Catalan')
-4.0
>>> tanimoto('ATCG', 'TAGC')
-inf

class abydos.distance.Overlap[source]¶

Bases: abydos.distance._token_distance._TokenDistance

Overlap coefficient.

For two sets X and Y, the overlap coefficient [Szy34][Sim49], also called the Szymkiewicz-Simpson coefficient, is \(sim_{overlap}(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}\).

sim(src, tar, qval=2)[source]¶

Return the overlap coefficient of two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Overlap similarity
Return type:	float

Examples

>>> cmp = Overlap()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.4
>>> cmp.sim('aluminum', 'Catalan')
0.125
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_overlap(src, tar, qval=2)[source]¶

Return the overlap distance between two strings.

This is a wrapper for Overlap.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Overlap distance
Return type:	float

Examples

>>> dist_overlap('cat', 'hat')
0.5
>>> dist_overlap('Niall', 'Neil')
0.6
>>> dist_overlap('aluminum', 'Catalan')
0.875
>>> dist_overlap('ATCG', 'TAGC')
1.0

abydos.distance.sim_overlap(src, tar, qval=2)[source]¶

Return the overlap coefficient of two strings.

This is a wrapper for Overlap.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Overlap similarity
Return type:	float

Examples

>>> sim_overlap('cat', 'hat')
0.5
>>> sim_overlap('Niall', 'Neil')
0.4
>>> sim_overlap('aluminum', 'Catalan')
0.125
>>> sim_overlap('ATCG', 'TAGC')
0.0

class abydos.distance.Cosine[source]¶

Bases: abydos.distance._token_distance._TokenDistance

Cosine similarity.

For two sets X and Y, the cosine similarity, Otsuka-Ochiai coefficient, or Ochiai coefficient [Ots36][Och57] is: \(sim_{cosine}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\).

sim(src, tar, qval=2)[source]¶

Return the cosine similarity of two strings.

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Cosine similarity
Return type:	float

Examples

>>> cmp = Cosine()
>>> cmp.sim('cat', 'hat')
0.5
>>> cmp.sim('Niall', 'Neil')
0.3651483716701107
>>> cmp.sim('aluminum', 'Catalan')
0.11785113019775793
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_cosine(src, tar, qval=2)[source]¶

Return the cosine distance between two strings.

This is a wrapper for Cosine.dist().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Cosine distance
Return type:	float

Examples

>>> dist_cosine('cat', 'hat')
0.5
>>> dist_cosine('Niall', 'Neil')
0.6348516283298893
>>> dist_cosine('aluminum', 'Catalan')
0.882148869802242
>>> dist_cosine('ATCG', 'TAGC')
1.0

abydos.distance.sim_cosine(src, tar, qval=2)[source]¶

Return the cosine similarity of two strings.

This is a wrapper for Cosine.sim().

Parameters:	src (str) -- Source string (or QGrams/Counter objects) for comparison tar (str) -- Target string (or QGrams/Counter objects) for comparison qval (int) -- The length of each q-gram; 0 for non-q-gram version
Returns:	Cosine similarity
Return type:	float

Examples

>>> sim_cosine('cat', 'hat')
0.5
>>> sim_cosine('Niall', 'Neil')
0.3651483716701107
>>> sim_cosine('aluminum', 'Catalan')
0.11785113019775793
>>> sim_cosine('ATCG', 'TAGC')
0.0

class abydos.distance.Bag[source]¶

Bases: abydos.distance._token_distance._TokenDistance

Bag distance.

Bag distance is proposed in [BCP02]. It is defined as: \(max(|multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)|)\).

dist(src, tar)[source]¶

Return the normalized bag distance between two strings.

Bag distance is normalized by dividing by \(max( |src|, |tar| )\).

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized bag distance
Return type:	float

Examples

>>> cmp = Bag()
>>> cmp.dist('cat', 'hat')
0.3333333333333333
>>> cmp.dist('Niall', 'Neil')
0.4
>>> cmp.dist('aluminum', 'Catalan')
0.625
>>> cmp.dist('ATCG', 'TAGC')
0.0

dist_abs(src, tar)[source]¶

Return the bag distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Bag distance
Return type:	int

Examples

>>> cmp = Bag()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('aluminum', 'Catalan')
5
>>> cmp.dist_abs('ATCG', 'TAGC')
0
>>> cmp.dist_abs('abcdefg', 'hijklm')
7
>>> cmp.dist_abs('abcdefg', 'hijklmno')
8

abydos.distance.bag(src, tar)[source]¶

Return the bag distance between two strings.

This is a wrapper for Bag.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Bag distance
Return type:	int

Examples

>>> bag('cat', 'hat')
1
>>> bag('Niall', 'Neil')
2
>>> bag('aluminum', 'Catalan')
5
>>> bag('ATCG', 'TAGC')
0
>>> bag('abcdefg', 'hijklm')
7
>>> bag('abcdefg', 'hijklmno')
8

abydos.distance.dist_bag(src, tar)[source]¶

Return the normalized bag distance between two strings.

This is a wrapper for Bag.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized bag distance
Return type:	float

Examples

>>> dist_bag('cat', 'hat')
0.3333333333333333
>>> dist_bag('Niall', 'Neil')
0.4
>>> dist_bag('aluminum', 'Catalan')
0.625
>>> dist_bag('ATCG', 'TAGC')
0.0

abydos.distance.sim_bag(src, tar)[source]¶

Return the normalized bag similarity of two strings.

This is a wrapper for Bag.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized bag similarity
Return type:	float

Examples

>>> round(sim_bag('cat', 'hat'), 12)
0.666666666667
>>> sim_bag('Niall', 'Neil')
0.6
>>> sim_bag('aluminum', 'Catalan')
0.375
>>> sim_bag('ATCG', 'TAGC')
1.0

class abydos.distance.MongeElkan[source]¶

Bases: abydos.distance._distance._Distance

Monge-Elkan similarity.

Monge-Elkan is defined in [ME96].

Note: Monge-Elkan is NOT a symmetric similarity algorithm. Thus, the similarity of src to tar is not necessarily equal to the similarity of tar to src. If the symmetric argument is True, a symmetric value is calculated, at the cost of doubling the computation time (since \(sim_{Monge-Elkan}(src, tar)\) and \(sim_{Monge-Elkan}(tar, src)\) are both calculated and then averaged).

sim(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)[source]¶

Return the Monge-Elkan similarity of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison sim_func (function) -- The internal similarity metric to employ symmetric (bool) -- Return a symmetric similarity measure
Returns:	Monge-Elkan similarity
Return type:	float

Examples

>>> cmp = MongeElkan()
>>> cmp.sim('cat', 'hat')
0.75
>>> round(cmp.sim('Niall', 'Neil'), 12)
0.666666666667
>>> round(cmp.sim('aluminum', 'Catalan'), 12)
0.388888888889
>>> cmp.sim('ATCG', 'TAGC')
0.5

abydos.distance.dist_monge_elkan(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)[source]¶

Return the Monge-Elkan distance between two strings.

This is a wrapper for MongeElkan.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison sim_func (function) -- The internal similarity metric to employ symmetric (bool) -- Return a symmetric similarity measure
Returns:	Monge-Elkan distance
Return type:	float

Examples

>>> dist_monge_elkan('cat', 'hat')
0.25
>>> round(dist_monge_elkan('Niall', 'Neil'), 12)
0.333333333333
>>> round(dist_monge_elkan('aluminum', 'Catalan'), 12)
0.611111111111
>>> dist_monge_elkan('ATCG', 'TAGC')
0.5

abydos.distance.sim_monge_elkan(src, tar, sim_func=<function sim_levenshtein>, symmetric=False)[source]¶

Return the Monge-Elkan similarity of two strings.

This is a wrapper for MongeElkan.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison sim_func (function) -- Rhe internal similarity metric to employ symmetric (bool) -- Return a symmetric similarity measure
Returns:	Monge-Elkan similarity
Return type:	float

Examples

>>> sim_monge_elkan('cat', 'hat')
0.75
>>> round(sim_monge_elkan('Niall', 'Neil'), 12)
0.666666666667
>>> round(sim_monge_elkan('aluminum', 'Catalan'), 12)
0.388888888889
>>> sim_monge_elkan('ATCG', 'TAGC')
0.5

class abydos.distance.NeedlemanWunsch[source]¶

Bases: abydos.distance._distance._Distance

Needleman-Wunsch score.

The Needleman-Wunsch score [NW70] is a standard edit distance measure.

dist_abs(src, tar, gap_cost=1, sim_func=<function sim_ident>)[source]¶

Return the Needleman-Wunsch score of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison gap_cost (float) -- The cost of an alignment gap (1 by default) sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
Returns:	Needleman-Wunsch score
Return type:	float

Examples

>>> cmp = NeedlemanWunsch()
>>> cmp.dist_abs('cat', 'hat')
2.0
>>> cmp.dist_abs('Niall', 'Neil')
1.0
>>> cmp.dist_abs('aluminum', 'Catalan')
-1.0
>>> cmp.dist_abs('ATCG', 'TAGC')
0.0

static sim_matrix(src, tar, mat=None, mismatch_cost=0, match_cost=1, symmetric=True, alphabet=None)[source]¶

Return the matrix similarity of two strings.

With the default parameters, this is identical to sim_ident. It is possible for sim_matrix to return values outside of the range \([0, 1]\), if values outside that range are present in mat, mismatch_cost, or match_cost.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison mat (dict) -- A dict mapping tuples to costs; the tuples are (src, tar) pairs of symbols from the alphabet parameter mismatch_cost (float) -- The value returned if (src, tar) is absent from mat when src does not equal tar match_cost (float) -- The value returned if (src, tar) is absent from mat when src equals tar symmetric (bool) -- True if the cost of src not matching tar is identical to the cost of tar not matching src; in this case, the values in mat need only contain (src, tar) or (tar, src), not both alphabet (str) -- A collection of tokens from which src and tar are drawn; if this is defined a ValueError is raised if either tar or src is not found in alphabet
Returns:	Matrix similarity
Return type:	float
Raises:	`ValueError` -- src value not in alphabet `ValueError` -- tar value not in alphabet

Examples

>>> NeedlemanWunsch.sim_matrix('cat', 'hat')
0
>>> NeedlemanWunsch.sim_matrix('hat', 'hat')
1

abydos.distance.needleman_wunsch(src, tar, gap_cost=1, sim_func=<function sim_ident>)[source]¶

Return the Needleman-Wunsch score of two strings.

This is a wrapper for NeedlemanWunsch.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison gap_cost (float) -- The cost of an alignment gap (1 by default) sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
Returns:	Needleman-Wunsch score
Return type:	float

Examples

>>> needleman_wunsch('cat', 'hat')
2.0
>>> needleman_wunsch('Niall', 'Neil')
1.0
>>> needleman_wunsch('aluminum', 'Catalan')
-1.0
>>> needleman_wunsch('ATCG', 'TAGC')
0.0

class abydos.distance.SmithWaterman[source]¶

Bases: abydos.distance._needleman_wunsch.NeedlemanWunsch

Smith-Waterman score.

The Smith-Waterman score [SW81] is a standard edit distance measure, differing from Needleman-Wunsch in that it focuses on local alignment and disallows negative scores.

dist_abs(src, tar, gap_cost=1, sim_func=<function sim_ident>)[source]¶

Return the Smith-Waterman score of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison gap_cost (float) -- The cost of an alignment gap (1 by default) sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
Returns:	Smith-Waterman score
Return type:	float

Examples

>>> cmp = SmithWaterman()
>>> cmp.dist_abs('cat', 'hat')
2.0
>>> cmp.dist_abs('Niall', 'Neil')
1.0
>>> cmp.dist_abs('aluminum', 'Catalan')
0.0
>>> cmp.dist_abs('ATCG', 'TAGC')
1.0

abydos.distance.smith_waterman(src, tar, gap_cost=1, sim_func=<function sim_ident>)[source]¶

Return the Smith-Waterman score of two strings.

This is a wrapper for SmithWaterman.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison gap_cost (float) -- The cost of an alignment gap (1 by default) sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
Returns:	Smith-Waterman score
Return type:	float

Examples

>>> smith_waterman('cat', 'hat')
2.0
>>> smith_waterman('Niall', 'Neil')
1.0
>>> smith_waterman('aluminum', 'Catalan')
0.0
>>> smith_waterman('ATCG', 'TAGC')
1.0

class abydos.distance.Gotoh[source]¶

Bases: abydos.distance._needleman_wunsch.NeedlemanWunsch

Gotoh score.

The Gotoh score [Got82] is essentially Needleman-Wunsch with affine gap penalties.

dist_abs(src, tar, gap_open=1, gap_ext=0.4, sim_func=<function sim_ident>)[source]¶

Return the Gotoh score of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison gap_open (float) -- The cost of an open alignment gap (1 by default) gap_ext (float) -- The cost of an alignment gap extension (0.4 by default) sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
Returns:	Gotoh score
Return type:	float

Examples

>>> cmp = Gotoh()
>>> cmp.dist_abs('cat', 'hat')
2.0
>>> cmp.dist_abs('Niall', 'Neil')
1.0
>>> round(cmp.dist_abs('aluminum', 'Catalan'), 12)
-0.4
>>> cmp.dist_abs('cat', 'hat')
2.0

abydos.distance.gotoh(src, tar, gap_open=1, gap_ext=0.4, sim_func=<function sim_ident>)[source]¶

Return the Gotoh score of two strings.

This is a wrapper for Gotoh.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison gap_open (float) -- The cost of an open alignment gap (1 by default) gap_ext (float) -- The cost of an alignment gap extension (0.4 by default) sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
Returns:	Gotoh score
Return type:	float

Examples

>>> gotoh('cat', 'hat')
2.0
>>> gotoh('Niall', 'Neil')
1.0
>>> round(gotoh('aluminum', 'Catalan'), 12)
-0.4
>>> gotoh('cat', 'hat')
2.0

class abydos.distance.LCSseq[source]¶

Bases: abydos.distance._distance._Distance

Longest common subsequence.

Longest common subsequence (LCSseq) is the longest subsequence of characters that two strings have in common.

lcsseq(src, tar)[source]¶

Return the longest common subsequence of two strings.

Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence [Cod18a]. This is licensed GFDL 1.2.

Modifications include:: conversion to a numpy array in place of a list of lists

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	The longest common subsequence
Return type:	str

Examples

>>> sseq = LCSseq()
>>> sseq.lcsseq('cat', 'hat')
'at'
>>> sseq.lcsseq('Niall', 'Neil')
'Nil'
>>> sseq.lcsseq('aluminum', 'Catalan')
'aln'
>>> sseq.lcsseq('ATCG', 'TAGC')
'AC'

sim(src, tar)[source]¶

Return the longest common subsequence similarity of two strings.

Longest common subsequence similarity (\(sim_{LCSseq}\)).

This employs the LCSseq function to derive a similarity metric: \(sim_{LCSseq}(s,t) = \frac{|LCSseq(s,t)|}{max(|s|, |t|)}\)

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	LCSseq similarity
Return type:	float

Examples

>>> sseq = LCSseq()
>>> sseq.sim('cat', 'hat')
0.6666666666666666
>>> sseq.sim('Niall', 'Neil')
0.6
>>> sseq.sim('aluminum', 'Catalan')
0.375
>>> sseq.sim('ATCG', 'TAGC')
0.5

abydos.distance.lcsseq(src, tar)[source]¶

Return the longest common subsequence of two strings.

This is a wrapper for LCSseq.lcsseq().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	The longest common subsequence
Return type:	str

Examples

>>> lcsseq('cat', 'hat')
'at'
>>> lcsseq('Niall', 'Neil')
'Nil'
>>> lcsseq('aluminum', 'Catalan')
'aln'
>>> lcsseq('ATCG', 'TAGC')
'AC'

abydos.distance.dist_lcsseq(src, tar)[source]¶

Return the longest common subsequence distance between two strings.

This is a wrapper for LCSseq.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	LCSseq distance
Return type:	float

Examples

>>> dist_lcsseq('cat', 'hat')
0.33333333333333337
>>> dist_lcsseq('Niall', 'Neil')
0.4
>>> dist_lcsseq('aluminum', 'Catalan')
0.625
>>> dist_lcsseq('ATCG', 'TAGC')
0.5

abydos.distance.sim_lcsseq(src, tar)[source]¶

Return the longest common subsequence similarity of two strings.

This is a wrapper for LCSseq.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	LCSseq similarity
Return type:	float

Examples

>>> sim_lcsseq('cat', 'hat')
0.6666666666666666
>>> sim_lcsseq('Niall', 'Neil')
0.6
>>> sim_lcsseq('aluminum', 'Catalan')
0.375
>>> sim_lcsseq('ATCG', 'TAGC')
0.5

class abydos.distance.LCSstr[source]¶

Bases: abydos.distance._distance._Distance

Longest common substring.

lcsstr(src, tar)[source]¶

Return the longest common substring of two strings.

Longest common substring (LCSstr).

Based on the code from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring [Wik18]. This is licensed Creative Commons: Attribution-ShareAlike 3.0.

Modifications include:

conversion to a numpy array in place of a list of lists

conversion to Python 2/3-safe range from xrange via six

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	The longest common substring
Return type:	str

Examples

>>> sstr = LCSstr()
>>> sstr.lcsstr('cat', 'hat')
'at'
>>> sstr.lcsstr('Niall', 'Neil')
'N'
>>> sstr.lcsstr('aluminum', 'Catalan')
'al'
>>> sstr.lcsstr('ATCG', 'TAGC')
'A'

sim(src, tar)[source]¶

Return the longest common substring similarity of two strings.

Longest common substring similarity (\(sim_{LCSstr}\)).

This employs the LCS function to derive a similarity metric: \(sim_{LCSstr}(s,t) = \frac{|LCSstr(s,t)|}{max(|s|, |t|)}\)

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	LCSstr similarity
Return type:	float

Examples

>>> sim_lcsstr('cat', 'hat')
0.6666666666666666
>>> sim_lcsstr('Niall', 'Neil')
0.2
>>> sim_lcsstr('aluminum', 'Catalan')
0.25
>>> sim_lcsstr('ATCG', 'TAGC')
0.25

abydos.distance.lcsstr(src, tar)[source]¶

Return the longest common substring of two strings.

This is a wrapper for LCSstr.lcsstr().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	The longest common substring
Return type:	str

Examples

>>> lcsstr('cat', 'hat')
'at'
>>> lcsstr('Niall', 'Neil')
'N'
>>> lcsstr('aluminum', 'Catalan')
'al'
>>> lcsstr('ATCG', 'TAGC')
'A'

abydos.distance.dist_lcsstr(src, tar)[source]¶

Return the longest common substring distance between two strings.

This is a wrapper for LCSstr.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	LCSstr distance
Return type:	float

Examples

>>> dist_lcsstr('cat', 'hat')
0.33333333333333337
>>> dist_lcsstr('Niall', 'Neil')
0.8
>>> dist_lcsstr('aluminum', 'Catalan')
0.75
>>> dist_lcsstr('ATCG', 'TAGC')
0.75

abydos.distance.sim_lcsstr(src, tar)[source]¶

Return the longest common substring similarity of two strings.

This is a wrapper for LCSstr.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	LCSstr similarity
Return type:	float

Examples

>>> sim_lcsstr('cat', 'hat')
0.6666666666666666
>>> sim_lcsstr('Niall', 'Neil')
0.2
>>> sim_lcsstr('aluminum', 'Catalan')
0.25
>>> sim_lcsstr('ATCG', 'TAGC')
0.25

class abydos.distance.RatcliffObershelp[source]¶

Bases: abydos.distance._distance._Distance

Ratcliff-Obershelp similarity.

This follows the Ratcliff-Obershelp algorithm [RM88] to derive a similarity measure:

Find the length of the longest common substring in src & tar.

Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.

Multiply this length by 2 and divide by the sum of the lengths of src & tar.

Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

sim(src, tar)[source]¶

Return the Ratcliff-Obershelp similarity of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Ratcliff-Obershelp similarity
Return type:	float

Examples

>>> cmp = RatcliffObershelp()
>>> round(cmp.sim('cat', 'hat'), 12)
0.666666666667
>>> round(cmp.sim('Niall', 'Neil'), 12)
0.666666666667
>>> round(cmp.sim('aluminum', 'Catalan'), 12)
0.4
>>> cmp.sim('ATCG', 'TAGC')
0.5

abydos.distance.dist_ratcliff_obershelp(src, tar)[source]¶

Return the Ratcliff-Obershelp distance between two strings.

This is a wrapper for RatcliffObershelp.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Ratcliff-Obershelp distance
Return type:	float

Examples

>>> round(dist_ratcliff_obershelp('cat', 'hat'), 12)
0.333333333333
>>> round(dist_ratcliff_obershelp('Niall', 'Neil'), 12)
0.333333333333
>>> round(dist_ratcliff_obershelp('aluminum', 'Catalan'), 12)
0.6
>>> dist_ratcliff_obershelp('ATCG', 'TAGC')
0.5

abydos.distance.sim_ratcliff_obershelp(src, tar)[source]¶

Return the Ratcliff-Obershelp similarity of two strings.

This is a wrapper for RatcliffObershelp.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Ratcliff-Obershelp similarity
Return type:	float

Examples

>>> round(sim_ratcliff_obershelp('cat', 'hat'), 12)
0.666666666667
>>> round(sim_ratcliff_obershelp('Niall', 'Neil'), 12)
0.666666666667
>>> round(sim_ratcliff_obershelp('aluminum', 'Catalan'), 12)
0.4
>>> sim_ratcliff_obershelp('ATCG', 'TAGC')
0.5

class abydos.distance.Ident[source]¶

Bases: abydos.distance._distance._Distance

Identity distance and similarity.

sim(src, tar)[source]¶

Return the identity similarity of two strings.

Identity similarity is 1.0 if the two strings are identical, otherwise 0.0

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Identity similarity
Return type:	float

Examples

>>> cmp = Ident()
>>> cmp.sim('cat', 'hat')
0.0
>>> cmp.sim('cat', 'cat')
1.0

abydos.distance.dist_ident(src, tar)[source]¶

Return the identity distance between two strings.

This is a wrapper for Ident.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Identity distance
Return type:	float

Examples

>>> dist_ident('cat', 'hat')
1.0
>>> dist_ident('cat', 'cat')
0.0

abydos.distance.sim_ident(src, tar)[source]¶

Return the identity similarity of two strings.

This is a wrapper for Ident.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Identity similarity
Return type:	float

Examples

>>> sim_ident('cat', 'hat')
0.0
>>> sim_ident('cat', 'cat')
1.0

class abydos.distance.Length[source]¶

Bases: abydos.distance._distance._Distance

Length similarity and distance.

sim(src, tar)[source]¶

Return the length similarity of two strings.

Length similarity is the ratio of the length of the shorter string to the longer.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Length similarity
Return type:	float

Examples

>>> cmp = Length()
>>> cmp.sim('cat', 'hat')
1.0
>>> cmp.sim('Niall', 'Neil')
0.8
>>> cmp.sim('aluminum', 'Catalan')
0.875
>>> cmp.sim('ATCG', 'TAGC')
1.0

abydos.distance.dist_length(src, tar)[source]¶

Return the length distance between two strings.

This is a wrapper for Length.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Length distance
Return type:	float

Examples

>>> dist_length('cat', 'hat')
0.0
>>> dist_length('Niall', 'Neil')
0.19999999999999996
>>> dist_length('aluminum', 'Catalan')
0.125
>>> dist_length('ATCG', 'TAGC')
0.0

abydos.distance.sim_length(src, tar)[source]¶

Return the length similarity of two strings.

This is a wrapper for Length.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Length similarity
Return type:	float

Examples

>>> sim_length('cat', 'hat')
1.0
>>> sim_length('Niall', 'Neil')
0.8
>>> sim_length('aluminum', 'Catalan')
0.875
>>> sim_length('ATCG', 'TAGC')
1.0

class abydos.distance.Prefix[source]¶

Bases: abydos.distance._distance._Distance

Prefix similiarity and distance.

sim(src, tar)[source]¶

Return the prefix similarity of two strings.

Prefix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the start of both terms.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Prefix similarity
Return type:	float

Examples

>>> cmp = Prefix()
>>> cmp.sim('cat', 'hat')
0.0
>>> cmp.sim('Niall', 'Neil')
0.25
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_prefix(src, tar)[source]¶

Return the prefix distance between two strings.

This is a wrapper for Prefix.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Prefix distance
Return type:	float

Examples

>>> dist_prefix('cat', 'hat')
1.0
>>> dist_prefix('Niall', 'Neil')
0.75
>>> dist_prefix('aluminum', 'Catalan')
1.0
>>> dist_prefix('ATCG', 'TAGC')
1.0

abydos.distance.sim_prefix(src, tar)[source]¶

Return the prefix similarity of two strings.

This is a wrapper for Prefix.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Prefix similarity
Return type:	float

Examples

>>> sim_prefix('cat', 'hat')
0.0
>>> sim_prefix('Niall', 'Neil')
0.25
>>> sim_prefix('aluminum', 'Catalan')
0.0
>>> sim_prefix('ATCG', 'TAGC')
0.0

class abydos.distance.Suffix[source]¶

Bases: abydos.distance._distance._Distance

Suffix similarity and distance.

sim(src, tar)[source]¶

Return the suffix similarity of two strings.

Suffix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the end of both terms.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Suffix similarity
Return type:	float

Examples

>>> cmp = Suffix()
>>> cmp.sim('cat', 'hat')
0.6666666666666666
>>> cmp.sim('Niall', 'Neil')
0.25
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_suffix(src, tar)[source]¶

Return the suffix distance between two strings.

This is a wrapper for Suffix.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Suffix distance
Return type:	float

Examples

>>> dist_suffix('cat', 'hat')
0.33333333333333337
>>> dist_suffix('Niall', 'Neil')
0.75
>>> dist_suffix('aluminum', 'Catalan')
1.0
>>> dist_suffix('ATCG', 'TAGC')
1.0

abydos.distance.sim_suffix(src, tar)[source]¶

Return the suffix similarity of two strings.

This is a wrapper for Suffix.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Suffix similarity
Return type:	float

Examples

>>> sim_suffix('cat', 'hat')
0.6666666666666666
>>> sim_suffix('Niall', 'Neil')
0.25
>>> sim_suffix('aluminum', 'Catalan')
0.0
>>> sim_suffix('ATCG', 'TAGC')
0.0

class abydos.distance.NCDzlib(level=-1)[source]¶

Bases: abydos.distance._distance._Distance

Normalized Compression Distance using zlib compression.

Cf. https://zlib.net/

Normalized compression distance (NCD) [CV05].

dist(src, tar)[source]¶

Return the NCD between two strings using zlib compression.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> cmp = NCDzlib()
>>> cmp.dist('cat', 'hat')
0.3333333333333333
>>> cmp.dist('Niall', 'Neil')
0.45454545454545453
>>> cmp.dist('aluminum', 'Catalan')
0.5714285714285714
>>> cmp.dist('ATCG', 'TAGC')
0.4

abydos.distance.dist_ncd_zlib(src, tar)[source]¶

Return the NCD between two strings using zlib compression.

This is a wrapper for NCDzlib.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> dist_ncd_zlib('cat', 'hat')
0.3333333333333333
>>> dist_ncd_zlib('Niall', 'Neil')
0.45454545454545453
>>> dist_ncd_zlib('aluminum', 'Catalan')
0.5714285714285714
>>> dist_ncd_zlib('ATCG', 'TAGC')
0.4

abydos.distance.sim_ncd_zlib(src, tar)[source]¶

Return the NCD similarity between two strings using zlib compression.

This is a wrapper for NCDzlib.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	float
Return type:	Compression similarity

Examples

>>> sim_ncd_zlib('cat', 'hat')
0.6666666666666667
>>> sim_ncd_zlib('Niall', 'Neil')
0.5454545454545454
>>> sim_ncd_zlib('aluminum', 'Catalan')
0.4285714285714286
>>> sim_ncd_zlib('ATCG', 'TAGC')
0.6

class abydos.distance.NCDbz2(level=9)[source]¶

Bases: abydos.distance._distance._Distance

Normalized Compression Distance using bzip2 compression.

Cf. https://en.wikipedia.org/wiki/Bzip2

Normalized compression distance (NCD) [CV05].

dist(src, tar)[source]¶

Return the NCD between two strings using bzip2 compression.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> cmp = NCDbz2()
>>> cmp.dist('cat', 'hat')
0.06666666666666667
>>> cmp.dist('Niall', 'Neil')
0.03125
>>> cmp.dist('aluminum', 'Catalan')
0.17647058823529413
>>> cmp.dist('ATCG', 'TAGC')
0.03125

abydos.distance.dist_ncd_bz2(src, tar)[source]¶

Return the NCD between two strings using bzip2 compression.

This is a wrapper for NCDbz2.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> dist_ncd_bz2('cat', 'hat')
0.06666666666666667
>>> dist_ncd_bz2('Niall', 'Neil')
0.03125
>>> dist_ncd_bz2('aluminum', 'Catalan')
0.17647058823529413
>>> dist_ncd_bz2('ATCG', 'TAGC')
0.03125

abydos.distance.sim_ncd_bz2(src, tar)[source]¶

Return the NCD similarity between two strings using bzip2 compression.

This is a wrapper for NCDbz2.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression similarity
Return type:	float

Examples

>>> sim_ncd_bz2('cat', 'hat')
0.9333333333333333
>>> sim_ncd_bz2('Niall', 'Neil')
0.96875
>>> sim_ncd_bz2('aluminum', 'Catalan')
0.8235294117647058
>>> sim_ncd_bz2('ATCG', 'TAGC')
0.96875

class abydos.distance.NCDlzma[source]¶

Bases: abydos.distance._distance._Distance

Normalized Compression Distance using LZMA compression.

Cf. https://en.wikipedia.org/wiki/Lempel-Ziv-Markov_chain_algorithm

Normalized compression distance (NCD) [CV05].

dist(src, tar)[source]¶

Return the NCD between two strings using LZMA compression.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float
Raises:	`ValueError` -- Install the PylibLZMA module in order to use LZMA

Examples

>>> cmp = NCDlzma()
>>> cmp.dist('cat', 'hat')
0.08695652173913043
>>> cmp.dist('Niall', 'Neil')
0.16
>>> cmp.dist('aluminum', 'Catalan')
0.16
>>> cmp.dist('ATCG', 'TAGC')
0.08695652173913043

abydos.distance.dist_ncd_lzma(src, tar)[source]¶

Return the NCD between two strings using LZMA compression.

This is a wrapper for NCDlzma.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> dist_ncd_lzma('cat', 'hat')
0.08695652173913043
>>> dist_ncd_lzma('Niall', 'Neil')
0.16
>>> dist_ncd_lzma('aluminum', 'Catalan')
0.16
>>> dist_ncd_lzma('ATCG', 'TAGC')
0.08695652173913043

abydos.distance.sim_ncd_lzma(src, tar)[source]¶

Return the NCD similarity between two strings using LZMA compression.

This is a wrapper for NCDlzma.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression similarity
Return type:	float

Examples

>>> sim_ncd_lzma('cat', 'hat')
0.9130434782608696
>>> sim_ncd_lzma('Niall', 'Neil')
0.84
>>> sim_ncd_lzma('aluminum', 'Catalan')
0.84
>>> sim_ncd_lzma('ATCG', 'TAGC')
0.9130434782608696

class abydos.distance.NCDarith[source]¶

Bases: abydos.distance._distance._Distance

Normalized Compression Distance using arithmetic coding.

Cf. https://en.wikipedia.org/wiki/Arithmetic_coding

Normalized compression distance (NCD) [CV05].

dist(src, tar, probs=None)[source]¶

Return the NCD between two strings using arithmetic coding.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison probs (dict) -- A dictionary trained with `Arithmetic.train()`
Returns:	Compression distance
Return type:	float

Examples

>>> cmp = NCDarith()
>>> cmp.dist('cat', 'hat')
0.5454545454545454
>>> cmp.dist('Niall', 'Neil')
0.6875
>>> cmp.dist('aluminum', 'Catalan')
0.8275862068965517
>>> cmp.dist('ATCG', 'TAGC')
0.6923076923076923

abydos.distance.dist_ncd_arith(src, tar, probs=None)[source]¶

Return the NCD between two strings using arithmetic coding.

This is a wrapper for NCDarith.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison probs (dict) -- A dictionary trained with `Arithmetic.train()`
Returns:	Compression distance
Return type:	float

Examples

>>> dist_ncd_arith('cat', 'hat')
0.5454545454545454
>>> dist_ncd_arith('Niall', 'Neil')
0.6875
>>> dist_ncd_arith('aluminum', 'Catalan')
0.8275862068965517
>>> dist_ncd_arith('ATCG', 'TAGC')
0.6923076923076923

abydos.distance.sim_ncd_arith(src, tar, probs=None)[source]¶

Return the NCD similarity between two strings using arithmetic coding.

This is a wrapper for NCDarith.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison probs (dict) -- A dictionary trained with `Arithmetic.train()`
Returns:	Compression similarity
Return type:	float

Examples

>>> sim_ncd_arith('cat', 'hat')
0.4545454545454546
>>> sim_ncd_arith('Niall', 'Neil')
0.3125
>>> sim_ncd_arith('aluminum', 'Catalan')
0.1724137931034483
>>> sim_ncd_arith('ATCG', 'TAGC')
0.3076923076923077

class abydos.distance.NCDbwtrle[source]¶

Bases: abydos.distance._ncd_rle.NCDrle

Normalized Compression Distance using BWT plus RLE.

Cf. https://en.wikipedia.org/wiki/Burrows-Wheeler_transform

Normalized compression distance (NCD) [CV05].

dist(src, tar)[source]¶

Return the NCD between two strings using BWT plus RLE.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> cmp = NCDbwtrle()
>>> cmp.dist('cat', 'hat')
0.75
>>> cmp.dist('Niall', 'Neil')
0.8333333333333334
>>> cmp.dist('aluminum', 'Catalan')
1.0
>>> cmp.dist('ATCG', 'TAGC')
0.8

abydos.distance.dist_ncd_bwtrle(src, tar)[source]¶

Return the NCD between two strings using BWT plus RLE.

This is a wrapper for NCDbwtrle.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> dist_ncd_bwtrle('cat', 'hat')
0.75
>>> dist_ncd_bwtrle('Niall', 'Neil')
0.8333333333333334
>>> dist_ncd_bwtrle('aluminum', 'Catalan')
1.0
>>> dist_ncd_bwtrle('ATCG', 'TAGC')
0.8

abydos.distance.sim_ncd_bwtrle(src, tar)[source]¶

Return the NCD similarity between two strings using BWT plus RLE.

This is a wrapper for NCDbwtrle.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression similarity
Return type:	float

Examples

>>> sim_ncd_bwtrle('cat', 'hat')
0.25
>>> sim_ncd_bwtrle('Niall', 'Neil')
0.16666666666666663
>>> sim_ncd_bwtrle('aluminum', 'Catalan')
0.0
>>> sim_ncd_bwtrle('ATCG', 'TAGC')
0.19999999999999996

class abydos.distance.NCDrle[source]¶

Bases: abydos.distance._distance._Distance

Normalized Compression Distance using RLE.

Cf. https://en.wikipedia.org/wiki/Run-length_encoding

Normalized compression distance (NCD) [CV05].

dist(src, tar)[source]¶

Return the NCD between two strings using RLE.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> cmp = NCDrle()
>>> cmp.dist('cat', 'hat')
1.0
>>> cmp.dist('Niall', 'Neil')
1.0
>>> cmp.dist('aluminum', 'Catalan')
1.0
>>> cmp.dist('ATCG', 'TAGC')
1.0

abydos.distance.dist_ncd_rle(src, tar)[source]¶

Return the NCD between two strings using RLE.

This is a wrapper for NCDrle.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression distance
Return type:	float

Examples

>>> dist_ncd_rle('cat', 'hat')
1.0
>>> dist_ncd_rle('Niall', 'Neil')
1.0
>>> dist_ncd_rle('aluminum', 'Catalan')
1.0
>>> dist_ncd_rle('ATCG', 'TAGC')
1.0

abydos.distance.sim_ncd_rle(src, tar)[source]¶

Return the NCD similarity between two strings using RLE.

This is a wrapper for NCDrle.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Compression similarity
Return type:	float

Examples

>>> sim_ncd_rle('cat', 'hat')
0.0
>>> sim_ncd_rle('Niall', 'Neil')
0.0
>>> sim_ncd_rle('aluminum', 'Catalan')
0.0
>>> sim_ncd_rle('ATCG', 'TAGC')
0.0

class abydos.distance.MRA[source]¶

Bases: abydos.distance._distance._Distance

Match Rating Algorithm comparison rating.

The Western Airlines Surname Match Rating Algorithm comparison rating, as presented on page 18 of [MKTM77].

dist_abs(src, tar)[source]¶

Return the MRA comparison rating of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	MRA comparison rating
Return type:	int

Examples

>>> cmp = MRA()
>>> cmp.dist_abs('cat', 'hat')
5
>>> cmp.dist_abs('Niall', 'Neil')
6
>>> cmp.dist_abs('aluminum', 'Catalan')
0
>>> cmp.dist_abs('ATCG', 'TAGC')
5

sim(src, tar)[source]¶

Return the normalized MRA similarity of two strings.

This is the MRA normalized to \([0, 1]\), given that MRA itself is constrained to the range \([0, 6]\).

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized MRA similarity
Return type:	float

Examples

>>> cmp = MRA()
>>> cmp.sim('cat', 'hat')
0.8333333333333334
>>> cmp.sim('Niall', 'Neil')
1.0
>>> cmp.sim('aluminum', 'Catalan')
0.0
>>> cmp.sim('ATCG', 'TAGC')
0.8333333333333334

abydos.distance.mra_compare(src, tar)[source]¶

Return the MRA comparison rating of two strings.

This is a wrapper for MRA.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	MRA comparison rating
Return type:	int

Examples

>>> mra_compare('cat', 'hat')
5
>>> mra_compare('Niall', 'Neil')
6
>>> mra_compare('aluminum', 'Catalan')
0
>>> mra_compare('ATCG', 'TAGC')
5

abydos.distance.dist_mra(src, tar)[source]¶

Return the normalized MRA distance between two strings.

This is a wrapper for MRA.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized MRA distance
Return type:	float

Examples

>>> dist_mra('cat', 'hat')
0.16666666666666663
>>> dist_mra('Niall', 'Neil')
0.0
>>> dist_mra('aluminum', 'Catalan')
1.0
>>> dist_mra('ATCG', 'TAGC')
0.16666666666666663

abydos.distance.sim_mra(src, tar)[source]¶

Return the normalized MRA similarity of two strings.

This is a wrapper for MRA.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison
Returns:	Normalized MRA similarity
Return type:	float

Examples

>>> sim_mra('cat', 'hat')
0.8333333333333334
>>> sim_mra('Niall', 'Neil')
1.0
>>> sim_mra('aluminum', 'Catalan')
0.0
>>> sim_mra('ATCG', 'TAGC')
0.8333333333333334

class abydos.distance.Editex[source]¶

Bases: abydos.distance._distance._Distance

Editex.

As described on pages 3 & 4 of [ZD96].

The local variant is based on [RU09].

dist(src, tar, cost=(0, 1, 2), local=False)[source]¶

Return the normalized Editex distance between two strings.

The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2)) local (bool) -- If True, the local variant of Editex is used
Returns:	Normalized Editex distance
Return type:	int

Examples

>>> cmp = Editex()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.2
>>> cmp.dist('aluminum', 'Catalan')
0.75
>>> cmp.dist('ATCG', 'TAGC')
0.75

dist_abs(src, tar, cost=(0, 1, 2), local=False)[source]¶

Return the Editex distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2)) local (bool) -- If True, the local variant of Editex is used
Returns:	Editex distance
Return type:	int

Examples

>>> cmp = Editex()
>>> cmp.dist_abs('cat', 'hat')
2
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('aluminum', 'Catalan')
12
>>> cmp.dist_abs('ATCG', 'TAGC')
6

abydos.distance.editex(src, tar, cost=(0, 1, 2), local=False)[source]¶

Return the Editex distance between two strings.

This is a wrapper for Editex.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2)) local (bool) -- If True, the local variant of Editex is used
Returns:	Editex distance
Return type:	int

Examples

>>> editex('cat', 'hat')
2
>>> editex('Niall', 'Neil')
2
>>> editex('aluminum', 'Catalan')
12
>>> editex('ATCG', 'TAGC')
6

abydos.distance.dist_editex(src, tar, cost=(0, 1, 2), local=False)[source]¶

Return the normalized Editex distance between two strings.

This is a wrapper for Editex.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2)) local (bool) -- If True, the local variant of Editex is used
Returns:	Normalized Editex distance
Return type:	int

Examples

>>> round(dist_editex('cat', 'hat'), 12)
0.333333333333
>>> round(dist_editex('Niall', 'Neil'), 12)
0.2
>>> dist_editex('aluminum', 'Catalan')
0.75
>>> dist_editex('ATCG', 'TAGC')
0.75

abydos.distance.sim_editex(src, tar, cost=(0, 1, 2), local=False)[source]¶

Return the normalized Editex similarity of two strings.

This is a wrapper for Editex.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison cost (tuple) -- A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2)) local (bool) -- If True, the local variant of Editex is used
Returns:	Normalized Editex similarity
Return type:	int

Examples

>>> round(sim_editex('cat', 'hat'), 12)
0.666666666667
>>> round(sim_editex('Niall', 'Neil'), 12)
0.8
>>> sim_editex('aluminum', 'Catalan')
0.25
>>> sim_editex('ATCG', 'TAGC')
0.25

class abydos.distance.MLIPNS[source]¶

Bases: abydos.distance._distance._Distance

MLIPNS similarity.

Modified Language-Independent Product Name Search (MLIPNS) is described in [SA10]. This function returns only 1.0 (similar) or 0.0 (not similar). LIPNS similarity is identical to normalized Hamming similarity.

hamming = <abydos.distance._hamming.Hamming object>¶

sim(src, tar, threshold=0.25, max_mismatches=2)[source]¶

Return the MLIPNS similarity of two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default) max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:	MLIPNS similarity
Return type:	float

Examples

>>> sim_mlipns('cat', 'hat')
1.0
>>> sim_mlipns('Niall', 'Neil')
0.0
>>> sim_mlipns('aluminum', 'Catalan')
0.0
>>> sim_mlipns('ATCG', 'TAGC')
0.0

abydos.distance.dist_mlipns(src, tar, threshold=0.25, max_mismatches=2)[source]¶

Return the MLIPNS distance between two strings.

This is a wrapper for MLIPNS.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default) max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:	MLIPNS distance
Return type:	float

Examples

>>> dist_mlipns('cat', 'hat')
0.0
>>> dist_mlipns('Niall', 'Neil')
1.0
>>> dist_mlipns('aluminum', 'Catalan')
1.0
>>> dist_mlipns('ATCG', 'TAGC')
1.0

abydos.distance.sim_mlipns(src, tar, threshold=0.25, max_mismatches=2)[source]¶

Return the MLIPNS similarity of two strings.

This is a wrapper for MLIPNS.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default) max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
Returns:	MLIPNS similarity
Return type:	float

Examples

>>> sim_mlipns('cat', 'hat')
1.0
>>> sim_mlipns('Niall', 'Neil')
0.0
>>> sim_mlipns('aluminum', 'Catalan')
0.0
>>> sim_mlipns('ATCG', 'TAGC')
0.0

class abydos.distance.Baystat[source]¶

Bases: abydos.distance._distance._Distance

Baystat similarity and distance.

Good results for shorter words are reported when setting min_ss_len to 1 and either left_ext OR right_ext to 1.

The Baystat similarity is defined in [FurnrohrRvR02].

This is ostensibly a port of the R module PPRL's implementation: https://github.com/cran/PPRL/blob/master/src/MTB_Baystat.cpp [Ruk18]. As such, this could be made more pythonic.

sim(src, tar, min_ss_len=None, left_ext=None, right_ext=None)[source]¶

Return the Baystat similarity.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison min_ss_len (int) -- Minimum substring length to be considered left_ext (int) -- Left-side extension length right_ext (int) -- Right-side extension length
Returns:	The Baystat similarity
Return type:	float

Examples

>>> cmp = Baystat()
>>> round(cmp.sim('cat', 'hat'), 12)
0.666666666667
>>> cmp.sim('Niall', 'Neil')
0.4
>>> round(cmp.sim('Colin', 'Cuilen'), 12)
0.166666666667
>>> cmp.sim('ATCG', 'TAGC')
0.0

abydos.distance.dist_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None)[source]¶

Return the Baystat distance.

This is a wrapper for Baystat.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison min_ss_len (int) -- Minimum substring length to be considered left_ext (int) -- Left-side extension length right_ext (int) -- Right-side extension length
Returns:	The Baystat distance
Return type:	float

Examples

>>> round(dist_baystat('cat', 'hat'), 12)
0.333333333333
>>> dist_baystat('Niall', 'Neil')
0.6
>>> round(dist_baystat('Colin', 'Cuilen'), 12)
0.833333333333
>>> dist_baystat('ATCG', 'TAGC')
1.0

abydos.distance.sim_baystat(src, tar, min_ss_len=None, left_ext=None, right_ext=None)[source]¶

Return the Baystat similarity.

This is a wrapper for Baystat.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison min_ss_len (int) -- Minimum substring length to be considered left_ext (int) -- Left-side extension length right_ext (int) -- Right-side extension length
Returns:	The Baystat similarity
Return type:	float

Examples

>>> round(sim_baystat('cat', 'hat'), 12)
0.666666666667
>>> sim_baystat('Niall', 'Neil')
0.4
>>> round(sim_baystat('Colin', 'Cuilen'), 12)
0.166666666667
>>> sim_baystat('ATCG', 'TAGC')
0.0

class abydos.distance.Eudex[source]¶

Bases: abydos.distance._distance._Distance

Distance between the Eudex hashes of two terms.

Cf. [Tic].

dist(src, tar, weights='exponential', max_length=8)[source]¶

Return normalized distance between the Eudex hashes of two terms.

This is Eudex distance normalized to [0, 1].

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison weights (str, iterable, or generator function) -- The weights or weights generator function max_length (int) -- The number of characters to encode as a eudex hash
Returns:	The normalized Eudex Hamming distance
Return type:	int

Examples

>>> cmp = Eudex()
>>> round(cmp.dist('cat', 'hat'), 12)
0.062745098039
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.000980392157
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.004901960784
>>> round(cmp.dist('ATCG', 'TAGC'), 12)
0.197549019608

dist_abs(src, tar, weights='exponential', max_length=8, normalized=False)[source]¶

Calculate the distance between the Eudex hashes of two terms.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison weights (str, iterable, or generator function) -- The weights or weights generator function If set to `None`, a simple Hamming distance is calculated. If set to `exponential`, weight decays by powers of 2, as proposed in the eudex specification: https://github.com/ticki/eudex. If set to `fibonacci`, weight decays through the Fibonacci series, as in the eudex reference implementation. If set to a callable function, this assumes it creates a generator and the generator is used to populate a series of weights. If set to an iterable, the iterable's values should be integers and will be used as the weights. max_length (int) -- The number of characters to encode as a eudex hash normalized (bool) -- Normalizes to [0, 1] if True
Returns:	The Eudex Hamming distance
Return type:	int

Examples

>>> cmp = Eudex()
>>> cmp.dist_abs('cat', 'hat')
128
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('Colin', 'Cuilen')
10
>>> cmp.dist_abs('ATCG', 'TAGC')
403

>>> cmp.dist_abs('cat', 'hat', weights='fibonacci')
34
>>> cmp.dist_abs('Niall', 'Neil', weights='fibonacci')
2
>>> cmp.dist_abs('Colin', 'Cuilen', weights='fibonacci')
7
>>> cmp.dist_abs('ATCG', 'TAGC', weights='fibonacci')
117

>>> cmp.dist_abs('cat', 'hat', weights=None)
1
>>> cmp.dist_abs('Niall', 'Neil', weights=None)
1
>>> cmp.dist_abs('Colin', 'Cuilen', weights=None)
2
>>> cmp.dist_abs('ATCG', 'TAGC', weights=None)
9

>>> # Using the OEIS A000142:
>>> cmp.dist_abs('cat', 'hat', [1, 1, 2, 6, 24, 120, 720, 5040])
1
>>> cmp.dist_abs('Niall', 'Neil', [1, 1, 2, 6, 24, 120, 720, 5040])
720
>>> cmp.dist_abs('Colin', 'Cuilen',
... [1, 1, 2, 6, 24, 120, 720, 5040])
744
>>> cmp.dist_abs('ATCG', 'TAGC', [1, 1, 2, 6, 24, 120, 720, 5040])
6243

static gen_exponential(base=2)[source]¶

Yield the next value in an exponential series of the base.

Starts at base**0

Parameters:	base (int) -- The base to exponentiate
Yields:	int -- The next power of base

static gen_fibonacci()[source]¶

Yield the next Fibonacci number.

Based on https://www.python-course.eu/generators.php Starts at Fibonacci number 3 (the second 1)

Yields:	int -- The next Fibonacci number

abydos.distance.eudex_hamming(src, tar, weights='exponential', max_length=8, normalized=False)[source]¶

Calculate the Hamming distance between the Eudex hashes of two terms.

This is a wrapper for Eudex.eudex_hamming().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison weights (str, iterable, or generator function) -- The weights or weights generator function max_length (int) -- The number of characters to encode as a eudex hash normalized (bool) -- Normalizes to [0, 1] if True
Returns:	The Eudex Hamming distance
Return type:	int

Examples

>>> eudex_hamming('cat', 'hat')
128
>>> eudex_hamming('Niall', 'Neil')
2
>>> eudex_hamming('Colin', 'Cuilen')
10
>>> eudex_hamming('ATCG', 'TAGC')
403

>>> eudex_hamming('cat', 'hat', weights='fibonacci')
34
>>> eudex_hamming('Niall', 'Neil', weights='fibonacci')
2
>>> eudex_hamming('Colin', 'Cuilen', weights='fibonacci')
7
>>> eudex_hamming('ATCG', 'TAGC', weights='fibonacci')
117

>>> eudex_hamming('cat', 'hat', weights=None)
1
>>> eudex_hamming('Niall', 'Neil', weights=None)
1
>>> eudex_hamming('Colin', 'Cuilen', weights=None)
2
>>> eudex_hamming('ATCG', 'TAGC', weights=None)
9

>>> # Using the OEIS A000142:
>>> eudex_hamming('cat', 'hat', [1, 1, 2, 6, 24, 120, 720, 5040])
1
>>> eudex_hamming('Niall', 'Neil', [1, 1, 2, 6, 24, 120, 720, 5040])
720
>>> eudex_hamming('Colin', 'Cuilen', [1, 1, 2, 6, 24, 120, 720, 5040])
744
>>> eudex_hamming('ATCG', 'TAGC', [1, 1, 2, 6, 24, 120, 720, 5040])
6243

abydos.distance.dist_eudex(src, tar, weights='exponential', max_length=8)[source]¶

Return normalized Hamming distance between Eudex hashes of two terms.

This is a wrapper for Eudex.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison weights (str, iterable, or generator function) -- The weights or weights generator function max_length (int) -- The number of characters to encode as a eudex hash
Returns:	The normalized Eudex Hamming distance
Return type:	int

Examples

>>> round(dist_eudex('cat', 'hat'), 12)
0.062745098039
>>> round(dist_eudex('Niall', 'Neil'), 12)
0.000980392157
>>> round(dist_eudex('Colin', 'Cuilen'), 12)
0.004901960784
>>> round(dist_eudex('ATCG', 'TAGC'), 12)
0.197549019608

abydos.distance.sim_eudex(src, tar, weights='exponential', max_length=8)[source]¶

Return normalized Hamming similarity between Eudex hashes of two terms.

This is a wrapper for Eudex.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison weights (str, iterable, or generator function) -- The weights or weights generator function max_length (int) -- The number of characters to encode as a eudex hash
Returns:	The normalized Eudex Hamming similarity
Return type:	int

Examples

>>> round(sim_eudex('cat', 'hat'), 12)
0.937254901961
>>> round(sim_eudex('Niall', 'Neil'), 12)
0.999019607843
>>> round(sim_eudex('Colin', 'Cuilen'), 12)
0.995098039216
>>> round(sim_eudex('ATCG', 'TAGC'), 12)
0.802450980392

class abydos.distance.Sift4[source]¶

Bases: abydos.distance._distance._Distance

Sift4 Common version.

This is an approximation of edit distance, described in [Zac14].

dist(src, tar, max_offset=5, max_distance=0)[source]¶

Return the normalized "common" Sift4 distance between two terms.

This is Sift4 distance, normalized to [0, 1].

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters max_distance (int) -- The distance at which to stop and exit
Returns:	The normalized Sift4 distance
Return type:	float

Examples

>>> cmp = Sift4()
>>> round(cmp.dist('cat', 'hat'), 12)
0.333333333333
>>> cmp.dist('Niall', 'Neil')
0.4
>>> cmp.dist('Colin', 'Cuilen')
0.5
>>> cmp.dist('ATCG', 'TAGC')
0.5

dist_abs(src, tar, max_offset=5, max_distance=0)[source]¶

Return the "common" Sift4 distance between two terms.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters max_distance (int) -- The distance at which to stop and exit
Returns:	The Sift4 distance according to the common formula
Return type:	int

Examples

>>> cmp = Sift4()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('Colin', 'Cuilen')
3
>>> cmp.dist_abs('ATCG', 'TAGC')
2

class abydos.distance.Sift4Simplest[source]¶

Bases: abydos.distance._sift4.Sift4

Sift4 Simplest version.

This is an approximation of edit distance, described in [Zac14].

dist_abs(src, tar, max_offset=5)[source]¶

Return the "simplest" Sift4 distance between two terms.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters
Returns:	The Sift4 distance according to the simplest formula
Return type:	int

Examples

>>> cmp = Sift4Simplest()
>>> cmp.dist_abs('cat', 'hat')
1
>>> cmp.dist_abs('Niall', 'Neil')
2
>>> cmp.dist_abs('Colin', 'Cuilen')
3
>>> cmp.dist_abs('ATCG', 'TAGC')
2

abydos.distance.sift4_common(src, tar, max_offset=5, max_distance=0)[source]¶

Return the "common" Sift4 distance between two terms.

This is a wrapper for Sift4.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters max_distance (int) -- The distance at which to stop and exit
Returns:	The Sift4 distance according to the common formula
Return type:	int

Examples

>>> sift4_common('cat', 'hat')
1
>>> sift4_common('Niall', 'Neil')
2
>>> sift4_common('Colin', 'Cuilen')
3
>>> sift4_common('ATCG', 'TAGC')
2

abydos.distance.sift4_simplest(src, tar, max_offset=5)[source]¶

Return the "simplest" Sift4 distance between two terms.

This is a wrapper for Sift4Simplest.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters
Returns:	The Sift4 distance according to the simplest formula
Return type:	int

Examples

>>> sift4_simplest('cat', 'hat')
1
>>> sift4_simplest('Niall', 'Neil')
2
>>> sift4_simplest('Colin', 'Cuilen')
3
>>> sift4_simplest('ATCG', 'TAGC')
2

abydos.distance.dist_sift4(src, tar, max_offset=5, max_distance=0)[source]¶

Return the normalized "common" Sift4 distance between two terms.

This is a wrapper for Sift4.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters max_distance (int) -- The distance at which to stop and exit
Returns:	The normalized Sift4 distance
Return type:	float

Examples

>>> round(dist_sift4('cat', 'hat'), 12)
0.333333333333
>>> dist_sift4('Niall', 'Neil')
0.4
>>> dist_sift4('Colin', 'Cuilen')
0.5
>>> dist_sift4('ATCG', 'TAGC')
0.5

abydos.distance.sim_sift4(src, tar, max_offset=5, max_distance=0)[source]¶

Return the normalized "common" Sift4 similarity of two terms.

This is a wrapper for Sift4.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison max_offset (int) -- The number of characters to search for matching letters max_distance (int) -- The distance at which to stop and exit
Returns:	The normalized Sift4 similarity
Return type:	float

Examples

>>> round(sim_sift4('cat', 'hat'), 12)
0.666666666667
>>> sim_sift4('Niall', 'Neil')
0.6
>>> sim_sift4('Colin', 'Cuilen')
0.5
>>> sim_sift4('ATCG', 'TAGC')
0.5

class abydos.distance.Typo[source]¶

Bases: abydos.distance._distance._Distance

Typo distance.

This is inspired by Typo-Distance [Son11], and a fair bit of this was copied from that module. Compared to the original, this supports different metrics for substitution.

dist(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5), layout='QWERTY')[source]¶

Return the normalized typo distance between two strings.

This is typo distance, normalized to [0, 1].

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison metric (str) -- Supported values include: `euclidean`, `manhattan`, `log-euclidean`, and `log-manhattan` cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used. layout (str) -- Name of the keyboard layout to use (Currently supported: `QWERTY`, `Dvorak`, `AZERTY`, `QWERTZ`)
Returns:	Normalized typo distance
Return type:	float

Examples

>>> cmp = Typo()
>>> round(cmp.dist('cat', 'hat'), 12)
0.527046283086
>>> round(cmp.dist('Niall', 'Neil'), 12)
0.565028142929
>>> round(cmp.dist('Colin', 'Cuilen'), 12)
0.569035609563
>>> cmp.dist('ATCG', 'TAGC')
0.625

dist_abs(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5), layout='QWERTY')[source]¶

Return the typo distance between two strings.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison metric (str) -- Supported values include: `euclidean`, `manhattan`, `log-euclidean`, and `log-manhattan` cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used. layout (str) -- Name of the keyboard layout to use (Currently supported: `QWERTY`, `Dvorak`, `AZERTY`, `QWERTZ`)
Returns:	Typo distance
Return type:	float
Raises:	`ValueError` -- char not found in any keyboard layouts

Examples

>>> cmp = Typo()
>>> cmp.dist_abs('cat', 'hat')
1.5811388
>>> cmp.dist_abs('Niall', 'Neil')
2.8251407
>>> cmp.dist_abs('Colin', 'Cuilen')
3.4142137
>>> cmp.dist_abs('ATCG', 'TAGC')
2.5

>>> cmp.dist_abs('cat', 'hat', metric='manhattan')
2.0
>>> cmp.dist_abs('Niall', 'Neil', metric='manhattan')
3.0
>>> cmp.dist_abs('Colin', 'Cuilen', metric='manhattan')
3.5
>>> cmp.dist_abs('ATCG', 'TAGC', metric='manhattan')
2.5

>>> cmp.dist_abs('cat', 'hat', metric='log-manhattan')
0.804719
>>> cmp.dist_abs('Niall', 'Neil', metric='log-manhattan')
2.2424533
>>> cmp.dist_abs('Colin', 'Cuilen', metric='log-manhattan')
2.2424533
>>> cmp.dist_abs('ATCG', 'TAGC', metric='log-manhattan')
2.3465736

abydos.distance.typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5), layout='QWERTY')[source]¶

Return the typo distance between two strings.

This is a wrapper for Typo.typo().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison metric (str) -- Supported values include: `euclidean`, `manhattan`, `log-euclidean`, and `log-manhattan` cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used. layout (str) -- Name of the keyboard layout to use (Currently supported: `QWERTY`, `Dvorak`, `AZERTY`, `QWERTZ`)
Returns:	Typo distance
Return type:	float

Examples

>>> typo('cat', 'hat')
1.5811388
>>> typo('Niall', 'Neil')
2.8251407
>>> typo('Colin', 'Cuilen')
3.4142137
>>> typo('ATCG', 'TAGC')
2.5

>>> typo('cat', 'hat', metric='manhattan')
2.0
>>> typo('Niall', 'Neil', metric='manhattan')
3.0
>>> typo('Colin', 'Cuilen', metric='manhattan')
3.5
>>> typo('ATCG', 'TAGC', metric='manhattan')
2.5

>>> typo('cat', 'hat', metric='log-manhattan')
0.804719
>>> typo('Niall', 'Neil', metric='log-manhattan')
2.2424533
>>> typo('Colin', 'Cuilen', metric='log-manhattan')
2.2424533
>>> typo('ATCG', 'TAGC', metric='log-manhattan')
2.3465736

abydos.distance.dist_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5), layout='QWERTY')[source]¶

Return the normalized typo distance between two strings.

This is a wrapper for Typo.dist().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison metric (str) -- Supported values include: `euclidean`, `manhattan`, `log-euclidean`, and `log-manhattan` cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used. layout (str) -- Name of the keyboard layout to use (Currently supported: `QWERTY`, `Dvorak`, `AZERTY`, `QWERTZ`)
Returns:	Normalized typo distance
Return type:	float

Examples

>>> round(dist_typo('cat', 'hat'), 12)
0.527046283086
>>> round(dist_typo('Niall', 'Neil'), 12)
0.565028142929
>>> round(dist_typo('Colin', 'Cuilen'), 12)
0.569035609563
>>> dist_typo('ATCG', 'TAGC')
0.625

abydos.distance.sim_typo(src, tar, metric='euclidean', cost=(1, 1, 0.5, 0.5), layout='QWERTY')[source]¶

Return the normalized typo similarity between two strings.

This is a wrapper for Typo.sim().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison metric (str) -- Supported values include: `euclidean`, `manhattan`, `log-euclidean`, and `log-manhattan` cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used. layout (str) -- Name of the keyboard layout to use (Currently supported: `QWERTY`, `Dvorak`, `AZERTY`, `QWERTZ`)
Returns:	Normalized typo similarity
Return type:	float

Examples

>>> round(sim_typo('cat', 'hat'), 12)
0.472953716914
>>> round(sim_typo('Niall', 'Neil'), 12)
0.434971857071
>>> round(sim_typo('Colin', 'Cuilen'), 12)
0.430964390437
>>> sim_typo('ATCG', 'TAGC')
0.375

class abydos.distance.Synoname[source]¶

Bases: abydos.distance._distance._Distance

Synoname.

Cf. [JPGTrust91][Gro91]

dist(src, tar, word_approx_min=0.3, char_approx_min=0.73, tests=4095)[source]¶

Return the normalized Synoname distance between two words.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison word_approx_min (float) -- The minimum word approximation value to signal a 'word_approx' match char_approx_min (float) -- The minimum character approximation value to signal a 'char_approx' match tests (int or Iterable) -- Either an integer indicating tests to perform or a list of test names to perform (defaults to performing all tests)
Returns:	Normalized Synoname distance
Return type:	float

dist_abs(src, tar, word_approx_min=0.3, char_approx_min=0.73, tests=4095, ret_name=False)[source]¶

Return the Synoname similarity type of two words.

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison word_approx_min (float) -- The minimum word approximation value to signal a 'word_approx' match char_approx_min (float) -- The minimum character approximation value to signal a 'char_approx' match tests (int or Iterable) -- Either an integer indicating tests to perform or a list of test names to perform (defaults to performing all tests) ret_name (bool) -- If True, returns the match name rather than its integer equivalent
Returns:	Synoname value
Return type:	int (or str if ret_name is True)

Examples

>>> cmp = Synoname()
>>> cmp.dist_abs(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', ''))
2
>>> cmp.dist_abs(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', ''),
... ret_name=True)
'omission'
>>> cmp.dist_abs(('Dore', 'Gustave', ''),
... ('Dore', 'Paul Gustave Louis Christophe', ''), ret_name=True)
'inclusion'
>>> cmp.dist_abs(('Pereira', 'I. R.', ''), ('Pereira', 'I. Smith', ''),
... ret_name=True)
'word_approx'

abydos.distance.synoname(src, tar, word_approx_min=0.3, char_approx_min=0.73, tests=4095, ret_name=False)[source]¶

Return the Synoname similarity type of two words.

This is a wrapper for Synoname.dist_abs().

Parameters:	src (str) -- Source string for comparison tar (str) -- Target string for comparison word_approx_min (float) -- The minimum word approximation value to signal a 'word_approx' match char_approx_min (float) -- The minimum character approximation value to signal a 'char_approx' match tests (int or Iterable) -- Either an integer indicating tests to perform or a list of test names to perform (defaults to performing all tests) ret_name (bool) -- If True, returns the match name rather than its integer equivalent
Returns:	Synoname value
Return type:	int (or str if ret_name is True)

Examples

>>> synoname(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', ''))
2
>>> synoname(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', ''),
... ret_name=True)
'omission'
>>> synoname(('Dore', 'Gustave', ''),
... ('Dore', 'Paul Gustave Louis Christophe', ''), ret_name=True)
'inclusion'
>>> synoname(('Pereira', 'I. R.', ''), ('Pereira', 'I. Smith', ''),
... ret_name=True)
'word_approx'