abydos.distance package¶
abydos.distance.
The distance package implements string distance measure and metric classes:
These include traditional Levenshtein edit distance and related algorithms:
Levenshtein distance (
Levenshtein
)Optimal String Alignment distance (
Levenshtein
withmode='osa'
)Damerau-Levenshtein distance (
DamerauLevenshtein
)Yujian-Bo normalized edit distance (
YujianBo
)Higuera-Micó contextual normalized edit distance (
HigueraMico
)Indel distance (
Indel
)Syllable Alignment Pattern Searching similarity (
distance.SAPS
)Meta-Levenshtein distance (
MetaLevenshtein
)Covington distance (
Covington
)ALINE distance (
ALINE
)FlexMetric distance (
FlexMetric
)BI-SIM similarity (
BISIM
)Discounted Levenshtein distance (
DiscountedLevenshtein
)Phonetic edit distance (
PhoneticEditDistance
)
Hamming distance (Hamming
), Relaxed Hamming distance
(RelaxedHamming
), and the closely related Modified
Language-Independent Product Name Search distance (MLIPNS
) are
provided.
Block edit distances:
Tichy edit distance (
Tichy
)Levenshtein distance with block operations (
BlockLevenshtein
)Rees-Levenshtein distance (
ReesLevenshtein
)Cormode's LZ distance (
CormodeLZ
)Shapira-Storer I edit distance with block moves, greedy algorithm (
ShapiraStorerI
)
Distance metrics developed for the US Census or derived from them are included:
Jaro distance (
JaroWinkler
withmode='Jaro'
)Jaro-Winkler distance (
JaroWinkler
)Strcmp95 distance (
Strcmp95
)Iterative-SubString (I-Sub) correlation (
IterativeSubString
)
A large set of multi-set token-based distance metrics are provided, including:
AMPLE similarity (
AMPLE
)AZZOO similarity (
AZZOO
)Anderberg's D similarity (
Anderberg
)Andres & Marzo's Delta correlation (
AndresMarzoDelta
)Baroni-Urbani & Buser I similarity (
BaroniUrbaniBuserI
)Baroni-Urbani & Buser II correlation (
BaroniUrbaniBuserII
)Batagelj & Bren similarity (
BatageljBren
)Baulieu I distance (
BaulieuI
)Baulieu II distance (
BaulieuII
)Baulieu III distance (
BaulieuIII
)Baulieu IV distance (
BaulieuIV
)Baulieu V distance (
BaulieuV
)Baulieu VI distance (
BaulieuVI
)Baulieu VII distance (
BaulieuVII
)Baulieu VIII distance (
BaulieuVIII
)Baulieu IX distance (
BaulieuIX
)Baulieu X distance (
BaulieuX
)Baulieu XI distance (
BaulieuXI
)Baulieu XII distance (
BaulieuXII
)Baulieu XIII distance (
BaulieuXIII
)Baulieu XIV distance (
BaulieuXIV
)Baulieu XV distance (
BaulieuXV
)Benini I correlation (
BeniniI
)Benini II correlation (
BeniniII
)Bennet's S correlation (
Bennet
)Braun-Blanquet similarity (
BraunBlanquet
)Canberra distance (
Canberra
)Cao similarity (
Cao
)Chao's Dice similarity (
ChaoDice
)Chao's Jaccard similarity (
ChaoJaccard
)Chebyshev distance (
Chebyshev
)Chord distance (
Chord
)Clark distance (
Clark
)Clement similarity (
Clement
)Cohen's Kappa similarity (
CohenKappa
)Cole correlation (
Cole
)Consonni & Todeschini I similarity (
ConsonniTodeschiniI
)Consonni & Todeschini II similarity (
ConsonniTodeschiniII
)Consonni & Todeschini III similarity (
ConsonniTodeschiniIII
)Consonni & Todeschini IV similarity (
ConsonniTodeschiniIV
)Consonni & Todeschini V correlation (
ConsonniTodeschiniV
)Cosine similarity (
Cosine
)Dennis similarity (
Dennis
)Dice's Asymmetric I similarity (
DiceAsymmetricI
)Dice's Asymmetric II similarity (
DiceAsymmetricII
)Digby correlation (
Digby
)Dispersion correlation (
Dispersion
)Doolittle similarity (
Doolittle
)Dunning similarity (
Dunning
)Euclidean distance (
Euclidean
)Eyraud similarity (
Eyraud
)Fager & McGowan similarity (
FagerMcGowan
)Faith similarity (
Faith
)Fidelity similarity (
Fidelity
)Fleiss correlation (
Fleiss
)Fleiss-Levin-Paik similarity (
FleissLevinPaik
)Forbes I similarity (
ForbesI
)Forbes II correlation (
ForbesII
)Fossum similarity (
Fossum
)Generalized Fleiss correlation (
GeneralizedFleiss
)Gilbert correlation (
Gilbert
)Gilbert & Wells similarity (
GilbertWells
)Gini I correlation (
GiniI
)Gini II correlation (
GiniII
)Goodall similarity (
Goodall
)Goodman & Kruskal's Lambda similarity (
GoodmanKruskalLambda
)Goodman & Kruskal's Lambda-r correlation (
GoodmanKruskalLambdaR
)Goodman & Kruskal's Tau A similarity (
GoodmanKruskalTauA
)Goodman & Kruskal's Tau B similarity (
GoodmanKruskalTauB
)Gower & Legendre similarity (
GowerLegendre
)Guttman Lambda A similarity (
GuttmanLambdaA
)Guttman Lambda B similarity (
GuttmanLambdaB
)Gwet's AC correlation (
GwetAC
)Hamann correlation (
Hamann
)Harris & Lahey similarity (
HarrisLahey
)Hassanat distance (
Hassanat
)Hawkins & Dotson similarity (
HawkinsDotson
)Hellinger distance (
Hellinger
)Henderson-Heron similarity (
HendersonHeron
)Horn-Morisita similarity (
HornMorisita
)Hurlbert correlation (
Hurlbert
)Jaccard similarity (
Jaccard
) & Tanimoto coefficient (Jaccard.tanimoto_coeff()
)Jaccard-NM similarity (
JaccardNM
)Johnson similarity (
Johnson
)Kendall's Tau correlation (
KendallTau
)Kent & Foster I similarity (
KentFosterI
)Kent & Foster II similarity (
KentFosterII
)Köppen I correlation (
KoppenI
)Köppen II similarity (
KoppenII
)Kuder & Richardson correlation (
KuderRichardson
)Kuhns I correlation (
KuhnsI
)Kuhns II correlation (
KuhnsII
)Kuhns III correlation (
KuhnsIII
)Kuhns IV correlation (
KuhnsIV
)Kuhns V correlation (
KuhnsV
)Kuhns VI correlation (
KuhnsVI
)Kuhns VII correlation (
KuhnsVII
)Kuhns VIII correlation (
KuhnsVIII
)Kuhns IX correlation (
KuhnsIX
)Kuhns X correlation (
KuhnsX
)Kuhns XI correlation (
KuhnsXI
)Kuhns XII similarity (
KuhnsXII
)Kulczynski I similarity (
KulczynskiI
)Kulczynski II similarity (
KulczynskiII
)Lorentzian distance (
Lorentzian
)Maarel correlation (
Maarel
)Manhattan distance (
Manhattan
)Morisita similarity (
Morisita
)marking distance (
Marking
)marking metric (
MarkingMetric
)MASI similarity (
MASI
)Matusita distance (
Matusita
)Maxwell & Pilliner correlation (
MaxwellPilliner
)McConnaughey correlation (
McConnaughey
)McEwen & Michael correlation (
McEwenMichael
)mean squared contingency correlation (
MSContingency
)Michael similarity (
Michael
)Michelet similarity (
Michelet
)Millar distance (
Millar
)Minkowski distance (
Minkowski
)Mountford similarity (
Mountford
)Mutual Information similarity (
MutualInformation
)Overlap distance (
Overlap
)Pattern difference (
Pattern
)Pearson & Heron II correlation (
PearsonHeronII
)Pearson II similarity (
PearsonII
)Pearson III correlation (
PearsonIII
)Pearson's Chi-Squared similarity (
PearsonChiSquared
)Pearson's Phi correlation (
PearsonPhi
)Peirce correlation (
Peirce
)q-gram distance (
QGram
)Raup-Crick similarity (
RaupCrick
)Rogers & Tanimoto similarity (
RogersTanimoto
)Rogot & Goldberg similarity (
RogotGoldberg
)Russell & Rao similarity (
RussellRao
)Scott's Pi correlation (
ScottPi
)Shape difference (
Shape
)Size difference (
Size
)Sokal & Michener similarity (
SokalMichener
)Sokal & Sneath I similarity (
SokalSneathI
)Sokal & Sneath II similarity (
SokalSneathII
)Sokal & Sneath III similarity (
SokalSneathIII
)Sokal & Sneath IV similarity (
SokalSneathIV
)Sokal & Sneath V similarity (
SokalSneathV
)Sørensen–Dice coefficient (
Dice
)Sorgenfrei similarity (
Sorgenfrei
)Steffensen similarity (
Steffensen
)Stiles similarity (
Stiles
)Stuart's Tau correlation (
StuartTau
)Tarantula similarity (
Tarantula
)Tarwid correlation (
Tarwid
)Tetrachoric correlation coefficient (
Tetrachronic
)Tulloss' R similarity (
TullossR
)Tulloss' S similarity (
TullossS
)Tulloss' T similarity (
TullossT
)Tulloss' U similarity (
TullossU
)Tversky distance (
Tversky
)Weighted Jaccard similarity (
WeightedJaccard
)Unigram subtuple similarity (
UnigramSubtuple
)Unknown A correlation (
UnknownA
)Unknown B similarity (
UnknownB
)Unknown C similarity (
UnknownC
)Unknown D similarity (
UnknownD
)Unknown E correlation (
UnknownE
)Unknown F similarity (
UnknownF
)Unknown G similarity (
UnknownG
)Unknown H similarity (
UnknownH
)Unknown I similarity (
UnknownI
)Unknown J similarity (
UnknownJ
)Unknown K distance (
UnknownK
)Unknown L similarity (
UnknownL
)Unknown M similarity (
UnknownM
)Upholt similarity (
Upholt
)Warrens I correlation (
WarrensI
)Warrens II similarity (
WarrensII
)Warrens III correlation (
WarrensIII
)Warrens IV similarity (
WarrensIV
)Warrens V similarity (
WarrensV
)Whittaker distance (
Whittaker
)Yates' Chi-Squared similarity (
YatesChiSquared
)Yule's Q correlation (
YuleQ
)Yule's Q II distance (
YuleQII
)Yule's Y correlation (
YuleY
)YJHHR distance (
YJHHR
)Bhattacharyya distance (
Bhattacharyya
)Brainerd-Robinson similarity (
BrainerdRobinson
)Quantitative Cosine similarity (
QuantitativeCosine
)Quantitative Dice similarity (
QuantitativeDice
)Quantitative Jaccard similarity (
QuantitativeJaccard
)Roberts similarity (
Roberts
)Average linkage distance (
AverageLinkage
)Single linkage distance (
SingleLinkage
)Complete linkage distance (
CompleteLinkage
)Bag distance (
Bag
)Soft cosine similarity (
SoftCosine
)Monge-Elkan distance (
MongeElkan
)TF-IDF similarity (
TFIDF
)SoftTF-IDF similarity (
SoftTFIDF
)Jensen-Shannon divergence (
JensenShannon
)Simplified Fellegi-Sunter distance (
FellegiSunter
)MinHash similarity (
MinHash
)BLEU similarity (
BLEU
)Rouge-L similarity (
RougeL
)Rouge-W similarity (
RougeW
)Rouge-S similarity (
RougeS
)Rouge-SU similarity (
RougeSU
)Positional Q-Gram Dice distance (
PositionalQGramDice
)Positional Q-Gram Jaccard distance (
PositionalQGramJaccard
)Positional Q-Gram Overlap distance (
PositionalQGramOverlap
)
Three popular sequence alignment algorithms are provided:
Needleman-Wunsch score (
NeedlemanWunsch
)Smith-Waterman score (
SmithWaterman
)Gotoh score (
Gotoh
)
Classes relating to substring and subsequence distances include:
Longest common subsequence (
LCSseq
)Longest common substring (
LCSstr
)Ratcliff-Obserhelp distance (
RatcliffObershelp
)
A number of simple distance classes provided in the package include:
Normalized compression distance classes for a variety of compression algorithms are provided:
Three similarity measures from SeatGeek's FuzzyWuzzy:
FuzzyWuzzy Partial String similarity (
FuzzyWuzzyPartialString
)FuzzyWuzzy Token Sort similarity (
FuzzyWuzzyTokenSort
)FuzzyWuzzy Token Set similarity (
FuzzyWuzzyTokenSet
)
A convenience class, allowing one to pass a list of string transforms (phonetic algorithms, string transforms, and/or stemmers) and, optionally, a string distance measure to compute the similarity/distance of two strings that have undergone each transform, is provided in:
Phonetic distance (
PhoneticDistance
)
The remaining distance measures & metrics include:
Western Airlines' Match Rating Algorithm comparison (
distance.MRA
)Editex (
Editex
)Bavarian Landesamt für Statistik distance (
Baystat
)Eudex distance (
distance.Eudex
)Sift4 distance (
Sift4
,Sift4Simplest
,Sift4Extended
)Typo distance (
Typo
)Synoname (
Synoname
)Ozbay metric (
Ozbay
)Indice de Similitude-Guth (
ISG
)INClusion Programme (
Inclusion
)Guth (
Guth
)Victorian Panel Study (
VPS
)LIG3 (
LIG3
)String subsequence kernel (SSK) (
SSK
)
Most of the distance and similarity measures have sim
and dist
methods,
which return a measure that is normalized to the range \([0, 1]\). The
normalized distance and similarity are always complements, so the normalized
distance will always equal 1 - the similarity for a particular measure supplied
with the same input. Some measures have an absolute distance method
dist_abs
that is not limited to any range.
All three methods can be demonstrated using the DamerauLevenshtein
class:
>>> dl = DamerauLevenshtein()
>>> dl.dist_abs('orange', 'strange')
2
>>> dl.dist('orange', 'strange')
0.2857142857142857
>>> dl.sim('orange', 'strange')
0.7142857142857143
-
abydos.distance.
sim
(src, tar, method=<function sim_levenshtein>)[source]¶ Return a similarity of two strings.
This is a generalized function for calling other similarity functions.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
method (function) -- Specifies the similarity metric (
sim_levenshtein()
by default)
- Returns
Similarity according to the specified function
- Return type
float
- Raises
AttributeError -- Unknown distance function
Examples
>>> round(sim('cat', 'hat'), 12) 0.666666666667 >>> round(sim('Niall', 'Neil'), 12) 0.4 >>> sim('aluminum', 'Catalan') 0.125 >>> sim('ATCG', 'TAGC') 0.25
New in version 0.1.0.
-
abydos.distance.
dist
(src, tar, method=<function sim_levenshtein>)[source]¶ Return a distance between two strings.
This is a generalized function for calling other distance functions.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
method (function) -- Specifies the similarity metric (
sim_levenshtein()
by default) -- Note that this takes a similarity metric function, not a distance metric function.
- Returns
Distance according to the specified function
- Return type
float
- Raises
AttributeError -- Unknown distance function
Examples
>>> round(dist('cat', 'hat'), 12) 0.333333333333 >>> round(dist('Niall', 'Neil'), 12) 0.6 >>> dist('aluminum', 'Catalan') 0.875 >>> dist('ATCG', 'TAGC') 0.75
New in version 0.1.0.
-
class
abydos.distance.
Levenshtein
(mode='lev', cost=(1, 1, 1, 1), normalizer=<built-in function max>, taper=False, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Levenshtein distance.
This is the standard edit distance measure. Cf. [Lev65][Lev66].
Optimal string alignment (aka restricted Damerau-Levenshtein distance) [Boy11] is also supported.
The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm [WF74].
Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.
New in version 0.3.6.
Changed in version 0.4.0: Added taper option
Initialize Levenshtein instance.
- Parameters
mode (str) --
Specifies a mode for computing the Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
taper (bool) -- Enables cost tapering. Following [ZD96], it causes edits at the start of the string to "just [exceed] twice the minimum penalty for replacement or deletion at the end of the string".
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
alignment
(src, tar)[source]¶ Return the Levenshtein alignment of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
A tuple containing the Levenshtein distance and the two strings, aligned.
- Return type
tuple
Examples
>>> cmp = Levenshtein() >>> cmp.alignment('cat', 'hat') (1.0, 'cat', 'hat') >>> cmp.alignment('Niall', 'Neil') (3.0, 'N-iall', 'Nei-l-') >>> cmp.alignment('aluminum', 'Catalan') (7.0, '-aluminum', 'Catalan--') >>> cmp.alignment('ATCG', 'TAGC') (3.0, 'ATCG-', '-TAGC')
>>> cmp = Levenshtein(mode='osa') >>> cmp.alignment('ATCG', 'TAGC') (2.0, 'ATCG', 'TAGC') >>> cmp.alignment('ACTG', 'TAGC') (4.0, 'ACT-G-', '--TAGC')
New in version 0.4.1.
-
dist
(src, tar)[source]¶ Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Levenshtein distance between src & tar
- Return type
float
Examples
>>> cmp = Levenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
dist_abs
(src, tar)[source]¶ Return the Levenshtein distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Levenshtein distance between src & tar
- Return type
int (may return a float if cost has float values)
Examples
>>> cmp = Levenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 3
>>> cmp = Levenshtein(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 2 >>> cmp.dist_abs('ACTG', 'TAGC') 4
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
levenshtein
(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶ Return the Levenshtein distance between two strings.
This is a wrapper of
Levenshtein.dist_abs()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
mode (str) --
Specifies a mode for computing the Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- Returns
The Levenshtein distance between src & tar
- Return type
int (may return a float if cost has float values)
Examples
>>> levenshtein('cat', 'hat') 1 >>> levenshtein('Niall', 'Neil') 3 >>> levenshtein('aluminum', 'Catalan') 7 >>> levenshtein('ATCG', 'TAGC') 3
>>> levenshtein('ATCG', 'TAGC', mode='osa') 2 >>> levenshtein('ACTG', 'TAGC', mode='osa') 4
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Levenshtein.dist_abs method instead.
-
abydos.distance.
dist_levenshtein
(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶ Return the normalized Levenshtein distance between two strings.
This is a wrapper of
Levenshtein.dist()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
mode (str) --
Specifies a mode for computing the Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- Returns
The Levenshtein distance between src & tar
- Return type
float
Examples
>>> round(dist_levenshtein('cat', 'hat'), 12) 0.333333333333 >>> round(dist_levenshtein('Niall', 'Neil'), 12) 0.6 >>> dist_levenshtein('aluminum', 'Catalan') 0.875 >>> dist_levenshtein('ATCG', 'TAGC') 0.75
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Levenshtein.dist method instead.
-
abydos.distance.
sim_levenshtein
(src, tar, mode='lev', cost=(1, 1, 1, 1))[source]¶ Return the Levenshtein similarity of two strings.
This is a wrapper of
Levenshtein.sim()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
mode (str) --
Specifies a mode for computing the Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- Returns
The Levenshtein similarity between src & tar
- Return type
float
Examples
>>> round(sim_levenshtein('cat', 'hat'), 12) 0.666666666667 >>> round(sim_levenshtein('Niall', 'Neil'), 12) 0.4 >>> sim_levenshtein('aluminum', 'Catalan') 0.125 >>> sim_levenshtein('ATCG', 'TAGC') 0.25
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Levenshtein.sim method instead.
-
class
abydos.distance.
DamerauLevenshtein
(cost=(1, 1, 1, 1), normalizer=<built-in function max>, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Damerau-Levenshtein distance.
This computes the Damerau-Levenshtein distance [Dam64]. Damerau-Levenshtein code is based on Java code by Kevin L. Stern [Ste14], under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
Initialize Levenshtein instance.
- Parameters
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Damerau-Levenshtein similarity of two strings.
Damerau-Levenshtein distance normalized to the interval [0, 1].
The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Damerau-Levenshtein distance
- Return type
float
Examples
>>> cmp = DamerauLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
dist_abs
(src, tar)[source]¶ Return the Damerau-Levenshtein distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Damerau-Levenshtein distance between src & tar
- Return type
int (may return a float if cost has float values)
- Raises
ValueError -- Unsupported cost assignment; the cost of two transpositions must not be less than the cost of an insert plus a delete.
Examples
>>> cmp = DamerauLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
damerau_levenshtein
(src, tar, cost=(1, 1, 1, 1))[source]¶ Return the Damerau-Levenshtein distance between two strings.
This is a wrapper of
DamerauLevenshtein.dist_abs()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- Returns
The Damerau-Levenshtein distance between src & tar
- Return type
int (may return a float if cost has float values)
Examples
>>> damerau_levenshtein('cat', 'hat') 1 >>> damerau_levenshtein('Niall', 'Neil') 3 >>> damerau_levenshtein('aluminum', 'Catalan') 7 >>> damerau_levenshtein('ATCG', 'TAGC') 2
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the DamerauLevenshtein.dist_abs method instead.
-
abydos.distance.
dist_damerau
(src, tar, cost=(1, 1, 1, 1))[source]¶ Return the Damerau-Levenshtein similarity of two strings.
This is a wrapper of
DamerauLevenshtein.dist()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- Returns
The normalized Damerau-Levenshtein distance
- Return type
float
Examples
>>> round(dist_damerau('cat', 'hat'), 12) 0.333333333333 >>> round(dist_damerau('Niall', 'Neil'), 12) 0.6 >>> dist_damerau('aluminum', 'Catalan') 0.875 >>> dist_damerau('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the DamerauLevenshtein.dist method instead.
-
abydos.distance.
sim_damerau
(src, tar, cost=(1, 1, 1, 1))[source]¶ Return the Damerau-Levenshtein similarity of two strings.
This is a wrapper of
DamerauLevenshtein.sim()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
- Returns
The normalized Damerau-Levenshtein similarity
- Return type
float
Examples
>>> round(sim_damerau('cat', 'hat'), 12) 0.666666666667 >>> round(sim_damerau('Niall', 'Neil'), 12) 0.4 >>> sim_damerau('aluminum', 'Catalan') 0.125 >>> sim_damerau('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the DamerauLevenshtein.sim method instead.
-
class
abydos.distance.
ShapiraStorerI
(cost=(1, 1), prime=False, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Shapira & Storer I edit distance with block moves, greedy algorithm.
Shapira & Storer's greedy edit distance [SS07] is similar to Levenshtein edit distance, but with two important distinctions:
It considers blocks of characters, if they occur in both the source and target strings, so the edit distance between 'abcab' and 'abc' is only 1, since the substring 'ab' occurs in both and can be inserted as a block into 'abc'.
It allows three edit operations: insert, delete, and move (but not substitute). Thus the distance between 'abcde' and 'deabc' is only 1 because the block 'abc' can be moved in 1 move operation, rather than being deleted and inserted in 2 separate operations.
If prime is set to True at initialization, this employs the greedy' algorithm, which limits replacements of blocks in the two strings to matching occurrences of the LCS.
New in version 0.4.0.
Initialize ShapiraStorerI instance.
- Parameters
prime (bool) -- If True, employs the greedy' algorithm rather than greedy
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Shapira & Storer I distance.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Shapira & Storer I distance between src & tar
- Return type
float
Examples
>>> cmp = ShapiraStorerI() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.333333333333 >>> cmp.dist('aluminum', 'Catalan') 0.6 >>> cmp.dist('ATCG', 'TAGC') 0.25
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Shapira & Storer I edit distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Shapira & Storer I edit distance between src & tar
- Return type
int
Examples
>>> cmp = ShapiraStorerI() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 9 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.4.0.
-
class
abydos.distance.
Marking
(**kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Ehrenfeucht & Haussler's marking distance.
This edit distance [EH88] is the number of marked characters in one word that must be masked in order for that word to consist entirely of substrings of another word.
It is normalized by the length of the first word.
New in version 0.4.0.
Initialize Marking instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized marking distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
marking distance
- Return type
float
Examples
>>> cmp = Marking() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.5 >>> cmp.dist('cbaabdcb', 'abcba') 0.25
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the marking distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
marking distance
- Return type
int
Examples
>>> cmp = Marking() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 5 >>> cmp.dist_abs('ATCG', 'TAGC') 2 >>> cmp.dist_abs('cbaabdcb', 'abcba') 2
New in version 0.4.0.
-
class
abydos.distance.
MarkingMetric
(**kwargs)[source]¶ Bases:
abydos.distance._marking.Marking
Ehrenfeucht & Haussler's marking metric.
This metric [EH88] is the base 2 logarithm of the product of the marking distances between each term plus 1 computed in both orders. For strings x and y, this is:
\[dist_{MarkingMetric}(x, y) = log_2((diff(x, y)+1)(diff(y, x)+1))\]The function diff is Ehrenfeucht & Haussler's marking distance
Marking
.New in version 0.4.0.
Initialize MarkingMetric instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized marking distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
marking distance
- Return type
float
Examples
>>> cmp = Marking() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.5 >>> cmp.dist('cbaabdcb', 'abcba') 0.25
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the marking distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
marking distance
- Return type
int
Examples
>>> cmp = MarkingMetric() >>> cmp.dist_abs('cat', 'hat') 2.0 >>> cmp.dist_abs('Niall', 'Neil') 3.584962500721156 >>> cmp.dist_abs('aluminum', 'Catalan') 4.584962500721156 >>> cmp.dist_abs('ATCG', 'TAGC') 3.169925001442312 >>> cmp.dist_abs('cbaabdcb', 'abcba') 2.584962500721156
New in version 0.4.0.
-
class
abydos.distance.
YujianBo
(cost=(1, 1, 1, 1), **kwargs)[source]¶ Bases:
abydos.distance._levenshtein.Levenshtein
Yujian-Bo normalized Levenshtein distance.
Yujian-Bo's normalization of Levenshtein distance [YB07], given Levenshtein distance \(GLD(X, Y)\) between two strings X and Y, is
\[dist_{N-GLD}(X, Y) = \frac{2 \cdot GLD(X, Y)}{|X| + |Y| + GLD(X, Y)}\]New in version 0.4.0.
Initialize YujianBo instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Yujian-Bo normalized edit distance between strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Yujian-Bo normalized edit distance between src & tar
- Return type
float
Examples
>>> cmp = YujianBo() >>> round(cmp.dist('cat', 'hat'), 12) 0.285714285714 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.5 >>> cmp.dist('aluminum', 'Catalan') 0.6363636363636364 >>> cmp.dist('ATCG', 'TAGC') 0.5454545454545454
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Yujian-Bo normalized edit distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Yujian-Bo normalized edit distance between src & tar
- Return type
int
Examples
>>> cmp = YujianBo() >>> cmp.dist_abs('cat', 'hat') 0.2857142857142857 >>> cmp.dist_abs('Niall', 'Neil') 0.5 >>> cmp.dist_abs('aluminum', 'Catalan') 0.6363636363636364 >>> cmp.dist_abs('ATCG', 'TAGC') 0.5454545454545454
New in version 0.4.0.
-
class
abydos.distance.
HigueraMico
(**kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
The Higuera-Micó contextual normalized edit distance.
This is presented in [delHigueraMico08].
This measure is not normalized to a particular range. Indeed, for an string of infinite length as and a string of 0 length, the contextual normalized edit distance would be infinity. But so long as the relative difference in string lengths is not too great, the distance will generally remain below 1.0
Notes
The "normalized" version of this distance, implemented in the dist method is merely the minimum of the distance and 1.0.
New in version 0.4.0.
Initialize Levenshtein instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the bounded Higuera-Micó distance between two strings.
This is the distance bounded to the range [0, 1].
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The bounded Higuera-Micó distance between src & tar
- Return type
float
Examples
>>> cmp = HigueraMico() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.5333333333333333 >>> cmp.dist('aluminum', 'Catalan') 0.7916666666666667 >>> cmp.dist('ATCG', 'TAGC') 0.6000000000000001
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Higuera-Micó distance between two strings.
This is a straightforward implementation of Higuera & Micó pseudocode from [delHigueraMico08], ported to Numpy.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Higuera-Micó distance between src & tar
- Return type
float
Examples
>>> cmp = HigueraMico() >>> cmp.dist_abs('cat', 'hat') 0.3333333333333333 >>> cmp.dist_abs('Niall', 'Neil') 0.5333333333333333 >>> cmp.dist_abs('aluminum', 'Catalan') 0.7916666666666667 >>> cmp.dist_abs('ATCG', 'TAGC') 0.6000000000000001
New in version 0.4.0.
-
class
abydos.distance.
Indel
(**kwargs)[source]¶ Bases:
abydos.distance._levenshtein.Levenshtein
Indel distance.
This is equivalent to Levenshtein distance, when only inserts and deletes are possible.
New in version 0.3.6.
Initialize Levenshtein instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized indel distance between two strings.
This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized indel distance
- Return type
float
Examples
>>> cmp = Indel() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.333333333333 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.454545454545 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.3.6.
-
abydos.distance.
indel
(src, tar)[source]¶ Return the indel distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Indel distance
- Return type
int
Examples
>>> indel('cat', 'hat') 2 >>> indel('Niall', 'Neil') 3 >>> indel('Colin', 'Cuilen') 5 >>> indel('ATCG', 'TAGC') 4
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Indel.dist_abs method instead.
-
abydos.distance.
dist_indel
(src, tar)[source]¶ Return the normalized indel distance between two strings.
This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized indel distance
- Return type
float
Examples
>>> round(dist_indel('cat', 'hat'), 12) 0.333333333333 >>> round(dist_indel('Niall', 'Neil'), 12) 0.333333333333 >>> round(dist_indel('Colin', 'Cuilen'), 12) 0.454545454545 >>> dist_indel('ATCG', 'TAGC') 0.5
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Indel.dist method instead.
-
abydos.distance.
sim_indel
(src, tar)[source]¶ Return the normalized indel similarity of two strings.
This is equivalent to normalized Levenshtein similarity, when only inserts and deletes are possible.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized indel similarity
- Return type
float
Examples
>>> round(sim_indel('cat', 'hat'), 12) 0.666666666667 >>> round(sim_indel('Niall', 'Neil'), 12) 0.666666666667 >>> round(sim_indel('Colin', 'Cuilen'), 12) 0.545454545455 >>> sim_indel('ATCG', 'TAGC') 0.5
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Indel.sim method instead.
-
class
abydos.distance.
SAPS
(cost=(1, -1, -4, 6, -2, -1, -3), normalizer=<built-in function max>, tokenizer=None, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Syllable Alignment Pattern Searching tokenizer.
This is the alignment and similarity calculation described on p. 917-918 of [RY05].
New in version 0.4.0.
Initialize SAPS instance.
- Parameters
cost (tuple) --
A 7-tuple representing the cost of the four possible matches:
syllable-internal match
syllable-internal mis-match
syllable-initial match or mismatch with syllable-internal
syllable-initial match
syllable-initial mis-match
syllable-internal gap
syllable-initial gap
(by default: (1, -1, -4, 6, -2, -1, -3))
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized SAPS similarity between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized SAPS similarity between src & tar
- Return type
float
Examples
>>> cmp = SAPS() >>> round(cmp.sim('cat', 'hat'), 12) 0.0 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.2 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the SAPS similarity between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The SAPS similarity between src & tar
- Return type
int
Examples
>>> cmp = SAPS() >>> cmp.sim_score('cat', 'hat') 0 >>> cmp.sim_score('Niall', 'Neil') 3 >>> cmp.sim_score('aluminum', 'Catalan') -11 >>> cmp.sim_score('ATCG', 'TAGC') -1 >>> cmp.sim_score('Stevenson', 'Stinson') 16
New in version 0.4.0.
-
class
abydos.distance.
MetaLevenshtein
(tokenizer=None, corpus=None, metric=None, normalizer=<built-in function max>, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Meta-Levenshtein distance.
Meta-Levenshtein distance [MYCappe08] combines Soft-TFIDF with Levenshtein alignment.
New in version 0.4.0.
Initialize MetaLevenshtein instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagecorpus (UnigramCorpus) -- A unigram corpus
UnigramCorpus
. If None, a corpus will be created from the two words when a similarity function is called.metric (_Distance) -- A string distance measure class for making soft matches, by default Jaro-Winkler.
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Levenshtein distance between src & tar
- Return type
float
Examples
>>> cmp = MetaLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.205186754296 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.507780131444 >>> cmp.dist('aluminum', 'Catalan') 0.8675933954313434 >>> cmp.dist('ATCG', 'TAGC') 0.8077801314441113
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
dist_abs
(src, tar)[source]¶ Return the Meta-Levenshtein distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Meta-Levenshtein distance
- Return type
float
Examples
>>> cmp = MetaLevenshtein() >>> cmp.dist_abs('cat', 'hat') 0.6155602628882225 >>> cmp.dist_abs('Niall', 'Neil') 2.538900657220556 >>> cmp.dist_abs('aluminum', 'Catalan') 6.940747163450747 >>> cmp.dist_abs('ATCG', 'TAGC') 3.2311205257764453
New in version 0.4.0.
-
class
abydos.distance.
Covington
(weights=(0, 5, 10, 30, 60, 100, 40, 50), **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Covington distance.
Covington distance [Cov96]
New in version 0.4.0.
Initialize Covington instance.
- Parameters
weights (tuple) --
An 8-tuple of costs for each kind of match or mismatch described in Covington's paper:
exact consonant or glide match
exact vowel match
vowel-vowel length mismatch or i and y or u and w
vowel-vowel mismatch
consonant-consonant mismatch
consonant-vowel mismatch
skip preceded by a skip
skip not preceded by a skip
The weights used in Covington's first approximation can be used by supplying the tuple (0.0, 0.0, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
alignment
(src, tar)[source]¶ Return the top Covington alignment of two strings.
This returns only the top alignment in a standard (score, source alignment, target alignment) tuple format.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Covington score & alignment
- Return type
tuple(float, str, str)
Examples
>>> cmp = Covington() >>> cmp.alignment('hart', 'kordis') (240, 'hart--', 'kordis') >>> cmp.alignment('niy', 'genu') (170, '--niy', 'genu-')
New in version 0.4.1.
-
alignments
(src, tar, top_n=None)[source]¶ Return the Covington alignments of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
top_n (int) -- The number of alignments to return. If None, all alignments will be returned. If 0, all alignments with the top score will be returned.
- Returns
Covington alignments
- Return type
list
Examples
>>> cmp = Covington() >>> cmp.alignments('hart', 'kordis', top_n=1)[0] Alignment(src='hart--', tar='kordis', score=240) >>> cmp.alignments('niy', 'genu', top_n=1)[0] Alignment(src='--niy', tar='genu-', score=170)
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Covington distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized Covington distance
- Return type
float
Examples
>>> cmp = Covington() >>> cmp.dist('cat', 'hat') 0.19117647058823528 >>> cmp.dist('Niall', 'Neil') 0.25555555555555554 >>> cmp.dist('aluminum', 'Catalan') 0.43333333333333335 >>> cmp.dist('ATCG', 'TAGC') 0.45454545454545453
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Covington distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Covington distance
- Return type
float
Examples
>>> cmp = Covington() >>> cmp.dist_abs('cat', 'hat') 65 >>> cmp.dist_abs('Niall', 'Neil') 115 >>> cmp.dist_abs('aluminum', 'Catalan') 325 >>> cmp.dist_abs('ATCG', 'TAGC') 200
New in version 0.4.0.
-
class
abydos.distance.
ALINE
(epsilon=0, c_skip=-10, c_sub=35, c_exp=45, c_vwl=10, mode='local', phones='aline', normalizer=<built-in function max>, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
ALINE alignment, similarity, and distance.
ALINE alignment was developed by [Kon00][Kon02][DHC+08], and establishes an alignment algorithm based on multivalued phonetic features and feature salience weights. Along with the alignment itself, the algorithm produces a term similarity score.
[DHC+08] develops ALINE's similarity score into a similarity measure & distance measure:
\[sim_{ALINE} = \frac{2 \dot score_{ALINE}(src, tar)} {score_{ALINE}(src, src) + score_{ALINE}(tar, tar)}\]However, because the average of the two self-similarity scores is not guaranteed to be greater than or equal to the similarity score between the two strings, by default, this formula is not used here in order to guarantee that the similarity measure is bounded to [0, 1]. Instead, Kondrak's similarity measure is employed:
\[sim_{ALINE} = \frac{score_{ALINE}(src, tar)} {max(score_{ALINE}(src, src), score_{ALINE}(tar, tar))}\]New in version 0.4.0.
Initialize ALINE instance.
- Parameters
epsilon (float) -- The portion (out of 1.0) of the maximum ALINE score, above which alignments are returned. If set to 0, only the alignments matching the maximum alignment score are returned. If set to 1, all alignments scoring 0 or higher are returned.
c_skip (int) -- The cost of an insertion or deletion
c_sub (int) -- The cost of a substitution
c_exp (int) -- The cost of an expansion or contraction
c_vwl (int) -- The additional cost of a vowel substitution, expansion, or contraction
mode (str) -- Alignment mode, which can be
local
(default),global
,half-local
, orsemi-global
phones (str) --
- Phonetic symbol set, which can be:
aline
selects Kondrak's original symbols setipa
selects IPA symbols
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). For the normalization proposed by Downey, et al. (2008), set this to:
lambda x: sum(x)/len(x)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
alignment
(src, tar)[source]¶ Return the top ALINE alignment of two strings.
The top ALINE alignment is the first alignment with the best score. The purpose of this function is to have a single tuple as a return value.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
ALINE alignment and its score
- Return type
tuple(float, str, str)
Examples
>>> cmp = ALINE() >>> cmp.alignment('cat', 'hat') (50.0, 'c ‖ a t ‖', 'h ‖ a t ‖') >>> cmp.alignment('niall', 'neil') (90.0, '‖ n i a ll ‖', '‖ n e i l ‖') >>> cmp.alignment('aluminum', 'catalan') (81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖') >>> cmp.alignment('atcg', 'tagc') (65.0, '‖ a t c ‖ g', 't ‖ a g c ‖')
New in version 0.4.1.
-
alignments
(src, tar, score_only=False)[source]¶ Return the ALINE alignments of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
score_only (bool) -- Return the score only, not the alignments
- Returns
ALINE alignments and their scores or the top score
- Return type
list(tuple(float, str, str) or float
Examples
>>> cmp = ALINE() >>> cmp.alignments('cat', 'hat') [(50.0, 'c ‖ a t ‖', 'h ‖ a t ‖')] >>> cmp.alignments('niall', 'neil') [(90.0, '‖ n i a ll ‖', '‖ n e i l ‖')] >>> cmp.alignments('aluminum', 'catalan') [(81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖')] >>> cmp.alignments('atcg', 'tagc') [(65.0, '‖ a t c ‖ g', 't ‖ a g c ‖'), (65.0, 'a ‖ tc - g ‖', '‖ t a g ‖ c')]
New in version 0.4.0.
Changed in version 0.4.1: Renamed from .alignment to .alignments
-
c_features
= {'aspirated', 'lateral', 'manner', 'nasal', 'place', 'retroflex', 'syllabic', 'voice'}¶
-
feature_weights
= {'affricate': 0.9, 'alveolar': 0.85, 'approximant': 0.6, 'back': 0.0, 'bilabial': 1.0, 'central': 0.5, 'dental': 0.9, 'fricative': 0.8, 'front': 1.0, 'glottal': 0.1, 'high': 1.0, 'high vowel': 0.4, 'labiodental': 0.95, 'low': 0.0, 'low vowel': 0.0, 'mid': 0.5, 'mid vowel': 0.2, 'minus': 0.0, 'palatal': 0.7, 'palato-alveolar': 0.75, 'pharyngeal': 0.3, 'plus': 1.0, 'retroflex': 0.8, 'stop': 1.0, 'tap': 0.5, 'trill': 0.55, 'uvular': 0.5, 'velar': 0.6}¶
-
phones_ipa
= {'a': {'aspirated': 'minus', 'back': 'front', 'high': 'low', 'lateral': 'minus', 'long': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'b': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'c': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'd': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'e': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'f': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'g': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'h': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'i': {'aspirated': 'minus', 'back': 'front', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'j': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'k': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'l': {'aspirated': 'minus', 'lateral': 'plus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'm': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'n': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'o': {'aspirated': 'minus', 'back': 'back', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'p': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'q': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'r': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'trill', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 's': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 't': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'u': {'aspirated': 'minus', 'back': 'back', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'v': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'w': {'aspirated': 'minus', 'double': 'bilabial', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'x': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'y': {'aspirated': 'minus', 'back': 'front', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'z': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'æ': {'aspirated': 'minus', 'back': 'front', 'high': 'low', 'lateral': 'minus', 'long': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ç': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ð': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'dental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ø': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ħ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'pharyngeal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ŋ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'œ': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɒ': {'aspirated': 'minus', 'back': 'back', 'high': 'low', 'lateral': 'minus', 'long': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɔ': {'aspirated': 'minus', 'back': 'back', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɖ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ə': {'aspirated': 'minus', 'back': 'central', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɛ': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɟ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɢ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɣ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɦ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɨ': {'aspirated': 'minus', 'back': 'central', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɬ': {'aspirated': 'minus', 'lateral': 'plus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ɮ': {'aspirated': 'minus', 'lateral': 'plus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɰ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɱ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɲ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɳ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɴ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɸ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ɹ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɻ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɽ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'tap', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɾ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'tap', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʀ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'trill', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʁ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʂ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʃ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palato-alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʈ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʉ': {'aspirated': 'minus', 'back': 'central', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ʋ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʐ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʒ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palato-alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʔ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʕ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'pharyngeal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʙ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'trill', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʝ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʰ': {'aspirated': 'plus', 'supplemental': True}, 'ː': {'long': 'plus', 'supplemental': True}, 'β': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'θ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'dental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'χ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}}¶
-
phones_kondrak
= {'A': {'aspirated': 'plus', 'supplemental': True}, 'B': {'back': 'back', 'supplemental': True}, 'C': {'back': 'central', 'supplemental': True}, 'D': {'place': 'dental', 'supplemental': True}, 'F': {'back': 'front', 'supplemental': True}, 'H': {'long': 'plus', 'supplemental': True}, 'N': {'nasal': 'plus', 'supplemental': True}, 'P': {'place': 'palatal', 'supplemental': True}, 'R': {'round': 'plus', 'supplemental': True}, 'S': {'manner': 'fricative', 'supplemental': True}, 'V': {'place': 'palato-alveolar', 'supplemental': True}, 'a': {'back': 'central', 'high': 'low', 'lateral': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'b': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'c': {'lateral': 'minus', 'manner': 'affricate', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'd': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'e': {'back': 'front', 'high': 'mid', 'lateral': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'f': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'g': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'h': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'i': {'back': 'front', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'j': {'lateral': 'minus', 'manner': 'affricate', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'k': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'l': {'lateral': 'plus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'm': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'n': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'o': {'back': 'back', 'high': 'mid', 'lateral': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'p': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'q': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'r': {'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 's': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 't': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'u': {'back': 'back', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'v': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'w': {'back': 'back', 'double': 'bilabial', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'x': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'y': {'back': 'front', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'z': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}}¶
-
salience
= {'aspirated': 5, 'back': 5, 'high': 5, 'lateral': 10, 'long': 1, 'manner': 50, 'nasal': 10, 'place': 40, 'retroflex': 10, 'round': 5, 'syllabic': 5, 'voice': 10}¶
-
sim
(src, tar)[source]¶ Return the normalized ALINE similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized ALINE similarity
- Return type
float
Examples
>>> cmp = ALINE() >>> cmp.dist('cat', 'hat') 0.4117647058823529 >>> cmp.dist('niall', 'neil') 0.33333333333333337 >>> cmp.dist('aluminum', 'catalan') 0.5925 >>> cmp.dist('atcg', 'tagc') 0.45833333333333337
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the ALINE alignment score of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
ALINE alignment score
- Return type
float
Examples
>>> cmp = ALINE() >>> cmp.sim_score('cat', 'hat') 50.0 >>> cmp.sim_score('niall', 'neil') 90.0 >>> cmp.sim_score('aluminum', 'catalan') 81.5 >>> cmp.sim_score('atcg', 'tagc') 65.0
New in version 0.4.0.
-
v_features
= {'back', 'high', 'long', 'nasal', 'retroflex', 'round', 'syllabic'}¶
-
class
abydos.distance.
FlexMetric
(normalizer=<built-in function max>, indel_costs=None, subst_costs=None, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
FlexMetric distance.
FlexMetric distance [Kem05]
New in version 0.4.0.
Initialize FlexMetric instance.
- Parameters
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
indel_costs (list of tuples) -- A list of insertion and deletion costs. Each list element should be a tuple consisting of an iterable (sets are best) and a float value. The iterable consists of those letters whose insertion or deletion has a cost equal to the float value.
subst_costs (list of tuples) -- A list of substitution costs. Each list element should be a tuple consisting of an iterable (sets are best) and a float value. The iterable consists of the letters in each letter class, which may be substituted for each other at cost equal to the float value.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized FlexMetric distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized FlexMetric distance
- Return type
float
Examples
>>> cmp = FlexMetric() >>> cmp.dist('cat', 'hat') 0.26666666666666666 >>> cmp.dist('Niall', 'Neil') 0.3 >>> cmp.dist('aluminum', 'Catalan') 0.8375 >>> cmp.dist('ATCG', 'TAGC') 0.5499999999999999
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the FlexMetric distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
FlexMetric distance
- Return type
float
Examples
>>> cmp = FlexMetric() >>> cmp.dist_abs('cat', 'hat') 0.8 >>> cmp.dist_abs('Niall', 'Neil') 1.5 >>> cmp.dist_abs('aluminum', 'Catalan') 6.7 >>> cmp.dist_abs('ATCG', 'TAGC') 2.1999999999999997
New in version 0.4.0.
-
class
abydos.distance.
BISIM
(qval=2, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
BI-SIM similarity.
BI-SIM similarity [KD03] is an n-gram based, edit-distance derived similarity measure.
New in version 0.4.0.
Initialize BISIM instance.
- Parameters
qval (int) -- The number of characters to consider in each n-gram (q-gram). By default this is 2, hence BI-SIM. But TRI-SIM can be calculated by setting this to 3.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the BI-SIM similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
BI-SIM similarity
- Return type
float
Examples
>>> cmp = BISIM() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.3125 >>> cmp.sim('ATCG', 'TAGC') 0.375
New in version 0.4.0.
-
class
abydos.distance.
DiscountedLevenshtein
(mode='lev', normalizer=<built-in function max>, discount_from=1, discount_func='log', vowels='aeiou', **kwargs)[source]¶ Bases:
abydos.distance._levenshtein.Levenshtein
Discounted Levenshtein distance.
This is a variant of Levenshtein distance for which edits later in a string have discounted cost, on the theory that earlier edits are less likely than later ones.
New in version 0.4.1.
Initialize DiscountedLevenshtein instance.
- Parameters
mode (str) --
Specifies a mode for computing the discounted Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
discount_from (int or str) -- If an int is supplied, this is the first character whose edit cost will be discounted. If the str
coda
is supplied, discounting will start with the first non-vowel after the first vowel (the first syllable coda).discount_func (str or function) -- The two supported str arguments are
log
, for a logarithmic discount function, andexp
for a exponential discount function. See notes below for information on how to supply your own discount function.vowels (str) -- These are the letters to consider as vowels when discount_from is set to
coda
. It defaults to the English vowels 'aeiou', but it would be reasonable to localize this to other languages or to add orthographic semi-vowels like 'y', 'w', and even 'h'.**kwargs -- Arbitrary keyword arguments
Notes
This class is highly experimental and will need additional tuning.
The discount function can be passed as a callable function. It should expect an integer as its only argument and return a float, ideally less than or equal to 1.0. The argument represents the degree of discounting to apply.
New in version 0.4.1.
-
dist
(src, tar)[source]¶ Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Levenshtein distance between src & tar
- Return type
float
Examples
>>> cmp = DiscountedLevenshtein() >>> cmp.dist('cat', 'hat') 0.3513958291799864 >>> cmp.dist('Niall', 'Neil') 0.5909885886270658 >>> cmp.dist('aluminum', 'Catalan') 0.8348163322045603 >>> cmp.dist('ATCG', 'TAGC') 0.7217609721523955
New in version 0.4.1.
-
dist_abs
(src, tar)[source]¶ Return the Levenshtein distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Levenshtein distance between src & tar
- Return type
float (may return a float if cost has float values)
Examples
>>> cmp = DiscountedLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2.526064024369237 >>> cmp.dist_abs('aluminum', 'Catalan') 5.053867269967515 >>> cmp.dist_abs('ATCG', 'TAGC') 2.594032108779918
>>> cmp = DiscountedLevenshtein(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 1.7482385137517997 >>> cmp.dist_abs('ACTG', 'TAGC') 3.342270622531718
New in version 0.4.1.
-
class
abydos.distance.
PhoneticEditDistance
(mode='lev', cost=(1, 1, 1, 0.33333), normalizer=<built-in function max>, weights=None, **kwargs)[source]¶ Bases:
abydos.distance._levenshtein.Levenshtein
Phonetic edit distance.
This is a variation on Levenshtein edit distance, intended for strings in IPA, that compares individual phones based on their featural similarity.
New in version 0.4.1.
Initialize PhoneticEditDistance instance.
- Parameters
mode (str) --
Specifies a mode for computing the edit distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 0.33333)). Note that transpositions cost a relatively low 0.33333. If this were 1.0, no phones would ever be transposed under the normal weighting, since even quite dissimilar phones such as [a] and [p] still agree in nearly 63% of their features.
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
weights (None or list or tuple or dict) -- If None, all features are of equal significance and a simple normalized hamming distance of the features is calculated. If a list or tuple of numeric values is supplied, the values are inferred as the weights for each feature, in order of the features listed in abydos.phones._phones._FEATURE_MASK. If a dict is supplied, its key values should match keys in abydos.phones._phones._FEATURE_MASK to which each weight (value) should be assigned. Missing values in all cases are assigned a weight of 0 and will be omitted from the comparison.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
-
dist
(src, tar)[source]¶ Return the normalized phonetic edit distance between two strings.
The edit distance is normalized by dividing the edit distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Levenshtein distance between src & tar
- Return type
float
Examples
>>> cmp = PhoneticEditDistance() >>> round(cmp.dist('cat', 'hat'), 12) 0.059139784946 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.232258064516 >>> cmp.dist('aluminum', 'Catalan') 0.3084677419354839 >>> cmp.dist('ATCG', 'TAGC') 0.2983870967741935
New in version 0.4.1.
-
dist_abs
(src, tar)[source]¶ Return the phonetic edit distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The phonetic edit distance between src & tar
- Return type
int (may return a float if cost has float values)
Examples
>>> cmp = PhoneticEditDistance() >>> cmp.dist_abs('cat', 'hat') 0.17741935483870974 >>> cmp.dist_abs('Niall', 'Neil') 1.161290322580645 >>> cmp.dist_abs('aluminum', 'Catalan') 2.467741935483871 >>> cmp.dist_abs('ATCG', 'TAGC') 1.193548387096774
>>> cmp = PhoneticEditDistance(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 0.46236225806451603 >>> cmp.dist_abs('ACTG', 'TAGC') 1.2580645161290323
New in version 0.4.1.
-
class
abydos.distance.
Hamming
(diff_lens=True, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Hamming distance.
Hamming distance [Ham50] equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings' lengths and adds to this the difference in string lengths.
New in version 0.3.6.
Initialize Hamming instance.
- Parameters
diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Hamming distance between two strings.
Hamming distance normalized to the interval [0, 1].
The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).
The arguments are identical to those of the hamming() function.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized Hamming distance
- Return type
float
Examples
>>> cmp = Hamming() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
dist_abs
(src, tar)[source]¶ Return the Hamming distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Hamming distance between src & tar
- Return type
int
- Raises
ValueError -- Undefined for sequences of unequal length; set diff_lens to True for Hamming distance between strings of unequal lengths.
Examples
>>> cmp = Hamming() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 8 >>> cmp.dist_abs('ATCG', 'TAGC') 4
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
hamming
(src, tar, diff_lens=True)[source]¶ Return the Hamming distance between two strings.
This is a wrapper for
Hamming.dist_abs()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
- Returns
The Hamming distance between src & tar
- Return type
int
Examples
>>> hamming('cat', 'hat') 1 >>> hamming('Niall', 'Neil') 3 >>> hamming('aluminum', 'Catalan') 8 >>> hamming('ATCG', 'TAGC') 4
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Hamming.dist_abs method instead.
-
abydos.distance.
dist_hamming
(src, tar, diff_lens=True)[source]¶ Return the normalized Hamming distance between two strings.
This is a wrapper for
Hamming.dist()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
- Returns
The normalized Hamming distance
- Return type
float
Examples
>>> round(dist_hamming('cat', 'hat'), 12) 0.333333333333 >>> dist_hamming('Niall', 'Neil') 0.6 >>> dist_hamming('aluminum', 'Catalan') 1.0 >>> dist_hamming('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Hamming.dist method instead.
-
abydos.distance.
sim_hamming
(src, tar, diff_lens=True)[source]¶ Return the normalized Hamming similarity of two strings.
This is a wrapper for
Hamming.sim()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
- Returns
The normalized Hamming similarity
- Return type
float
Examples
>>> round(sim_hamming('cat', 'hat'), 12) 0.666666666667 >>> sim_hamming('Niall', 'Neil') 0.4 >>> sim_hamming('aluminum', 'Catalan') 0.0 >>> sim_hamming('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Hamming.sim method instead.
-
class
abydos.distance.
MLIPNS
(threshold=0.25, max_mismatches=2, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
MLIPNS similarity.
Modified Language-Independent Product Name Search (MLIPNS) is described in [SA10]. This function returns only 1.0 (similar) or 0.0 (not similar). LIPNS similarity is identical to normalized Hamming similarity.
New in version 0.3.6.
Initialize MLIPNS instance.
- Parameters
threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default)
max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the MLIPNS similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
MLIPNS similarity
- Return type
float
Examples
>>> sim_mlipns('cat', 'hat') 1.0 >>> sim_mlipns('Niall', 'Neil') 0.0 >>> sim_mlipns('aluminum', 'Catalan') 0.0 >>> sim_mlipns('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
dist_mlipns
(src, tar, threshold=0.25, max_mismatches=2)[source]¶ Return the MLIPNS distance between two strings.
This is a wrapper for
MLIPNS.dist()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default)
max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
- Returns
MLIPNS distance
- Return type
float
Examples
>>> dist_mlipns('cat', 'hat') 0.0 >>> dist_mlipns('Niall', 'Neil') 1.0 >>> dist_mlipns('aluminum', 'Catalan') 1.0 >>> dist_mlipns('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the MLIPNS.dist method instead.
-
abydos.distance.
sim_mlipns
(src, tar, threshold=0.25, max_mismatches=2)[source]¶ Return the MLIPNS similarity of two strings.
This is a wrapper for
MLIPNS.sim()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default)
max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
- Returns
MLIPNS similarity
- Return type
float
Examples
>>> sim_mlipns('cat', 'hat') 1.0 >>> sim_mlipns('Niall', 'Neil') 0.0 >>> sim_mlipns('aluminum', 'Catalan') 0.0 >>> sim_mlipns('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the MLIPNS.sim method instead.
-
class
abydos.distance.
RelaxedHamming
(tokenizer=None, maxdist=2, discount=0.2, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Relaxed Hamming distance.
This is a variant of Hamming distance in which positionally close matches are considered partially matching.
New in version 0.4.1.
Initialize DiscountedHamming instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagemaxdist (int) -- The maximum distance to consider for discounting.
discount (float) -- The discount factor multiplied by the distance from the source string position.
**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.1.
-
dist
(src, tar)[source]¶ Return the normalized relaxed Hamming distance between strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized relaxed Hamming distance
- Return type
float
Examples
>>> cmp = RelaxedHamming() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> cmp.dist('Niall', 'Neil') 0.27999999999999997 >>> cmp.dist('aluminum', 'Catalan') 0.8 >>> cmp.dist('ATCG', 'TAGC') 0.2
New in version 0.4.1.
-
dist_abs
(src, tar)[source]¶ Return the discounted Hamming distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Relaxed Hamming distance
- Return type
float
Examples
>>> cmp = RelaxedHamming() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.4 >>> cmp.dist_abs('aluminum', 'Catalan') 6.4 >>> cmp.dist_abs('ATCG', 'TAGC') 0.8
New in version 0.4.1.
-
class
abydos.distance.
Tichy
(cost=(1, 1), **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Tichy edit distance.
Tichy described an algorithm, implemented below, in [Tic84]. Following this, [Cor03] identifies an interpretation of this algorithm's output as a distance measure, which is largely followed by the methods below.
Tichy's algorithm locates substrings of a string S to be copied in order to create a string T. The only other operation used by his algorithms for string reconstruction are add operations.
Notes
While [Cor03] counts only move operations to calculate distance, I give the option (enabled by default) of counting add operations as part of the distance measure. To ignore the cost of add operations, set the cost value to (1, 0), for example, when initializing the object. Further, in the case that S and T are identical, a distance of 0 will be returned, even though this would still be counted as a single move operation spanning the whole of string S.
New in version 0.4.0.
Initialize Tichy instance.
- Parameters
cost (tuple) -- A 2-tuple representing the cost of the two possible edits: block moves and adds (by default: (1, 1))
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Tichy edit distance between two strings.
The Tichy distance is normalized by dividing the distance by the length of the tar string.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Tichy distance between src & tar
- Return type
float
Examples
>>> cmp = Tichy() >>> round(cmp.dist('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.dist('Niall', 'Neil'), 12) 1.0 >>> cmp.dist('aluminum', 'Catalan') 0.8571428571428571 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Tichy distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The Tichy distance between src & tar
- Return type
int (may return a float if cost has float values)
Examples
>>> cmp = Tichy() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 4 >>> cmp.dist_abs('aluminum', 'Catalan') 6 >>> cmp.dist_abs('ATCG', 'TAGC') 4
New in version 0.4.0.
-
class
abydos.distance.
BlockLevenshtein
(cost=(1, 1, 1, 1), normalizer=<built-in function max>, **kwargs)[source]¶ Bases:
abydos.distance._levenshtein.Levenshtein
Levenshtein distance with block operations.
In addition to character-level insert, delete, and replace operations, this version of the Levenshtein distance supports block-level insert, delete, and replace, provided that the block occurs in both input strings.
New in version 0.4.0.
Initialize BlockLevenshtein instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized block Levenshtein distance between strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The normalized Levenshtein distance with blocks between src & tar
- Return type
float
Examples
>>> cmp = BlockLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the block Levenshtein edit distance between two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
The block Levenshtein edit distance between src & tar
- Return type
int
Examples
>>> cmp = BlockLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 3
New in version 0.4.0.
-
class
abydos.distance.
CormodeLZ
(**kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Cormode's LZ distance.
Cormode's LZ distance [CPSV00][Cor03]
New in version 0.4.0.
Initialize CormodeLZ instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Cormode's LZ distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Cormode's LZ distance
- Return type
float
Examples
>>> cmp = CormodeLZ() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.8 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Cormode's LZ distance of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Cormode's LZ distance
- Return type
float
Examples
>>> cmp = CormodeLZ() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 5 >>> cmp.dist_abs('aluminum', 'Catalan') 6 >>> cmp.dist_abs('ATCG', 'TAGC') 4
New in version 0.4.0.
-
class
abydos.distance.
JaroWinkler
(qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Jaro-Winkler distance.
Jaro(-Winkler) distance is a string edit distance initially proposed by Jaro and extended by Winkler [Jar89][Win90].
This is Python based on the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.
New in version 0.3.6.
Initialize JaroWinkler instance.
- Parameters
qval (int) -- The length of each q-gram (defaults to 1: character-wise matching)
mode (str) --
Indicates which variant of this distance metric to compute:
winkler
-- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the wordjaro
-- computes the Jaro distance
long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers." (Used in 'winkler' mode only.)
boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.)
scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Jaro or Jaro-Winkler similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Jaro or Jaro-Winkler similarity
- Return type
float
- Raises
ValueError -- Unsupported boost_threshold assignment; boost_threshold must be between 0 and 1.
ValueError -- Unsupported scaling_factor assignment; scaling_factor must be between 0 and 0.25.'
Examples
>>> round(sim_jaro_winkler('cat', 'hat'), 12) 0.777777777778 >>> round(sim_jaro_winkler('Niall', 'Neil'), 12) 0.805 >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12) 0.60119047619 >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12) 0.833333333333
>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12) 0.777777777778 >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12) 0.783333333333 >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12) 0.60119047619 >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12) 0.833333333333
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
dist_jaro_winkler
(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]¶ Return the Jaro or Jaro-Winkler distance between two strings.
This is a wrapper for
JaroWinkler.dist()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
qval (int) -- The length of each q-gram (defaults to 1: character-wise matching)
mode (str) --
Indicates which variant of this distance metric to compute:
winkler
-- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the wordjaro
-- computes the Jaro distance
long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixedlength fields such as phone and social security numbers." (Used in 'winkler' mode only.)
boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.)
scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
- Returns
Jaro or Jaro-Winkler distance
- Return type
float
Examples
>>> round(dist_jaro_winkler('cat', 'hat'), 12) 0.222222222222 >>> round(dist_jaro_winkler('Niall', 'Neil'), 12) 0.195 >>> round(dist_jaro_winkler('aluminum', 'Catalan'), 12) 0.39880952381 >>> round(dist_jaro_winkler('ATCG', 'TAGC'), 12) 0.166666666667
>>> round(dist_jaro_winkler('cat', 'hat', mode='jaro'), 12) 0.222222222222 >>> round(dist_jaro_winkler('Niall', 'Neil', mode='jaro'), 12) 0.216666666667 >>> round(dist_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12) 0.39880952381 >>> round(dist_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12) 0.166666666667
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the JaroWinkler.dist method instead.
-
abydos.distance.
sim_jaro_winkler
(src, tar, qval=1, mode='winkler', long_strings=False, boost_threshold=0.7, scaling_factor=0.1)[source]¶ Return the Jaro or Jaro-Winkler similarity of two strings.
This is a wrapper for
JaroWinkler.sim()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
qval (int) -- The length of each q-gram (defaults to 1: character-wise matching)
mode (str) --
Indicates which variant of this distance metric to compute:
winkler
-- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the wordjaro
-- computes the Jaro distance
long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixedlength fields such as phone and social security numbers." (Used in 'winkler' mode only.)
boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.)
scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
- Returns
Jaro or Jaro-Winkler similarity
- Return type
float
Examples
>>> round(sim_jaro_winkler('cat', 'hat'), 12) 0.777777777778 >>> round(sim_jaro_winkler('Niall', 'Neil'), 12) 0.805 >>> round(sim_jaro_winkler('aluminum', 'Catalan'), 12) 0.60119047619 >>> round(sim_jaro_winkler('ATCG', 'TAGC'), 12) 0.833333333333
>>> round(sim_jaro_winkler('cat', 'hat', mode='jaro'), 12) 0.777777777778 >>> round(sim_jaro_winkler('Niall', 'Neil', mode='jaro'), 12) 0.783333333333 >>> round(sim_jaro_winkler('aluminum', 'Catalan', mode='jaro'), 12) 0.60119047619 >>> round(sim_jaro_winkler('ATCG', 'TAGC', mode='jaro'), 12) 0.833333333333
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the JaroWinkler.sim method instead.
-
class
abydos.distance.
Strcmp95
(long_strings=False, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Strcmp95.
This is a Python translation of the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.
This is based on the Jaro-Winkler distance, but also attempts to correct for some common typos and frequently confused characters. It is also limited to uppercase ASCII characters, so it is appropriate to American names, but not much else.
New in version 0.3.6.
Initialize Strcmp95 instance.
- Parameters
long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the strcmp95 similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Strcmp95 similarity
- Return type
float
Examples
>>> cmp = Strcmp95() >>> cmp.sim('cat', 'hat') 0.7777777777777777 >>> cmp.sim('Niall', 'Neil') 0.8454999999999999 >>> cmp.sim('aluminum', 'Catalan') 0.6547619047619048 >>> cmp.sim('ATCG', 'TAGC') 0.8333333333333334
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
dist_strcmp95
(src, tar, long_strings=False)[source]¶ Return the strcmp95 distance between two strings.
This is a wrapper for
Strcmp95.dist()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
- Returns
Strcmp95 distance
- Return type
float
Examples
>>> round(dist_strcmp95('cat', 'hat'), 12) 0.222222222222 >>> round(dist_strcmp95('Niall', 'Neil'), 12) 0.1545 >>> round(dist_strcmp95('aluminum', 'Catalan'), 12) 0.345238095238 >>> round(dist_strcmp95('ATCG', 'TAGC'), 12) 0.166666666667
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Strcmp95.dist method instead.
-
abydos.distance.
sim_strcmp95
(src, tar, long_strings=False)[source]¶ Return the strcmp95 similarity of two strings.
This is a wrapper for
Strcmp95.sim()
.- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
- Returns
Strcmp95 similarity
- Return type
float
Examples
>>> sim_strcmp95('cat', 'hat') 0.7777777777777777 >>> sim_strcmp95('Niall', 'Neil') 0.8454999999999999 >>> sim_strcmp95('aluminum', 'Catalan') 0.6547619047619048 >>> sim_strcmp95('ATCG', 'TAGC') 0.8333333333333334
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Strcmp95.sim method instead.
-
class
abydos.distance.
IterativeSubString
(hamacher=0.6, normalize_strings=False, **kwargs)[source]¶ Bases:
abydos.distance._distance._Distance
Iterative-SubString correlation.
Iterative-SubString (I-Sub) correlation [SSK05]
This is a straightforward port of the primary author's Java implementation: http://www.image.ece.ntua.gr/~gstoil/software/I_Sub.java
New in version 0.4.0.
Initialize IterativeSubString instance.
- Parameters
hamacher (float) -- The constant factor for the Hamacher product
normalize_strings (bool) -- Normalize the strings by removing the characters in '._ ' and lower casing
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Iterative-SubString correlation of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Iterative-SubString correlation
- Return type
float
Examples
>>> cmp = IterativeSubString() >>> cmp.corr('cat', 'hat') -1.0 >>> cmp.corr('Niall', 'Neil') -0.9 >>> cmp.corr('aluminum', 'Catalan') -1.0 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Iterative-SubString similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Iterative-SubString similarity
- Return type
float
Examples
>>> cmp = IterativeSubString() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.04999999999999999 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
AMPLE
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
AMPLE similarity.
The AMPLE similarity [DLZ05][AZvanGemund07] is defined in getAverageSequenceWeight() in the AverageSequenceWeightEvaluator.java file of AMPLE's source code. For two sets X and Y and a population N, it is
\[sim_{AMPLE}(X, Y) = \big|\frac{|X \cap Y|}{|X|} - \frac{|Y \setminus X|}{|N \setminus X|}\big|\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{AMPLE} = \big|\frac{a}{a+b}-\frac{c}{c+d}\big|\]Notes
This measure is asymmetric. The first ratio considers how similar the two strings are, while the second considers how dissimilar the second string is. As a result, both very similar and very dissimilar strings will score high on this measure, provided the unique aspects are present chiefly in the latter string.
New in version 0.4.0.
Initialize AMPLE instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the AMPLE similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
AMPLE similarity
- Return type
float
Examples
>>> cmp = AMPLE() >>> cmp.sim('cat', 'hat') 0.49743589743589745 >>> cmp.sim('Niall', 'Neil') 0.32947729220222793 >>> cmp.sim('aluminum', 'Catalan') 0.10209049255441008 >>> cmp.sim('ATCG', 'TAGC') 0.006418485237483954
New in version 0.4.0.
-
class
abydos.distance.
AZZOO
(sigma=0.5, alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
AZZOO similarity.
For two sets X and Y, and alphabet N, and a parameter \(\sigma\), AZZOO similarity [CTY06] is
\[sim_{AZZOO_{\sigma}}(X, Y) = \sum{s_i}\]where \(s_i = 1\) if \(X_i = Y_i = 1\), \(s_i = \sigma\) if \(X_i = Y_i = 0\), and \(s_i = 0\) otherwise.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{AZZOO} = a + \sigma \cdot d\]New in version 0.4.0.
Initialize AZZOO instance.
- Parameters
sigma (float) -- Sigma designates the contribution to similarity given by the 0-0 samples in the set.
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the AZZOO similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
AZZOO similarity
- Return type
float
Examples
>>> cmp = AZZOO() >>> cmp.sim('cat', 'hat') 0.9923857868020305 >>> cmp.sim('Niall', 'Neil') 0.9860759493670886 >>> cmp.sim('aluminum', 'Catalan') 0.9710327455919395 >>> cmp.sim('ATCG', 'TAGC') 0.9809885931558935
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the AZZOO similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
AZZOO similarity
- Return type
float
Examples
>>> cmp = AZZOO() >>> cmp.sim_score('cat', 'hat') 391.0 >>> cmp.sim_score('Niall', 'Neil') 389.5 >>> cmp.sim_score('aluminum', 'Catalan') 385.5 >>> cmp.sim_score('ATCG', 'TAGC') 387.0
New in version 0.4.0.
-
class
abydos.distance.
Anderberg
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Anderberg's D.
For two sets X and Y and a population N, Anderberg's D [And73] is
\[\begin{split}t_1 = max(|X \cap Y|, |X \setminus Y|)+ max(|Y \setminus X|, |(N \setminus X) \setminus Y|)+\\ max(|X \cap Y|, |Y \setminus X|)+ max(|X \setminus Y|, |(N \setminus X) \setminus Y|)\\ \\ t_2 = max(|Y|, |N \setminus Y|)+max(|X|, |N \setminus X|)\\ \\ sim_{Anderberg}(X, Y) = \frac{t_1-t_2}{2|N|}\end{split}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Anderberg} = \frac{(max(a,b)+max(c,d)+max(a,c)+max(b,d))- (max(a+b,b+d)+max(a+b,c+d))}{2n}\]Notes
There are various references to another "Anderberg similarity", \(sim_{Anderberg} = \frac{8a}{8a+b+c}\), but I cannot substantiate the claim that this appears in [And73]. In any case, if you want to use this measure, you may instatiate
WeightedJaccard
with weight=8.Anderberg states that "[t]his quantity is the actual reduction in the error probability (also the actual increase in the correct prediction) as a consequence of using predictor information" [And73]. It ranges [0, 0.5] so a
sim
method ranging [0, 1] is provided in addition tosim_score
, which gives the value D itself.It is difficult to term this measure a similarity score. Identical strings often fail to gain high scores. Also, strings that would otherwise be considered quite similar often earn lower scores than those that are less similar.
New in version 0.4.0.
Initialize Anderberg instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Anderberg's D similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Anderberg similarity
- Return type
float
Examples
>>> cmp = Anderberg() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Anderberg's D similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Anderberg similarity
- Return type
float
Examples
>>> cmp = Anderberg() >>> cmp.sim_score('cat', 'hat') 0.0 >>> cmp.sim_score('Niall', 'Neil') 0.0 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
AndresMarzoDelta
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Andres & Marzo's Delta correlation.
For two sets X and Y and a population N, Andres & Marzo's \(\Delta\) correlation [AndresM04] is
\[corr_{AndresMarzo_\Delta}(X, Y) = \Delta = \frac{|X \cap Y| + |(N \setminus X) \setminus Y| - 2\sqrt{|X \setminus Y| \cdot |Y \setminus X|}}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{AndresMarzo_\Delta} = \Delta = \frac{a+d-2\sqrt{b \cdot c}}{n}\]New in version 0.4.0.
Initialize AndresMarzoDelta instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Andres & Marzo's Delta correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Andres & Marzo's Delta correlation
- Return type
float
Examples
>>> cmp = AndresMarzoDelta() >>> cmp.corr('cat', 'hat') 0.9897959183673469 >>> cmp.corr('Niall', 'Neil') 0.9822344346552608 >>> cmp.corr('aluminum', 'Catalan') 0.9618259496215341 >>> cmp.corr('ATCG', 'TAGC') 0.9744897959183674
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Andres & Marzo's Delta similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Andres & Marzo's Delta similarity
- Return type
float
Examples
>>> cmp = AndresMarzoDelta() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9911172173276304 >>> cmp.sim('aluminum', 'Catalan') 0.980912974810767 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
-
class
abydos.distance.
BaroniUrbaniBuserI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baroni-Urbani & Buser I similarity.
For two sets X and Y and a population N, the Baroni-Urbani & Buser I similarity [BUB76] is
\[sim_{BaroniUrbaniBuserI}(X, Y) = \frac{\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y|} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]This is the second, but more commonly used and referenced of the two similarities proposed by Baroni-Urbani & Buser.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaroniUrbaniBuserI} = \frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}\]New in version 0.4.0.
Initialize BaroniUrbaniBuserI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Baroni-Urbani & Buser I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baroni-Urbani & Buser I similarity
- Return type
float
Examples
>>> cmp = BaroniUrbaniBuserI() >>> cmp.sim('cat', 'hat') 0.9119837740878104 >>> cmp.sim('Niall', 'Neil') 0.8552823175014205 >>> cmp.sim('aluminum', 'Catalan') 0.656992712054851 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
BaroniUrbaniBuserII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baroni-Urbani & Buser II correlation.
For two sets X and Y and a population N, the Baroni-Urbani & Buser II correlation [BUB76] is
\[corr_{BaroniUrbaniBuserII}(X, Y) = \frac{\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y| - |X \setminus Y| - |Y \setminus X|} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]This is the first, but less commonly used and referenced of the two similarities proposed by Baroni-Urbani & Buser.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{BaroniUrbaniBuserII} = \frac{\sqrt{ad}+a-b-c}{\sqrt{ad}+a+b+c}\]New in version 0.4.0.
Initialize BaroniUrbaniBuserII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Baroni-Urbani & Buser II correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baroni-Urbani & Buser II correlation
- Return type
float
Examples
>>> cmp = BaroniUrbaniBuserII() >>> cmp.corr('cat', 'hat') 0.8239675481756209 >>> cmp.corr('Niall', 'Neil') 0.7105646350028408 >>> cmp.corr('aluminum', 'Catalan') 0.31398542410970204 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Baroni-Urbani & Buser II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baroni-Urbani & Buser II similarity
- Return type
float
Examples
>>> cmp = BaroniUrbaniBuserII() >>> cmp.sim('cat', 'hat') 0.9119837740878105 >>> cmp.sim('Niall', 'Neil') 0.8552823175014204 >>> cmp.sim('aluminum', 'Catalan') 0.656992712054851 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
BatageljBren
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Batagelj & Bren distance.
For two sets X and Y and a population N, the Batagelj & Bren distance [BB95], Batagelj & Bren's \(Q_0\), is
\[dist_{BatageljBren}(X, Y) = \frac{|X \setminus Y| \cdot |Y \setminus X|} {|X \cap Y| \cdot |(N \setminus X) \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BatageljBren} = \frac{bc}{ad}\]New in version 0.4.0.
Initialize BatageljBren instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Batagelj & Bren distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Batagelj & Bren distance
- Return type
float
Examples
>>> cmp = BatageljBren() >>> cmp.dist('cat', 'hat') 3.2789465400556106e-06 >>> cmp.dist('Niall', 'Neil') 9.874917709019092e-06 >>> cmp.dist('aluminum', 'Catalan') 9.276668350823718e-05 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Batagelj & Bren distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Batagelj & Bren distance
- Return type
float
Examples
>>> cmp = BatageljBren() >>> cmp.dist_abs('cat', 'hat') 0.002570694087403599 >>> cmp.dist_abs('Niall', 'Neil') 0.007741935483870968 >>> cmp.dist_abs('aluminum', 'Catalan') 0.07282184655396619 >>> cmp.dist_abs('ATCG', 'TAGC') inf
New in version 0.4.0.
-
class
abydos.distance.
BaulieuI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu I distance.
For two sets X and Y and a population N, Baulieu I distance [Bau89] is
\[sim_{BaulieuI}(X, Y) = \frac{|X| \cdot |Y| - |X \cap Y|^2}{|X| \cdot |Y|}\]This is Baulieu's 12th dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaulieuI} = \frac{(a+b)(a+c)-a^2}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize BaulieuI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu I distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu I distance
- Return type
float
Examples
>>> cmp = BaulieuI() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8666666666666667 >>> cmp.dist('aluminum', 'Catalan') 0.9861111111111112 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu II similarity.
For two sets X and Y and a population N, Baulieu II similarity [Bau89] is
\[sim_{BaulieuII}(X, Y) = \frac{|X \cap Y|^2 \cdot |(N \setminus X) \setminus Y|^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]This is based on Baulieu's 13th dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaulieuII} = \frac{a^2d^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize BaulieuII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Baulieu II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu II similarity
- Return type
float
Examples
>>> cmp = BaulieuII() >>> cmp.sim('cat', 'hat') 0.24871959237343852 >>> cmp.sim('Niall', 'Neil') 0.13213719608444902 >>> cmp.sim('aluminum', 'Catalan') 0.013621892326789235 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuIII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu III distance.
For two sets X and Y and a population N, Baulieu III distance [Bau89] is
\[sim_{BaulieuIII}(X, Y) = \frac{|N|^2 - 4(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)}{2 \cdot |N|^2}\]This is based on Baulieu's 20th dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaulieuIII} = \frac{n^2 - 4(ad-bc)}{2n^2}\]Notes
It should be noted that this is based on Baulieu's 20th dissimilarity coefficient. This distance is exactly half Baulieu's 20th dissimilarity. According to [Bau89], the 20th dissimilarity should be a value in the range [0.0, 1.0], meeting the article's (P1) property, but the formula given ranges [0.0, 2.0], so dividing by 2 corrects the formula to meet the article's expectations.
New in version 0.4.0.
Initialize BaulieuIII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu III distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu III distance
- Return type
float
Examples
>>> cmp = BaulieuIII() >>> cmp.dist('cat', 'hat') 0.4949500208246564 >>> cmp.dist('Niall', 'Neil') 0.4949955747605165 >>> cmp.dist('aluminum', 'Catalan') 0.49768591017891195 >>> cmp.dist('ATCG', 'TAGC') 0.5000813463140358
New in version 0.4.0.
-
class
abydos.distance.
BaulieuIV
(alphabet=None, tokenizer=None, intersection_type='crisp', positive_irrational=2.718281828459045, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu IV distance.
For two sets X and Y, a population N, and a positive irractional number k, Baulieu IV distance [Bau97] is
\[dist_{BaulieuIV}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| - (|X \cap Y| + \frac{1}{2}) \cdot (|(N \setminus X) \setminus Y| + \frac{1}{2}) \cdot |(N \setminus X) \setminus Y| \cdot k}{|N|}\]This is Baulieu's 22nd dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuIV} = \frac{b+c-(a+\frac{1}{2})(d+\frac{1}{2})dk}{n}\]Notes
The default value of k is Euler's number \(e\), but other irrationals such as \(\pi\) or \(\sqrt{2}\) could be substituted at initialization.
New in version 0.4.0.
Initialize BaulieuIV instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Baulieu IV distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Baulieu IV distance
- Return type
float
Examples
>>> cmp = BaulieuIV() >>> cmp.dist('cat', 'hat') 0.49999799606535283 >>> cmp.dist('Niall', 'Neil') 0.49999801148659684 >>> cmp.dist('aluminum', 'Catalan') 0.49999883126809364 >>> cmp.dist('ATCG', 'TAGC') 0.4999996033268451
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Baulieu IV distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu IV distance
- Return type
float
Examples
>>> cmp = BaulieuIV() >>> cmp.dist_abs('cat', 'hat') -5249.96272285802 >>> cmp.dist_abs('Niall', 'Neil') -5209.561726488335 >>> cmp.dist_abs('aluminum', 'Catalan') -3073.6070822721244 >>> cmp.dist_abs('ATCG', 'TAGC') -1039.2151656463932
New in version 0.4.0.
-
class
abydos.distance.
BaulieuV
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu V distance.
For two sets X and Y and a population N, Baulieu V distance [Bau97] is
\[dist_{BaulieuV}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| + 1}{|X \cap Y| + |X \setminus Y| + |Y \setminus X| + 1}\]This is Baulieu's 23rd dissimilarity coefficient. This coefficient fails Baulieu's (P2) property, that \(D(a,0,0,0) = 0\). Rather, \(D(a,0,0,0) > 0\), but \(\lim_{a \to \infty} D(a,0,0,0) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuV} = \frac{b+c+1}{a+b+c+1}\]New in version 0.4.0.
Initialize BaulieuV instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu V distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu V distance
- Return type
float
Examples
>>> cmp = BaulieuV() >>> cmp.dist('cat', 'hat') 0.7142857142857143 >>> cmp.dist('Niall', 'Neil') 0.8 >>> cmp.dist('aluminum', 'Catalan') 0.9411764705882353 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuVI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu VI distance.
For two sets X and Y and a population N, Baulieu VI distance [Bau97] is
\[dist_{BaulieuVI}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + 1}\]This is Baulieu's 24th dissimilarity coefficient. This coefficient fails Baulieu's (P3) property, that \(D(a,b,c,d) = 1\) for some (a,b,c,d). Rather, \(D(a,b,c,d) < 1\), but \(\lim_{b \to \infty, c \to \infty} D(a,b,c,d) = 0\) for \(a = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuVI} = \frac{b+c}{a+b+c+1}\]New in version 0.4.0.
Initialize BaulieuVI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu VI distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu VI distance
- Return type
float
Examples
>>> cmp = BaulieuVI() >>> cmp.dist('cat', 'hat') 0.5714285714285714 >>> cmp.dist('Niall', 'Neil') 0.7 >>> cmp.dist('aluminum', 'Catalan') 0.8823529411764706 >>> cmp.dist('ATCG', 'TAGC') 0.9090909090909091
New in version 0.4.0.
-
class
abydos.distance.
BaulieuVII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu VII distance.
For two sets X and Y and a population N, Baulieu VII distance [Bau97] is
\[dist_{BaulieuVII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|N| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]This is Baulieu's 25th dissimilarity coefficient. This coefficient fails Baulieu's (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuVII} = \frac{b+c}{n + a \cdot (a-4)^2}\]New in version 0.4.0.
Initialize BaulieuVII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu VII distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu VII distance
- Return type
float
Examples
>>> cmp = BaulieuVII() >>> cmp.dist('cat', 'hat') 0.005050505050505051 >>> cmp.dist('Niall', 'Neil') 0.008838383838383838 >>> cmp.dist('aluminum', 'Catalan') 0.018891687657430732 >>> cmp.dist('ATCG', 'TAGC') 0.012755102040816327
New in version 0.4.0.
-
class
abydos.distance.
BaulieuVIII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu VIII distance.
For two sets X and Y and a population N, Baulieu VIII distance [Bau97] is
\[dist_{BaulieuVIII}(X, Y) = \frac{(|X \setminus Y| - |Y \setminus X|)^2}{|N|^2}\]This is Baulieu's 26th dissimilarity coefficient. This coefficient fails Baulieu's (P5) property, that \(D(a,b+1,c,d) \geq D(a,b,c,d)\), with equality holding if \(D(a,b,c,d) = 1\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuVIII} = \frac{(b-c)^2}{n^2}\]New in version 0.4.0.
Initialize BaulieuVIII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu VIII distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu VIII distance
- Return type
float
Examples
>>> cmp = BaulieuVIII() >>> cmp.dist('cat', 'hat') 0.0 >>> cmp.dist('Niall', 'Neil') 1.6269262807163682e-06 >>> cmp.dist('aluminum', 'Catalan') 1.6227838857560144e-06 >>> cmp.dist('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuIX
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu IX distance.
For two sets X and Y and a population N, Baulieu IX distance [Bau97] is
\[dist_{BaulieuIX}(X, Y) = \frac{|X \setminus Y| + 2 \cdot |Y \setminus X|}{|N| + |Y \setminus X|}\]This is Baulieu's 27th dissimilarity coefficient. This coefficient fails Baulieu's (P7) property, that \(D(a,b,c,d) = D(a,c,b,d)\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuIX} = \frac{b+2c}{a+b+2c+d}\]New in version 0.4.0.
Initialize BaulieuIX instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu IX distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu IX distance
- Return type
float
Examples
>>> cmp = BaulieuIX() >>> cmp.dist('cat', 'hat') 0.007633587786259542 >>> cmp.dist('Niall', 'Neil') 0.012706480304955527 >>> cmp.dist('aluminum', 'Catalan') 0.027777777777777776 >>> cmp.dist('ATCG', 'TAGC') 0.019011406844106463
New in version 0.4.0.
-
class
abydos.distance.
BaulieuX
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu X distance.
For two sets X and Y and a population N, Baulieu X distance [Bau97] is
\[dist_{BaulieuX}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| + max(|X \setminus Y|, |Y \setminus X|)}{|N| + max(|X \setminus Y|, |Y \setminus X|)}\]This is Baulieu's 28th dissimilarity coefficient. This coefficient fails Baulieu's (P8) property, that \(D\) is a rational function whose numerator and denominator are both (total) linear.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuX} = \frac{b+c+max(b,c)}{n+max(b,c)}\]New in version 0.4.0.
Initialize BaulieuX instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu X distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu X distance
- Return type
float
Examples
>>> cmp = BaulieuX() >>> cmp.dist('cat', 'hat') 0.007633587786259542 >>> cmp.dist('Niall', 'Neil') 0.013959390862944163 >>> cmp.dist('aluminum', 'Catalan') 0.029003783102143757 >>> cmp.dist('ATCG', 'TAGC') 0.019011406844106463
New in version 0.4.0.
-
class
abydos.distance.
BaulieuXI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu XI distance.
For two sets X and Y and a population N, Baulieu XI distance [Bau97] is
\[dist_{BaulieuXI}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \setminus Y| + |Y \setminus X| + |(N \setminus X) \setminus Y|}\]This is Baulieu's 29th dissimilarity coefficient. This coefficient fails Baulieu's (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXI} = \frac{b+c}{b+c+d}\]New in version 0.4.0.
Initialize BaulieuXI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu XI distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu XI distance
- Return type
float
Examples
>>> cmp = BaulieuXI() >>> cmp.dist('cat', 'hat') 0.005115089514066497 >>> cmp.dist('Niall', 'Neil') 0.008951406649616368 >>> cmp.dist('aluminum', 'Catalan') 0.01913265306122449 >>> cmp.dist('ATCG', 'TAGC') 0.012755102040816327
New in version 0.4.0.
-
class
abydos.distance.
BaulieuXII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu XII distance.
For two sets X and Y and a population N, Baulieu XII distance [Bau97] is
\[dist_{BaulieuXII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| - 1}\]This is Baulieu's 30th dissimilarity coefficient. This coefficient fails Baulieu's (P5) property, that \(D(a,b+1,c,d) \geq D(a,b,c,d)\), with equality holding if \(D(a,b,c,d) = 1\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXII} = \frac{b+c}{a+b+c-1}\]Notes
In the special case of comparisons where the intersection (a) contains 0 members, the size of the intersection is set to 1, resulting in a distance of 1.0. This prevents the distance from exceeding 1.0 and similarity from becoming negative.
New in version 0.4.0.
Initialize BaulieuXII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu XII distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu XII distance
- Return type
float
Examples
>>> cmp = BaulieuXII() >>> cmp.dist('cat', 'hat') 0.8 >>> cmp.dist('Niall', 'Neil') 0.875 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuXIII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu XIII distance.
For two sets X and Y and a population N, Baulieu XIII distance [Bau97] is
\[dist_{BaulieuXIII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]This is Baulieu's 31st dissimilarity coefficient. This coefficient fails Baulieu's (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXIII} = \frac{b+c}{a+b+c+a \cdot (a-4)^2}\]New in version 0.4.0.
Initialize BaulieuXIII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu XIII distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu XIII distance
- Return type
float
Examples
>>> cmp = BaulieuXIII() >>> cmp.dist('cat', 'hat') 0.2857142857142857 >>> cmp.dist('Niall', 'Neil') 0.4117647058823529 >>> cmp.dist('aluminum', 'Catalan') 0.6 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuXIV
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu XIV distance.
For two sets X and Y and a population N, Baulieu XIV distance [Bau97] is
\[dist_{BaulieuXIV}(X, Y) = \frac{|X \setminus Y| + 2 \cdot |Y \setminus X|}{|X \cap Y| + |X \setminus Y| + 2 \cdot |Y \setminus X|}\]This is Baulieu's 32nd dissimilarity coefficient. This coefficient fails Baulieu's (P7) property, that \(D(a,b,c,d) = D(a,c,b,d)\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXIV} = \frac{b+2c}{a+b+2c}\]New in version 0.4.0.
Initialize BaulieuXIV instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu XIV distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu XIV distance
- Return type
float
Examples
>>> cmp = BaulieuXIV() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8333333333333334 >>> cmp.dist('aluminum', 'Catalan') 0.9565217391304348 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
BaulieuXV
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Baulieu XV distance.
For two sets X and Y and a population N, Baulieu XV distance [Bau97] is
\[dist_{BaulieuXV}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| + max(|X \setminus Y|, |Y \setminus X|)}{|X \cap Y| + |X \setminus Y| + |Y \setminus X| + max(|X \setminus Y|, |Y \setminus X|)}\]This is Baulieu's 33rd dissimilarity coefficient. This coefficient fails Baulieu's (P8) property, that \(D\) is a rational function whose numerator and denominator are both (total) linear.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXV} = \frac{b+c+max(b, c)}{a+b+c+max(b, c)}\]New in version 0.4.0.
Initialize BaulieuXV instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Baulieu XV distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Baulieu XV distance
- Return type
float
Examples
>>> cmp = BaulieuXV() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8461538461538461 >>> cmp.dist('aluminum', 'Catalan') 0.9583333333333334 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
BeniniI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
BeniniI correlation.
For two sets X and Y and a population N, Benini I correlation, Benini's Index of Attraction, [Ben01] is
\[corr_{BeniniI}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|}{|Y| \cdot |N \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{BeniniI} = \frac{ad-bc}{(a+c)(c+d)}\]New in version 0.4.0.
Initialize BeniniI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Benini I correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Benini I correlation
- Return type
float
Examples
>>> cmp = BeniniI() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3953727506426735 >>> cmp.corr('aluminum', 'Catalan') 0.11485180412371133 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Benini I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Benini I similarity
- Return type
float
Examples
>>> cmp = BeniniI() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6976863753213367 >>> cmp.sim('aluminum', 'Catalan') 0.5574259020618557 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
-
class
abydos.distance.
BeniniII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
BeniniII correlation.
For two sets X and Y and a population N, Benini II correlation, Benini's Index of Repulsion, [Ben01] is
\[corr_{BeniniII}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {min(|Y| \cdot |N \setminus X|, |X| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{BeniniII} = \frac{ad-bc}{min((a+c)(c+d), (a+b)(b+d))}\]New in version 0.4.0.
Initialize BeniniII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Benini II correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Benini II correlation
- Return type
float
Examples
>>> cmp = BeniniII() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3953727506426735 >>> cmp.corr('aluminum', 'Catalan') 0.11485180412371133 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Benini II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Benini II similarity
- Return type
float
Examples
>>> cmp = BeniniII() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6976863753213367 >>> cmp.sim('aluminum', 'Catalan') 0.5574259020618557 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
-
class
abydos.distance.
Bennet
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Bennet's S correlation.
For two sets X and Y and a population N, Bennet's \(S\) correlation [BAG54] is
\[corr_{Bennet}(X, Y) = S = \frac{p_o - p_e^S}{1 - p_e^S}\]where
\[ \begin{align}\begin{aligned}p_o = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\p_e^S = \frac{1}{2}\end{aligned}\end{align} \]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[ \begin{align}\begin{aligned}p_o = \frac{a+d}{n}\\p_e^S = \frac{1}{2}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize Bennet instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Bennet's S correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Bennet's S correlation
- Return type
float
Examples
>>> cmp = Bennet() >>> cmp.corr('cat', 'hat') 0.989795918367347 >>> cmp.corr('Niall', 'Neil') 0.9821428571428572 >>> cmp.corr('aluminum', 'Catalan') 0.9617834394904459 >>> cmp.corr('ATCG', 'TAGC') 0.9744897959183674
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Bennet's S similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Bennet's S similarity
- Return type
float
Examples
>>> cmp = Bennet() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9910714285714286 >>> cmp.sim('aluminum', 'Catalan') 0.9808917197452229 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
-
class
abydos.distance.
BraunBlanquet
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Braun-Blanquet similarity.
For two sets X and Y and a population N, the Braun-Blanquet similarity [BB32] is
\[sim_{BraunBlanquet}(X, Y) = \frac{|X \cap Y|}{max(|X|, |Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BraunBlanquet} = \frac{a}{max(a+b, a+c)}\]New in version 0.4.0.
Initialize BraunBlanquet instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Braun-Blanquet similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Braun-Blanquet similarity
- Return type
float
Examples
>>> cmp = BraunBlanquet() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.1111111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
Canberra
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Canberra distance.
For two sets X and Y, the Canberra distance [LW66][LW67b] is
\[sim_{Canberra}(X, Y) = \frac{|X \triangle Y|}{|X|+|Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Canberra} = \frac{b+c}{(a+b)+(a+c)}\]New in version 0.4.0.
Initialize Canberra instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the Canberra distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Canberra distance
- Return type
float
Examples
>>> cmp = Canberra() >>> cmp.dist('cat', 'hat') 0.5 >>> cmp.dist('Niall', 'Neil') 0.6363636363636364 >>> cmp.dist('aluminum', 'Catalan') 0.8823529411764706 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
class
abydos.distance.
Cao
(**kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Cao's CY dissimilarity.
Given \(X_{ij}\) (the number of individuals of speecies \(j\) in sample \(i\)), \(X_{kj}\) (the number of individuals of speecies \(j\) in sample \(k\)), and \(N\) (the total number of speecies present in both samples), Cao dissimilarity (CYd) [CBW97] is:
\[dist_{Cao}(X, Y) = CYd = \frac{1}{N}\sum\Bigg(\frac{(X_{ij} + X_{kj})log_{10}\big( \frac{X_{ij}+X_{kj}}{2}\big)-X_{ij}log_{10}X_{kj}-X_{kj}log_{10}X_{ij}} {X_{ij}+X_{kj}}\Bigg)\]In the above formula, whenever \(X_{ij} = 0\) or \(X_{kj} = 0\), the value 0.1 is substituted.
Since this measure ranges from 0 to \(\infty\), a similarity measure, CYs, ranging from 0 to 1 was also developed.
\[sim_{Cao}(X, Y) = CYs = 1 - \frac{Observed~CYd}{Maximum~CYd}\]where
\[Observed~CYd = \sum\Bigg(\frac{(X_{ij} + X_{kj})log_{10}\big( \frac{X_{ij}+X_{kj}}{2}\big)-X_{ij}log_{10}X_{kj}-X_{kj}log_{10}X_{ij}} {X_{ij}+X_{kj}}\Bigg)\]and with \(a\) (the number of species present in both samples), \(b\) (the number of species present in sample \(i\) only), and \(c\) (the number of species present in sample \(j\) only),
\[Maximum~CYd = D_1 + D_2 + D_3\]with
\[ \begin{align}\begin{aligned}D_1 = \sum_{j=1}^b \Bigg(\frac{(X_{ij} + 0.1) log_{10} \big( \frac{X_{ij}+0.1}{2}\big)-X_{ij}log_{10}0.1-0.1log_{10}X_{ij}} {X_{ij}+0.1}\Bigg)\\D_2 = \sum_{j=1}^c \Bigg(\frac{(X_{kj} + 0.1) log_{10} \big( \frac{X_{kj}+0.1}{2}\big)-X_{kj}log_{10}0.1-0.1log_{10}X_{kj}} {X_{kj}+0.1}\Bigg)\\D_1 = \sum_{j=1}^a \frac{a}{2} \Bigg(\frac{(D_i + 1) log_{10} \big(\frac{D_i+1}{2}\big)-log_{10}D_i}{D_i+1} + \frac{(D_k + 1) log_{10} \big(\frac{D_k+1}{2}\big)-log_{10}D_k}{D_k+1}\Bigg)\end{aligned}\end{align} \]with
\[ \begin{align}\begin{aligned}D_i = \frac{\sum X_{ij} - \frac{a}{2}}{\frac{a}{2}}\\D_k = \frac{\sum X_{kj} - \frac{a}{2}}{\frac{a}{2}}\end{aligned}\end{align} \]for
\[ \begin{align}\begin{aligned}X_{ij} \geq 1\\X_{kj} \geq 1\end{aligned}\end{align} \]New in version 0.4.1.
Initialize Cao instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
-
dist_abs
(src, tar)[source]¶ Return Cao's CY dissimilarity (CYd) of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Cao's CY dissimilarity
- Return type
float
Examples
>>> cmp = Cao() >>> cmp.dist_abs('cat', 'hat') 0.3247267992925765 >>> cmp.dist_abs('Niall', 'Neil') 0.4132886536450973 >>> cmp.dist_abs('aluminum', 'Catalan') 0.5530666041976232 >>> cmp.dist_abs('ATCG', 'TAGC') 0.6494535985851531
New in version 0.4.1.
-
sim
(src, tar)[source]¶ Return Cao's CY similarity (CYs) of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Cao's CY similarity
- Return type
float
Examples
>>> cmp = Cao() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
-
class
abydos.distance.
ChaoDice
(**kwargs)[source]¶ Bases:
abydos.distance._chao_jaccard.ChaoJaccard
Chao's Dice similarity.
Chao's Dice similarity [CCCS04]
New in version 0.4.1.
Initialize ChaoDice instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
-
sim
(src, tar)[source]¶ Return the normalized Chao's Dice similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized Chao's Dice similarity
- Return type
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoDice() >>> cmp.sim('cat', 'hat') 0.36666666666666664 >>> cmp.sim('Niall', 'Neil') 0.27868852459016397 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
-
sim_score
(src, tar)[source]¶ Return the Chao's Dice similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Chao's Dice similarity
- Return type
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoDice() >>> cmp.sim_score('cat', 'hat') 0.36666666666666664 >>> cmp.sim_score('Niall', 'Neil') 0.27868852459016397 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.1.
-
class
abydos.distance.
ChaoJaccard
(**kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Chao's Jaccard similarity.
Chao's Jaccard similarity [CCCS04]
New in version 0.4.1.
Initialize ChaoJaccard instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
-
sim
(src, tar)[source]¶ Return normalized Chao's Jaccard similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Normalized Chao's Jaccard similarity
- Return type
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoJaccard() >>> cmp.sim('cat', 'hat') 0.22448979591836735 >>> cmp.sim('Niall', 'Neil') 0.1619047619047619 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
-
sim_score
(src, tar)[source]¶ Return Chao's Jaccard similarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Chao's Jaccard similarity
- Return type
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoJaccard() >>> cmp.sim_score('cat', 'hat') 0.22448979591836735 >>> cmp.sim_score('Niall', 'Neil') 0.1619047619047619 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.1.
-
class
abydos.distance.
Chebyshev
(alphabet=0, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._minkowski.Minkowski
Chebyshev distance.
Euclidean distance is the chessboard distance, equivalent to Minkowski distance in \(L^\infty\)-space.
New in version 0.3.6.
Initialize Euclidean instance.
- Parameters
alphabet (collection or int) -- The values or size of the alphabet
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(*args, **kwargs)[source]¶ Raise exception when called.
- Parameters
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises
NotImplementedError -- Method disabled for Chebyshev distance
New in version 0.3.6.
-
dist_abs
(src, tar)[source]¶ Return the Chebyshev distance between two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
The Chebyshev distance
- Return type
float
Examples
>>> cmp = Chebyshev() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.0 >>> cmp.dist_abs('Colin', 'Cuilen') 1.0 >>> cmp.dist_abs('ATCG', 'TAGC') 1.0
>>> cmp = Chebyshev(qval=1) >>> cmp.dist_abs('ATCG', 'TAGC') 0.0 >>> cmp.dist_abs('ATCGATTCGGAATTTC', 'TAGCATAATCGCCG') 3.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
chebyshev
(src, tar, qval=2, alphabet=0)[source]¶ Return the Chebyshev distance between two strings.
This is a wrapper for the
Chebyshev.dist_abs()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
alphabet (collection or int) -- The values or size of the alphabet
- Returns
The Chebyshev distance
- Return type
float
Examples
>>> chebyshev('cat', 'hat') 1.0 >>> chebyshev('Niall', 'Neil') 1.0 >>> chebyshev('Colin', 'Cuilen') 1.0 >>> chebyshev('ATCG', 'TAGC') 1.0 >>> chebyshev('ATCG', 'TAGC', qval=1) 0.0 >>> chebyshev('ATCGATTCGGAATTTC', 'TAGCATAATCGCCG', qval=1) 3.0
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Chebyshev.dist_abs method instead.
-
class
abydos.distance.
Chord
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Chord distance.
For two sets X and Y drawn from an alphabet S, the chord distance [Orloci67] is
\[sim_{chord}(X, Y) = \sqrt{\sum_{i \in S}\Big(\frac{X_i}{\sqrt{\sum_{j \in X} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j \in Y} Y_j^2}}\Big)^2}\]New in version 0.4.0.
Initialize Chord instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Chord distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized chord distance
- Return type
float
Examples
>>> cmp = Chord() >>> cmp.dist('cat', 'hat') 0.707106781186547 >>> cmp.dist('Niall', 'Neil') 0.796775770420944 >>> cmp.dist('aluminum', 'Catalan') 0.94519820240106 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Chord distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Chord distance
- Return type
float
Examples
>>> cmp = Chord() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.126811100699571 >>> cmp.dist_abs('aluminum', 'Catalan') 1.336712116966249 >>> cmp.dist_abs('ATCG', 'TAGC') 1.414213562373095
New in version 0.4.0.
-
class
abydos.distance.
Clark
(**kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Clark's coefficient of divergence.
For two sets X and Y and a population N, Clark's coefficient of divergence [Cla52] is:
\[dist_{Clark}(X, Y) = \sqrt{\frac{\sum_{i=0}^{|N|} \big(\frac{x_i-y_i}{x_i+y_i}\big)^2}{|N|}}\]New in version 0.4.1.
Initialize Clark instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
-
dist
(src, tar)[source]¶ Return Clark's coefficient of divergence of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Clark's coefficient of divergence
- Return type
float
Examples
>>> cmp = Clark() >>> cmp.dist('cat', 'hat') 0.816496580927726 >>> cmp.dist('Niall', 'Neil') 0.8819171036881969 >>> cmp.dist('aluminum', 'Catalan') 0.9660917830792959 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.1.
-
class
abydos.distance.
Clement
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Clement similarity.
For two sets X and Y and a population N, Clement similarity [Cle76] is defined as
\[sim_{Clement}(X, Y) = \frac{|X \cap Y|}{|X|}\Big(1-\frac{|X|}{|N|}\Big) + \frac{|(N \setminus X) \setminus Y|}{|N \setminus X|} \Big(1-\frac{|N \setminus X|}{|N|}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Clement} = \frac{a}{a+b}\Big(1 - \frac{a+b}{n}\Big) + \frac{d}{c+d}\Big(1 - \frac{c+d}{n}\Big)\]New in version 0.4.0.
Initialize Clement instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Clement similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Clement similarity
- Return type
float
Examples
>>> cmp = Clement() >>> cmp.sim('cat', 'hat') 0.5025379382522239 >>> cmp.sim('Niall', 'Neil') 0.33840586363079933 >>> cmp.sim('aluminum', 'Catalan') 0.12119877280918714 >>> cmp.sim('ATCG', 'TAGC') 0.006336616803332366
New in version 0.4.0.
-
class
abydos.distance.
CohenKappa
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Cohen's Kappa similarity.
For two sets X and Y and a population N, Cohen's kappa similarity [Coh60] is
\[sim_{Cohen_\kappa}(X, Y) = \kappa = \frac{p_o - p_e^\kappa}{1 - p_e^\kappa}\]where
\[\begin{split}\begin{array}{l} p_o = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\ \\ p_e^\kappa = \frac{|X|}{|N|} \cdot \frac{|Y|}{|N|} + \frac{|N \setminus X|}{|N|} \cdot \frac{|N \setminus Y|}{|N|} \end{array}\end{split}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\begin{split}\begin{array}{l} p_o = \frac{a+d}{n}\\ \\ p_e^\kappa = \frac{a+b}{n} \cdot \frac{a+c}{n} + \frac{c+d}{n} \cdot \frac{b+d}{n} \end{array}\end{split}\]New in version 0.4.0.
Initialize CohenKappa instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return Cohen's Kappa similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Cohen's Kappa similarity
- Return type
float
Examples
>>> cmp = CohenKappa() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9955041746949261 >>> cmp.sim('aluminum', 'Catalan') 0.9903412749517064 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
-
class
abydos.distance.
Cole
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Cole correlation.
For two sets X and Y and a population N, the Cole correlation [Col49] has three formulae:
If \(|X \cap Y| \cdot |(N \setminus X) \setminus Y| \geq |X \setminus Y| \cdot |Y \setminus Y|\) then
\[corr_{Cole}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {(|X \cap Y| + |X \setminus Y|) \cdot (|X \setminus Y| + |(N \setminus X) \setminus Y|)}\]If \(|(N \setminus X) \setminus Y| \geq |X \cap Y|\) then
\[corr_{Cole}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {(|X \cap Y| + |X \setminus Y|) \cdot (|X \cap Y| + |Y \setminus X|)}\]Otherwise
\[corr_{Cole}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {(|X \setminus Y| + |(N \setminus X) \setminus Y|) \cdot (|Y \setminus X| + |(N \setminus X) \setminus Y|)}\]
Cole terms this measurement the Coefficient of Interspecific Association.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\begin{split}corr_{Cole} = \left\{ \begin{array}{ll} \frac{ad-bc}{(a+b)(b+d)} & \textup{if} ~ad \geq bc \\ \\ \frac{ad-bc}{(a+b)(a+c)} & \textup{if} ~d \geq a \\ \\ \frac{ad-bc}{(b+d)(c+d)} & \textup{otherwise} \end{array} \right.\end{split}\]New in version 0.4.0.
Initialize Cole instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Cole correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Cole correlation
- Return type
float
Examples
>>> cmp = Cole() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3290543431750107 >>> cmp.corr('aluminum', 'Catalan') 0.10195910195910196 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Cole similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for similarity
tar (str) -- Target string (or QGrams/Counter objects) for similarity
- Returns
Cole similarity
- Return type
float
Examples
>>> cmp = Cole() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6645271715875054 >>> cmp.sim('aluminum', 'Catalan') 0.550979550979551 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
ConsonniTodeschiniI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Consonni & Todeschini I similarity.
For two sets X and Y and a population N, Consonni & Todeschini I similarity [CT12] is
\[sim_{ConsonniTodeschiniI}(X, Y) = \frac{log(1+|X \cap Y|+|(N \setminus X) \setminus Y|)} {log(1+|N|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniI} = \frac{log(1+a+d)}{log(1+n)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Consonni & Todeschini I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Consonni & Todeschini I similarity
- Return type
float
Examples
>>> cmp = ConsonniTodeschiniI() >>> cmp.sim('cat', 'hat') 0.9992336018090547 >>> cmp.sim('Niall', 'Neil') 0.998656222829757 >>> cmp.sim('aluminum', 'Catalan') 0.9971098629456009 >>> cmp.sim('ATCG', 'TAGC') 0.9980766131469967
New in version 0.4.0.
-
class
abydos.distance.
ConsonniTodeschiniII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Consonni & Todeschini II similarity.
For two sets X and Y and a population N, Consonni & Todeschini II similarity [CT12] is
\[sim_{ConsonniTodeschiniII}(X, Y) = \frac{log(1+|N|) - log(1+|X \setminus Y|+|Y \setminus X|} {log(1+|N|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniII} = \frac{log(1+n)-log(1+b+c)}{log(1+n)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Consonni & Todeschini II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Consonni & Todeschini II similarity
- Return type
float
Examples
>>> cmp = ConsonniTodeschiniII() >>> cmp.sim('cat', 'hat') 0.7585487129939101 >>> cmp.sim('Niall', 'Neil') 0.6880377723094788 >>> cmp.sim('aluminum', 'Catalan') 0.5841297898633079 >>> cmp.sim('ATCG', 'TAGC') 0.640262668568961
New in version 0.4.0.
-
class
abydos.distance.
ConsonniTodeschiniIII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Consonni & Todeschini III similarity.
For two sets X and Y and a population N, Consonni & Todeschini III similarity [CT12] is
\[sim_{ConsonniTodeschiniIII}(X, Y) = \frac{log(1+|X \cap Y|)}{log(1+|N|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniIII} = \frac{log(1+a)}{log(1+n)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniIII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Consonni & Todeschini III similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Consonni & Todeschini III similarity
- Return type
float
Examples
>>> cmp = ConsonniTodeschiniIII() >>> cmp.sim('cat', 'hat') 0.1648161441769704 >>> cmp.sim('Niall', 'Neil') 0.1648161441769704 >>> cmp.sim('aluminum', 'Catalan') 0.10396755253417303 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
ConsonniTodeschiniIV
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Consonni & Todeschini IV similarity.
For two sets X and Y and a population N, Consonni & Todeschini IV similarity [CT12] is
\[sim_{ConsonniTodeschiniIV}(X, Y) = \frac{log(1+|X \cap Y|)}{log(1+|X \cup Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniIV} = \frac{log(1+a)}{log(1+a+b+c)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniIV instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Consonni & Todeschini IV similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Consonni & Todeschini IV similarity
- Return type
float
Examples
>>> cmp = ConsonniTodeschiniIV() >>> cmp.sim('cat', 'hat') 0.5645750340535796 >>> cmp.sim('Niall', 'Neil') 0.4771212547196623 >>> cmp.sim('aluminum', 'Catalan') 0.244650542118226 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
ConsonniTodeschiniV
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Consonni & Todeschini V correlation.
For two sets X and Y and a population N, Consonni & Todeschini V correlation [CT12] is
\[corr_{ConsonniTodeschiniV}(X, Y) = \frac{log(1+|X \cap Y| \cdot |(N \setminus X) \setminus Y|)- log(1+|X \setminus Y| \cdot |Y \setminus X|)} {log(1+\frac{|N|^2}{4})}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{ConsonniTodeschiniV} = \frac{log(1+ad)-log(1+bc)}{log(1+\frac{n^2}{4})}\]New in version 0.4.0.
Initialize ConsonniTodeschiniV instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Consonni & Todeschini V correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Consonni & Todeschini V correlation
- Return type
float
Examples
>>> cmp = ConsonniTodeschiniV() >>> cmp.corr('cat', 'hat') 0.48072545510682463 >>> cmp.corr('Niall', 'Neil') 0.4003930264973547 >>> cmp.corr('aluminum', 'Catalan') 0.21794239483504532 >>> cmp.corr('ATCG', 'TAGC') -0.2728145951429799
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Consonni & Todeschini V similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Consonni & Todeschini V similarity
- Return type
float
Examples
>>> cmp = ConsonniTodeschiniV() >>> cmp.sim('cat', 'hat') 0.7403627275534124 >>> cmp.sim('Niall', 'Neil') 0.7001965132486774 >>> cmp.sim('aluminum', 'Catalan') 0.6089711974175227 >>> cmp.sim('ATCG', 'TAGC') 0.36359270242851005
New in version 0.4.0.
-
class
abydos.distance.
Cosine
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Cosine similarity.
For two sets X and Y, the cosine similarity, Otsuka-Ochiai coefficient, or Ochiai coefficient [Ots36][Och57] is
\[sim_{cosine}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{cosine} = \frac{a}{\sqrt{(a+b)(a+c)}}\]Notes
This measure is also known as the Fowlkes-Mallows index [FM83] for two classes and G-measure, the geometric mean of precision & recall.
New in version 0.3.6.
Initialize Cosine instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the cosine similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Cosine similarity
- Return type
float
Examples
>>> cmp = Cosine() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3651483716701107 >>> cmp.sim('aluminum', 'Catalan') 0.11785113019775793 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
dist_cosine
(src, tar, qval=2)[source]¶ Return the cosine distance between two strings.
This is a wrapper for
Cosine.dist()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
- Returns
Cosine distance
- Return type
float
Examples
>>> dist_cosine('cat', 'hat') 0.5 >>> dist_cosine('Niall', 'Neil') 0.6348516283298893 >>> dist_cosine('aluminum', 'Catalan') 0.882148869802242 >>> dist_cosine('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Cosine.dist method instead.
-
abydos.distance.
sim_cosine
(src, tar, qval=2)[source]¶ Return the cosine similarity of two strings.
This is a wrapper for
Cosine.sim()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
- Returns
Cosine similarity
- Return type
float
Examples
>>> sim_cosine('cat', 'hat') 0.5 >>> sim_cosine('Niall', 'Neil') 0.3651483716701107 >>> sim_cosine('aluminum', 'Catalan') 0.11785113019775793 >>> sim_cosine('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Cosine.sim method instead.
-
class
abydos.distance.
Dennis
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Dennis similarity.
For two sets X and Y and a population N, Dennis similarity [Den65] is
\[sim_{Dennis}(X, Y) = \frac{|X \cap Y| - \frac{|X| \cdot |Y|}{|N|}} {\sqrt{\frac{|X|\cdot|Y|}{|N|}}}\]This is the fourth of Dennis' association measures, and that which she claims is the best of the four.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Dennis} = \frac{a-\frac{(a+b)(a+c)}{n}}{\sqrt{\frac{(a+b)(a+c)}{n}}}\]New in version 0.4.0.
Initialize Dennis instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Dennis correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dennis correlation
- Return type
float
Examples
>>> cmp = Dennis() >>> cmp.corr('cat', 'hat') 0.494897959183673 >>> cmp.corr('Niall', 'Neil') 0.358162114559075 >>> cmp.corr('aluminum', 'Catalan') 0.107041854561785 >>> cmp.corr('ATCG', 'TAGC') -0.006377551020408
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Dennis similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Dennis similarity
- Return type
float
Examples
>>> cmp = Dennis() >>> cmp.sim('cat', 'hat') 0.6632653061224487 >>> cmp.sim('Niall', 'Neil') 0.5721080763727167 >>> cmp.sim('aluminum', 'Catalan') 0.4046945697078567 >>> cmp.sim('ATCG', 'TAGC') 0.32908163265306134
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Dennis similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dennis similarity
- Return type
float
Examples
>>> cmp = Dennis() >>> cmp.sim_score('cat', 'hat') 13.857142857142858 >>> cmp.sim_score('Niall', 'Neil') 10.028539207654113 >>> cmp.sim_score('aluminum', 'Catalan') 2.9990827802847835 >>> cmp.sim_score('ATCG', 'TAGC') -0.17857142857142858
New in version 0.4.0.
-
class
abydos.distance.
Dice
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._tversky.Tversky
Sørensen–Dice coefficient.
For two sets X and Y, the Sørensen–Dice coefficient [Dic45][Sorensen48][Cze09][MDobrzanskiZ50] is
\[sim_{Dice}(X, Y) = \frac{2 \cdot |X \cap Y|}{|X| + |Y|}\]This is the complement of Bray & Curtis dissimilarity [BC57], also known as the Lance & Williams dissimilarity [LW67a].
This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\alpha = \beta = 0.5\).
In the Ruby text library this is identified as White similarity, after [Whid.].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Dice} = \frac{2a}{2a+b+c}\]Notes
In terms of a confusion matrix, this is equivalent to \(F_1\) score
ConfusionTable.f1_score()
.The multiset variant is termed Gleason similarity [Gle20].
New in version 0.3.6.
Initialize Dice instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Sørensen–Dice coefficient of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Sørensen–Dice similarity
- Return type
float
Examples
>>> cmp = Dice() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352941 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
dist_dice
(src, tar, qval=2)[source]¶ Return the Sørensen–Dice distance between two strings.
This is a wrapper for
Dice.dist()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
- Returns
Sørensen–Dice distance
- Return type
float
Examples
>>> dist_dice('cat', 'hat') 0.5 >>> dist_dice('Niall', 'Neil') 0.6363636363636364 >>> dist_dice('aluminum', 'Catalan') 0.8823529411764706 >>> dist_dice('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Dice.dist method instead.
-
abydos.distance.
sim_dice
(src, tar, qval=2)[source]¶ Return the Sørensen–Dice coefficient of two strings.
This is a wrapper for
Dice.sim()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
- Returns
Sørensen–Dice similarity
- Return type
float
Examples
>>> sim_dice('cat', 'hat') 0.5 >>> sim_dice('Niall', 'Neil') 0.36363636363636365 >>> sim_dice('aluminum', 'Catalan') 0.11764705882352941 >>> sim_dice('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Dice.sim method instead.
-
class
abydos.distance.
DiceAsymmetricI
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Dice's Asymmetric I similarity.
For two sets X and Y and a population N, Dice's Asymmetric I similarity [Dic45] is
\[sim_{DiceAsymmetricI}(X, Y) = \frac{|X \cap Y|}{|X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{DiceAsymmetricI} = \frac{a}{a+b}\]Notes
In terms of a confusion matrix, this is equivalent to precision or positive predictive value
ConfusionTable.precision()
.New in version 0.4.0.
Initialize DiceAsymmetricI instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Dice's Asymmetric I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dice's Asymmetric I similarity
- Return type
float
Examples
>>> cmp = DiceAsymmetricI() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.1111111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
DiceAsymmetricII
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Dice's Asymmetric II similarity.
For two sets X and Y, Dice's Asymmetric II similarity [Dic45] is
\[sim_{DiceAsymmetricII}(X, Y) = \frac{|X \cap Y|}{|Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{DiceAsymmetricII} = \frac{a}{a+c}\]Notes
In terms of a confusion matrix, this is equivalent to recall, sensitivity, or true positive rate
ConfusionTable.recall()
.New in version 0.4.0.
Initialize DiceAsymmetricII instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Dice's Asymmetric II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dice's Asymmetric II similarity
- Return type
float
Examples
>>> cmp = DiceAsymmetricII() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.125 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
Digby
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Digby correlation.
For two sets X and Y and a population N, Digby's approximation of the tetrachoric correlation coefficient [Dig83] is
\[corr_{Digby}(X, Y) = \frac{(|X \cap Y| \cdot |(N \setminus X) \setminus Y|)^\frac{3}{4}- (|X \setminus Y| \cdot |Y \setminus X|)^\frac{3}{4}} {(|X \cap Y| \cdot |(N \setminus X) \setminus Y|)^\frac{3}{4} + (|X \setminus Y| \cdot |Y \setminus X|)^\frac{3}{4}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Digby} = \frac{ad^\frac{3}{4}-bc^\frac{3}{4}}{ad^\frac{3}{4}+bc^\frac{3}{4}}\]New in version 0.4.0.
Initialize Digby instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Digby correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Digby correlation
- Return type
float
Examples
>>> cmp = Digby() >>> cmp.corr('cat', 'hat') 0.9774244829419212 >>> cmp.corr('Niall', 'Neil') 0.9491281473458171 >>> cmp.corr('aluminum', 'Catalan') 0.7541039303781305 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Digby similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Digby similarity
- Return type
float
Examples
>>> cmp = Digby() >>> cmp.sim('cat', 'hat') 0.9887122414709606 >>> cmp.sim('Niall', 'Neil') 0.9745640736729085 >>> cmp.sim('aluminum', 'Catalan') 0.8770519651890653 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
Dispersion
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Dispersion correlation.
For two sets X and Y and a population N, the dispersion correlation [Cor17] is
\[corr_{dispersion}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {|N|^2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{dispersion} = \frac{ad-bc}{n^2}\]New in version 0.4.0.
Initialize Dispersion instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Dispersion correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dispersion correlation
- Return type
float
Examples
>>> cmp = Dispersion() >>> cmp.corr('cat', 'hat') 0.002524989587671803 >>> cmp.corr('Niall', 'Neil') 0.002502212619741774 >>> cmp.corr('aluminum', 'Catalan') 0.0011570449105440383 >>> cmp.corr('ATCG', 'TAGC') -4.06731570179092e-05
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Dispersion similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dispersion similarity
- Return type
float
Examples
>>> cmp = Dispersion() >>> cmp.sim('cat', 'hat') 0.5012624947938359 >>> cmp.sim('Niall', 'Neil') 0.5012511063098709 >>> cmp.sim('aluminum', 'Catalan') 0.500578522455272 >>> cmp.sim('ATCG', 'TAGC') 0.499979663421491
New in version 0.4.0.
-
class
abydos.distance.
Doolittle
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Doolittle similarity.
For two sets X and Y and a population N, the Doolittle similarity [Doo84] is
\[sim_{Doolittle}(X, Y) = \frac{(|X \cap Y| \cdot |N| - |X| \cdot |Y|)^2} {|X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Doolittle} = \frac{(an-(a+b)(a+c))^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize Doolittle instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Doolittle similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Doolittle similarity
- Return type
float
Examples
>>> cmp = Doolittle() >>> cmp.sim('cat', 'hat') 0.24744247205785666 >>> cmp.sim('Niall', 'Neil') 0.13009912077202224 >>> cmp.sim('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.sim('ATCG', 'TAGC') 4.1196952743799446e-05
New in version 0.4.0.
-
class
abydos.distance.
Dunning
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Dunning similarity.
For two sets X and Y and a population N, Dunning log-likelihood [Dun93], following [CGHH91], is
\[\begin{split}sim_{Dunning}(X, Y) = \lambda = |X \cap Y| \cdot log_2(|X \cap Y|) +\\ |X \setminus Y| \cdot log_2(|X \setminus Y|) + |Y \setminus X| \cdot log_2(|Y \setminus X|) +\\ |(N \setminus X) \setminus Y| \cdot log_2(|(N \setminus X) \setminus Y|) -\\ (|X| \cdot log_2(|X|) + |Y| \cdot log_2(|Y|) +\\ |N \setminus Y| \cdot log_2(|N \setminus Y|) + |N \setminus X| \cdot log_2(|N \setminus X|))\end{split}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\begin{split}sim_{Dunning} = \lambda = a \cdot log_2(a) +\\ b \cdot log_2(b) + c \cdot log_2(c) + d \cdot log_2(d) - \\ ((a+b) \cdot log_2(a+b) + (a+c) \cdot log_2(a+c) +\\ (b+d) \cdot log_2(b+d) + (c+d) log_2(c+d))\end{split}\]Notes
To avoid NaNs, every logarithm is calculated as the logarithm of 1 greater than the value in question. (Python's math.log1p function is used.)
New in version 0.4.0.
Initialize Dunning instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Dunning similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Dunning similarity
- Return type
float
Examples
>>> cmp = Dunning() >>> cmp.sim('cat', 'hat') 0.33462839191969423 >>> cmp.sim('Niall', 'Neil') 0.19229445539929793 >>> cmp.sim('aluminum', 'Catalan') 0.03220862737070572 >>> cmp.sim('ATCG', 'TAGC') 0.0010606026735052122
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Dunning similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Dunning similarity
- Return type
float
Examples
>>> cmp = Dunning() >>> cmp.sim('cat', 'hat') 0.33462839191969423 >>> cmp.sim('Niall', 'Neil') 0.19229445539929793 >>> cmp.sim('aluminum', 'Catalan') 0.03220862737070572 >>> cmp.sim('ATCG', 'TAGC') 0.0010606026735052122
New in version 0.4.0.
-
class
abydos.distance.
Euclidean
(alphabet=0, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._minkowski.Minkowski
Euclidean distance.
Euclidean distance is the straigh-line or as-the-crow-flies distance, equivalent to Minkowski distance in \(L^2\)-space.
New in version 0.3.6.
Initialize Euclidean instance.
- Parameters
alphabet (collection or int) -- The values or size of the alphabet
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Euclidean distance between two strings.
The normalized Euclidean distance is a distance metric in \(L^2\)-space, normalized to [0, 1].
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
The normalized Euclidean distance
- Return type
float
Examples
>>> cmp = Euclidean() >>> round(cmp.dist('cat', 'hat'), 12) 0.57735026919 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.683130051064 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.727606875109 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
-
dist_abs
(src, tar, normalized=False)[source]¶ Return the Euclidean distance between two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
normalized (bool) -- Normalizes to [0, 1] if True
- Returns
The Euclidean distance
- Return type
float
Examples
>>> cmp = Euclidean() >>> cmp.dist_abs('cat', 'hat') 2.0 >>> round(cmp.dist_abs('Niall', 'Neil'), 12) 2.645751311065 >>> cmp.dist_abs('Colin', 'Cuilen') 3.0 >>> round(cmp.dist_abs('ATCG', 'TAGC'), 12) 3.162277660168
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
-
abydos.distance.
euclidean
(src, tar, qval=2, normalized=False, alphabet=0)[source]¶ Return the Euclidean distance between two strings.
This is a wrapper for
Euclidean.dist_abs()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
normalized (bool) -- Normalizes to [0, 1] if True
alphabet (collection or int) -- The values or size of the alphabet
- Returns
float
- Return type
The Euclidean distance
Examples
>>> euclidean('cat', 'hat') 2.0 >>> round(euclidean('Niall', 'Neil'), 12) 2.645751311065 >>> euclidean('Colin', 'Cuilen') 3.0 >>> round(euclidean('ATCG', 'TAGC'), 12) 3.162277660168
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Euclidean.dist_abs method instead.
-
abydos.distance.
dist_euclidean
(src, tar, qval=2, alphabet=0)[source]¶ Return the normalized Euclidean distance between two strings.
This is a wrapper for
Euclidean.dist()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
alphabet (collection or int) -- The values or size of the alphabet
- Returns
The normalized Euclidean distance
- Return type
float
Examples
>>> round(dist_euclidean('cat', 'hat'), 12) 0.57735026919 >>> round(dist_euclidean('Niall', 'Neil'), 12) 0.683130051064 >>> round(dist_euclidean('Colin', 'Cuilen'), 12) 0.727606875109 >>> dist_euclidean('ATCG', 'TAGC') 1.0
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Euclidean.dist method instead.
-
abydos.distance.
sim_euclidean
(src, tar, qval=2, alphabet=0)[source]¶ Return the normalized Euclidean similarity of two strings.
This is a wrapper for
Euclidean.sim()
.- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
qval (int) -- The length of each q-gram
alphabet (collection or int) -- The values or size of the alphabet
- Returns
The normalized Euclidean similarity
- Return type
float
Examples
>>> round(sim_euclidean('cat', 'hat'), 12) 0.42264973081 >>> round(sim_euclidean('Niall', 'Neil'), 12) 0.316869948936 >>> round(sim_euclidean('Colin', 'Cuilen'), 12) 0.272393124891 >>> sim_euclidean('ATCG', 'TAGC') 0.0
New in version 0.3.0.
Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Euclidean.sim method instead.
-
class
abydos.distance.
Eyraud
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Eyraud similarity.
For two sets X and Y and a population N, the Eyraud similarity [Eyr38] is
\[sim_{Eyraud}(X, Y) = \frac{|X \cap Y| - |X| \cdot |Y|} {|X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|}\]For lack of access to the original, this formula is based on the concurring formulae presented in [Shi93] and [Hubalek08].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Eyraud} = \frac{a-(a+b)(a+c)}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize Eyraud instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Eyraud similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Eyraud similarity
- Return type
float
Examples
>>> cmp = Eyraud() >>> cmp.sim('cat', 'hat') 1.438198553583169e-06 >>> cmp.sim('Niall', 'Neil') 1.5399964580081465e-06 >>> cmp.sim('aluminum', 'Catalan') 1.6354719962967386e-06 >>> cmp.sim('ATCG', 'TAGC') 1.6478781097519779e-06
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Eyraud similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Eyraud similarity
- Return type
float
Examples
>>> cmp = Eyraud() >>> cmp.sim_score('cat', 'hat') -1.438198553583169e-06 >>> cmp.sim_score('Niall', 'Neil') -1.5399964580081465e-06 >>> cmp.sim_score('aluminum', 'Catalan') -1.6354719962967386e-06 >>> cmp.sim_score('ATCG', 'TAGC') -1.6478781097519779e-06
New in version 0.4.0.
-
class
abydos.distance.
FagerMcGowan
(tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Fager & McGowan similarity.
For two sets X and Y, the Fager & McGowan similarity [Fag57][FM63] is
\[sim_{FagerMcGowan}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X|\cdot|Y|}} - \frac{1}{2\sqrt{max(|X|, |Y|)}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{FagerMcGowan} = \frac{a}{\sqrt{(a+b)(a+c)}} - \frac{1}{2\sqrt{max(a+b, a+c)}}\]New in version 0.4.0.
Initialize FagerMcGowan instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Fager & McGowan similarity of two strings.
As this similarity ranges from \((-\inf, 1.0)\), this normalization simply clamps the value to the range (0.0, 1.0).
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Fager & McGowan similarity
- Return type
float
Examples
>>> cmp = FagerMcGowan() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.16102422643817918 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Fager & McGowan similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Fager & McGowan similarity
- Return type
float
Examples
>>> cmp = FagerMcGowan() >>> cmp.sim_score('cat', 'hat') 0.25 >>> cmp.sim_score('Niall', 'Neil') 0.16102422643817918 >>> cmp.sim_score('aluminum', 'Catalan') -0.048815536468908724 >>> cmp.sim_score('ATCG', 'TAGC') -0.22360679774997896
New in version 0.4.0.
-
class
abydos.distance.
Faith
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Faith similarity.
For two sets X and Y and a population N, the Faith similarity [Fai83] is
\[sim_{Faith}(X, Y) = \frac{|X \cap Y| + \frac{|(N \setminus X) \setminus Y|}{2}}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Faith} = \frac{a+\frac{d}{2}}{n}\]New in version 0.4.0.
Initialize Faith instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Faith similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Faith similarity
- Return type
float
Examples
>>> cmp = Faith() >>> cmp.sim('cat', 'hat') 0.4987244897959184 >>> cmp.sim('Niall', 'Neil') 0.4968112244897959 >>> cmp.sim('aluminum', 'Catalan') 0.4910828025477707 >>> cmp.sim('ATCG', 'TAGC') 0.49362244897959184
New in version 0.4.0.
-
class
abydos.distance.
Fidelity
(tokenizer=None, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Fidelity.
For two multisets X and Y drawn from an alphabet S, fidelity is
\[sim_{Fidelity}(X, Y) = \Bigg( \sum_{i \in S} \sqrt{|\frac{A_i}{|A|} \cdot \frac{B_i}{|B|}|} \Bigg)^2\]New in version 0.4.0.
Initialize Fidelity instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the fidelity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
fidelity
- Return type
float
Examples
>>> cmp = Fidelity() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.1333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.013888888888888888 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
Fleiss
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Fleiss correlation.
For two sets X and Y and a population N, Fleiss correlation [Fle75] is
\[corr_{Fleiss}(X, Y) = \frac{(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|) \cdot (|X| \cdot |N \setminus X| + |Y| \cdot |N \setminus Y|)} {2 \cdot |X| \cdot |N \setminus X| \cdot |Y| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Fleiss} = \frac{(ad-bc)((a+b)(c+d)+(a+c)(b+d))}{2(a+b)(c+d)(a+c)(b+d)}\]This is Fleiss' \(M(A_1)\), \(ad-bc\) divided by the harmonic mean of the marginals \(p_1q_1 = (a+b)(c+d)\) and \(p_2q_2 = (a+c)(b+d)\).
New in version 0.4.0.
Initialize Fleiss instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Fleiss correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Fleiss correlation
- Return type
float
Examples
>>> cmp = Fleiss() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3621712520061204 >>> cmp.corr('aluminum', 'Catalan') 0.10839724112919989 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Fleiss similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Fleiss similarity
- Return type
float
Examples
>>> cmp = Fleiss() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6810856260030602 >>> cmp.sim('aluminum', 'Catalan') 0.5541986205645999 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
-
class
abydos.distance.
FleissLevinPaik
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Fleiss-Levin-Paik similarity.
For two sets X and Y and a population N, Fleiss-Levin-Paik similarity [FLP03] is
\[sim_{FleissLevinPaik}(X, Y) = \frac{2|(N \setminus X) \setminus Y|} {2|(N \setminus X) \setminus Y| + |X \setminus Y| + |Y \setminus X|}\]This is [Mor12]'s 'd Specific Agreement'.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{FleissLevinPaik} = \frac{2d}{2d + b + c}\]New in version 0.4.0.
Initialize FleissLevinPaik instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Fleiss-Levin-Paik similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Fleiss-Levin-Paik similarity
- Return type
float
Examples
>>> cmp = FleissLevinPaik() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9955041746949261 >>> cmp.sim('aluminum', 'Catalan') 0.9903412749517064 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
-
class
abydos.distance.
ForbesI
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Forbes I similarity.
For two sets X and Y and a population N, the Forbes I similarity [For07][Moz36] is
\[sim_{ForbesI}(X, Y) = \frac{|N| \cdot |X \cap Y|}{|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ForbesI} = \frac{na}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize ForbesI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Forbes I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Forbes I similarity
- Return type
float
Examples
>>> cmp = ForbesI() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.11125283446712018 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Forbes I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Forbes I similarity
- Return type
float
Examples
>>> cmp = ForbesI() >>> cmp.sim_score('cat', 'hat') 98.0 >>> cmp.sim_score('Niall', 'Neil') 52.266666666666666 >>> cmp.sim_score('aluminum', 'Catalan') 10.902777777777779 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
ForbesII
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Forbes II correlation.
For two sets X and Y and a population N, the Forbes II correlation, as described in [For25], is
\[corr_{ForbesII}(X, Y) = \frac{|X \setminus Y| \cdot |Y \setminus X| - |X \cap Y| \cdot |(N \setminus X) \setminus Y|} {|X| \cdot |Y| - |N| \cdot min(|X|, |Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{ForbesII} = \frac{bc-ad}{(a+b)(a+c) - n \cdot min(a+b, a+c)}\]New in version 0.4.0.
Initialize ForbesII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Forbes II correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Forbes II correlation
- Return type
float
Examples
>>> cmp = ForbesII() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3953727506426735 >>> cmp.corr('aluminum', 'Catalan') 0.11485180412371133 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Forbes II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Forbes II similarity
- Return type
float
Examples
>>> cmp = ForbesII() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6976863753213367 >>> cmp.sim('aluminum', 'Catalan') 0.5574259020618557 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
-
class
abydos.distance.
Fossum
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Fossum similarity.
For two sets X and Y and a population N, the Fossum similarity [FK66] is
\[sim_{Fossum}(X, Y) = \frac{|N| \cdot \Big(|X \cap Y|-\frac{1}{2}\Big)^2}{|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Fossum} = \frac{n(a-\frac{1}{2})^2}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize Fossum instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Fossum similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Fossum similarity
- Return type
float
Examples
>>> cmp = Fossum() >>> cmp.sim('cat', 'hat') 0.1836734693877551 >>> cmp.sim('Niall', 'Neil') 0.08925619834710742 >>> cmp.sim('aluminum', 'Catalan') 0.0038927335640138415 >>> cmp.sim('ATCG', 'TAGC') 0.01234567901234568
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Fossum similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Fossum similarity
- Return type
float
Examples
>>> cmp = Fossum() >>> cmp.sim_score('cat', 'hat') 110.25 >>> cmp.sim_score('Niall', 'Neil') 58.8 >>> cmp.sim_score('aluminum', 'Catalan') 2.7256944444444446 >>> cmp.sim_score('ATCG', 'TAGC') 7.84
New in version 0.4.0.
-
class
abydos.distance.
GeneralizedFleiss
(alphabet=None, tokenizer=None, intersection_type='crisp', mean_func='arithmetic', marginals='a', proportional=False, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Generalized Fleiss correlation.
For two sets X and Y and a population N, Generalized Fleiss correlation is based on observations from [Fle75].
\[corr_{GeneralizedFleiss}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {\mu_{products~of~marginals}}\]The mean function \(\mu\) may be any of the mean functions in
abydos.stats
. The products of marginals may be one of the following:a
: \(|X| \cdot |N \setminus X|\) & \(|Y| \cdot |N \setminus Y|\)b
: \(|X| \cdot |Y|\) & \(|N \setminus X| \cdot |N \setminus Y|\)c
: \(|X| \cdot |N| \setminus Y|\) & \(|Y| \cdot |N \setminus X|\)
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{GeneralizedFleiss} = \frac{ad-bc}{\mu_{products~of~marginals}}\]And the products of marginals are:
a
: \(p_1q_1 = (a+b)(c+d)\) & \(p_2q_2 = (a+c)(b+d)\)b
: \(p_1p_2 = (a+b)(a+c)\) & \(q_1q_2 = (c+d)(b+d)\)c
: \(p_1q_2 = (a+b)(b+d)\) & \(p_2q_1 = (a+c)(c+d)\)
New in version 0.4.0.
Initialize GeneralizedFleiss instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.mean_func (str or function) --
Specifies the mean function to use. A function taking a list of numbers as its only required argument may be supplied, or one of the following strings will select the specified mean function from
abydos.stats
:arithmetic
employsamean()
, and this measure will be identical toMaxwellPilliner
with otherwise default parametersgeometric
employsgmean()
, and this measure will be identical toPearsonPhi
with otherwise default parametersharmonic
employshmean()
, and this measure will be identical toFleiss
with otherwise default parametersag
employs the arithmetic-geometric meanagmean()
gh
employs the geometric-harmonic meanghmean()
agh
employs the arithmetic-geometric-harmonic meanaghmean()
contraharmonic
employs the contraharmonic meancmean()
identric
employs the identric meanimean()
logarithmic
employs the logarithmic meanlmean()
quadratic
employs the quadratic meanqmean()
heronian
employs the Heronian meanheronian_mean()
hoelder
employs the Hölder meanhoelder_mean()
lehmer
employs the Lehmer meanlehmer_mean()
seiffert
employs Seiffert's meanseiffert_mean()
marginals (str) --
Specifies the pairs of marginals to multiply and calculate the resulting mean of. Can be:
a
: \(p_1q_1 = (a+b)(c+d)\) & \(p_2q_2 = (a+c)(b+d)\)b
: \(p_1p_2 = (a+b)(a+c)\) & \(q_1q_2 = (c+d)(b+d)\)c
: \(p_1q_2 = (a+b)(b+d)\) & \(p_2q_1 = (a+c)(c+d)\)
proportional (bool) -- If true, each of the values, \(a, b, c, d\) and the marginals will be divided by the total \(a+b+c+d=n\).
**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Generalized Fleiss correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Generalized Fleiss correlation
- Return type
float
Examples
>>> cmp = GeneralizedFleiss() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.35921989956790845 >>> cmp.corr('aluminum', 'Catalan') 0.10803030303030303 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Generalized Fleiss similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Generalized Fleiss similarity
- Return type
float
Examples
>>> cmp = GeneralizedFleiss() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6796099497839543 >>> cmp.sim('aluminum', 'Catalan') 0.5540151515151515 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
-
class
abydos.distance.
Gilbert
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Gilbert correlation.
For two sets X and Y and a population N, the Gilbert correlation [Gil84] is
\[corr_{Gilbert}(X, Y) = \frac{2(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)} {|N|^2 - |X \cap Y|^2 + |X \setminus Y|^2 + |Y \setminus X|^2 - |(N \setminus X) \setminus Y|^2}\]For lack of access to the original, this formula is based on the concurring formulae presented in [Pei84] and [Doo84].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Gilbert} = \frac{2(ad-cd)}{n^2-a^2+b^2+c^2-d^2}\]New in version 0.4.0.
Initialize Gilbert instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Gilbert correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gilbert correlation
- Return type
float
Examples
>>> cmp = Gilbert() >>> cmp.corr('cat', 'hat') 0.3310580204778157 >>> cmp.corr('Niall', 'Neil') 0.21890122402504983 >>> cmp.corr('aluminum', 'Catalan') 0.057094811018577836 >>> cmp.corr('ATCG', 'TAGC') -0.003198976327575176
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Gilbert similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gilbert similarity
- Return type
float
Examples
>>> cmp = Gilbert() >>> cmp.sim('cat', 'hat') 0.6655290102389079 >>> cmp.sim('Niall', 'Neil') 0.6094506120125249 >>> cmp.sim('aluminum', 'Catalan') 0.5285474055092889 >>> cmp.sim('ATCG', 'TAGC') 0.4984005118362124
New in version 0.4.0.
-
class
abydos.distance.
GilbertWells
(alphabet=None, tokenizer=None, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Gilbert & Wells similarity.
For two sets X and Y and a population N, the Gilbert & Wells similarity [GW66] is
\[sim_{GilbertWells}(X, Y) = ln \frac{|N|^3}{2\pi |X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|} + 2ln \frac{|N|! \cdot |X \cap Y|! \cdot |X \setminus Y|! \cdot |Y \setminus X|! \cdot |(N \setminus X) \setminus Y|!} {|X|! \cdot |Y|! \cdot |N \setminus Y|! \cdot |N \setminus X|!}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{GilbertWells} = ln \frac{n^3}{2\pi (a+b)(a+c)(b+d)(c+d)} + 2ln \frac{n!a!b!c!d!}{(a+b)!(a+c)!(b+d)!(c+d)!}\]Notes
Most lists of similarity & distance measures, including [Hubalek08][CCT10][Mor12] have a quite different formula, which would be \(ln~a - ln~b - ln \frac{a+b}{n} - ln \frac{a+c}{n} = ln\frac{an}{(a+b)(a+c)}\). However, neither this formula nor anything similar or equivalent to it appears anywhere within the cited work, [GW66]. See :class:
UnknownF
for this, alternative, measure.New in version 0.4.0.
Initialize GilbertWells instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Gilbert & Wells similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Gilbert & Wells similarity
- Return type
float
Examples
>>> cmp = GilbertWells() >>> cmp.sim('cat', 'hat') 0.4116913723876516 >>> cmp.sim('Niall', 'Neil') 0.2457247406857589 >>> cmp.sim('aluminum', 'Catalan') 0.05800001636414742 >>> cmp.sim('ATCG', 'TAGC') 0.028716013247135602
New in version 0.4.0.
-
sim_score
(src, tar)[source]¶ Return the Gilbert & Wells similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gilbert & Wells similarity
- Return type
float
Examples
>>> cmp = GilbertWells() >>> cmp.sim_score('cat', 'hat') 20.17617447734673 >>> cmp.sim_score('Niall', 'Neil') 16.717742356982733 >>> cmp.sim_score('aluminum', 'Catalan') 5.495096667524002 >>> cmp.sim_score('ATCG', 'TAGC') 1.6845961909440712
New in version 0.4.0.
-
class
abydos.distance.
GiniI
(alphabet=None, tokenizer=None, intersection_type='crisp', normalizer='proportional', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Gini I correlation.
For two sets X and Y and a population N, Gini I correlation [Gin12], using the formula from [GK59], is
\[corr_{GiniI}(X, Y) = \frac{\frac{|X \cap Y|+|(N \setminus X) \setminus Y|}{|N|} - \frac{|X| \cdot |Y|}{|N|} + \frac{|N \setminus Y| \cdot |N \setminus X|}{|N|}} {\sqrt{(1-(\frac{|X|}{|N|}^2+\frac{|Y|}{|N|}^2)) \cdot (1-(\frac{|N \setminus Y|}{|N|}^2 + \frac{|N \setminus X|}{|N|}^2))}}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[corr_{GiniI} = \frac{(a+d)-(a+b)(a+c) + (b+d)(c+d)} {\sqrt{(1-((a+b)^2+(c+d)^2))\cdot(1-((a+c)^2+(b+d)^2))}}\]New in version 0.4.0.
Initialize GiniI instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Gini I correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gini I correlation
- Return type
float
Examples
>>> cmp = GiniI() >>> cmp.corr('cat', 'hat') 0.49722814498933254 >>> cmp.corr('Niall', 'Neil') 0.39649090262533215 >>> cmp.corr('aluminum', 'Catalan') 0.14887105223941113 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237489576
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Gini I similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Gini I similarity
- Return type
float
Examples
>>> cmp = GiniI() >>> cmp.sim('cat', 'hat') 0.7486140724946663 >>> cmp.sim('Niall', 'Neil') 0.6982454513126661 >>> cmp.sim('aluminum', 'Catalan') 0.5744355261197056 >>> cmp.sim('ATCG', 'TAGC') 0.4967907573812552
New in version 0.4.0.
-
class
abydos.distance.
GiniII
(alphabet=None, tokenizer=None, intersection_type='crisp', normalizer='proportional', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Gini II distance.
For two sets X and Y and a population N, Gini II correlation [Gin15], using the formula from [GK59], is
\[corr_{GiniII}(X, Y) = \frac{\frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|} - (\frac{|X| \cdot |Y|}{|N|} + \frac{|N \setminus Y| \cdot |N \setminus X|}{|N|})} {1 - |\frac{|Y \setminus X| - |X \setminus Y|}{|N|}| - (\frac{|X| \cdot |Y|}{|N|} + \frac{|N \setminus Y| \cdot |N \setminus X|}{|N|})}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[corr_{GiniII} = \frac{(a+d) - ((a+b)(a+c) + (b+d)(c+d))} {1 - |b-c| - ((a+b)(a+c) + (b+d)(c+d))}\]New in version 0.4.0.
Initialize GiniII instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Gini II correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gini II correlation
- Return type
float
Examples
>>> cmp = GiniII() >>> cmp.corr('cat', 'hat') 0.49722814498933254 >>> cmp.corr('Niall', 'Neil') 0.4240703425535771 >>> cmp.corr('aluminum', 'Catalan') 0.15701415701415936 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237489576
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Gini II similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Gini II similarity
- Return type
float
Examples
>>> cmp = GiniII() >>> cmp.sim('cat', 'hat') 0.7486140724946663 >>> cmp.sim('Niall', 'Neil') 0.7120351712767885 >>> cmp.sim('aluminum', 'Catalan') 0.5785070785070797 >>> cmp.sim('ATCG', 'TAGC') 0.4967907573812552
New in version 0.4.0.
-
class
abydos.distance.
Goodall
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Goodall similarity.
For two sets X and Y and a population N, Goodall similarity [Goo67][AC77] is an angular transformation of Sokal & Michener's simple matching coefficient
\[sim_{Goodall}(X, Y) = \frac{2}{\pi} \sin^{-1}\Big( \sqrt{\frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}} \Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Goodall} =\frac{2}{\pi} \sin^{-1}\Big( \sqrt{\frac{a + d}{n}} \Big)\]New in version 0.4.0.
Initialize Goodall instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Goodall similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Goodall similarity
- Return type
float
Examples
>>> cmp = Goodall() >>> cmp.sim('cat', 'hat') 0.9544884026871964 >>> cmp.sim('Niall', 'Neil') 0.9397552079794624 >>> cmp.sim('aluminum', 'Catalan') 0.9117156301536503 >>> cmp.sim('ATCG', 'TAGC') 0.9279473952929225
New in version 0.4.0.
-
class
abydos.distance.
GoodmanKruskalLambda
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Goodman & Kruskal's Lambda similarity.
For two sets X and Y and a population N, Goodman & Kruskal's lambda [GK54] is
\[sim_{GK_\lambda}(X, Y) = \frac{\frac{1}{2}(max(|X \cap Y|, |X \setminus Y|)+ max(|Y \setminus X|, |(N \setminus X) \setminus Y|)+ max(|X \cap Y|, |Y \setminus X|)+ max(|X \setminus Y|, |(N \setminus X) \setminus Y|))- (max(|X|, |N \setminus X|)+max(|Y|, |N \setminus Y|))} {|N|-\frac{1}{2}(max(|X|, |N \setminus X|)+ max(|Y|, |N \setminus Y|))}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{GK_\lambda} = \frac{\frac{1}{2}((max(a,b)+max(c,d)+max(a,c)+max(b,d))- (max(a+b,c+d)+max(a+c,b+d)))} {n-\frac{1}{2}(max(a+b,c+d)+max(a+c,b+d))}\]New in version 0.4.0.
Initialize GoodmanKruskalLambda instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return Goodman & Kruskal's Lambda similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Goodman & Kruskal's Lambda similarity
- Return type
float
Examples
>>> cmp = GoodmanKruskalLambda() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
GoodmanKruskalLambdaR
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Goodman & Kruskal Lambda-r correlation.
For two sets X and Y and a population N, Goodman & Kruskal \(\lambda_r\) correlation [GK54] is
\[corr_{GK_{\lambda_r}}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y| - \frac{1}{2}(max(|X|, |N \setminus X|) + max(|Y|, |N \setminus Y|))} {|N| - \frac{1}{2}(max(|X|, |N \setminus X|) + max(|Y|, |N \setminus Y|))}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{GK_{\lambda_r}} = \frac{a + d - \frac{1}{2}(max(a+b,c+d)+max(a+c,b+d))} {n - \frac{1}{2}(max(a+b,c+d)+max(a+c,b+d))}\]New in version 0.4.0.
Initialize GoodmanKruskalLambdaR instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return Goodman & Kruskal Lambda-r correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Goodman & Kruskal Lambda-r correlation
- Return type
float
Examples
>>> cmp = GoodmanKruskalLambdaR() >>> cmp.corr('cat', 'hat') 0.0 >>> cmp.corr('Niall', 'Neil') -0.2727272727272727 >>> cmp.corr('aluminum', 'Catalan') -0.7647058823529411 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return Goodman & Kruskal Lambda-r similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Goodman & Kruskal Lambda-r similarity
- Return type
float
Examples
>>> cmp = GoodmanKruskalLambdaR() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352944 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
GoodmanKruskalTauA
(alphabet=None, tokenizer=None, intersection_type='crisp', normalizer='proportional', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Goodman & Kruskal's Tau A similarity.
For two sets X and Y and a population N, Goodman & Kruskal's \(\tau_a\) similarity [GK54], by analogy with \(\tau_b\), is
\[sim_{GK_{\tau_a}}(X, Y) = \frac{\frac{\frac{|X \cap Y|}{|N|}^2 + \frac{|Y \setminus X|}{|N|}^2}{\frac{|Y|}{|N|}}+ \frac{\frac{|X \setminus Y|}{|N|}^2 + \frac{|(N \setminus X) \setminus Y|}{|N|}^2} {\frac{|N \setminus X|}{|N|}} - (\frac{|X|}{|N|}^2 + \frac{|N \setminus X|}{|N|}^2)} {1 - (\frac{|X|}{|N|}^2 + \frac{|N \setminus X|}{|N|}^2)}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[sim_{GK_{\tau_a}} = \frac{ \frac{a^2 + c^2}{a+c} + \frac{b^2 + d^2}{b+d} - ((a+b)^2 + (c+d)^2)} {1 - ((a+b)^2 + (c+d)^2)}\]New in version 0.4.0.
Initialize GoodmanKruskalTauA instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return Goodman & Kruskal's Tau A similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Goodman & Kruskal's Tau A similarity
- Return type
float
Examples
>>> cmp = GoodmanKruskalTauA() >>> cmp.sim('cat', 'hat') 0.3304969657208484 >>> cmp.sim('Niall', 'Neil') 0.22137604585914503 >>> cmp.sim('aluminum', 'Catalan') 0.05991264724130685 >>> cmp.sim('ATCG', 'TAGC') 4.119695274745721e-05
New in version 0.4.0.
-
class
abydos.distance.
GoodmanKruskalTauB
(alphabet=None, tokenizer=None, intersection_type='crisp', normalizer='proportional', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Goodman & Kruskal's Tau B similarity.
For two sets X and Y and a population N, Goodman & Kruskal's \(\tau_b\) similarity [GK54] is
\[sim_{GK_{\tau_b}}(X, Y) = \frac{\frac{\frac{|X \cap Y|}{|N|}^2 + \frac{|X \setminus Y|}{|N|}^2}{\frac{|X|}{|N|}}+ \frac{\frac{|Y \setminus X|}{|N|}^2 + \frac{|(N \setminus X) \setminus Y|}{|N|}^2} {\frac{|N \setminus X|}{|N|}} - (\frac{|Y|}{|N|}^2 + \frac{|N \setminus Y|}{|N|}^2)} {1 - (\frac{|Y|}{|N|}^2 + \frac{|N \setminus Y|}{|N|}^2)}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[sim_{GK_{\tau_b}} = \frac{ \frac{a^2 + b^2}{a+b} + \frac{c^2 + d^2}{c+d} - ((a+c)^2 + (b+d)^2)} {1 - ((a+c)^2 + (b+d)^2)}\]New in version 0.4.0.
Initialize GoodmanKruskalTauB instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return Goodman & Kruskal's Tau B similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Goodman & Kruskal's Tau B similarity
- Return type
float
Examples
>>> cmp = GoodmanKruskalTauB() >>> cmp.sim('cat', 'hat') 0.3304969657208484 >>> cmp.sim('Niall', 'Neil') 0.2346006486710202 >>> cmp.sim('aluminum', 'Catalan') 0.06533810992392582 >>> cmp.sim('ATCG', 'TAGC') 4.119695274745721e-05
New in version 0.4.0.
-
class
abydos.distance.
GowerLegendre
(alphabet=None, tokenizer=None, intersection_type='crisp', theta=0.5, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Gower & Legendre similarity.
For two sets X and Y and a population N, the Gower & Legendre similarity [GL86] is
\[sim_{GowerLegendre}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|} {|X \cap Y| + |(N \setminus X) \setminus Y| + \theta \cdot |X \triangle Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{GowerLegendre} = \frac{a+d}{a+\theta(b+c)+d}\]New in version 0.4.0.
Initialize GowerLegendre instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.theta (float) -- The weight to place on the symmetric difference.
**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Gower & Legendre similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gower & Legendre similarity
- Return type
float
Examples
>>> cmp = GowerLegendre() >>> cmp.sim('cat', 'hat') 0.9974424552429667 >>> cmp.sim('Niall', 'Neil') 0.9955156950672646 >>> cmp.sim('aluminum', 'Catalan') 0.9903536977491961 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
-
class
abydos.distance.
GuttmanLambdaA
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Guttman's Lambda A similarity.
For two sets X and Y and a population N, Guttman's \(\lambda_a\) similarity [Gut41] is
\[sim_{Guttman_{\lambda_a}}(X, Y) = \frac{max(|X \cap Y|, |Y \setminus X|) + max(|X \setminus Y|, |(N \setminus X) \setminus Y|) - max(|X|, |N \setminus X|)} {|N| - max(|X|, |N \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Guttman_{\lambda_a}} = \frac{max(a, c) + max(b, d) - max(a+b, c+d)}{n - max(a+b, c+d)}\]New in version 0.4.0.
Initialize GuttmanLambdaA instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Guttman Lambda A similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Guttman's Lambda A similarity
- Return type
float
Examples
>>> cmp = GuttmanLambdaA() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
GuttmanLambdaB
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Guttman's Lambda B similarity.
For two sets X and Y and a population N, Guttman's \(\lambda_b\) similarity [Gut41] is
\[sim_{Guttman_{\lambda_b}}(X, Y) = \frac{max(|X \cap Y|, |X \setminus Y|) + max(|Y \setminus X|, |(N \setminus X) \setminus Y|) - max(|Y|, |N \setminus Y|)} {|N| - max(|Y|, |N \setminus Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Guttman_{\lambda_b}} = \frac{max(a, b) + max(c, d) - max(a+c, b+d)}{n - max(a+c, b+d)}\]New in version 0.4.0.
Initialize GuttmanLambdaB instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Guttman Lambda B similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Guttman's Lambda B similarity
- Return type
float
Examples
>>> cmp = GuttmanLambdaB() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
-
class
abydos.distance.
GwetAC
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Gwet's AC correlation.
For two sets X and Y and a population N, Gwet's AC correlation [Gwe08] is
\[corr_{Gwet_{AC}}(X, Y) = AC = \frac{p_o - p_e^{AC}}{1 - p_e^{AC}}\]where
\[ \begin{align}\begin{aligned}\begin{array}{lll} p_o &=&\frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\p_e^{AC}&=&\frac{1}{2}\Big(\frac{|X|+|Y|}{|N|}\cdot \frac{|X \setminus Y| + |Y \setminus X|}{|N|}\Big) \end{array}\end{aligned}\end{align} \]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[ \begin{align}\begin{aligned}\begin{array}{lll} p_o&=&\frac{a+d}{n}\\p_e^{AC}&=&\frac{1}{2}\Big(\frac{2a+b+c}{n}\cdot \frac{2d+b+c}{n}\Big) \end{array}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize GwetAC instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Gwet's AC correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gwet's AC correlation
- Return type
float
Examples
>>> cmp = GwetAC() >>> cmp.corr('cat', 'hat') 0.9948456319360438 >>> cmp.corr('Niall', 'Neil') 0.990945276504824 >>> cmp.corr('aluminum', 'Catalan') 0.9804734301840141 >>> cmp.corr('ATCG', 'TAGC') 0.9870811678360627
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Gwet's AC similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Gwet's AC similarity
- Return type
float
Examples
>>> cmp = GwetAC() >>> cmp.sim('cat', 'hat') 0.9974228159680218 >>> cmp.sim('Niall', 'Neil') 0.995472638252412 >>> cmp.sim('aluminum', 'Catalan') 0.9902367150920071 >>> cmp.sim('ATCG', 'TAGC') 0.9935405839180314
New in version 0.4.0.
-
class
abydos.distance.
Hamann
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Hamann correlation.
For two sets X and Y and a population N, the Hamann correlation [Ham61] is
\[corr_{Hamann}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y| - |X \setminus Y| - |Y \setminus X|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Hamann} = \frac{a+d-b-c}{n}\]New in version 0.4.0.
Initialize Hamann instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
corr
(src, tar)[source]¶ Return the Hamann correlation of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Hamann correlation
- Return type
float
Examples
>>> cmp = Hamann() >>> cmp.corr('cat', 'hat') 0.9897959183673469 >>> cmp.corr('Niall', 'Neil') 0.9821428571428571 >>> cmp.corr('aluminum', 'Catalan') 0.9617834394904459 >>> cmp.corr('ATCG', 'TAGC') 0.9744897959183674
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the normalized Hamann similarity of two strings.
Hamann similarity, which has a range [-1, 1] is normalized to [0, 1] by adding 1 and dividing by 2.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Hamann similarity
- Return type
float
Examples
>>> cmp = Hamann() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9910714285714286 >>> cmp.sim('aluminum', 'Catalan') 0.9808917197452229 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
-
class
abydos.distance.
HarrisLahey
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Harris & Lahey similarity.
For two sets X and Y and a population N, Harris & Lahey similarity [HL78] is
\[sim_{HarrisLahey}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\cdot \frac{|N \setminus Y| + |N \setminus X|}{2|N|}+ \frac{|(N \setminus X) \setminus Y|}{|N \setminus (X \cap Y)|}\cdot \frac{|X| + |Y|}{2|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{HarrisLahey} = \frac{a}{a+b+c}\cdot\frac{2d+b+c}{2n}+ \frac{d}{d+b+c}\cdot\frac{2a+b+c}{2n}\]Notes
Most catalogs of similarity coefficients [War08][Mor12][Xia13] omit the \(n\) terms in the denominators, but the worked example in [HL78] makes it clear that this is intended in the original.
New in version 0.4.0.
Initialize HarrisLahey instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Harris & Lahey similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Harris & Lahey similarity
- Return type
float
Examples
>>> cmp = HarrisLahey() >>> cmp.sim('cat', 'hat') 0.3367085964820711 >>> cmp.sim('Niall', 'Neil') 0.22761577457069784 >>> cmp.sim('aluminum', 'Catalan') 0.07244410503054725 >>> cmp.sim('ATCG', 'TAGC') 0.006296204706372345
New in version 0.4.0.
-
class
abydos.distance.
Hassanat
(tokenizer=None, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Hassanat distance.
For two multisets X and Y drawn from an alphabet S, Hassanat distance [Has14] is
\[dist_{Hassanat}(X, Y) = \sum_{i \in S} D(X_i, Y_i)\]where
\[\begin{split}D(X_i, Y_i) = \left\{\begin{array}{ll} 1-\frac{1+min(X_i, Y_i)}{1+max(X_i, Y_i)}&, min(X_i, Y_i) \geq 0 \\ \\ 1-\frac{1+min(X_i, Y_i)+|min(X_i, Y_i)|} {1+max(X_i, Y_i)+|min(X_i, Y_i)|}&, min(X_i, Y_i) < 0 \end{array}\right.\end{split}\]New in version 0.4.0.
Initialize Hassanat instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Hassanat distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Hassanat distance
- Return type
float
Examples
>>> cmp = Hassanat() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.3888888888888889 >>> cmp.dist('aluminum', 'Catalan') 0.4777777777777778 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Hassanat distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Hassanat distance
- Return type
float
Examples
>>> cmp = Hassanat() >>> cmp.dist_abs('cat', 'hat') 2.0 >>> cmp.dist_abs('Niall', 'Neil') 3.5 >>> cmp.dist_abs('aluminum', 'Catalan') 7.166666666666667 >>> cmp.dist_abs('ATCG', 'TAGC') 5.0
New in version 0.4.0.
-
class
abydos.distance.
HawkinsDotson
(alphabet=None, tokenizer=None, intersection_type='crisp', **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Hawkins & Dotson similarity.
For two sets X and Y and a population N, Hawkins & Dotson similarity [HD73] is the mean of the occurrence agreement and non-occurrence agreement
\[sim_{HawkinsDotson}(X, Y) = \frac{1}{2}\cdot\Big( \frac{|X \cap Y|}{|X \cup Y|}+ \frac{|(N \setminus X) \setminus Y|}{|N \setminus (X \cap Y)|} \Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{HawkinsDotson} = \frac{1}{2}\cdot\Big(\frac{a}{a+b+c}+\frac{d}{b+c+d}\Big)\]New in version 0.4.0.
Initialize HawkinsDotson instance.
- Parameters
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
-
sim
(src, tar)[source]¶ Return the Hawkins & Dotson similarity of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Hawkins & Dotson similarity
- Return type
float
Examples
>>> cmp = HawkinsDotson() >>> cmp.sim('cat', 'hat') 0.6641091219096334 >>> cmp.sim('Niall', 'Neil') 0.606635407786303 >>> cmp.sim('aluminum', 'Catalan') 0.5216836734693877 >>> cmp.sim('ATCG', 'TAGC') 0.49362244897959184
New in version 0.4.0.
-
class
abydos.distance.
Hellinger
(tokenizer=None, **kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Hellinger distance.
For two multisets X and Y drawn from an alphabet S, Hellinger distance [Hel09] is
\[dist_{Hellinger}(X, Y) = \sqrt{2 \cdot \sum_{i \in S} (\sqrt{|A_i|} - \sqrt{|B_i|})^2}\]New in version 0.4.0.
Initialize Hellinger instance.
- Parameters
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
- Other Parameters
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
-
dist
(src, tar)[source]¶ Return the normalized Hellinger distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Normalized Hellinger distance
- Return type
float
Examples
>>> cmp = Hellinger() >>> cmp.dist('cat', 'hat') 0.8164965809277261 >>> cmp.dist('Niall', 'Neil') 0.881917103688197 >>> cmp.dist('aluminum', 'Catalan') 0.9128709291752769 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
-
dist_abs
(src, tar)[source]¶ Return the Hellinger distance of two strings.
- Parameters
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns
Hellinger distance
- Return type
float
Examples
>>> cmp = Hellinger() >>> cmp.dist_abs('cat', 'hat') 2.8284271247461903 >>> cmp.dist_abs('Niall', 'Neil') 3.7416573867739413 >>> cmp.dist_abs('aluminum', 'Catalan') 5.477225575051661 >>> cmp.dist_abs('ATCG', 'TAGC') 4.47213595499958
New in version 0.4.0.
-
class
abydos.distance.
HendersonHeron
(**kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Henderson-Heron dissimilarity.
For two sets X and Y and a population N, Henderson-Heron dissimilarity [HH77] is:
New in version 0.4.1.
Initialize HendersonHeron instance.
- Parameters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
-
dist
(src, tar)[source]¶ Return the Henderson-Heron dissimilarity of two strings.
- Parameters
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns
Henderson-Heron dissimilarity
- Return type
float
Examples
>>> cmp = HendersonHeron() >>> cmp.dist('cat', 'hat') 0.00011668873858680838 >>> cmp.dist('Niall', 'Neil') 0.00048123075776606097 >>> cmp.dist('aluminum', 'Catalan') 0.08534181060514882 >>> cmp.dist('ATCG', 'TAGC') 0.9684367974410505
New in version 0.4.1.
-
class
abydos.distance.
HornMorisita
(**kwargs)[source]¶ Bases:
abydos.distance._token_distance._TokenDistance
Horn-Morisita index of overlap.
Horn-Morisita index of overlap [Hor66], given two populations X and Y drawn from S species, is:
\[sim_{Horn-Morisita}(X, Y) = C_{\lambda} = \frac{2\sum_{i=1}^S x_i y_i} {(\hat{\lambda}_x + \hat{\lambda}_y