abydos.fingerprint package¶
abydos.fingerprint.
The fingerprint package implements string fingerprints such as:
Basic fingerprinters originating in OpenRefine <http://openrefine.org>:
Fingerprints developed by Pollock & Zomora:
- Skeleton key (
SkeletonKey
)- Omission key (
OmissionKey
)Fingerprints developed by Cisłak & Grabowski:
- Occurrence (
Occurrence
)- Occurrence halved (
OccurrenceHalved
)- Count (
Count
)- Position (
Position
)The Synoname toolcode (
SynonameToolcode
)
Each fingerprint class has a fingerprint
method that takes a string and
returns the string's fingerprint:
>>> sk = SkeletonKey()
>>> sk.fingerprint('orange')
'ORNGAE'
>>> sk.fingerprint('strange')
'STRNGAE'
-
class
abydos.fingerprint.
String
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
String Fingerprint.
The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening joiners. This fingerprint is described at [Ope12].
-
fingerprint
(phrase, joiner=' ')[source]¶ Return string fingerprint.
Parameters: - phrase (str) -- The string from which to calculate the fingerprint
- joiner (str) -- The string that will be placed between each word
Returns: The fingerprint of the phrase
Return type: str
Example
>>> sf = String() >>> sf.fingerprint('The quick brown fox jumped over the lazy dog.') 'brown dog fox jumped lazy over quick the'
-
-
abydos.fingerprint.
str_fingerprint
(phrase, joiner=' ')[source]¶ Return string fingerprint.
This is a wrapper for
String.fingerprint()
.Parameters: - phrase (str) -- The string from which to calculate the fingerprint
- joiner (str) -- The string that will be placed between each word
Returns: The fingerprint of the phrase
Return type: str
Example
>>> str_fingerprint('The quick brown fox jumped over the lazy dog.') 'brown dog fox jumped lazy over quick the'
-
class
abydos.fingerprint.
QGram
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Q-Gram Fingerprint.
A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated. This fingerprint is described at [Ope12].
-
fingerprint
(phrase, qval=2, start_stop='', joiner='')[source]¶ Return Q-Gram fingerprint.
Parameters: - phrase (str) -- The string from which to calculate the q-gram fingerprint
- qval (int) -- The length of each q-gram (by default 2)
- start_stop (str) -- The start & stop symbol(s) to concatenate on either end of the
phrase, as defined in
tokenizer.QGrams
- joiner (str) -- The string that will be placed between each word
Returns: The q-gram fingerprint of the phrase
Return type: str
Examples
>>> qf = QGram() >>> qf.fingerprint('The quick brown fox jumped over the lazy dog.') 'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy' >>> qf.fingerprint('Christopher') 'cherhehrisopphristto' >>> qf.fingerprint('Niall') 'aliallni'
-
-
abydos.fingerprint.
qgram_fingerprint
(phrase, qval=2, start_stop='', joiner='')[source]¶ Return Q-Gram fingerprint.
This is a wrapper for
QGram.fingerprint()
.Parameters: - phrase (str) -- The string from which to calculate the q-gram fingerprint
- qval (int) -- The length of each q-gram (by default 2)
- start_stop (str) -- The start & stop symbol(s) to concatenate on either end of the phrase,
as defined in
tokenizer.QGrams
- joiner (str) -- The string that will be placed between each word
Returns: The q-gram fingerprint of the phrase
Return type: str
Examples
>>> qgram_fingerprint('The quick brown fox jumped over the lazy dog.') 'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy' >>> qgram_fingerprint('Christopher') 'cherhehrisopphristto' >>> qgram_fingerprint('Niall') 'aliallni'
-
class
abydos.fingerprint.
Phonetic
[source]¶ Bases:
abydos.fingerprint._string.String
Phonetic Fingerprint.
A phonetic fingerprint is identical to a standard string fingerprint, as implemented in
String
, but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm. This fingerprint is described at [Ope12].-
fingerprint
(phrase, phonetic_algorithm=<function double_metaphone>, joiner=' ', *args, **kwargs)[source]¶ Return the phonetic fingerprint of a phrase.
Parameters: - phrase (str) -- The string from which to calculate the phonetic fingerprint
- phonetic_algorithm (function) -- A phonetic algorithm that takes a string and returns a string
(presumably a phonetic representation of the original string). By
default, this function uses
double_metaphone()
. - joiner (str) -- The string that will be placed between each word
- *args -- Variable length argument list
- **kwargs -- Arbitrary keyword arguments
Returns: The phonetic fingerprint of the phrase
Return type: str
Examples
>>> pf = Phonetic() >>> pf.fingerprint('The quick brown fox jumped over the lazy dog.') '0 afr fks jmpt kk ls prn tk' >>> from abydos.phonetic import soundex >>> pf.fingerprint('The quick brown fox jumped over the lazy dog.', ... phonetic_algorithm=soundex) 'b650 d200 f200 j513 l200 o160 q200 t000'
-
-
abydos.fingerprint.
phonetic_fingerprint
(phrase, phonetic_algorithm=<function double_metaphone>, joiner=' ', *args, **kwargs)[source]¶ Return the phonetic fingerprint of a phrase.
This is a wrapper for
Phonetic.fingerprint()
.Parameters: - phrase (str) -- The string from which to calculate the phonetic fingerprint
- phonetic_algorithm (function) -- A phonetic algorithm that takes a string and returns a string
(presumably a phonetic representation of the original string). By
default, this function uses
double_metaphone()
. - joiner (str) -- The string that will be placed between each word
- *args -- Variable length argument list
- **kwargs -- Arbitrary keyword arguments
Returns: The phonetic fingerprint of the phrase
Return type: str
Examples
>>> phonetic_fingerprint('The quick brown fox jumped over the lazy dog.') '0 afr fks jmpt kk ls prn tk' >>> from abydos.phonetic import soundex >>> phonetic_fingerprint('The quick brown fox jumped over the lazy dog.', ... phonetic_algorithm=soundex) 'b650 d200 f200 j513 l200 o160 q200 t000'
-
class
abydos.fingerprint.
OmissionKey
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Omission Key.
The omission key of a word is defined in [PZ84].
-
fingerprint
(word)[source]¶ Return the omission key.
Parameters: word (str) -- The word to transform into its omission key Returns: The omission key Return type: str Examples
>>> ok = OmissionKey() >>> ok.fingerprint('The quick brown fox jumped over the lazy dog.') 'JKQXZVWYBFMGPDHCLNTREUIOA' >>> ok.fingerprint('Christopher') 'PHCTSRIOE' >>> ok.fingerprint('Niall') 'LNIA'
-
-
abydos.fingerprint.
omission_key
(word)[source]¶ Return the omission key.
This is a wrapper for
OmissionKey.fingerprint()
.Parameters: word (str) -- The word to transform into its omission key Returns: The omission key Return type: str Examples
>>> omission_key('The quick brown fox jumped over the lazy dog.') 'JKQXZVWYBFMGPDHCLNTREUIOA' >>> omission_key('Christopher') 'PHCTSRIOE' >>> omission_key('Niall') 'LNIA'
-
class
abydos.fingerprint.
SkeletonKey
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Skeleton Key.
The skeleton key of a word is defined in [PZ84].
-
fingerprint
(word)[source]¶ Return the skeleton key.
Parameters: word (str) -- The word to transform into its skeleton key Returns: The skeleton key Return type: str Examples
>>> sk = SkeletonKey() >>> sk.fingerprint('The quick brown fox jumped over the lazy dog.') 'THQCKBRWNFXJMPDVLZYGEUIOA' >>> sk.fingerprint('Christopher') 'CHRSTPIOE' >>> sk.fingerprint('Niall') 'NLIA'
-
-
abydos.fingerprint.
skeleton_key
(word)[source]¶ Return the skeleton key.
This is a wrapper for
SkeletonKey.fingerprint()
.Parameters: word (str) -- The word to transform into its skeleton key Returns: The skeleton key Return type: str Examples
>>> skeleton_key('The quick brown fox jumped over the lazy dog.') 'THQCKBRWNFXJMPDVLZYGEUIOA' >>> skeleton_key('Christopher') 'CHRSTPIOE' >>> skeleton_key('Niall') 'NLIA'
-
class
abydos.fingerprint.
Occurrence
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Occurrence Fingerprint.
Based on the occurrence fingerprint from [CislakG17].
-
fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶ Return the occurrence fingerprint.
Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
Returns: The occurrence fingerprint
Return type: int
Examples
>>> of = Occurrence() >>> bin(of.fingerprint('hat')) '0b110000100000000' >>> bin(of.fingerprint('niall')) '0b10110000100000' >>> bin(of.fingerprint('colin')) '0b1110000110000' >>> bin(of.fingerprint('atcg')) '0b110000000010000' >>> bin(of.fingerprint('entreatment')) '0b1110010010000100'
-
-
abydos.fingerprint.
occurrence_fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶ Return the occurrence fingerprint.
This is a wrapper for
Occurrence.fingerprint()
.Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
Returns: The occurrence fingerprint
Return type: int
Examples
>>> bin(occurrence_fingerprint('hat')) '0b110000100000000' >>> bin(occurrence_fingerprint('niall')) '0b10110000100000' >>> bin(occurrence_fingerprint('colin')) '0b1110000110000' >>> bin(occurrence_fingerprint('atcg')) '0b110000000010000' >>> bin(occurrence_fingerprint('entreatment')) '0b1110010010000100'
-
class
abydos.fingerprint.
OccurrenceHalved
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Occurrence Halved Fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
-
fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶ Return the occurrence halved fingerprint.
Based on the occurrence halved fingerprint from [CislakG17].
Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
Returns: The occurrence halved fingerprint
Return type: int
Examples
>>> ohf = OccurrenceHalved() >>> bin(ohf.fingerprint('hat')) '0b1010000000010' >>> bin(ohf.fingerprint('niall')) '0b10010100000' >>> bin(ohf.fingerprint('colin')) '0b1001010000' >>> bin(ohf.fingerprint('atcg')) '0b10100000000000' >>> bin(ohf.fingerprint('entreatment')) '0b1111010000110000'
-
-
abydos.fingerprint.
occurrence_halved_fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶ Return the occurrence halved fingerprint.
This is a wrapper for
OccurrenceHalved.fingerprint()
.Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
Returns: The occurrence halved fingerprint
Return type: int
Examples
>>> bin(occurrence_halved_fingerprint('hat')) '0b1010000000010' >>> bin(occurrence_halved_fingerprint('niall')) '0b10010100000' >>> bin(occurrence_halved_fingerprint('colin')) '0b1001010000' >>> bin(occurrence_halved_fingerprint('atcg')) '0b10100000000000' >>> bin(occurrence_halved_fingerprint('entreatment')) '0b1111010000110000'
-
class
abydos.fingerprint.
Count
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Count Fingerprint.
Based on the count fingerprint from [CislakG17].
-
fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶ Return the count fingerprint.
Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
Returns: The count fingerprint
Return type: int
Examples
>>> cf = Count() >>> bin(cf.fingerprint('hat')) '0b1010000000001' >>> bin(cf.fingerprint('niall')) '0b10001010000' >>> bin(cf.fingerprint('colin')) '0b101010000' >>> bin(cf.fingerprint('atcg')) '0b1010000000000' >>> bin(cf.fingerprint('entreatment')) '0b1111010000100000'
-
-
abydos.fingerprint.
count_fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶ Return the count fingerprint.
This is a wrapper for
Count.fingerprint()
.Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
Returns: The count fingerprint
Return type: int
Examples
>>> bin(count_fingerprint('hat')) '0b1010000000001' >>> bin(count_fingerprint('niall')) '0b10001010000' >>> bin(count_fingerprint('colin')) '0b101010000' >>> bin(count_fingerprint('atcg')) '0b1010000000000' >>> bin(count_fingerprint('entreatment')) '0b1111010000100000'
-
class
abydos.fingerprint.
Position
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Position Fingerprint.
Based on the position fingerprint from [CislakG17].
-
fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter=3)[source]¶ Return the position fingerprint.
Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
- bits_per_letter (int) -- The bits to assign for letter position
Returns: The position fingerprint
Return type: int
Examples
>>> bin(position_fingerprint('hat')) '0b1110100011111111' >>> bin(position_fingerprint('niall')) '0b1111110101110010' >>> bin(position_fingerprint('colin')) '0b1111111110010111' >>> bin(position_fingerprint('atcg')) '0b1110010001111111' >>> bin(position_fingerprint('entreatment')) '0b101011111111'
-
-
abydos.fingerprint.
position_fingerprint
(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter=3)[source]¶ Return the position fingerprint.
This is a wrapper for
Position.fingerprint()
.Parameters: - word (str) -- The word to fingerprint
- n_bits (int) -- Number of bits in the fingerprint returned
- most_common (list) -- The most common tokens in the target language, ordered by frequency
- bits_per_letter (int) -- The bits to assign for letter position
Returns: The position fingerprint
Return type: int
Examples
>>> bin(position_fingerprint('hat')) '0b1110100011111111' >>> bin(position_fingerprint('niall')) '0b1111110101110010' >>> bin(position_fingerprint('colin')) '0b1111111110010111' >>> bin(position_fingerprint('atcg')) '0b1110010001111111' >>> bin(position_fingerprint('entreatment')) '0b101011111111'
-
class
abydos.fingerprint.
SynonameToolcode
[source]¶ Bases:
abydos.fingerprint._fingerprint._Fingerprint
Synoname Toolcode.
Cf. [JPGTrust91][Gro91].
-
fingerprint
(lname, fname='', qual='', normalize=0)[source]¶ Build the Synoname toolcode.
Parameters: - lname (str) -- Last name
- fname (str) -- First name (can be blank)
- qual (str) -- Qualifier
- normalize (int) -- Normalization mode (0, 1, or 2)
Returns: The transformed names and the synoname toolcode
Return type: tuple
Examples
>>> st = SynonameToolcode() >>> st.fingerprint('hat') ('hat', '', '0000000003$$h') >>> st.fingerprint('niall') ('niall', '', '0000000005$$n') >>> st.fingerprint('colin') ('colin', '', '0000000005$$c') >>> st.fingerprint('atcg') ('atcg', '', '0000000004$$a') >>> st.fingerprint('entreatment') ('entreatment', '', '0000000011$$e')
>>> st.fingerprint('Ste.-Marie', 'Count John II', normalize=2) ('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji') >>> st.fingerprint('Michelangelo IV', '', 'Workshop of') ('michelangelo iv', '', '3000550015$055b$mi')
-
-
abydos.fingerprint.
synoname_toolcode
(lname, fname='', qual='', normalize=0)[source]¶ Build the Synoname toolcode.
This is a wrapper for
SynonameToolcode.fingerprint()
.Parameters: - lname (str) -- Last name
- fname (str) -- First name (can be blank)
- qual (str) -- Qualifier
- normalize (int) -- Normalization mode (0, 1, or 2)
Returns: The transformed names and the synoname toolcode
Return type: tuple
Examples
>>> synoname_toolcode('hat') ('hat', '', '0000000003$$h') >>> synoname_toolcode('niall') ('niall', '', '0000000005$$n') >>> synoname_toolcode('colin') ('colin', '', '0000000005$$c') >>> synoname_toolcode('atcg') ('atcg', '', '0000000004$$a') >>> synoname_toolcode('entreatment') ('entreatment', '', '0000000011$$e')
>>> synoname_toolcode('Ste.-Marie', 'Count John II', normalize=2) ('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji') >>> synoname_toolcode('Michelangelo IV', '', 'Workshop of') ('michelangelo iv', '', '3000550015$055b$mi')