abydos.fingerprint module¶

abydos.fingerprint.

The fingerprint module implements string fingerprints such as:

string fingerprint
q-gram fingerprint
phonetic fingerprint
Pollock & Zomora’s skeleton key
Pollock & Zomora’s omission key
Cisłak & Grabowski’s occurrence fingerprint
Cisłak & Grabowski’s occurrence halved fingerprint
Cisłak & Grabowski’s count fingerprint
Cisłak & Grabowski’s position fingerprint
Synoname Toolcode

abydos.fingerprint.count_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶

Return the count fingerprint.

Based on the count fingerprint from [CislakG17].

Parameters:	word (str) – the word to fingerprint n_bits (int) – number of bits in the fingerprint returned most_common (list) – the most common tokens in the target language, ordered by frequency
Returns:	the count fingerprint
Return type:	int

>>> bin(count_fingerprint('hat'))
'0b1010000000001'
>>> bin(count_fingerprint('niall'))
'0b10001010000'
>>> bin(count_fingerprint('colin'))
'0b101010000'
>>> bin(count_fingerprint('atcg'))
'0b1010000000000'
>>> bin(count_fingerprint('entreatment'))
'0b1111010000100000'

abydos.fingerprint.occurrence_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶

Return the occurrence fingerprint.

Based on the occurrence fingerprint from [CislakG17].

Parameters:	word (str) – the word to fingerprint n_bits (int) – number of bits in the fingerprint returned most_common (list) – the most common tokens in the target language, ordered by frequency
Returns:	the occurrence fingerprint
Return type:	int

>>> bin(occurrence_fingerprint('hat'))
'0b110000100000000'
>>> bin(occurrence_fingerprint('niall'))
'0b10110000100000'
>>> bin(occurrence_fingerprint('colin'))
'0b1110000110000'
>>> bin(occurrence_fingerprint('atcg'))
'0b110000000010000'
>>> bin(occurrence_fingerprint('entreatment'))
'0b1110010010000100'

abydos.fingerprint.occurrence_halved_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]¶

Return the occurrence halved fingerprint.

Based on the occurrence halved fingerprint from [CislakG17].

Parameters:	word (str) – the word to fingerprint n_bits (int) – number of bits in the fingerprint returned most_common (list) – the most common tokens in the target language, ordered by frequency
Returns:	the occurrence halved fingerprint
Return type:	int

>>> bin(occurrence_halved_fingerprint('hat'))
'0b1010000000010'
>>> bin(occurrence_halved_fingerprint('niall'))
'0b10010100000'
>>> bin(occurrence_halved_fingerprint('colin'))
'0b1001010000'
>>> bin(occurrence_halved_fingerprint('atcg'))
'0b10100000000000'
>>> bin(occurrence_halved_fingerprint('entreatment'))
'0b1111010000110000'

abydos.fingerprint.omission_key(word)[source]¶

Return the omission key.

The omission key of a word is defined in [PZ84].

Parameters:	word (str) – the word to transform into its omission key
Returns:	the omission key
Return type:	str

>>> omission_key('The quick brown fox jumped over the lazy dog.')
'JKQXZVWYBFMGPDHCLNTREUIOA'
>>> omission_key('Christopher')
'PHCTSRIOE'
>>> omission_key('Niall')
'LNIA'

abydos.fingerprint.phonetic_fingerprint(phrase, phonetic_algorithm=<function double_metaphone>, joiner=' ', *args)[source]¶

Return the phonetic fingerprint of a phrase.

A phonetic fingerprint is identical to a standard string fingerprint, as implemented in abydos.clustering.fingerprint(), but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm. This fingerprint is described at [Ope12].

Parameters:	phrase (str) – the string from which to calculate the phonetic fingerprint phonetic_algorithm (function) – a phonetic algorithm that takes a string and returns a string (presumably a phonetic representation of the original string) By default, this function uses abydos.phonetic.double_metaphone() joiner (str) – the string that will be placed between each word args – additional arguments to pass to the phonetic algorithm, along with the phrase itself
Returns:	the phonetic fingerprint of the phrase
Return type:	str

>>> phonetic_fingerprint('The quick brown fox jumped over the lazy dog.')
'0 afr fks jmpt kk ls prn tk'
>>> from abydos.phonetic import soundex
>>> phonetic_fingerprint('The quick brown fox jumped over the lazy dog.',
... phonetic_algorithm=soundex)
'b650 d200 f200 j513 l200 o160 q200 t000'

abydos.fingerprint.position_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter=3)[source]¶

Return the position fingerprint.

Based on the position fingerprint from [CislakG17].

Parameters:	word (str) – the word to fingerprint n_bits (int) – number of bits in the fingerprint returned most_common (list) – the most common tokens in the target language, ordered by frequency bits_per_letter (int) – the bits to assign for letter position
Returns:	the position fingerprint
Return type:	int

>>> bin(position_fingerprint('hat'))
'0b1110100011111111'
>>> bin(position_fingerprint('niall'))
'0b1111110101110010'
>>> bin(position_fingerprint('colin'))
'0b1111111110010111'
>>> bin(position_fingerprint('atcg'))
'0b1110010001111111'
>>> bin(position_fingerprint('entreatment'))
'0b101011111111'

abydos.fingerprint.qgram_fingerprint(phrase, qval=2, start_stop='', joiner='')[source]¶

Return Q-Gram fingerprint.

A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated. This fingerprint is described at [Ope12].

Parameters:	phrase (str) – the string from which to calculate the q-gram fingerprint qval (int) – the length of each q-gram (by default 2) start_stop (str) – the start & stop symbol(s) to concatenate on either end of the phrase, as defined in abydos.util.qgram() joiner (str) – the string that will be placed between each word
Returns:	the q-gram fingerprint of the phrase
Return type:	str

>>> qgram_fingerprint('The quick brown fox jumped over the lazy dog.')
'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy'
>>> qgram_fingerprint('Christopher')
'cherhehrisopphristto'
>>> qgram_fingerprint('Niall')
'aliallni'

abydos.fingerprint.skeleton_key(word)[source]¶

Return the skeleton key.

The skeleton key of a word is defined in [PZ84].

Parameters:	word (str) – the word to transform into its skeleton key
Returns:	the skeleton key
Return type:	str

>>> skeleton_key('The quick brown fox jumped over the lazy dog.')
'THQCKBRWNFXJMPDVLZYGEUIOA'
>>> skeleton_key('Christopher')
'CHRSTPIOE'
>>> skeleton_key('Niall')
'NLIA'

abydos.fingerprint.str_fingerprint(phrase, joiner=' ')[source]¶

Return string fingerprint.

The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening joiners. This fingerprint is described at [Ope12].

Parameters:	phrase (str) – the string from which to calculate the fingerprint joiner (str) – the string that will be placed between each word
Returns:	the fingerprint of the phrase
Return type:	str

>>> str_fingerprint('The quick brown fox jumped over the lazy dog.')
'brown dog fox jumped lazy over quick the'

abydos.fingerprint.synoname_toolcode(lname, fname='', qual='', normalize=0)[source]¶

Build the Synoname toolcode.

Cf. [JPGTrust91][Gro91].

Parameters:	lname (str) – last name fname (str) – first name (can be blank) qual (str) – qualifier normalize (int) – normalization mode (0, 1, or 2)
Returns:	the transformed last and first names and the synoname toolcode
Return type:	tuple

>>> synoname_toolcode('hat')
('hat', '', '0000000003$$h')
>>> synoname_toolcode('niall')
('niall', '', '0000000005$$n')
>>> synoname_toolcode('colin')
('colin', '', '0000000005$$c')
>>> synoname_toolcode('atcg')
('atcg', '', '0000000004$$a')
>>> synoname_toolcode('entreatment')
('entreatment', '', '0000000011$$e')

>>> synoname_toolcode('Ste.-Marie', 'Count John II', normalize=2)
('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji')
>>> synoname_toolcode('Michelangelo IV', '', 'Workshop of')
('michelangelo iv', '', '3000550015$055b$mi')