abydos.fingerprint module

abydos.fingerprint.

The fingerprint module implements string fingerprints such as:
  • string fingerprint
  • q-gram fingerprint
  • phonetic fingerprint
  • Pollock & Zomora’s skeleton key
  • Pollock & Zomora’s omission key
  • Cisłak & Grabowski’s occurrence fingerprint
  • Cisłak & Grabowski’s occurrence halved fingerprint
  • Cisłak & Grabowski’s count fingerprint
  • Cisłak & Grabowski’s position fingerprint
  • Synoname Toolcode
abydos.fingerprint.count_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]

Return the count fingerprint.

Based on the count fingerprint from [CislakG17].

Parameters:
  • word (str) – the word to fingerprint
  • n_bits (int) – number of bits in the fingerprint returned
  • most_common (list) – the most common tokens in the target language, ordered by frequency
Returns:

the count fingerprint

Return type:

int

>>> bin(count_fingerprint('hat'))
'0b1010000000001'
>>> bin(count_fingerprint('niall'))
'0b10001010000'
>>> bin(count_fingerprint('colin'))
'0b101010000'
>>> bin(count_fingerprint('atcg'))
'0b1010000000000'
>>> bin(count_fingerprint('entreatment'))
'0b1111010000100000'
abydos.fingerprint.occurrence_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]

Return the occurrence fingerprint.

Based on the occurrence fingerprint from [CislakG17].

Parameters:
  • word (str) – the word to fingerprint
  • n_bits (int) – number of bits in the fingerprint returned
  • most_common (list) – the most common tokens in the target language, ordered by frequency
Returns:

the occurrence fingerprint

Return type:

int

>>> bin(occurrence_fingerprint('hat'))
'0b110000100000000'
>>> bin(occurrence_fingerprint('niall'))
'0b10110000100000'
>>> bin(occurrence_fingerprint('colin'))
'0b1110000110000'
>>> bin(occurrence_fingerprint('atcg'))
'0b110000000010000'
>>> bin(occurrence_fingerprint('entreatment'))
'0b1110010010000100'
abydos.fingerprint.occurrence_halved_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]

Return the occurrence halved fingerprint.

Based on the occurrence halved fingerprint from [CislakG17].

Parameters:
  • word (str) – the word to fingerprint
  • n_bits (int) – number of bits in the fingerprint returned
  • most_common (list) – the most common tokens in the target language, ordered by frequency
Returns:

the occurrence halved fingerprint

Return type:

int

>>> bin(occurrence_halved_fingerprint('hat'))
'0b1010000000010'
>>> bin(occurrence_halved_fingerprint('niall'))
'0b10010100000'
>>> bin(occurrence_halved_fingerprint('colin'))
'0b1001010000'
>>> bin(occurrence_halved_fingerprint('atcg'))
'0b10100000000000'
>>> bin(occurrence_halved_fingerprint('entreatment'))
'0b1111010000110000'
abydos.fingerprint.omission_key(word)[source]

Return the omission key.

The omission key of a word is defined in [PZ84].

Parameters:word (str) – the word to transform into its omission key
Returns:the omission key
Return type:str
>>> omission_key('The quick brown fox jumped over the lazy dog.')
'JKQXZVWYBFMGPDHCLNTREUIOA'
>>> omission_key('Christopher')
'PHCTSRIOE'
>>> omission_key('Niall')
'LNIA'
abydos.fingerprint.phonetic_fingerprint(phrase, phonetic_algorithm=<function double_metaphone>, joiner=' ', *args)[source]

Return the phonetic fingerprint of a phrase.

A phonetic fingerprint is identical to a standard string fingerprint, as implemented in abydos.clustering.fingerprint(), but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm. This fingerprint is described at [Ope12].

Parameters:
  • phrase (str) – the string from which to calculate the phonetic fingerprint
  • phonetic_algorithm (function) – a phonetic algorithm that takes a string and returns a string (presumably a phonetic representation of the original string) By default, this function uses abydos.phonetic.double_metaphone()
  • joiner (str) – the string that will be placed between each word
  • args – additional arguments to pass to the phonetic algorithm, along with the phrase itself
Returns:

the phonetic fingerprint of the phrase

Return type:

str

>>> phonetic_fingerprint('The quick brown fox jumped over the lazy dog.')
'0 afr fks jmpt kk ls prn tk'
>>> from abydos.phonetic import soundex
>>> phonetic_fingerprint('The quick brown fox jumped over the lazy dog.',
... phonetic_algorithm=soundex)
'b650 d200 f200 j513 l200 o160 q200 t000'
abydos.fingerprint.position_fingerprint(word, n_bits=16, most_common=('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter=3)[source]

Return the position fingerprint.

Based on the position fingerprint from [CislakG17].

Parameters:
  • word (str) – the word to fingerprint
  • n_bits (int) – number of bits in the fingerprint returned
  • most_common (list) – the most common tokens in the target language, ordered by frequency
  • bits_per_letter (int) – the bits to assign for letter position
Returns:

the position fingerprint

Return type:

int

>>> bin(position_fingerprint('hat'))
'0b1110100011111111'
>>> bin(position_fingerprint('niall'))
'0b1111110101110010'
>>> bin(position_fingerprint('colin'))
'0b1111111110010111'
>>> bin(position_fingerprint('atcg'))
'0b1110010001111111'
>>> bin(position_fingerprint('entreatment'))
'0b101011111111'
abydos.fingerprint.qgram_fingerprint(phrase, qval=2, start_stop='', joiner='')[source]

Return Q-Gram fingerprint.

A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated. This fingerprint is described at [Ope12].

Parameters:
  • phrase (str) – the string from which to calculate the q-gram fingerprint
  • qval (int) – the length of each q-gram (by default 2)
  • start_stop (str) – the start & stop symbol(s) to concatenate on either end of the phrase, as defined in abydos.util.qgram()
  • joiner (str) – the string that will be placed between each word
Returns:

the q-gram fingerprint of the phrase

Return type:

str

>>> qgram_fingerprint('The quick brown fox jumped over the lazy dog.')
'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy'
>>> qgram_fingerprint('Christopher')
'cherhehrisopphristto'
>>> qgram_fingerprint('Niall')
'aliallni'
abydos.fingerprint.skeleton_key(word)[source]

Return the skeleton key.

The skeleton key of a word is defined in [PZ84].

Parameters:word (str) – the word to transform into its skeleton key
Returns:the skeleton key
Return type:str
>>> skeleton_key('The quick brown fox jumped over the lazy dog.')
'THQCKBRWNFXJMPDVLZYGEUIOA'
>>> skeleton_key('Christopher')
'CHRSTPIOE'
>>> skeleton_key('Niall')
'NLIA'
abydos.fingerprint.str_fingerprint(phrase, joiner=' ')[source]

Return string fingerprint.

The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening joiners. This fingerprint is described at [Ope12].

Parameters:
  • phrase (str) – the string from which to calculate the fingerprint
  • joiner (str) – the string that will be placed between each word
Returns:

the fingerprint of the phrase

Return type:

str

>>> str_fingerprint('The quick brown fox jumped over the lazy dog.')
'brown dog fox jumped lazy over quick the'
abydos.fingerprint.synoname_toolcode(lname, fname='', qual='', normalize=0)[source]

Build the Synoname toolcode.

Cf. [JPGTrust91][Gro91].

Parameters:
  • lname (str) – last name
  • fname (str) – first name (can be blank)
  • qual (str) – qualifier
  • normalize (int) – normalization mode (0, 1, or 2)
Returns:

the transformed last and first names and the synoname toolcode

Return type:

tuple

>>> synoname_toolcode('hat')
('hat', '', '0000000003$$h')
>>> synoname_toolcode('niall')
('niall', '', '0000000005$$n')
>>> synoname_toolcode('colin')
('colin', '', '0000000005$$c')
>>> synoname_toolcode('atcg')
('atcg', '', '0000000004$$a')
>>> synoname_toolcode('entreatment')
('entreatment', '', '0000000011$$e')
>>> synoname_toolcode('Ste.-Marie', 'Count John II', normalize=2)
('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji')
>>> synoname_toolcode('Michelangelo IV', '', 'Workshop of')
('michelangelo iv', '', '3000550015$055b$mi')