abydos.phonetic package

abydos.phonetic.

The phonetic module implements phonetic algorithms including:

  • Robert C. Russell’s Index
  • American Soundex
  • Refined Soundex
  • Daitch-Mokotoff Soundex
  • Kölner Phonetik
  • NYSIIS
  • Match Rating Algorithm
  • Metaphone
  • Double Metaphone
  • Caverphone
  • Alpha Search Inquiry System
  • Fuzzy Soundex
  • Phonex
  • Phonem
  • Phonix
  • SfinxBis
  • phonet
  • Standardized Phonetic Frequency Code
  • Statistics Canada
  • Lein
  • Roger Root
  • Oxford Name Compression Algorithm (ONCA)
  • Eudex phonetic hash
  • Haase Phonetik
  • Reth-Schek Phonetik
  • FONEM
  • Parmar-Kumbharana
  • Davidson’s Consonant Code
  • SoundD
  • PSHP Soundex/Viewex Coding
  • an early version of Henry Code
  • Norphone
  • Dolby Code
  • Phonetic Spanish
  • Spanish Metaphone
  • MetaSoundex
  • SoundexBR
  • NRL English-to-phoneme
  • Beider-Morse Phonetic Matching
abydos.phonetic.russell_index(word)[source]

Return the Russell Index (integer output) of a word.

This follows Robert C. Russell’s Index algorithm, as described in [Rus18].

Parameters:word (str) – the word to transform
Returns:the Russell Index value
Return type:int
>>> russell_index('Christopher')
3813428
>>> russell_index('Niall')
715
>>> russell_index('Smith')
3614
>>> russell_index('Schmidt')
3614
abydos.phonetic.russell_index_num_to_alpha(num)[source]

Convert the Russell Index integer to an alphabetic string.

This follows Robert C. Russell’s Index algorithm, as described in [Rus18].

Parameters:num (int) – a Russell Index integer value
Returns:the Russell Index as an alphabetic string
Return type:str
>>> russell_index_num_to_alpha(3813428)
'CRACDBR'
>>> russell_index_num_to_alpha(715)
'NAL'
>>> russell_index_num_to_alpha(3614)
'CMAD'
abydos.phonetic.russell_index_alpha(word)[source]

Return the Russell Index (alphabetic output) for the word.

This follows Robert C. Russell’s Index algorithm, as described in [Rus18].

Parameters:word (str) – the word to transform
Returns:the Russell Index value as an alphabetic string
Return type:str
>>> russell_index_alpha('Christopher')
'CRACDBR'
>>> russell_index_alpha('Niall')
'NAL'
>>> russell_index_alpha('Smith')
'CMAD'
>>> russell_index_alpha('Schmidt')
'CMAD'
abydos.phonetic.soundex(word, max_length=4, var='American', reverse=False, zero_pad=True)[source]

Return the Soundex code for a word.

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 4)
  • var (str) –

    the variant of the algorithm to employ (defaults to ‘American’):

    • ’American’ follows the American Soundex algorithm, as described at [UnitedStates07] and in [Knu98]; this is also called Miracode
    • ’special’ follows the rules from the 1880-1910 US Census retrospective re-analysis, in which h & w are not treated as blocking consonants but as vowels. Cf. [Rep13].
    • ’Census’ follows the rules laid out in GIL 55 [UnitedStates97] by the US Census, including coding prefixed and unprefixed versions of some names
  • reverse (bool) – reverse the word before computing the selected Soundex (defaults to False); This results in “Reverse Soundex”, which is useful for blocking in cases where the initial elements may be in error.
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Soundex value

Return type:

str

>>> soundex("Christopher")
'C623'
>>> soundex("Niall")
'N400'
>>> soundex('Smith')
'S530'
>>> soundex('Schmidt')
'S530'
>>> soundex('Christopher', max_length=-1)
'C623160000000000000000000000000000000000000000000000000000000000'
>>> soundex('Christopher', max_length=-1, zero_pad=False)
'C62316'
>>> soundex('Christopher', reverse=True)
'R132'
>>> soundex('Ashcroft')
'A261'
>>> soundex('Asicroft')
'A226'
>>> soundex('Ashcroft', var='special')
'A226'
>>> soundex('Asicroft', var='special')
'A226'
abydos.phonetic.refined_soundex(word, max_length=-1, zero_pad=False, retain_vowels=False)[source]

Return the Refined Soundex code for a word.

This is Soundex, but with more character classes. It was defined at [Boy98].

Parameters:
  • word – the word to transform
  • max_length – the length of the code returned (defaults to unlimited)
  • zero_pad – pad the end of the return value with 0s to achieve a max_length string
  • retain_vowels – retain vowels (as 0) in the resulting code
Returns:

the Refined Soundex value

Return type:

str

>>> refined_soundex('Christopher')
'C393619'
>>> refined_soundex('Niall')
'N87'
>>> refined_soundex('Smith')
'S386'
>>> refined_soundex('Schmidt')
'S386'
abydos.phonetic.dm_soundex(word, max_length=6, zero_pad=True)[source]

Return the Daitch-Mokotoff Soundex code for a word.

Based on Daitch-Mokotoff Soundex [Mok97], this returns values of a word as a set. A collection is necessary since there can be multiple values for a single word.

Parameters:
  • word – the word to transform
  • max_length – the length of the code returned (defaults to 6; must be between 6 and 64)
  • zero_pad – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Daitch-Mokotoff Soundex value

Return type:

str

>>> sorted(dm_soundex('Christopher'))
['494379', '594379']
>>> dm_soundex('Niall')
{'680000'}
>>> dm_soundex('Smith')
{'463000'}
>>> dm_soundex('Schmidt')
{'463000'}
>>> sorted(dm_soundex('The quick brown fox', max_length=20,
... zero_pad=False))
['35457976754', '3557976754']
abydos.phonetic.fuzzy_soundex(word, max_length=5, zero_pad=True)[source]

Return the Fuzzy Soundex code for a word.

Fuzzy Soundex is an algorithm derived from Soundex, defined in [HM02].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Fuzzy Soundex value

Return type:

str

>>> fuzzy_soundex('Christopher')
'K6931'
>>> fuzzy_soundex('Niall')
'N4000'
>>> fuzzy_soundex('Smith')
'S5300'
>>> fuzzy_soundex('Smith')
'S5300'
abydos.phonetic.lein(word, max_length=4, zero_pad=True)[source]

Return the Lein code for a word.

This is Lein name coding, described in [MKTM77].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the maximum length (default 4) of the code to return
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Lein code

Return type:

str

>>> lein('Christopher')
'C351'
>>> lein('Niall')
'N300'
>>> lein('Smith')
'S210'
>>> lein('Schmidt')
'S521'
abydos.phonetic.phonex(word, max_length=4, zero_pad=True)[source]

Return the Phonex code for a word.

Phonex is an algorithm derived from Soundex, defined in [LR96].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Phonex value

Return type:

str

>>> phonex('Christopher')
'C623'
>>> phonex('Niall')
'N400'
>>> phonex('Schmidt')
'S253'
>>> phonex('Smith')
'S530'
abydos.phonetic.phonix(word, max_length=4, zero_pad=True)[source]

Return the Phonix code for a word.

Phonix is a Soundex-like algorithm defined in [Gad90].

This implementation is based on: - [Pfe00] - [Chr11] - [Kollar]

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Phonix value

Return type:

str

>>> phonix('Christopher')
'K683'
>>> phonix('Niall')
'N400'
>>> phonix('Smith')
'S530'
>>> phonix('Schmidt')
'S530'
abydos.phonetic.pshp_soundex_first(fname, max_length=4, german=False)[source]

Calculate the PSHP Soundex/Viewex Coding of a first name.

This coding is based on [HBD76].

Reference was also made to the German version of the same: [HBD79].

A separate function, pshp_soundex_last() is used for last names.

Parameters:
  • fname (str) – the first name to encode
  • max_length (int) – the length of the code returned (defaults to 4)
  • german (bool) – set to True if the name is German (different rules apply)
Returns:

the PSHP Soundex/Viewex Coding

Return type:

str

>>> pshp_soundex_first('Smith')
'S530'
>>> pshp_soundex_first('Waters')
'W352'
>>> pshp_soundex_first('James')
'J700'
>>> pshp_soundex_first('Schmidt')
'S500'
>>> pshp_soundex_first('Ashcroft')
'A220'
>>> pshp_soundex_first('John')
'J500'
>>> pshp_soundex_first('Colin')
'K400'
>>> pshp_soundex_first('Niall')
'N400'
>>> pshp_soundex_first('Sally')
'S400'
>>> pshp_soundex_first('Jane')
'J500'
abydos.phonetic.pshp_soundex_last(lname, max_length=4, german=False)[source]

Calculate the PSHP Soundex/Viewex Coding of a last name.

This coding is based on [HBD76].

Reference was also made to the German version of the same: [HBD79].

A separate function, pshp_soundex_first() is used for first names.

Parameters:
  • lname (str) – the last name to encode
  • max_length (int) – the length of the code returned (defaults to 4)
  • german (bool) – set to True if the name is German (different rules apply)
Returns:

the PSHP Soundex/Viewex Coding

Return type:

str

>>> pshp_soundex_last('Smith')
'S530'
>>> pshp_soundex_last('Waters')
'W350'
>>> pshp_soundex_last('James')
'J500'
>>> pshp_soundex_last('Schmidt')
'S530'
>>> pshp_soundex_last('Ashcroft')
'A225'
abydos.phonetic.nysiis(word, max_length=6, modified=False)[source]

Return the NYSIIS code for a word.

The New York State Identification and Intelligence System algorithm is defined in [Taf70].

The modified version of this algorithm is described in Appendix B of [LA77].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the maximum length (default 6) of the code to return
  • modified (bool) – indicates whether to use USDA modified NYSIIS
Returns:

the NYSIIS value

Return type:

str

>>> nysiis('Christopher')
'CRASTA'
>>> nysiis('Niall')
'NAL'
>>> nysiis('Smith')
'SNAT'
>>> nysiis('Schmidt')
'SNAD'
>>> nysiis('Christopher', max_length=-1)
'CRASTAFAR'
>>> nysiis('Christopher', max_length=8, modified=True)
'CRASTAFA'
>>> nysiis('Niall', max_length=8, modified=True)
'NAL'
>>> nysiis('Smith', max_length=8, modified=True)
'SNAT'
>>> nysiis('Schmidt', max_length=8, modified=True)
'SNAD'
abydos.phonetic.mra(word)[source]

Return the MRA personal numeric identifier (PNI) for a word.

A description of the Western Airlines Surname Match Rating Algorithm can be found on page 18 of [MKTM77].

Parameters:word (str) – the word to transform
Returns:the MRA PNI
Return type:str
>>> mra('Christopher')
'CHRPHR'
>>> mra('Niall')
'NL'
>>> mra('Smith')
'SMTH'
>>> mra('Schmidt')
'SCHMDT'
abydos.phonetic.caverphone(word, version=2)[source]

Return the Caverphone code for a word.

A description of version 1 of the algorithm can be found in [Hoo02].

A description of version 2 of the algorithm can be found in [Hoo04].

Parameters:
  • word (str) – the word to transform
  • version (int) – the version of Caverphone to employ for encoding (defaults to 2)
Returns:

the Caverphone value

Return type:

str

>>> caverphone('Christopher')
'KRSTFA1111'
>>> caverphone('Niall')
'NA11111111'
>>> caverphone('Smith')
'SMT1111111'
>>> caverphone('Schmidt')
'SKMT111111'
>>> caverphone('Christopher', 1)
'KRSTF1'
>>> caverphone('Niall', 1)
'N11111'
>>> caverphone('Smith', 1)
'SMT111'
>>> caverphone('Schmidt', 1)
'SKMT11'
abydos.phonetic.alpha_sis(word, max_length=14)[source]

Return the IBM Alpha Search Inquiry System code for a word.

The Alpha Search Inquiry System code is defined in [IBMCorporation73]. This implementation is based on the description in [MKTM77].

A collection is necessary since there can be multiple values for a single word. But the collection must be ordered since the first value is the primary coding.

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 14)
Returns:

the Alpha SIS value

Return type:

tuple

>>> alpha_sis('Christopher')
('06401840000000', '07040184000000', '04018400000000')
>>> alpha_sis('Niall')
('02500000000000',)
>>> alpha_sis('Smith')
('03100000000000',)
>>> alpha_sis('Schmidt')
('06310000000000',)
abydos.phonetic.davidson(lname, fname='.', omit_fname=False)[source]

Return Davidson’s Consonant Code.

This is based on the name compression system described in [Dav62].

[Dol70] identifies this as having been the name compression algorithm used by SABRE.

Parameters:
  • lname (str) – Last name (or word) to be encoded
  • fname (str) – First name (optional), of which the first character is included in the code.
  • omit_fname (bool) – Set to True to completely omit the first character of the first name
Returns:

Davidson’s Consonant Code

Return type:

str

>>> davidson('Gough')
'G   .'
>>> davidson('pneuma')
'PNM .'
>>> davidson('knight')
'KNGT.'
>>> davidson('trice')
'TRC .'
>>> davidson('judge')
'JDG .'
>>> davidson('Smith', 'James')
'SMT J'
>>> davidson('Wasserman', 'Tabitha')
'WSRMT'
abydos.phonetic.dolby(word, max_length=-1, keep_vowels=False, vowel_char='*')[source]

Return the Dolby Code of a name.

This follows “A Spelling Equivalent Abbreviation Algorithm For Personal Names” from [Dol70] and [C+69].

Parameters:
  • word – the word to encode
  • max_length – maximum length of the returned Dolby code – this also activates the fixed-length code mode if it is greater than 0
  • keep_vowels – if True, retains all vowel markers
  • vowel_char – the vowel marker character (default to *)
Returns:

the Dolby Code

Return type:

str

>>> dolby('Hansen')
'H*NSN'
>>> dolby('Larsen')
'L*RSN'
>>> dolby('Aagaard')
'*GR'
>>> dolby('Braaten')
'BR*DN'
>>> dolby('Sandvik')
'S*NVK'
>>> dolby('Hansen', max_length=6)
'H*NS*N'
>>> dolby('Larsen', max_length=6)
'L*RS*N'
>>> dolby('Aagaard', max_length=6)
'*G*R  '
>>> dolby('Braaten', max_length=6)
'BR*D*N'
>>> dolby('Sandvik', max_length=6)
'S*NF*K'
>>> dolby('Smith')
'SM*D'
>>> dolby('Waters')
'W*DRS'
>>> dolby('James')
'J*MS'
>>> dolby('Schmidt')
'SM*D'
>>> dolby('Ashcroft')
'*SKRFD'
>>> dolby('Smith', max_length=6)
'SM*D  '
>>> dolby('Waters', max_length=6)
'W*D*RS'
>>> dolby('James', max_length=6)
'J*M*S '
>>> dolby('Schmidt', max_length=6)
'SM*D  '
>>> dolby('Ashcroft', max_length=6)
'*SKRFD'
abydos.phonetic.spfc(word)[source]

Return the Standardized Phonetic Frequency Code (SPFC) of a word.

Standardized Phonetic Frequency Code is roughly Soundex-like. This implementation is based on page 19-21 of [MKTM77].

Parameters:word (str) – the word to transform
Returns:the SPFC value
Return type:str
>>> spfc('Christopher Smith')
'01160'
>>> spfc('Christopher Schmidt')
'01160'
>>> spfc('Niall Smith')
'01660'
>>> spfc('Niall Schmidt')
'01660'
>>> spfc('L.Smith')
'01960'
>>> spfc('R.Miller')
'65490'
>>> spfc(('L', 'Smith'))
'01960'
>>> spfc(('R', 'Miller'))
'65490'
abydos.phonetic.roger_root(word, max_length=5, zero_pad=True)[source]

Return the Roger Root code for a word.

This is Roger Root name coding, described in [MKTM77].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the maximum length (default 5) of the code to return
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the Roger Root code

Return type:

str

>>> roger_root('Christopher')
'06401'
>>> roger_root('Niall')
'02500'
>>> roger_root('Smith')
'00310'
>>> roger_root('Schmidt')
'06310'
abydos.phonetic.statistics_canada(word, max_length=4)[source]

Return the Statistics Canada code for a word.

The original description of this algorithm could not be located, and may only have been specified in an unpublished TR. The coding does not appear to be in use by Statistics Canada any longer. In its place, this is an implementation of the “Census modified Statistics Canada name coding procedure”.

The modified version of this algorithm is described in Appendix B of
[MKTM77].
Parameters:
  • word (str) – the word to transform
  • max_length (int) – the maximum length (default 4) of the code to return
Returns:

the Statistics Canada name code value

Return type:

str

>>> statistics_canada('Christopher')
'CHRS'
>>> statistics_canada('Niall')
'NL'
>>> statistics_canada('Smith')
'SMTH'
>>> statistics_canada('Schmidt')
'SCHM'
abydos.phonetic.sound_d(word, max_length=4)[source]

Return the SoundD code.

SoundD is defined in [VB12].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 4)
Returns:

the SoundD code

Return type:

str

>>> sound_d('Gough')
'2000'
>>> sound_d('pneuma')
'5500'
>>> sound_d('knight')
'5300'
>>> sound_d('trice')
'3620'
>>> sound_d('judge')
'2200'
abydos.phonetic.parmar_kumbharana(word)[source]

Return the Parmar-Kumbharana encoding of a word.

This is based on the phonetic algorithm proposed in [PK14].

Parameters:word (str) – the word to transform
Returns:the Parmar-Kumbharana encoding
Return type:str
>>> parmar_kumbharana('Gough')
'GF'
>>> parmar_kumbharana('pneuma')
'NM'
>>> parmar_kumbharana('knight')
'NT'
>>> parmar_kumbharana('trice')
'TRS'
>>> parmar_kumbharana('judge')
'JJ'
abydos.phonetic.metaphone(word, max_length=-1)[source]

Return the Metaphone code for a word.

Based on Lawrence Philips’ Pick BASIC code from 1990 [Phi90b], as described in [Phi90a]. This incorporates some corrections to the above code, particularly some of those suggested by Michael Kuhn in [Kuh95].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the maximum length of the returned Metaphone code (defaults to 64, but in Philips’ original implementation this was 4)
Returns:

the Metaphone value

Return type:

str

>>> metaphone('Christopher')
'KRSTFR'
>>> metaphone('Niall')
'NL'
>>> metaphone('Smith')
'SM0'
>>> metaphone('Schmidt')
'SKMTT'
abydos.phonetic.double_metaphone(word, max_length=-1)[source]

Return the Double Metaphone code for a word.

Based on Lawrence Philips’ (Visual) C++ code from 1999 [Phi00].

Parameters:
  • word – the word to transform
  • max_length – the maximum length of the returned Double Metaphone codes (defaults to 64, but in Philips’ original implementation this was 4)
Returns:

the Double Metaphone value(s)

Return type:

tuple

>>> double_metaphone('Christopher')
('KRSTFR', '')
>>> double_metaphone('Niall')
('NL', '')
>>> double_metaphone('Smith')
('SM0', 'XMT')
>>> double_metaphone('Schmidt')
('XMT', 'SMT')
abydos.phonetic.eudex(word, max_length=8)[source]

Return the eudex phonetic hash of a word.

This implementation of eudex phonetic hashing is based on the specification (not the reference implementation) at [Tic].

Further details can be found at [Tic16].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length in bits of the code returned (default 8)
Returns:

the eudex hash

Return type:

int

>>> eudex('Colin')
432345564238053650
>>> eudex('Christopher')
433648490138894409
>>> eudex('Niall')
648518346341351840
>>> eudex('Smith')
720575940412906756
>>> eudex('Schmidt')
720589151732307997
abydos.phonetic.bmpm(word, language_arg=0, name_mode='gen', match_mode='approx', concat=False, filter_langs=False)[source]

Return the Beider-Morse Phonetic Matching encoding(s) of a term.

The Beider-Morse Phonetic Matching algorithm is described in [BM08]. The reference implementation is licensed under GPLv3.

Parameters:
  • word (str) – the word to transform
  • language_arg (str) –

    the language of the term; supported values include:

    • ’any’
    • ’arabic’
    • ’cyrillic’
    • ’czech’
    • ’dutch’
    • ’english’
    • ’french’
    • ’german’
    • ’greek’
    • ’greeklatin’
    • ’hebrew’
    • ’hungarian’
    • ’italian’
    • ’latvian’
    • ’polish’
    • ’portuguese’
    • ’romanian’
    • ’russian’
    • ’spanish’
    • ’turkish’
  • name_mode (str) –

    the name mode of the algorithm:

    • ’gen’ – general (default)
    • ’ash’ – Ashkenazi
    • ’sep’ – Sephardic
  • match_mode (str) – matching mode: ‘approx’ or ‘exact’
  • concat (bool) – concatenation mode
  • filter_langs (bool) – filter out incompatible languages
Returns:

the BMPM value(s)

Return type:

tuple

>>> bmpm('Christopher')
'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
xristYfir xristopi xritopir xritopi xristofi xritofir xritofi tzristopir
tzristofir zristopir zristopi zritopir zritopi zristofir zristofi zritofir
zritofi'
>>> bmpm('Niall')
'nial niol'
>>> bmpm('Smith')
'zmit'
>>> bmpm('Schmidt')
'zmit stzmit'
>>> bmpm('Christopher', language_arg='German')
'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
xristYfir'
>>> bmpm('Christopher', language_arg='English')
'tzristofir tzrQstofir tzristafir tzrQstafir xristofir xrQstofir xristafir
xrQstafir'
>>> bmpm('Christopher', language_arg='German', name_mode='ash')
'xrQstopir xrQstYpir xristopir xristYpir xrQstofir xrQstYfir xristofir
xristYfir'
>>> bmpm('Christopher', language_arg='German', match_mode='exact')
'xriStopher xriStofer xristopher xristofer'
abydos.phonetic.nrl(word)[source]

Return the Naval Research Laboratory phonetic encoding of a word.

This is defined by [EJMS76].

Parameters:word (str) – the word to transform
Returns:the NRL phonetic encoding
Return type:str
>>> nrl('the')
'DHAX'
>>> nrl('round')
'rAWnd'
>>> nrl('quick')
'kwIHk'
>>> nrl('eaten')
'IYtEHn'
>>> nrl('Smith')
'smIHTH'
>>> nrl('Larsen')
'lAArsEHn'
abydos.phonetic.metasoundex(word, lang='en')[source]

Return the MetaSoundex code for a word.

This is based on [KV17]. Only English (‘en’) and Spanish (‘es’) languages are supported, as in the original.

Parameters:
  • word (str) – the word to transform
  • lang (str) – either ‘en’ for English or ‘es’ for Spanish
Returns:

the MetaSoundex code

Return type:

str

>>> metasoundex('Smith')
'4500'
>>> metasoundex('Waters')
'7362'
>>> metasoundex('James')
'1520'
>>> metasoundex('Schmidt')
'4530'
>>> metasoundex('Ashcroft')
'0261'
>>> metasoundex('Perez', lang='es')
'094'
>>> metasoundex('Martinez', lang='es')
'69364'
>>> metasoundex('Gutierrez', lang='es')
'83994'
>>> metasoundex('Santiago', lang='es')
'4638'
>>> metasoundex('Nicolás', lang='es')
'6754'
abydos.phonetic.onca(word, max_length=4, zero_pad=True)[source]

Return the Oxford Name Compression Algorithm (ONCA) code for a word.

This is the Oxford Name Compression Algorithm, based on [Gil97].

I can find no complete description of the “anglicised version of the NYSIIS method” identified as the first step in this algorithm, so this is likely not a precisely correct implementation, in that it employs the standard NYSIIS algorithm.

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the maximum length (default 5) of the code to return
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the ONCA code

Return type:

str

>>> onca('Christopher')
'C623'
>>> onca('Niall')
'N400'
>>> onca('Smith')
'S530'
>>> onca('Schmidt')
'S530'
abydos.phonetic.fonem(word)[source]

Return the FONEM code of a word.

FONEM is a phonetic algorithm designed for French (particularly surnames in Saguenay, Canada), defined in [BBL81].

Guillaume Plique’s Javascript implementation [Pli18] at https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/french/fonem.js was also consulted for this implementation.

Parameters:word (str) – the word to transform
Returns:the FONEM code
Return type:str
>>> fonem('Marchand')
'MARCHEN'
>>> fonem('Beaulieu')
'BOLIEU'
>>> fonem('Beaumont')
'BOMON'
>>> fonem('Legrand')
'LEGREN'
>>> fonem('Pelletier')
'PELETIER'
abydos.phonetic.henry_early(word, max_length=3)[source]

Calculate the early version of the Henry code for a word.

The early version of Henry coding is given in [LegareLC72]. This is different from the later version defined in [Hen76].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 3)
Returns:

the early Henry code

Return type:

str

>>> henry_early('Marchand')
'MRC'
>>> henry_early('Beaulieu')
'BL'
>>> henry_early('Beaumont')
'BM'
>>> henry_early('Legrand')
'LGR'
>>> henry_early('Pelletier')
'PLT'
abydos.phonetic.koelner_phonetik(word)[source]

Return the Kölner Phonetik (numeric output) code for a word.

Based on the algorithm defined by [Pos69].

While the output code is numeric, it is still a str because 0s can lead the code.

Parameters:word (str) – the word to transform
Returns:the Kölner Phonetik value as a numeric string
Return type:str
>>> koelner_phonetik('Christopher')
'478237'
>>> koelner_phonetik('Niall')
'65'
>>> koelner_phonetik('Smith')
'862'
>>> koelner_phonetik('Schmidt')
'862'
>>> koelner_phonetik('Müller')
'657'
>>> koelner_phonetik('Zimmermann')
'86766'
abydos.phonetic.koelner_phonetik_num_to_alpha(num)[source]

Convert a Kölner Phonetik code from numeric to alphabetic.

Parameters:num (str) – a numeric Kölner Phonetik representation (can be a str or an int)
Returns:an alphabetic representation of the same word
Return type:str
>>> koelner_phonetik_num_to_alpha('862')
'SNT'
>>> koelner_phonetik_num_to_alpha('657')
'NLR'
>>> koelner_phonetik_num_to_alpha('86766')
'SNRNN'
abydos.phonetic.koelner_phonetik_alpha(word)[source]

Return the Kölner Phonetik (alphabetic output) code for a word.

Parameters:word (str) – the word to transform
Returns:the Kölner Phonetik value as an alphabetic string
Return type:str
>>> koelner_phonetik_alpha('Smith')
'SNT'
>>> koelner_phonetik_alpha('Schmidt')
'SNT'
>>> koelner_phonetik_alpha('Müller')
'NLR'
>>> koelner_phonetik_alpha('Zimmermann')
'SNRNN'
abydos.phonetic.haase_phonetik(word, primary_only=False)[source]

Return the Haase Phonetik (numeric output) code for a word.

Based on the algorithm described at [Pra15].

Based on the original [HH00].

While the output code is numeric, it is nevertheless a str.

Parameters:
  • word (str) – the word to transform
  • primary_only (bool) – if True, only the primary code is returned
Returns:

the Haase Phonetik value as a numeric string

Return type:

tuple

>>> haase_phonetik('Joachim')
('9496',)
>>> haase_phonetik('Christoph')
('4798293', '8798293')
>>> haase_phonetik('Jörg')
('974',)
>>> haase_phonetik('Smith')
('8692',)
>>> haase_phonetik('Schmidt')
('8692', '4692')
abydos.phonetic.reth_schek_phonetik(word)[source]

Return Reth-Schek Phonetik code for a word.

This algorithm is proposed in [vonRethS77].

Since I couldn’t secure a copy of that document (maybe I’ll look for it next time I’m in Germany), this implementation is based on what I could glean from the implementations published by German Record Linkage Center (www.record-linkage.de):

  • Privacy-preserving Record Linkage (PPRL) (in R) [Ruk18]
  • Merge ToolBox (in Java) [SBB04]

Rules that are unclear:

  • Should ‘C’ become ‘G’ or ‘Z’? (PPRL has both, ‘Z’ rule blocked)
  • Should ‘CC’ become ‘G’? (PPRL has blocked ‘CK’ that may be typo)
  • Should ‘TUI’ -> ‘ZUI’ rule exist? (PPRL has rule, but I can’t think of a German word with ‘-tui-‘ in it.)
  • Should we really change ‘SCH’ -> ‘CH’ and then ‘CH’ -> ‘SCH’?
Parameters:word (str) – the word to transform
Returns:the Reth-Schek Phonetik code
Return type:str
>>> reth_schek_phonetik('Joachim')
'JOAGHIM'
>>> reth_schek_phonetik('Christoph')
'GHRISDOF'
>>> reth_schek_phonetik('Jörg')
'JOERG'
>>> reth_schek_phonetik('Smith')
'SMID'
>>> reth_schek_phonetik('Schmidt')
'SCHMID'
abydos.phonetic.phonem(word)[source]

Return the Phonem code for a word.

Phonem is defined in [GM88].

This version is based on the Perl implementation documented at [Wil05]. It includes some enhancements presented in the Java port at [dcm4che].

Phonem is intended chiefly for German names/words.

Parameters:word (str) – the word to transform
Returns:the Phonem value
Return type:str
>>> phonem('Christopher')
'CRYSDOVR'
>>> phonem('Niall')
'NYAL'
>>> phonem('Smith')
'SMYD'
>>> phonem('Schmidt')
'CMYD'
abydos.phonetic.phonet(word, mode=1, lang='de')[source]

Return the phonet code for a word.

phonet (“Hannoveraner Phonetik”) was developed by Jörg Michael and documented in [Mic99].

This is a port of Jesper Zedlitz’s code, which is licensed LGPL [Zed15].

That is, in turn, based on Michael’s C code, which is also licensed LGPL [Mic07].

Parameters:
  • word (str) – the word to transform
  • mode (int) – the ponet variant to employ (1 or 2)
  • lang (str) – ‘de’ (default) for German ‘none’ for no language
Returns:

the phonet value

Return type:

str

>>> phonet('Christopher')
'KRISTOFA'
>>> phonet('Niall')
'NIAL'
>>> phonet('Smith')
'SMIT'
>>> phonet('Schmidt')
'SHMIT'
>>> phonet('Christopher', mode=2)
'KRIZTUFA'
>>> phonet('Niall', mode=2)
'NIAL'
>>> phonet('Smith', mode=2)
'ZNIT'
>>> phonet('Schmidt', mode=2)
'ZNIT'
>>> phonet('Christopher', lang='none')
'CHRISTOPHER'
>>> phonet('Niall', lang='none')
'NIAL'
>>> phonet('Smith', lang='none')
'SMITH'
>>> phonet('Schmidt', lang='none')
'SCHMIDT'
abydos.phonetic.soundex_br(word, max_length=4, zero_pad=True)[source]

Return the SoundexBR encoding of a word.

This is based on [Mar15].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 4)
  • zero_pad (bool) – pad the end of the return value with 0s to achieve a max_length string
Returns:

the SoundexBR code

Return type:

str

>>> soundex_br('Oliveira')
'O416'
>>> soundex_br('Almeida')
'A453'
>>> soundex_br('Barbosa')
'B612'
>>> soundex_br('Araújo')
'A620'
>>> soundex_br('Gonçalves')
'G524'
>>> soundex_br('Goncalves')
'G524'
abydos.phonetic.phonetic_spanish(word, max_length=-1)[source]

Return the PhoneticSpanish coding of word.

This follows the coding described in [AmonME12] and [delPAngelesEGGM15].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to unlimited)
Returns:

the PhoneticSpanish code

Return type:

str

>>> phonetic_spanish('Perez')
'094'
>>> phonetic_spanish('Martinez')
'69364'
>>> phonetic_spanish('Gutierrez')
'83994'
>>> phonetic_spanish('Santiago')
'4638'
>>> phonetic_spanish('Nicolás')
'6454'
abydos.phonetic.spanish_metaphone(word, max_length=6, modified=False)[source]

Return the Spanish Metaphone of a word.

This is a quick rewrite of the Spanish Metaphone Algorithm, as presented at https://github.com/amsqr/Spanish-Metaphone and discussed in [MLM12].

Modified version based on [delPAngelesBailonM16].

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to 6)
  • modified (bool) – Set to True to use del Pilar Angeles & Bailón-Miguel’s modified version of the algorithm
Returns:

the Spanish Metaphone code

Return type:

str

>>> spanish_metaphone('Perez')
'PRZ'
>>> spanish_metaphone('Martinez')
'MRTNZ'
>>> spanish_metaphone('Gutierrez')
'GTRRZ'
>>> spanish_metaphone('Santiago')
'SNTG'
>>> spanish_metaphone('Nicolás')
'NKLS'
abydos.phonetic.sfinxbis(word, max_length=-1)[source]

Return the SfinxBis code for a word.

SfinxBis is a Soundex-like algorithm defined in [Axe09].

This implementation follows the reference implementation: [Sjoo09].

SfinxBis is intended chiefly for Swedish names.

Parameters:
  • word (str) – the word to transform
  • max_length (int) – the length of the code returned (defaults to unlimited)
Returns:

the SfinxBis value

Return type:

tuple

>>> sfinxbis('Christopher')
('K68376',)
>>> sfinxbis('Niall')
('N4',)
>>> sfinxbis('Smith')
('S53',)
>>> sfinxbis('Schmidt')
('S53',)
>>> sfinxbis('Johansson')
('J585',)
>>> sfinxbis('Sjöberg')
('#162',)
abydos.phonetic.norphone(word)[source]

Return the Norphone code.

The reference implementation by Lars Marius Garshol is available in [Gar15].

Norphone was designed for Norwegian, but this implementation has been extended to support Swedish vowels as well. This function incorporates the “not implemented” rules from the above file’s rule set.

Parameters:word (str) – the word to transform
Returns:the Norphone code
Return type:str
>>> norphone('Hansen')
'HNSN'
>>> norphone('Larsen')
'LRSN'
>>> norphone('Aagaard')
'ÅKRT'
>>> norphone('Braaten')
'BRTN'
>>> norphone('Sandvik')
'SNVK'