Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. An evaluation of similarity coefficients for software fault localization. In 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06). 2007. doi:10.1109/PRDC.2006.18.


Jason Adams. Ruby port of uealite stemmer. 2017. URL:


William A. Ainsworth. A system for converting text into speech. IEEE Transactions on Audio and Electroacoustics, AU-21(3):288–290, June 1973. doi:10.1109/TAU.1973.1162452.


Iván Amón, Francisco Moreno, and Jaime Echeverri. Algoritmo fonético para detección de cadenas de texto duplicadas en el idioma español. Revista Ingenier\'ıas Universidad de Medell\'ın, 11(20):127–138, June 2012. URL:\&script=sci\_abstract\&tlng=es.


Michael R. Anderberg. Cluster Analysis for Applications. Academic Press, New York, 1973. doi:10.1016/C2013-0-06161-0.


Marti J. Anderson and Russell B. Millar. Spatial variation and effects of habitat on temperate reef fish assemblages in northeastern new zealand. Journal of Experimental Marine Biology and Ecology, 305:191–221, 2004. doi:10.1016/j.jembe.2003.12.011.


A. Martín Andrés and P. Femia Marzo. Delta: a new measure of agreement between two raters. British Journal of Mathematical and Statistical Psychology, 57(1):1–20, May 2004. doi:10.1348/000711004849268.


Brian Austin and Rita R. Colwell. Evaluation of some coefficients for use in numerical taxonomy of microorganisms. International Journal of Systematic Bacteriology, 27(3):204–210, July 1977. doi:10.1099/00207713-27-3-204.


Pål Axelsson. Sfinxbis. Technical Report, Swedish Alliance for Middleware Infrastructure, April 2009. URL:


Cesare Baroni-Urbani and Mauro W. Buser. Similarity of binary data. Systematic Biology, 25(3):251–259, September 1976. doi:10.2307/2412493.


Ilaria Bartolini, Paolo Ciaccia, and Marco Patella. String matching with metric trees using an approximate distance. In Alberto H. F. Laender and Arlindo L. Oliveira, editors, SPIRE 2002: String Processing and Information Retrieval, 271–283. Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. URL:, doi:10.1007/3-540-45735-6\_24.


Vladimir Batagelj and Matevž Bren. Comparing resemblance measures. Journal of Classification, 12(1):73–90, March 1995. doi:10.1007/BF01202268.


Forrest B. Baulieu. A classification of presence/absence based dissimilarity coefficients. Journal of Classification, 6(1):233–246, 1989. doi:10.1007/BF01908601.


Forrest B. Baulieu. Two variant axiom systems for presence/absence based dissimilarity coefficients. Journal of Classification, 14(1):159–170, 1997. doi:10.1007/s003579900009.


Alexander Beider and Stephen P. Morse. Beider-morse phonetic matching: an alternative to soundex with fewer false hits. International Review of Jewish Genealogy, Summer 2008. URL:


Rudolfo Benini. Principii di Demografia. Number 29 in Manuali Barbera di Scienze Giuridiche Sociali e Politiche. G. Barbera, Firenze, 1901. URL:


E. M. Bennet, R. Alpert, and A. C. Goldstein. Communications through limited-response questioning. Public Opinion Quarterly, 18(3):303–308, 1954. doi:10.1086/266520.


Anil Kumar Bhattacharyya. On a measure of divergence between two multinomial populations. Sankhyā: The Indian Journal of Statistics (1933-1960), 7(4):401–406, July 1946. doi:10.2307/25047882.


Gerard Bouchard and Christian Pouyez. Name variations and computerized record linkage. Historical Methods: A Journal of Quantitative and Interdisciplinary History, 13(2):119–125, 1980. doi:10.1080/01615440.1980.10594037.


Gérard Bouchard, Patrick Brard, and Yolande Lavoie. Fonem: un code de transcription phonétique pour la reconstitution automatique des familles saguenayennes. Population, 1981. URL:\_0032-4663\_1981\_num\_36\_6\_17248, doi:10.2307/1532326.


Carolyn B. Boyce. Information on the refined soundex algorithm. November 1998. URL:


Leonid Boytsov. Indexing methods for approximate dictionary searching: comparative analysis. Journal of Experimental Algorithmics, 16:1.1:1.1–1.1:1.91, May 2011. doi:10.1145/1963190.1963191.


George W. Brainerd. The place of chronological ordering in archaeological analysis. American Antiquity, 16(4):301–313, April 1951. doi:10.2307/276979.


Josias Braun-Blanquet. Plant Sociology: The Study of Plant Communities. McGraw-Hill Book Company, New York, 1932. URL:


J. Roger Bray and John T. Curtis. An ordination of upland forest communities of southern wisconsin. Ecological Monographs, 27(4):325–349, February 1957. URL:, doi:10.2307/1942268.


Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, 21–29. 1997. doi:10.1109/SEQUEN.1997.666900.


Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. SRC Research Report 124, Digital Equipment Corporation, Palo Alto, May 1994. URL:


Yong Cao, Anthony W. Bark, and W. Peter Williams. Similarity measure bias in river benthic aufwuchs community analysis. Water Environment Research, 69(1):95–106, 1997. doi:10.2175/106143097x125227.


Jörg Caumanns. A fast and simple stemming algorithm for german words. Technical Report, Free University of Berlin, 1999. URL:


Sung-Hyuk Cha. Taxonomy of nominal type histogram distance measures. In Proceedings of the American Conference on Applied Mathematics (MATH '08). 2008. URL:


Sung-Hyuk Cha, Charles C. Tappert, and Sungsoo Yoon. Enhancing binary feature vector similarity measures. Journal of Pattern Recognition Research, 1(1):63–77, 2006. doi:10.13176/11.20.


Anne Chao, Robin L. Chazdon, Robert K. Colwell, and Tsung-Jen Shen. A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology Letters, 8(2):148–159, 2004. doi:10.1111/j.1461-0248.2004.00707.x.


Seung-Seok Choi, Sung-Hyuk Cha, and Charles C. Tappert. A survey of binary similarity and distance measures. Systemics, Cybernetics and Informatics, 8(1):43–48, 2010.


Peter Christen. A comparison of personal name matching: techniques and practical issues. Technical Report TR-CS-06-02, Australian National University, Canberra, Australia, 2006. URL:


Peter Christen. Febrl (freely extensible biomedical record linkage) – December 2011. URL:


Kenneth Church, William Gale, Patrick Hanks, and Donald Hindle. Using statistics in lexical analysis. In Lexical Acquisition: Exploiting On-Line Resources to Build up a Lexicon, pages 115–164. Lawrence Erlbaum, Hillsdale, NJ, 1991.


Richard Churchill. URL:


Rudi Cilibrasi and Paul Michael Béla Vitanyi. Clustering by compression. IEEE Transactions on Information Theory, 51(4):1523–1545, April 2005. URL:, doi:10.1109/TIT.2005.844059.


Aleksander Cisłak and Szymon Grabowski. Lightweight fingerprints for fast approximate keyword matching using bitwise operations. CoRR, 2017. URL:


Philip J. Clark. An extension of the coefficient of divergence for use with multiple characters. Copeia, 1952(2):61–64, June 1952. doi:10.2307/1438532.


Paul W. Clement. A formula for computing inter-observer agreement. Psychological Reports, 39(1):257–258, 1976. doi:10.2466/pr0.1976.39.1.257.


Rosetta Code. Longest common subsequence. 2018. URL:\_common\_subsequence\#Dynamic\_Programming\_6.


Rosetta Code. Run-length encoding. 2018. URL:\_encoding\#Python.


Adam Cohen. Fuzzywuzzy: fuzzy string matching in python. July 2011. URL:


Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960. doi:10.1177/001316446002000104.


William A. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWEB'03 Proceedings of the 2003 International Conference on Information, 73–78. 2003. URL:


William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg, and Kathryn Rivard. Secondstring. 2003. URL:


Lamont C. Cole. The measurement of interspecific association. Ecology, 30(4):411–424, 1949. doi:10.2307/1932444.


Viviana Consonni and Roberto Todeschini. New similarity coefficients for binary data. MATCH Communications in Mathematical and in Computer Chemistry, 68:581–592, 2012.


Graham Cormode. Seuqnce Distance Embeddings. PhD thesis, The University of Warwick, 2003. URL:\_THESIS\_Cormode\_2003.pdf.


Graham Cormode, Mike Paterson, Süleyman Cenk Sahinalp, and Uzi Vishkin. Communication complexity of document exchange. In SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, 197–200. 2000.


IBM Corporation. Alpha Search Inquiry System, General Information Manual. White Plains, NY, 1973.


IBM Corporation. IBM SPSS Statistics Algorithms. IBM Corporation, 25 edition, 2017. URL:\_SPSS\_Statistics\_Algorithms.pdf.


Michael A. Covington. An algorithm to align words for historical comparison. Computational Linguistics, 22(4):481–496, December 1996.


Lee J. Cronbach. Coefficient alpha and the internal structure of tests. Psychometrika, 16(3):297–334, September 1951. doi:10.1007/BF02310555.


Jay L. Cunningham and others. A study of the organization and search of bibliographic holdings in on-line computer systems: phase i. Technical Report, University of California, Berkleley, Institute of Library Research, March 1969. URL:


Jan Czekanowski. Zur differentialdiagnose der neandertalgruppe. Korrespondenz-Blatt der Deutschen Gesellschaft für Anthropologie, Ethnologie und Urgeschichte, 40:44–47, 1909.


Ido Dagan, Lillian Lee, and Fernando C. N. Pereire. Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1–3):43–69, February 1999. doi:10.1023/A:1007537716579.


Andrew Dalke. Arithmetic coder (python recipe). 2005. URL:


Valentin Dallmeier, Christian Lindig, and Andreas Zeller. Lightweight. In ECOOP'05 Proceedings of the 19th European conference on Object-Oriented Programming. 2005. URL:, doi:10.1007/11531142\_23.


Fred J. Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, March 1964. doi:10.1145/363958.363994.


Leon Davidson. Retrieval of misspelled names in an airlines passenger record system. Communications of the ACM, 5(3):169–171, March 1962. doi:10.1145/366862.366913.


dcm4che. DICOM toolkit & library: URL:


Sally F. Dennis. The construction of a thesaurus automatic from a sample of text. In Mary Elizabeth Stevens, Vincent E. Giuliano, and Laurence B. Heilprin, editors, Statistical Association Techniques for Mechanized Documentation: Symposium Proceedings, number 269 in National Bureau of Standards Miscellaneous Publication, 61–148. Washington, D.C., December 1965. United States Department of Commerce. URL:


Michel Marie Deza and Elena Deza. Encyclopedia of Distances. Springer-Verlag, Berlin, 4 edition, 2016.


Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945. URL:, doi:10.2307/1932409.


P. G. N. Digby. Approximating the tetrachoric correlation coefficient. Biometrics, 39(3):753–757, September 1983. doi:10.2307/2531104.


James L. Dolby. An algorithm for variable-length proper-name compression. Journal of Library Automation, 3(4):257–275, 1970. URL:, doi:10.6017/ital.v3i4.5259.


Mayrick H. Doolittle. The verification of predictions. The American Meteorological Journal, 2:327–329, 1884. URL:


Sean S. Downey, Brian Hallmark, Murray P. Cox, Peter Norquest, and J. Stephen Lansing. Computational featuresensitive reconstruction of language relationships: developing the aline distance for comparative historical linguistic reconstruction. Journal of Quantitative Linguistics, 15(4):340–369, November 2008. doi:10.1080/09296170802326681.


Harold E. Driver and Alfred L. Kroeber. Quantitative expression of cultural relationships. University of California Publications in American Archaeology and Ethnology, 31(4):211–256, 1932. URL:


Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1993. URL:


Andrzej Ehrenfeucht and David Haussler. A new distance metric on strings computable in linear time. Discrete Applied Mathematics, 20(3):191–203, 1988. doi:10.1016/0166-218X(88)90076-5.


Horst Eidenberger. Categorization and Machine Learning: The ModModel of Human Understanding in Computers. atpress, 2014.


Heinz Ellenberg. Grundlagen Der Vegetationsgliederung. Teil 1. Aufgaben Und Methoden Der Vegetationskunde. Verlag Eugen Ulmer, Stuttgart, 1956.


Honey S. Elovitz, Rodney W. Johnson, Astrid McHugh, and John E. Shore. Automatic translation of english text to pphonetic by means of letter-to-sound rules. NRL Report 7948, document AD/A021 929, Naval Research Laboratory, Washington, D.C., 1976.


Klas Erikson. Approximate swedish name matching - survey and test of different algorithms. Nada report TRITA-NA-E9721, KTH, Royal Institute of Technology, Stockholm, Sweden, 1997. URL:


Henri Eyraud. Les principes de la mesure des corrélations. Annales de l'Universit/e de Lyon, III Series, Section A, 1:30–47, 1938.


Edward W. Fager. Determination and analysis of recurrent groups. Ecology, 38(4):586–595, October 1957. doi:10.2307/1943124.


Edward W. Fager and John A. McGowan. Zooplankton species groups in the north pacific. Science, 140(3566):453–460, 1963. doi:10.1126/science.140.3566.453.


Daniel P. Faith. Asymmetric binary similarity measures. Oecologia, 57(3):287–290, March 1983. doi:10.1007/BF00377169.


Joseph L. Fleiss. Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31(3):651–659, 1975. doi:10.2307/2529549.


Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical Methods for Rates and Proportions. Wiley Series in Probability and Statistics. John Wiley & Sons, Hoboken, 3rd edition, 2003.


Stephen A. Forbes. On the local distribution of certain illinois fishes: an essay in statistical ecology. Bulletin of the Illinois State Laboratory of Natural History, 7:273–303, 1907.


Stephen A. Forbes. Method of determining and measuring the associative relations of species. Science, 61(1585):518–524, 1925.


Earl G. Fossum and Gilbert Kaskey. Optimization and standardization of information retrieval language and systems. Technical Report, Directorate of Information Sciences, Air Force Office of Scientific Research, Office of Aerospace Research, United States Air Force, Washington, D.C., 1966. URL:\_AD0630797.


E. B. Fowlkes and Colin L. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383):553–569, 1983. doi:10.1080/01621459.1983.10478008.


Michael Fürnrohr, Birgit Rimmelspacher, and Tilman von Roncador. Zusammenführung von datenbeständen ohne numerische identifikatoren: ein verfahren im rahmen der testuntersuchungen zu einem registergestützten zensus. Bayern in Zahlen, 2002(7):308–321, 2002. URL:\_\_hrung\_von\_datenbest\_\_nden\_ohne\_numerische\_identifikatoren.pdf.


T. N. Gadd. Phonix: the algorithm. Program, 24(4):363–366, 1990. doi:10.1108/eb047069.


Lars Marius Garshol. Norphone comparator. 2015. URL:


Wilde Georg and Carsten Meyer. Nicht wörtlich genommen, 'schreibweisentolerante' suchroutine in dbase implementiert. c't Magazin für Computer Technik, pages 126–131, October 1988.


N. Gilbert and Terry C. E. Wells. Analysis of quadrat data. Journal of Ecology, 54(3):675–685, November 1966. doi:10.2307/2257810.


Grove K. Gilbert. Finley's tornado predictions. American Meteorological Journal, 1:166–172, 1884.


Leicester E. Gill. Ox-link: the oxford medical record linkage system. In Record Linkage Techniques. Washington, D.C., March 1997. Federal Committee on Statistical Methodology, Office of Management and Budget. URL:


Corrado Gini. Variabilità e mutabilità. Contributo allo Studio delle Distribuzioni e delle Relazioni Statistiche. C. Cuppini, Bologna, 1912.


Corrado Gini. Nuovi contributi all teoria delle relazioni statistiche. Atti del Reale Istituto Veneto di Scienze, Lettere ed Arti, Series 8, 74(2):1903–1942, 1915.


Henry Allan Gleason. Some applications of the quadrat method. Bulletin of the Torrey Botanical Club, 47(1):21–33, January 1920. doi:10.2307/2480223.


David W. Goodall. The distribution of the matching coefficient. Biometrics, 23(4):647–656, December 1967. doi:10.2307/2528419.


Leo A. Goodman and William H. Kruskal. Measures of association for cross classification i. Journal of the American Statistical Association, 49(268):732–764, 1954. doi:10.2307/2281536.


Leo A. Goodman and William H. Kruskal. Measures of association for cross classification ii: further discussion and references. Journal of the American Statistical Association, 54(285):123–163, March 1959. doi:10.2307/2282143.


Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3):705–708, 1982. URL:, doi:10.1016/0022-2836(82)90398-9.


John C. Gower. A general coefficient of similarities and some of its properties. Biometrics, 27(4):857–871, December 1971. doi:10.2307/2528823.


John C. Gower and Pierre Legendre. Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3(1):5–48, February 1986. doi:10.1007/BF01896809.


Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S. Muthukrishman, and Divesh Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of the 27th VLDB Conference, Roma, Italy, 2001. 2001.


Aaron D. Gross. Getty synoname: the development of software for personal name pattern matching. In Intelligent Text and Image Handling - Volume 2, RIAO '91, 754–763. Paris, France, France, 1991. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE. URL:


J. P. Guildford. Fundamental Statistics in Psychology and Education. McGraw-Hill Book Company, New York, New York. URL:


Gloria J. A. Guth. Surname spellings and computerized record linkage. Historical Methods Newsletter, 10(1):10–19, 1976. doi:10.1080/00182494.1976.10112645.


Louis Guttman. An outline of the statistical theory of prediction. In Paul Horst, editor, The Prediction of Personal Adjustment, number 48, pages 253–311. Social Science Research Council, 1941. URL:;view=1up;seq=271.


Kilem Li Gwet. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1):29–48, 2008. doi:10.1348/000711006X126600.


Martin Haase and Kai Heitmann. Die erweiterte kölner phonetik. 2000.


Ulrich Hamann. Merkmalbestand und verwandtschaftsbeziehungen der farinosae: ein beitrag zum system der monokotyledonen. Willdenowia, 2:639–768, 1961.


R. W. Hamming. Error detecting and error correcting codes. The Bell System Technical Journal, 29(2):147–160, April 1950. URL:, doi:10.1002/j.1538-7305.1950.tb00463.x.


Donna Harman. How effective is stemming? Journal of the American Society for Information Science, 42(1):7–15, 1991. URL:\&rep=rep1\&type=pdf, doi:10.1002/(SICI)1097-4571(199101)42:1\%3C7::AID-ASI2\%3E3.0.CO;2-P.


Francis C. Harris and Benjamin B. Lahey. A method for combining occurrence and nonoccurrence interobserver agreement scores. Journal of Applied Behavior Analysis, 11(4):523–527, 1978. doi:10.1901/jaba.1978.11-523.


Ahmad Basheer Hassanat. Dimensionality invariant similarity measure. Journal of American Science, 10(8):221–226, 2014. URL:


Robert P. Hawkins and Victor A. Dotson. Reliability scores that delude: an alice in wonderland trip through the misleading characteristics of inter-observer agreement scores in interval recording. Technical Report, Western Michigan University, 1973. URL:


Ernst Hellinger. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. Journal Für Die Reine Und Angewandte Mathematik, 1909(136):210–271, 1909. doi:10.1515/crll.1909.136.210.


Robert A. Henderson and Malcolm L. Heron. A probabilistic method of paleobiogeographic analysis. Lethaia, 10(1):1–15, 1977. doi:10.1111/j.1502-3931.1977.tb00584.x.


Louis Henry. Projet de transcription phonétique des noms de famille. Annales de Démographie Historique, 1976:201–214, 1976. URL:\_0066-2062\_1976\_num\_1976\_1\_1313.


Theodore Hershberg, Alan Burstein, and Robert Dockhorn. Record linkage. Historical Methods Newsletter, 9(2–3):137–163, 1976. doi:10.1080/00182494.1976.10112639.


Theodore Hershberg, Alan Burstein, and Robert Dockhorn. Verkettung von daten: record linkage am beispiel des philadelphia social history project. In Wilhelm Heinz Schröder, editor, Moderne Stadtgeschichte, volume 8, pages 35–73. Klett-Cotta, 1979. URL:


David Holmes and M. Catherine McCabe. Improving precision and recall for soundex retrieval. In Proceedings. International Conference on Information Technology: Coding and Computing, 22–26. April 2002. URL:, doi:10.1109/ITCC.2002.1000354.


David Hood. Cavesystem: phonetic matching algorithm. Technical Report CTP060902, University of Otago, Dunedin, New Zealand, September 2002. URL:


David Hood. Caverphone revisited. Technical Report CTP150804, University of Otago, Dunedin, New Zealand, December 2004. URL:


Henry S. Horn. Measurement of "overlap" in comparative ecological studies. The American Naturalist, 100(914):419–424, September 1966. doi:10.2307/2459242.


Zdenek Hubálek. Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biological Reviews, 57(4):669–689, February 2008. doi:10.1111/j.1469-185X.1982.tb00376.x.


Stuart H. Hurlbert. A coefficient of interspecific assciation. Ecology, 50(1):1–9, January 1969. doi:10.2307/1934657.


Paul Jaccard. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:241–272, 1901. URL:


Matthew A. Jaro. Advances in record linkage methodology as applied to the 1985 census of tampa florida. Journal of the American Statistical Association, 84(406):414–420, 1989. doi:10.1080/01621459.1989.10478785.


Marie-Claire Jenkins and Dan Smith. Conservative stemming for search and indexing. Technical Report, University of East-Anglia, Norwich, UK, 2005. URL:


Sergio Jiminez, Claudio Becerra, and Alexander Gelbukh. SOFTCARDINALITY-CORE: improving text overlap with distributional measures for semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (\textasteriskcenteredSEM ), Volume 1: Proceedings of the Main Conference and the Shared Task, 194–201. Atlanta, GA, June 2013. Association for Computational Linguistics. URL:


Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, September 1967. doi:10.1007/BF02289588.


James A. Jones and Mary Jean Harrold. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE '05 Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, 273–282. New York, November 2005. ACM, ACM. doi:10.1145/1101908.1101949.


Sebastian Kempken. Bewertung historischer und regionaler schreibvarianten mit hilfe von abstandsmaßen. Master's thesis, Universität Duisburg-Essen, December 2005. URL:


Maurice G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, June 1938. doi:10.2307/2332226.


Ronald N. Kent and Sharon L. Foster. Direct observational procedure: methodological issues in naturalistic settings. In Anthony R. Ciminero, Karen, S. Calhoun, and Henry E. Adams, editors, Handbook of Behavioral Assessment, chapter 9, pages 279–328. John Wiley & Sons, New York, 1977. URL:


Donald E. Knuth. The Art of Computer Programming: Volume 3, Sorting and Searching, pages 394. Addison-Wesley, 1998.


Maroš Kollár. Text::phonetic::phonix. URL:


Grzegorz Kondrak. A new algorithm for the alignment of phonetic sequences. In NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference. 2000. doi:10.0000/


Grzegorz Kondrak. Algorithms for Language Reconstruction. PhD thesis, University of Toronto, 2002. URL:


Grzegorz Kondrak and Bonnie J. Dorr. A similarity-based approach and evaluation methodology for reduction of drug name confusion. Technical Report, University of Maryland, Institute for Advanced Computer Studies, 2003. URL:


Kerrthi Koneru and Cihan Varol. Privacy preserving record linkage using metasoundex algorithm. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 443–447. December 2017. URL:, doi:10.1109/ICMLA.2017.0-121.


G. Frederic Kuder and Marion Webster Richardson. The theory of the estimation of test reliability. Psychometrika, 2(3):151–160, September 1937. doi:10.1007/bf02288391.


Michael Kuhn. Metaphone searches. November 1995. URL:


John L. Kuhns. The continuum of coefficients of association. In Mary Elizabeth Stevens, Vincent E. Giuliano, and Laurence B. Heilprin, editors, Statistical Association Methods for Mechanized Documentation, number 269 in National Bureau of Standards Miscellaneous Publication, 33–40. 1964.


Maciej Kula. Simple minhash implementation in python. June 2015. URL:


Stanisław Kulczynśki. Die pflanzenassoziationen der pieninen. Bulletin International de l'Academie Polonaise des Sciences et des Lettres, Classe des Sciences Mathematiques et Naturelles, B (Sciences Naturelles), pages 57–203, 1927.


Wladimir Köppen. Die aufeinanderfolge der periodischen witterungserscheinungen nach den grundsätzen der wahrscheinlichkeitsrechnung. In Repertorium für Meteorologie, volume 2, pages 189–238. Akademiia Nauk, 1870. URL:\&pg=RA1-PA187\#v=onepage\&q\&f=false.


Andrew J. Lait and Brian Randell. An assessment of name matching algorithms. Technical Report, University of Newcastle upon Tyne, Newcastle upon Tyne, UK, 1996. URL:


Godfrey N. Lance and William T. Williams. Computer programs for hierarchical polythetic classification ("similarity analysis"). Computer Journal, 1966. doi:10.1093/comjnl/9.1.60.


Godfrey N. Lance and William T. Williams. A general theory of classificatory sorting strategies. ii. clustering systems. Computer Journal, 10(3):271–277, January 1967. URL:, doi:10.1093/comjnl/10.3.271.


Godfrey N. Lance and William T. Williams. Mixed-data classificatory programs i. agglomerative systems. Australian Computer Journal, 1:15–20, 1967.


Joerg Lang. Inner wworking of the german analyzer in lucene. November 2013. URL:


Pierre Legendre and Louis Legendre. Numerical Ecology. Number 20 in Developments in Environmental Modelling. Elsevier, Amsterdam, 2nd edition, 1998.


Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR, 163(4):845–848, 1965. URL:


Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8):707–710, February 1966. URL:


Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out. 2004. URL:


Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. doi:10.1162/153244302760200687.


Julie Beth Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2):22–31, June 1968. URL:


Billy T. Lynch and William L. Arends. Selection of a surname coding procedure for the srs record linkage system. Technical Report, Statistical Reporting Service, US Department of Agriculture, Washington, D.C., February 1977. URL:


Jacques Légaré, Yolande Lavoie, and Hubert Charbonneau. The early canadian population: problems in automatic record linkage. Canadian Historical Review, 53(4):427–442, December 1972. doi:10.3138/CHR-053-04-03.


Daniel Marcelino. Soundexbr: soundex (phonetic) algorithm for Brazilian portuguese. July 2015. URL:


Brian W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442–451, 1975.


Kameo Matusita. Decision rules, based on the distance, for problems of fit, two samples, and estimation. The Annals of Mathematical Statistics, 26(4):631–640, December 1955. doi:10.2307/2236376.


A. E. Maxwell and A. E. G. Pilliner. Deriving coefficients of reliability and agreement for ratings. The British Journal of Mathematical and Statistical Psychology, 21(1):105–116, May 1968. doi:10.1111/j.2044-8317.1968.tb00401.x.


Bayard H. McConnaughey. The determination and analysis of plankton communities. Lembaga Penelitian Laut, pages 1–40, 1964.


Jörg Michael. Doppelgänger gesucht – ein programm für die kontextsensitive phonetische stringumwandlung. c't Magazin für Computer Technik, pages 252, 1999. URL:


Jörg Michael. Phonet.c. August 2007. URL:


Ellis L. Michael. Marine ecology and the coefficient of association: a plea in behalf of quantitative biology. The Journal of Ecology, 8(1):54–59, 1920. doi:10.2307/2255213.


Hermann Minkowski. Geometrie der Zahlen. R. G. Teubner, Leipzig, 1910. URL:


Gary Mokotoff. Soundexing and genealogy. 1997. URL:


Alvaro E. Monge and Charles P. Elkan. The field matching problem: algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD'96, 267–270. AAAI Press, 1996. URL:


Gwendolyn B. Moore, John L. Kuhns, Jeffrey L. Trefftzs, and Christine A. Montgomery. Accessing Individual Records from Personal Data Files Using Non-Unique Identifiers. Number 500-2 in Special Publication. National Bureau of Standards, Washington, D.C., February 1977. URL:


Erwan Moreau, François Yvon, and Olivier Cappé. Robust similarity measures for named entities matching. In COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, 593–600. August 2008.


Masaaki Morisita. Measuring of interspecific association and similarity between communities. In Memoirs of the Faculty of Science, volume 3 of Series E (Biology), pages 65–80. Kyushu University, 1959.


James F. Morris. A Quantitative MethoMethod for Vetting "Dark Network" Intelligence Sources for Social Network Analysis. PhD thesis, Air Force Institute of Technology, 2012. URL:


Alejandro Mosquera, Elena Lloret, and Paloma Moreda. Towards facilitating the accessibility of web 2.0 Texts through text normalisation. In Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA) ; Istanbul, Turkey., 9–14. 2012. URL:


J. Motyka, B. Dobrzański, and S. Zawadzki. Wstçpne badania nad lakami paludniowo-wschodnilj lubel-szczyzny (preliminary studies on meadows in the south-east of the province lublin). Annales Universitatis Mariae Curie-Skłodowska, Sectio E, 5(13):367–447, 1950.


M. D. Mountford. An index of similarity and its application to classificatory problems. In P. W. Murphy, editor, Progress in Soil Zoology: Papers from a Colloquium on Research Methods Organized by the Soil Zoology Committee of the International Society of Soil Science, 43–50. London, July 1962. Butterworths. URL:\_in\_soil\_zoology.


Alan Mozley. The statistical analysis of the distribution of pond molluscs in western Canada. The American Naturalist, 1936. doi:10.1086/280660.


Rashid Naseem, Onaiza Maqbool, and Siraj Muhammad. Improved similarity measures for software clustering. In Proceedings of the Euromicro Conference on Software Maintenance and Reengineering, CSMR. March 2011. doi:10.1109/CSMR.2011.9.


Saul B. Needleman and Christian D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. URL:, doi:10.1016/0022-2836(70)90057-4.


Akira Ochiai. Zoogeographical studies on the soleoid fishes found in Japan and its neighhouring regions-ii. Bulletin of the Japanese Society of Scientific Fisheries, 22(9):526–530, 1957. URL:\_9\_526/\_pdf/-char/en, doi:10.2331/suisan.22.526.


Library of Congress. Classification and Shelflisting Manual. Library of Congress, 2013. URL:


OpenRefine. Clustering in depth. 2012. URL:


Laszlo Orlóci. An agllomerative method for classification of plant communities. The Journal of Ecology, 55(1):193–206, March 1967. doi:10.2307/2257725.


Yanosuke Otsuka. The faunal character of the Japanese pleistocene marine mollusca, as evidence of the climate having become colder during the pleistocene in Japan. Bulletin of the Biogeographical Society of Japan, 6(16):165–170, 1936.


Hakan Ozbay. Ozbay metric. 2015. URL:


Chris D. Paice. Another stemmer. In ACM SIGIR Forum, volume 24, 56–61. Fall 1990. URL:, doi:10.1145/101306.101310.


Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, 311–318. 2002. URL:


Vimal P. Parmar and CK Kumbharana. Study existing various phonetic algorithms and designing and development of a working model for the new developed algorithm and comparison by implementing ti with existing algorithm(s). International Journal of Computer Applications, 98(19):45–49, 2014. doi:10.5120/17295-7795.


Rebecca Passonneau. Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06), 831–836. May 2006.


Karl Pearson. Mathematical contributions to the theory of evolution. vii. on the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society, 195 A:1–47, 1900. doi:10.1098/rsta.1900.0022.


Karl Pearson and David Heron. On theories of association. Biometrika, 9(1/2):159–315, 1913. doi:10.2307/2331805.


Pavel Pecina. Lexical association measures and collocation extraction. Language Resources & Evaluation, 44(1/2):137–158, 2010. doi:10.2307/40666353.


Charles S. Peirce. The numerical measure of the success of predictions. Science, 4(93):453–454, 1884. doi:10.1126/science.ns-4.93.453-a.


Lionel S. Penrose. Distance, size and shape. Annals of Eugenics, 17(1):337–343, January 1952. doi:10.1111/j.1469-1809.1952.tb02527.x.


Ulrich Pfeifer. Wait 1.8 - soundex.c. 2000. URL:


Lawrence Philips. Hanging on the metaphone. Computer Language, 7(12):39–44, December 1990.


Lawrence Philips. Metaphone. December 1990. URL:


Lawrence Philips. The double metaphone search algorithm. C/C++ Users Journal, 18(6):38–43, June 2000.


Guillaume Plique. Talisman. 2018. URL:


Joseph J. Pollock and Antonio Zamora. Automatic spelling correction in scientific and scholarly text. Communications of the ACM, 27(4):358–368, April 1984. URL:, doi:10.1145/358027.358048.


Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, July 1980. URL:, doi:10.1108/eb046814.


Martin F. Porter. The english (porter2) stemming algorithm. September 2002. URL:


Hans Joachim Postel. Die kölner phonetik: ein verfahren zur identifizierung von personennamen auf der grundlage der gestaltanalyse. IBM-Nachrichten, 19:925–931, 1969.


Jörg Prante. Elasticsearch – 2015. URL:


M. Růžička. Anwendung mathematische-statistischer methoden in der geobotanik (synthetische bearbeitung von aufnahmen). Biologia, Bratislava, 13:647–661, 1958.


Dragomir Radev, Simone Teufel, Horacio Saggion, Wai Lam, John Blitzer, Arda Çelebi, Hong Qi, Elliott Drabek, and Danyu Liu. Evaluation of text summarization in a cross-lingual information retrieval framework. Technical Report, Johns Hopkins, 2001. URL:


William M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336):846–850, December 1971. doi:10.2307/2284239.


John W. Ratcliff and David E. Metzener. Pattern matching: the gestalt approach. Dr. Dobbs Journal, 1988. URL:


David M. Raup and Rex E. Crick. Measurement of faunal similarity in paleontology. Journal of Paleontology, 53(5):1213–1227, September 1979. doi:10.2307/1304099.


Mustapha Raïssouli, Fatima Leazizi, and Mohamed Chergui. Arithmetic-geometric-harmonic mean of three positive operators. Journal of Inequalities in Pure and Applied Mathematics, 2009. URL:\_08\_JIPAM/014\_08.pdf.


Tony Rees. Taxamatch, an algorithm for near ('fuzzy') matching on scientific names in taxonomic databases. PLoS ONE, 9(9):1–27, September 2014. doi:10.1371/journal.pone.0107510.


Tony Rees and Barbara Boehmer. The mdld (modified damerau-levenshtein distance) algorithm. November 2013. URL:


Dominic John Repici. Understanding classic soundex algorithms. 2013. URL:\#SoundExAndCensus.


Nicholas Ring and Alexandra L. Uitdenbogerd. Finding `lucy in disguise': the misheard lyric matching problem. In Gary Geunbae Lee, Dawei Song, Chin-Yew Lin, Akiko Aizawa, Kazuko Kuriyama, Masaharu Yoshioka, and Tetsuya Sakai, editors, Information Retrieval Technology, 157–167. Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. doi:10.1007/978-3-642-04769-5\_14.


David W. Roberts. Ordination on the basis of fuzzy set theory. Vegetatio, 66(3):123–131, 1986. doi:10.1007/BF00039905.


A. H. Robinson and Colin Cherry. Results of a prototype television bandwidth compression scheme. In Proceedings of the IEEE, volume 55, 356–364. IEEE, 1967. doi:10.1109/PROC.1967.5493.


W. S. Robinson. A method for chronologically ordering archaeological deposits. American Antiquity, 16(4):293–301, April 1951. doi:10.2307/276978.


David J. Rogers and Taffee T. Tanimoto. A computer program for classifying plants. Science, 132(3434):1115–1118, October 1960. doi:10.1126/science.132.3434.1115.


Eugene Rogot and Irving D. Goldberg. A proposed index for measuring agreement in test-retest studies. Journal of Chronic Diseases, 1966. doi:10.1016/0021-9681(66)90032-4.


Gong Ruibin and Chan Kai Yun. An adaptive model for phonetic string search. In Knowledge-Based Intelligent Information and Engineering Systems, 9th International Conference, KES 2005 Melbourne, Australia, September 14-16, 2005 Proceedings, Part III, volume 3683 of Lecture Notes in Artificial Intelligence, 915–921. 2005.


Dorothea Rukasz. Pprl – privacy preserving record linkage. 2018. URL:


Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, and Melissa C. Friesen. Computer-based coding of occupation codes for epidemiological analysis. In 2014 IEEE 27th International Symposium on Computer-Based Medical Systems, 347–350. 2014. doi:10.1109/CBMS.2014.79.


Paul F. Russell and T. Ramachandra Rao. On habitat and association of species of anopheline larvae in south-eastern madras. Journal of the Malaria Institute of India, 3(1):153–178, 1940.


Robert C. Russell. Index. 1918. URL:


Jacques Savoy. IR multilingual resources at unine. 2005. URL:


Kevin Schürer. Creating a nationally representative individual and household sample for great britain, 1851 to 1901 - the victorian panel study (vps). Historical Social Research / Historische Sozialforschung, 32(2):211–331, 2007. doi:10.2307/20762213.


Robyn Schinke, Mark Greengrass, Alexander M. Robertson, and Peter Willett. A stemming algorithm for latin text databases. Journal of Documentation, 52(2):172–187, 1996. doi:10.1108/eb026966.


Rainer Schnell, Tobias Bachteler, and Stefan Bender. A toolbox for record linkage. Australian Journal of Statistics, 33(1-2):125–133, 2004. URL:


William A. Scott. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 19(3):321–325, 1955. doi:10.1086/266577.


Heinz-Jürgen Seiffert. Problem 887. Nieuw Archief voor Wiskunde, 11(4):176, 1993.


SequentiX. Distance measures. 2018. URL:\_measures.htm.


Boumedyen A. N. Shannaq and Victor V. Alexandrov. Using product similarity for adding business. Global Journal of Computer Science and Technology, 10(12):2–8, October 2010. URL:


Dana Shapira and James A. Storer. Edit distance with move operations. Journal of Discrete Algorithms, 5(2):380–392, June 2007. doi:10.1016/j.jda.2005.01.010.


Guang R. Shi. Multivariate data analysis in palaeoecology and palaeobiogeography—a review. Palaeogeography, Palaeoclimatology, Palaeoecology, 105(3-4):199–234, 1993. doi:10.1016/0031-0182(93)90084-v.


Grigori Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, and David Pinto. Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas, 2014. URL:, doi:10.13053/CyS-18-3-2043.


Edward H. Simpson. Measurement of diversity. Nature, 163:688, April 1949. URL:, doi:10.1038/163688a0.


Allan Sjöö. Swamisfinxbix. 2009. URL:


Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. URL:, doi:10.1016/0022-2836(81)90087-5.


Chakkrit Snae and Bernard Diaz. An interface for mining genealogical nominal data using the concept of linkage and a hybrid name matching algorithm. Journal of 3D-Forum Society, 16(1):142–147, 2002. URL:\_Journal.pdf.


Robert R. Sokal and Charles D. Michener. A statistical method for evaluating systematic relationships. The University of Kansas Science Bulletin, 38, part 2(22):1409–1438, March 1958. URL:\_133648\_astatisticalmethodforevaluatin1902.


Robert R. Sokal and Peter H. A. Sneath. Principles of Numerical Taxonomy. W. H. Freeman and Company, San Francisco, 1963.


Wayne Song. Typo-distance. 2011. URL:


Theodor Sorgenfrei. Molluscan Assemblages from the Marine Middle Miocene of South Jutland and Their Environments. Number 79 in 2. Danmarks Geologiske Undersøgelse, 1–503, 1958.


United States. Using the Census Soundex. Number 55 in General Information Leaflet. National Archives and Records Administration, Washington, D.C., 1997. URL:


United States. Soundex system: the soundex indexing system. 2007. URL:


J. F. Steffensen. On certain measures of dependence between statistical variables. Biometrika, 26(1/2):251–255, May 1934. doi:10.2307/2332058.


Sam Steingold and Michal Laclavík. An information theoretic metric for multi-class categorization. Technical Report, Magnetic Media Online, 2015. URL:


Kevin L. Stern. 2014. URL:\_and\_algorithms/stern\_library/string/


H. Edmund Stiles. The association factor in information retrieval. Journal of the ACM, 8(2):271–279, April 1961. doi:10.1145/321062.321074.


Giorgos Stoilos, Giorgos Stamou, and Stefanos Kollias. A string metric for ontology alignment. In ISWC'05 Proceedings of the 4th international conference on The Semantic Web, 624–637. Galway, Ireland, November 2005. doi:10.1007/11574620\_45.


A. Stuart. The estimation and comparison of strengths of association in contingency tables. Biometrika, 40(1/2):105–110, June 1953. doi:10.2307/2333101.


Dezydery Szymkiewicz. Une contribution statistique à la géographie floristique. Acta Societatis Botanicorum Poloniae, 11(3):249–265, 1934. URL:, doi:10.5586/asbp.1934.012.


Thorvald Sørensen. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Kongelige Danske Videnskabernes Selskab, 5(4):1–34, 1948. URL:\_S\%C3\%B8rensen,\%20Thorvald.pdf.


Robert L. Taft. Name Search Techniques. Special report (New York State Identification and Intelligence System). Bureau of Systems Development, New York State Identification and Intelligence System, 1970.


T. T. Tanimoto. An elementary mathematical theory of classification and prediction. Technical Report, IBM, 1958.


Kazimierz Tarwid. Szacowanie zbieznosci nisz ekologicznych gatunkow droga oceny prawdopodobienstwa spotykania sie ich w polowach. Ekologia Polska, Seria B, pages 115–130, 1960.


Walter F. Tichy. The string-to-string correction problem with block moves. ACM Transactions on Computer Systems, 2(4):309–321, November 1984. doi:10.1145/357401.357404.


Ticki. Eudex: a blazingly fast phonetic reduction/hashing algorithm. URL:


Ticki. The eudex algorithm. December 2016. URL:


Rodham E. Tulloss. Assessment of similarity indices for undesirable properties and a new tripartite similarity index based on cost functions. In Mary E. Palm and Ignacio H. Chapela, editors, Mycology in Sustainable Development: Expanding Concepts, Vanishing Borders, pages 122–143. Parkway Publishers, Inc., Boone, NC, 1997.


W. A. Turner, G. Charton, F. Laville, and B. Michelet. Packaging information for peer review: new co-word analysis techniques. In Handbook of Quantitative Studies of Science and Technology. New Holland, 1988.


Amos Tversky. Features of similarity. Psychological Review, 84(4):327–352, 1977. URL:, doi:10.1037/0033-295x.84.4.327.


Esko Ukkonen. Approximate string-matching with q-grams and maximal matches. Theoretical Computer Science, 92(1):191–211, 1992. doi:10.1016/0304-3975(92)90143-4.


William B. Upholt. Estimation of DNA sequence divergence from comparison of restriction endonuclease digests. Nucleic Acids Research, 4(5):1257–1265, January 1977. doi:10.1093/nar/4.5.1257.


Cihan Varol and Coskun Bayrak. Hybrid matching algorithm for personal names. Journal of Data and Information Quality, 3(4):8:1–8:18, September 2012. doi:10.1145/2348828.2348830.


Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168–173, January 1974. doi:10.1145/321796.321811.


Matthijs J. Warrens. Similarity Coefficients for Binary Data: Properties of Coefficients, Coefficient Matrices, Multi-way Metrics and Multivariate Coefficients. PhD thesis, Universiteit Leiden, Leiden, June 2008. URL:\_thesis.pdf.


Simon White. How to strike a match. Web, Nd. The oldest version on Internet Archive was archived in 2004. URL:


R. H. Whittaker. A study of summer foliage insect communities in the great smoky mountains. Ecological Monographs, 22(1):1–44, January 1952. doi:10.2307/1948527.


Robert H. Whittaker. Ordination of Plant Communities. Volume 5 of Handbook of Vegetation Sciecne. Springer Netherlands, 1982.


Wikibooks. Algorithm implementation/strings/longest common substring. 2018. URL:\_Implementation/Strings/Longest\_common\_substring\#Python.


Martin Wilz. Aspekte der kodierung phonetischer Ähnlichkeiten in deutschen eigennamen. Master's thesis, Universität zu Köln, Köln, 2005. URL:\_Files/Allgemeine\_Dateien/Martin\_Wilz.pdf.


William E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. Technical Report, U.S. Bureau of the Census, Statistical Research Division, Washington, D.C., 1990. URL:


William E. Winkler, George McLaughlin, Matthew A. Jaro, and Maureen Lync. Strcmp95.c. January 1994. URL:


Hua Xiang. Similarity-based Virtual Screening: Effect of the Choice of Similarity Measure. PhD thesis, The University of Sheffield, 2013. URL:\_Final.pdf.


Ruiyu Yang, Yuxiang Jiang, Matthew W. Hahn, Elizabeth A. Houseworth, and Predrag Radivojac. New metrics for learning and inference on sets, ontologies, and functions. March 2016. URL:


Frank Yates. Contingency tables involving small numbers and the \$\chi \$2 Test. Supplement to the Journal of the Royal Statistical Society, 1(2):217–235, 1934. doi:10.2307/2983604.


William John Youden. Index for rating diagnostic tests. Cancer, 3(1):32–35, 1950. doi:10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>;2-3.


Li Yujian and Liu Bo. A normalized levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1091–1095, 2007. doi:10.1109/TPAMI.2007.1078.


G. Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 1912. doi:10.2307/2340126.


G. Udny Yule and Maurice G. Kendall. An Introduction to the Theory of Statistics. Griffin, London, 14 edition, 1968.


Siderite Zackwehdex. Super fast and accurate string distance algorithm: sift4. 2014. URL:


Jesper Zedlitz. Phonet4java 2015. URL:


Justin Zobel and Philip Dart. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96, 166–172. New York, NY, USA, 1996. ACM. doi:10.1145/243199.243258.


Colin de la Higuera and Luisa Micó. A contextual normalised edit distance. In First International Workshop on Similarity Search and Applications (sisap 2008). 2008. doi:10.1109/SISAP.2008.17.


María del Pilar Angeles and Noemi Bailón-Miguel. Performance of spanish encoding functions during record linkage. In DATA ANALYTICS 2016: The Fifth International Conference on Data Analysis, 1–7. 2016. URL:\#page=14.


María del Pilar Angeles, Adrián Espino-Gamez, and Jonathan Gil-Moncada. Comparison of a modified spanish phonetic, soundex, and phonex coding functions during data matching process. In 2015 International Conference on Informatics, Electronics Vision (ICIEV), 1–5. June 2015. URL:\_Comparison\_of\_a\_Modified\_Spanish\_Phonetic\_Soundex\_and\_Phonex\_coding\_functions\_during\_data\_matching\_process, doi:10.1109/ICIEV.2015.7334028.


The J. Paul Getty Trust. Synoname. 1991. URL:


Eddy van der Maarel. On the use of ordination model in phytosociology. Vegetatio Acta Geobotanica, 19(1–6):21–46, January 1969.


Hans-Peter von Reth and Hans-Jörg Schek. Eine zugriffsmethode für die phonetische Ähnlichkeitssuche. Technical Report 77.03.002, IBM Deutschland GmbH., 1977.