abydos.clustering module¶
abydos.clustering.
- The clustering module implements clustering algorithms such as:
- mean pair-wise similarity
-
abydos.clustering.
mean_pairwise_similarity
(collection, metric=<function sim>, mean_func=<function hmean>, symmetric=False)[source]¶ Calculate the mean pairwise similarity of a collection of strings.
Takes the mean of the pairwise similarity between each member of a collection, optionally in both directions (for asymmetric similarity metrics.
Parameters: - collection (list) – a collection of terms or a string that can be split
- metric (function) – a similarity metric function
- mean_func (function) – a mean function that takes a list of values and returns a float
- symmetric (bool) – set to True if all pairwise similarities should be calculated in both directions
Returns: the mean pairwise similarity of a collection of strings
Return type: float
>>> round(mean_pairwise_similarity(['Christopher', 'Kristof', ... 'Christobal']), 12) 0.519801980198 >>> round(mean_pairwise_similarity(['Niall', 'Neal', 'Neil']), 12) 0.545454545455
-
abydos.clustering.
pairwise_similarity_statistics
(src_collection, tar_collection, metric=<function sim>, mean_func=<function amean>, symmetric=False)[source]¶ Calculate the pairwise similarity statistics a collection of strings.
Calculate pairwise similarities among members of two collections, returning the maximum, minimum, mean (according to a supplied function, arithmetic mean, by default), and (population) standard deviation of those similarities.
Parameters: - src_collection (list) – a collection of terms or a string that can be split
- tar_collection (list) – a collection of terms or a string that can be split
- metric (function) – a similarity metric function
- mean_func (function) – a mean function that takes a list of values and returns a float
- symmetric (bool) – set to True if all pairwise similarities should be calculated in both directions
Returns: the max, min, mean, and standard deviation of similarities
Return type: tuple
>>> tuple(round(_, 12) for _ in pairwise_similarity_statistics( ... ['Christopher', 'Kristof', 'Christobal'], ['Niall', 'Neal', 'Neil'])) (0.2, 0.0, 0.118614718615, 0.075070477184)