abydos.stats package¶
abydos.stats.
The stats module defines functions for calculating various statistical data about linguistic objects.
Functions are provided for calculating the following means:
- arithmetic mean (
amean()
)- geometric mean (
gmean()
)- harmonic mean (
hmean()
)- quadratic mean (
qmean()
)- contraharmonic mean (
cmean()
)- logarithmic mean (
lmean()
)- identric (exponential) mean (
imean()
)- Seiffert's mean (
seiffert_mean()
)- Lehmer mean (
lehmer_mean()
)- Heronian mean (
heronian_mean()
)- Hölder (power/generalized) mean (
hoelder_mean()
)- arithmetic-geometric mean (
agmean()
)- geometric-harmonic mean (
ghmean()
)- arithmetic-geometric-harmonic mean (
aghmean()
)
And for calculating:
- midrange (
midrange()
)- median (
median()
)- mode (
mode()
)- variance (
var()
)- standard deviation (
std()
)
Some examples of the basic functions:
>>> nums = [16, 49, 55, 49, 6, 40, 23, 47, 29, 85, 76, 20]
>>> amean(nums)
41.25
>>> aghmean(nums)
32.42167170892585
>>> heronian_mean(nums)
37.931508950381925
>>> mode(nums)
49
>>> std(nums)
22.876935255113754
Two pairwise functions are provided:
- mean pairwise similarity (
mean_pairwise_similarity()
), which returns the mean similarity (using a supplied similarity function) among each item in a collection- pairwise similarity statistics (
pairwise_similarity_statistics()
), which returns the max, min, mean, and standard deviation of pairwise similarities between two collections
The confusion table class (ConfusionTable
) can be constructed in
a number of ways:
- four values, representing true positives, true negatives, false positives, and false negatives, can be passed to the constructor
- a list or tuple with four values, representing true positives, true negatives, false positives, and false negatives, can be passed to the constructor
- a dict with keys 'tp', 'tn', 'fp', 'fn', each assigned to the values for true positives, true negatives, false positives, and false negatives can be passed to the constructor
The ConfusionTable
class has methods:
to_tuple()
extracts theConfusionTable
values as a tuple: (\(w\), \(x\), \(y\), \(z\))to_dict()
extracts theConfusionTable
values as a dict: {'tp':\(w\), 'tn':\(x\), 'fp':\(y\), 'fn':\(z\)}true_pos()
returns the number of true positivestrue_neg()
returns the number of true negativesfalse_pos()
returns the number of false positivesfalse_neg()
returns the number of false negativescorrect_pop()
returns the correct populationerror_pop()
returns the error populationtest_pos_pop()
returns the test positive populationtest_neg_pop()
returns the test negative populationcond_pos_pop()
returns the condition positive populationcond_neg_pop()
returns the condition negative populationpopulation()
returns the total populationprecision()
returns the precisionprecision_gain()
returns the precision gainrecall()
returns the recallspecificity()
returns the specificitynpv()
returns the negative predictive valuefallout()
returns the falloutfdr()
returns the false discovery rateaccuracy()
returns the accuracyaccuracy_gain()
returns the accuracy gainbalanced_accuracy()
returns the balanced accuracyinformedness()
returns the informednessmarkedness()
returns the markednesspr_amean()
returns the arithmetic mean of precision & recallpr_gmean()
returns the geometric mean of precision & recallpr_hmean()
returns the harmonic mean of precision & recallpr_qmean()
returns the quadratic mean of precision & recallpr_cmean()
returns the contraharmonic mean of precision & recallpr_lmean()
returns the logarithmic mean of precision & recallpr_imean()
returns the identric mean of precision & recallpr_seiffert_mean()
returns Seiffert's mean of precision & recallpr_lehmer_mean()
returns the Lehmer mean of precision & recallpr_heronian_mean()
returns the Heronian mean of precision & recallpr_hoelder_mean()
returns the Hölder mean of precision & recallpr_agmean()
returns the arithmetic-geometric mean of precision & recallpr_ghmean()
returns the geometric-harmonic mean of precision & recallpr_aghmean()
returns the arithmetic-geometric-harmonic mean of precision & recallfbeta_score()
returns the \(F_{beta}\) scoref2_score()
returns the \(F_2\) scorefhalf_score()
returns the \(F_{\frac{1}{2}}\) scoree_score()
returns the \(E\) scoref1_score()
returns the \(F_1\) scoref_measure()
returns the F measureg_measure()
returns the G measuremcc()
returns Matthews correlation coefficientsignificance()
returns the significancekappa_statistic()
returns the Kappa statistic
>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f1_score()
0.8275862068965516
>>> ct.mcc()
0.5367450401216932
>>> ct.specificity()
0.75
>>> ct.significance()
66.26190476190476
The ConfusionTable
class also supports checking for equality with
another ConfusionTable
and casting to string with str()
:
>>> (ConfusionTable({'tp':120, 'tn':60, 'fp':20, 'fn':30}) ==
... ConfusionTable(120, 60, 20, 30))
True
>>> str(ConfusionTable(120, 60, 20, 30))
'tp:120, tn:60, fp:20, fn:30'
-
class
abydos.stats.
ConfusionTable
(tp=0, tn=0, fp=0, fn=0)[source]¶ Bases:
object
ConfusionTable object.
This object is initialized by passing either four integers (or a tuple of four integers) representing the squares of a confusion table: true positives, true negatives, false positives, and false negatives
The object possesses methods for the calculation of various statistics based on the confusion table.
-
accuracy
()[source]¶ Return accuracy.
Accuracy is defined as \(\frac{tp + tn}{population}\)
Cf. https://en.wikipedia.org/wiki/Accuracy
Returns: The accuracy of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.accuracy() 0.782608695652174
-
accuracy_gain
()[source]¶ Return gain in accuracy.
The gain in accuracy is defined as: \(G(accuracy) = \frac{accuracy}{random~ accuracy}\)
Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)
Returns: The gain in accuracy of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.accuracy_gain() 1.4325259515570934
-
balanced_accuracy
()[source]¶ Return balanced accuracy.
Balanced accuracy is defined as \(\frac{sensitivity + specificity}{2}\)
Cf. https://en.wikipedia.org/wiki/Accuracy
Returns: The balanced accuracy of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.balanced_accuracy() 0.775
-
cond_neg_pop
()[source]¶ Return condition negative population.
Returns: The condition negative population of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.cond_neg_pop() 80
-
cond_pos_pop
()[source]¶ Return condition positive population.
Returns: The condition positive population of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.cond_pos_pop() 150
-
correct_pop
()[source]¶ Return correct population.
Returns: The correct population of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.correct_pop() 180
-
e_score
(beta=1)[source]¶ Return \(E\)-score.
This is Van Rijsbergen's effectiveness measure: \(E=1-F_{\beta}\).
Cf. https://en.wikipedia.org/wiki/Information_retrieval#F-measure
Parameters: beta (float) -- The \(\beta\) parameter in the above formula Returns: The \(E\)-score of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.e_score() 0.17241379310344818
-
error_pop
()[source]¶ Return error population.
Returns: The error population of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.error_pop() 50
-
f1_score
()[source]¶ Return \(F_{1}\) score.
\(F_{1}\) score is the harmonic mean of precision and recall: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)
Cf. https://en.wikipedia.org/wiki/F1_score
Returns: The \(F_{1}\) of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.f1_score() 0.8275862068965516
-
f2_score
()[source]¶ Return \(F_{2}\).
The \(F_{2}\) score emphasizes recall over precision in comparison to the \(F_{1}\) score
Cf. https://en.wikipedia.org/wiki/F1_score
Returns: The \(F_{2}\) of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.f2_score() 0.8108108108108109
-
f_measure
()[source]¶ Return \(F\)-measure.
\(F\)-measure is the harmonic mean of precision and recall: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)
Cf. https://en.wikipedia.org/wiki/F1_score
Returns: The math:F-measure of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.f_measure() 0.8275862068965516
-
fallout
()[source]¶ Return fall-out.
Fall-out is defined as \(\frac{fp}{fp + tn}\)
AKA false positive rate (FPR)
Cf. https://en.wikipedia.org/wiki/Information_retrieval#Fall-out
Returns: The fall-out of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.fallout() 0.25
-
false_neg
()[source]¶ Return false negatives.
Returns: The false negatives of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.false_neg() 30
-
false_pos
()[source]¶ Return false positives.
Returns: The false positives of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.false_pos() 20
-
fbeta_score
(beta=1.0)[source]¶ Return \(F_{\beta}\) score.
\(F_{\beta}\) for a positive real value \(\beta\) "measures the effectiveness of retrieval with respect to a user who attaches \(\beta\) times as much importance to recall as precision" (van Rijsbergen 1979)
\(F_{\beta}\) score is defined as: \((1 + \beta^2) \cdot \frac{precision \cdot recall} {((\beta^2 \cdot precision) + recall)}\)
Cf. https://en.wikipedia.org/wiki/F1_score
Parameters: beta (float) -- The \(\beta\) parameter in the above formula Returns: The \(F_{\beta}\) of the confusion table Return type: float Raises: AttributeError
-- Beta must be a positive real valueExamples
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.fbeta_score() 0.8275862068965518 >>> ct.fbeta_score(beta=0.1) 0.8565371024734982
-
fdr
()[source]¶ Return false discovery rate (FDR).
False discovery rate is defined as \(\frac{fp}{fp + tp}\)
Cf. https://en.wikipedia.org/wiki/False_discovery_rate
Returns: The false discovery rate of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.fdr() 0.14285714285714285
-
fhalf_score
()[source]¶ Return \(F_{0.5}\) score.
The \(F_{0.5}\) score emphasizes precision over recall in comparison to the \(F_{1}\) score
Cf. https://en.wikipedia.org/wiki/F1_score
Returns: The \(F_{0.5}\) score of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.fhalf_score() 0.8450704225352114
-
g_measure
()[source]¶ Return G-measure.
\(G\)-measure is the geometric mean of precision and recall: \(\sqrt{precision \cdot recall}\)
This is identical to the Fowlkes–Mallows (FM) index for two clusters.
Cf. https://en.wikipedia.org/wiki/F1_score#G-measure
Cf. https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index
Returns: The \(G\)-measure of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.g_measure() 0.828078671210825
-
informedness
()[source]¶ Return informedness.
Informedness is defined as \(sensitivity + specificity - 1\).
AKA Youden's J statistic ([You50])
AKA DeltaP'
Cf. https://en.wikipedia.org/wiki/Youden%27s_J_statistic
Returns: The informedness of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.informedness() 0.55
-
kappa_statistic
()[source]¶ Return κ statistic.
The κ statistic is defined as: \(\kappa = \frac{accuracy - random~ accuracy} {1 - random~ accuracy}\)
The κ statistic compares the performance of the classifier relative to the performance of a random classifier. \(\kappa\) = 0 indicates performance identical to random. \(\kappa\) = 1 indicates perfect predictive success. \(\kappa\) = -1 indicates perfect predictive failure.
Returns: The κ statistic of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.kappa_statistic() 0.5344129554655871
-
markedness
()[source]¶ Return markedness.
Markedness is defined as \(precision + npv - 1\)
Returns: The markedness of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.markedness() 0.5238095238095237
-
mcc
()[source]¶ Return Matthews correlation coefficient (MCC).
The Matthews correlation coefficient is defined in [Mat75] as: \(\frac{(tp \cdot tn) - (fp \cdot fn)} {\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}\)
This is equivalent to the geometric mean of informedness and markedness, defined above.
Cf. https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
Returns: The Matthews correlation coefficient of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.mcc() 0.5367450401216932
-
npv
()[source]¶ Return negative predictive value (NPV).
NPV is defined as \(\frac{tn}{tn + fn}\)
Cf. https://en.wikipedia.org/wiki/Negative_predictive_value
Returns: The negative predictive value of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.npv() 0.6666666666666666
-
population
()[source]¶ Return population, N.
Returns: The population (N) of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.population() 230
-
pr_aghmean
()[source]¶ Return arithmetic-geometric-harmonic mean of precision & recall.
Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described in [Raissouli:2009].
Returns: The arithmetic-geometric-harmonic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_aghmean() 0.8280786712108288
-
pr_agmean
()[source]¶ Return arithmetic-geometric mean of precision & recall.
Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits)
Cf. https://en.wikipedia.org/wiki/Arithmetic-geometric_mean
Returns: The arithmetic-geometric mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_agmean() 0.8283250315702829
-
pr_amean
()[source]¶ Return arithmetic mean of precision & recall.
The arithmetic mean of precision and recall is defined as: \(\frac{precision \cdot recall}{2}\)
Cf. https://en.wikipedia.org/wiki/Arithmetic_mean
Returns: The arithmetic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_amean() 0.8285714285714285
-
pr_cmean
()[source]¶ Return contraharmonic mean of precision & recall.
The contraharmonic mean is: \(\frac{precision^{2} + recall^{2}}{precision + recall}\)
Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean
Returns: The contraharmonic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_cmean() 0.8295566502463055
-
pr_ghmean
()[source]¶ Return geometric-harmonic mean of precision & recall.
Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits)
Cf. https://en.wikipedia.org/wiki/Geometric-harmonic_mean
Returns: The geometric-harmonic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_ghmean() 0.8278323841238441
-
pr_gmean
()[source]¶ Return geometric mean of precision & recall.
The geometric mean of precision and recall is defined as: \(\sqrt{precision \cdot recall}\)
Cf. https://en.wikipedia.org/wiki/Geometric_mean
Returns: The geometric mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_gmean() 0.828078671210825
-
pr_heronian_mean
()[source]¶ Return Heronian mean of precision & recall.
The Heronian mean of precision and recall is defined as: \(\frac{precision + \sqrt{precision \cdot recall} + recall}{3}\)
Cf. https://en.wikipedia.org/wiki/Heronian_mean
Returns: The Heronian mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_heronian_mean() 0.8284071761178939
-
pr_hmean
()[source]¶ Return harmonic mean of precision & recall.
The harmonic mean of precision and recall is defined as: \(\frac{2 \cdot precision \cdot recall}{precision + recall}\)
Cf. https://en.wikipedia.org/wiki/Harmonic_mean
Returns: The harmonic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_hmean() 0.8275862068965516
-
pr_hoelder_mean
(exp=2)[source]¶ Return Hölder (power/generalized) mean of precision & recall.
The power mean of precision and recall is defined as: \(\frac{1}{2} \cdot \sqrt[exp]{precision^{exp} + recall^{exp}}\) for \(exp \ne 0\), and the geometric mean for \(exp = 0\)
Cf. https://en.wikipedia.org/wiki/Generalized_mean
Parameters: exp (float) -- The exponent of the Hölder mean Returns: The Hölder mean for the given exponent of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_hoelder_mean() 0.8290638930598233
-
pr_imean
()[source]¶ Return identric (exponential) mean of precision & recall.
The identric mean is: precision if precision = recall, otherwise \(\frac{1}{e} \cdot \sqrt[precision - recall]{\frac{precision^{precision}} {recall^{recall}}}\)
Cf. https://en.wikipedia.org/wiki/Identric_mean
Returns: The identric mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_imean() 0.8284071826325543
-
pr_lehmer_mean
(exp=2.0)[source]¶ Return Lehmer mean of precision & recall.
The Lehmer mean is: \(\frac{precision^{exp} + recall^{exp}} {precision^{exp-1} + recall^{exp-1}}\)
Cf. https://en.wikipedia.org/wiki/Lehmer_mean
Parameters: exp (float) -- The exponent of the Lehmer mean Returns: The Lehmer mean for the given exponent of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_lehmer_mean() 0.8295566502463055
-
pr_lmean
()[source]¶ Return logarithmic mean of precision & recall.
The logarithmic mean is: 0 if either precision or recall is 0, the precision if they are equal, otherwise \(\frac{precision - recall} {ln(precision) - ln(recall)}\)
Cf. https://en.wikipedia.org/wiki/Logarithmic_mean
Returns: The logarithmic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_lmean() 0.8282429171492667
-
pr_qmean
()[source]¶ Return quadratic mean of precision & recall.
The quadratic mean of precision and recall is defined as: \(\sqrt{\frac{precision^{2} + recall^{2}}{2}}\)
Cf. https://en.wikipedia.org/wiki/Quadratic_mean
Returns: The quadratic mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_qmean() 0.8290638930598233
-
pr_seiffert_mean
()[source]¶ Return Seiffert's mean of precision & recall.
Seiffert's mean of precision and recall is: \(\frac{precision - recall}{4 \cdot arctan \sqrt{\frac{precision}{recall}} - \pi}\)
It is defined in [Sei93].
Returns: Seiffert's mean of the confusion table's precision & recall Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.pr_seiffert_mean() 0.8284071696048312
-
precision
()[source]¶ Return precision.
Precision is defined as \(\frac{tp}{tp + fp}\)
AKA positive predictive value (PPV)
Cf. https://en.wikipedia.org/wiki/Precision_and_recall
Cf. https://en.wikipedia.org/wiki/Information_retrieval#Precision
Returns: The precision of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.precision() 0.8571428571428571
-
precision_gain
()[source]¶ Return gain in precision.
The gain in precision is defined as: \(G(precision) = \frac{precision}{random~ precision}\)
Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)
Returns: The gain in precision of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.precision_gain() 1.3142857142857143
-
recall
()[source]¶ Return recall.
Recall is defined as \(\frac{tp}{tp + fn}\)
AKA sensitivity
AKA true positive rate (TPR)
Cf. https://en.wikipedia.org/wiki/Precision_and_recall
Cf. https://en.wikipedia.org/wiki/Sensitivity_(test)
Cf. https://en.wikipedia.org/wiki/Information_retrieval#Recall
Returns: The recall of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.recall() 0.8
-
significance
()[source]¶ Return the significance, \(\chi^{2}\).
Significance is defined as: \(\chi^{2} = \frac{(tp \cdot tn - fp \cdot fn)^{2} (tp + tn + fp + fn)} {((tp + fp)(tp + fn)(tn + fp)(tn + fn)}\)
Also: \(\chi^{2} = MCC^{2} \cdot n\)
Cf. https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test
Returns: The significance of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.significance() 66.26190476190476
-
specificity
()[source]¶ Return specificity.
Specificity is defined as \(\frac{tn}{tn + fp}\)
AKA true negative rate (TNR)
Cf. https://en.wikipedia.org/wiki/Specificity_(tests)
Returns: The specificity of the confusion table Return type: float Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.specificity() 0.75
-
test_neg_pop
()[source]¶ Return test negative population.
Returns: The test negative population of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.test_neg_pop() 90
-
test_pos_pop
()[source]¶ Return test positive population.
Returns: The test positive population of the confusion table Return type: int Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.test_pos_pop() 140
-
to_dict
()[source]¶ Cast to dict.
Returns: The confusion table as a dict Return type: dict Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> import pprint >>> pprint.pprint(ct.to_dict()) {'fn': 30, 'fp': 20, 'tn': 60, 'tp': 120}
-
to_tuple
()[source]¶ Cast to tuple.
Returns: The confusion table as a 4-tuple (tp, tn, fp, fn) Return type: tuple Example
>>> ct = ConfusionTable(120, 60, 20, 30) >>> ct.to_tuple() (120, 60, 20, 30)
-
-
abydos.stats.
amean
(nums)[source]¶ Return arithmetic mean.
The arithmetic mean is defined as: \(\frac{\sum{nums}}{|nums|}\)
Cf. https://en.wikipedia.org/wiki/Arithmetic_mean
Parameters: nums (list) -- A series of numbers Returns: The arithmetric mean of nums Return type: float Examples
>>> amean([1, 2, 3, 4]) 2.5 >>> amean([1, 2]) 1.5 >>> amean([0, 5, 1000]) 335.0
-
abydos.stats.
gmean
(nums)[source]¶ Return geometric mean.
The geometric mean is defined as: \(\sqrt[|nums|]{\prod\limits_{i} nums_{i}}\)
Cf. https://en.wikipedia.org/wiki/Geometric_mean
Parameters: nums (list) -- A series of numbers Returns: The geometric mean of nums Return type: float Examples
>>> gmean([1, 2, 3, 4]) 2.213363839400643 >>> gmean([1, 2]) 1.4142135623730951 >>> gmean([0, 5, 1000]) 0.0
-
abydos.stats.
hmean
(nums)[source]¶ Return harmonic mean.
The harmonic mean is defined as: \(\frac{|nums|}{\sum\limits_{i}\frac{1}{nums_i}}\)
Following the behavior of Wolfram|Alpha: - If one of the values in nums is 0, return 0. - If more than one value in nums is 0, return NaN.
Cf. https://en.wikipedia.org/wiki/Harmonic_mean
Parameters: nums (list) -- A series of numbers Returns: The harmonic mean of nums Return type: float Raises: AttributeError
-- hmean requires at least one valueExamples
>>> hmean([1, 2, 3, 4]) 1.9200000000000004 >>> hmean([1, 2]) 1.3333333333333333 >>> hmean([0, 5, 1000]) 0
-
abydos.stats.
agmean
(nums)[source]¶ Return arithmetic-geometric mean.
Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits).
Cf. https://en.wikipedia.org/wiki/Arithmetic-geometric_mean
Parameters: nums (list) -- A series of numbers Returns: The arithmetic-geometric mean of nums Return type: float Examples
>>> agmean([1, 2, 3, 4]) 2.3545004777751077 >>> agmean([1, 2]) 1.4567910310469068 >>> agmean([0, 5, 1000]) 2.9753977059954195e-13
-
abydos.stats.
ghmean
(nums)[source]¶ Return geometric-harmonic mean.
Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits).
Cf. https://en.wikipedia.org/wiki/Geometric-harmonic_mean
Parameters: nums (list) -- A series of numbers Returns: The geometric-harmonic mean of nums Return type: float Examples
>>> ghmean([1, 2, 3, 4]) 2.058868154613003 >>> ghmean([1, 2]) 1.3728805006183502 >>> ghmean([0, 5, 1000]) 0.0
>>> ghmean([0, 0]) 0.0 >>> ghmean([0, 0, 5]) nan
-
abydos.stats.
aghmean
(nums)[source]¶ Return arithmetic-geometric-harmonic mean.
Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described in [Raissouli:2009].
Parameters: nums (list) -- A series of numbers Returns: The arithmetic-geometric-harmonic mean of nums Return type: float Examples
>>> aghmean([1, 2, 3, 4]) 2.198327159900212 >>> aghmean([1, 2]) 1.4142135623731884 >>> aghmean([0, 5, 1000]) 335.0
-
abydos.stats.
cmean
(nums)[source]¶ Return contraharmonic mean.
The contraharmonic mean is: \(\frac{\sum\limits_i x_i^2}{\sum\limits_i x_i}\)
Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean
Parameters: nums (list) -- A series of numbers Returns: The contraharmonic mean of nums Return type: float Examples
>>> cmean([1, 2, 3, 4]) 3.0 >>> cmean([1, 2]) 1.6666666666666667 >>> cmean([0, 5, 1000]) 995.0497512437811
-
abydos.stats.
imean
(nums)[source]¶ Return identric (exponential) mean.
The identric mean of two numbers x and y is: x if x = y otherwise \(\frac{1}{e} \sqrt[x-y]{\frac{x^x}{y^y}}\)
Cf. https://en.wikipedia.org/wiki/Identric_mean
Parameters: nums (list) -- A series of numbers Returns: The identric mean of nums Return type: float Raises: AttributeError
-- imean supports no more than two valuesExamples
>>> imean([1, 2]) 1.4715177646857693 >>> imean([1, 0]) nan >>> imean([2, 4]) 2.9430355293715387
-
abydos.stats.
lmean
(nums)[source]¶ Return logarithmic mean.
The logarithmic mean of an arbitrarily long series is defined by http://www.survo.fi/papers/logmean.pdf as: \(L(x_1, x_2, ..., x_n) = (n-1)! \sum\limits_{i=1}^n \frac{x_i} {\prod\limits_{\substack{j = 1\\j \ne i}}^n ln \frac{x_i}{x_j}}\)
Cf. https://en.wikipedia.org/wiki/Logarithmic_mean
Parameters: nums (list) -- A series of numbers Returns: The logarithmic mean of nums Return type: float Raises: AttributeError
-- No two values in the nums list may be equalExamples
>>> lmean([1, 2, 3, 4]) 2.2724242417489258 >>> lmean([1, 2]) 1.4426950408889634
-
abydos.stats.
qmean
(nums)[source]¶ Return quadratic mean.
The quadratic mean of precision and recall is defined as: \(\sqrt{\sum\limits_{i} \frac{num_i^2}{|nums|}}\)
Cf. https://en.wikipedia.org/wiki/Quadratic_mean
Parameters: nums (list) -- A series of numbers Returns: The quadratic mean of nums Return type: float Examples
>>> qmean([1, 2, 3, 4]) 2.7386127875258306 >>> qmean([1, 2]) 1.5811388300841898 >>> qmean([0, 5, 1000]) 577.3574860228857
-
abydos.stats.
heronian_mean
(nums)[source]¶ Return Heronian mean.
The Heronian mean is: \(\frac{\sum\limits_{i, j}\sqrt{{x_i \cdot x_j}}} {|nums| \cdot \frac{|nums| + 1}{2}}\) for \(j \ge i\)
Cf. https://en.wikipedia.org/wiki/Heronian_mean
Parameters: nums (list) -- A series of numbers Returns: The Heronian mean of nums Return type: float Examples
>>> heronian_mean([1, 2, 3, 4]) 2.3888282852609093 >>> heronian_mean([1, 2]) 1.4714045207910316 >>> heronian_mean([0, 5, 1000]) 179.28511301977582
-
abydos.stats.
hoelder_mean
(nums, exp=2)[source]¶ Return Hölder (power/generalized) mean.
The Hölder mean is defined as: \(\sqrt[p]{\frac{1}{|nums|} \cdot \sum\limits_i{x_i^p}}\) for \(p \ne 0\), and the geometric mean for \(p = 0\)
Cf. https://en.wikipedia.org/wiki/Generalized_mean
Parameters: - nums (list) -- A series of numbers
- exp (numeric) -- The exponent of the Hölder mean
Returns: The Hölder mean of nums for the given exponent
Return type: float
Examples
>>> hoelder_mean([1, 2, 3, 4]) 2.7386127875258306 >>> hoelder_mean([1, 2]) 1.5811388300841898 >>> hoelder_mean([0, 5, 1000]) 577.3574860228857
-
abydos.stats.
lehmer_mean
(nums, exp=2)[source]¶ Return Lehmer mean.
The Lehmer mean is: \(\frac{\sum\limits_i{x_i^p}}{\sum\limits_i{x_i^(p-1)}}\)
Cf. https://en.wikipedia.org/wiki/Lehmer_mean
Parameters: - nums (list) -- A series of numbers
- exp (numeric) -- The exponent of the Lehmer mean
Returns: The Lehmer mean of nums for the given exponent
Return type: float
Examples
>>> lehmer_mean([1, 2, 3, 4]) 3.0 >>> lehmer_mean([1, 2]) 1.6666666666666667 >>> lehmer_mean([0, 5, 1000]) 995.0497512437811
-
abydos.stats.
seiffert_mean
(nums)[source]¶ Return Seiffert's mean.
Seiffert's mean of two numbers x and y is: \(\frac{x - y}{4 \cdot arctan \sqrt{\frac{x}{y}} - \pi}\)
It is defined in [Sei93].
Parameters: nums (list) -- A series of numbers Returns: Sieffert's mean of nums Return type: float Raises: AttributeError
-- seiffert_mean supports no more than two valuesExamples
>>> seiffert_mean([1, 2]) 1.4712939827611637 >>> seiffert_mean([1, 0]) 0.3183098861837907 >>> seiffert_mean([2, 4]) 2.9425879655223275 >>> seiffert_mean([2, 1000]) 336.84053300118825
-
abydos.stats.
median
(nums)[source]¶ Return median.
With numbers sorted by value, the median is the middle value (if there is an odd number of values) or the arithmetic mean of the two middle values (if there is an even number of values).
Cf. https://en.wikipedia.org/wiki/Median
Parameters: nums (list) -- A series of numbers Returns: The median of nums Return type: int or float Examples
>>> median([1, 2, 3]) 2 >>> median([1, 2, 3, 4]) 2.5 >>> median([1, 2, 2, 4]) 2
-
abydos.stats.
midrange
(nums)[source]¶ Return midrange.
The midrange is the arithmetic mean of the maximum & minimum of a series.
Cf. https://en.wikipedia.org/wiki/Midrange
Parameters: nums (list) -- A series of numbers Returns: The midrange of nums Return type: float Examples
>>> midrange([1, 2, 3]) 2.0 >>> midrange([1, 2, 2, 3]) 2.0 >>> midrange([1, 2, 1000, 3]) 500.5
-
abydos.stats.
mode
(nums)[source]¶ Return the mode.
The mode of a series is the most common element of that series
Cf. https://en.wikipedia.org/wiki/Mode_(statistics)
Parameters: nums (list) -- A series of numbers Returns: The mode of nums Return type: int or float Example
>>> mode([1, 2, 2, 3]) 2
-
abydos.stats.
std
(nums, mean_func=<function amean>, ddof=0)[source]¶ Return the standard deviation.
The standard deviation of a series of values is the square root of the variance.
Cf. https://en.wikipedia.org/wiki/Standard_deviation
Parameters: - nums (list) -- A series of numbers
- mean_func (function) -- A mean function (amean by default)
- ddof (int) -- The degrees of freedom (0 by default)
Returns: The standard deviation of the values in the series
Return type: float
Examples
>>> std([1, 1, 1, 1]) 0.0 >>> round(std([1, 2, 3, 4]), 12) 1.11803398875 >>> round(std([1, 2, 3, 4], ddof=1), 12) 1.290994448736
-
abydos.stats.
var
(nums, mean_func=<function amean>, ddof=0)[source]¶ Calculate the variance.
The variance (\(\sigma^2\)) of a series of numbers (\(x_i\)) with mean \(\mu\) and population \(N\) is:
\(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2\).
Cf. https://en.wikipedia.org/wiki/Variance
Parameters: - nums (list) -- A series of numbers
- mean_func (function) -- A mean function (amean by default)
- ddof (int) -- The degrees of freedom (0 by default)
Returns: The variance of the values in the series
Return type: float
Examples
>>> var([1, 1, 1, 1]) 0.0 >>> var([1, 2, 3, 4]) 1.25 >>> round(var([1, 2, 3, 4], ddof=1), 12) 1.666666666667
-
abydos.stats.
mean_pairwise_similarity
(collection, metric=<function sim>, mean_func=<function hmean>, symmetric=False)[source]¶ Calculate the mean pairwise similarity of a collection of strings.
Takes the mean of the pairwise similarity between each member of a collection, optionally in both directions (for asymmetric similarity metrics.
Parameters: - collection (list) -- A collection of terms or a string that can be split
- metric (function) -- A similarity metric function
- mean_func (function) -- A mean function that takes a list of values and returns a float
- symmetric (bool) -- Set to True if all pairwise similarities should be calculated in both directions
Returns: The mean pairwise similarity of a collection of strings
Return type: float
Raises: ValueError
-- mean_func must be a functionValueError
-- metric must be a functionValueError
-- collection is neither a string nor iterable typeValueError
-- collection has fewer than two members
Examples
>>> round(mean_pairwise_similarity(['Christopher', 'Kristof', ... 'Christobal']), 12) 0.519801980198 >>> round(mean_pairwise_similarity(['Niall', 'Neal', 'Neil']), 12) 0.545454545455
-
abydos.stats.
pairwise_similarity_statistics
(src_collection, tar_collection, metric=<function sim>, mean_func=<function amean>, symmetric=False)[source]¶ Calculate the pairwise similarity statistics a collection of strings.
Calculate pairwise similarities among members of two collections, returning the maximum, minimum, mean (according to a supplied function, arithmetic mean, by default), and (population) standard deviation of those similarities.
Parameters: - src_collection (list) -- A collection of terms or a string that can be split
- tar_collection (list) -- A collection of terms or a string that can be split
- metric (function) -- A similarity metric function
- mean_func (function) -- A mean function that takes a list of values and returns a float
- symmetric (bool) -- Set to True if all pairwise similarities should be calculated in both directions
Returns: The max, min, mean, and standard deviation of similarities
Return type: tuple
Raises: ValueError
-- mean_func must be a functionValueError
-- metric must be a functionValueError
-- src_collection is neither a string nor iterableValueError
-- tar_collection is neither a string nor iterable
Example
>>> tuple(round(_, 12) for _ in pairwise_similarity_statistics( ... ['Christopher', 'Kristof', 'Christobal'], ['Niall', 'Neal', 'Neil'])) (0.2, 0.0, 0.118614718615, 0.075070477184)