abydos.stats package¶

abydos.stats.

The stats module defines functions for calculating various statistical data about linguistic objects.

Functions are provided for calculating the following means:

arithmetic mean (amean())

geometric mean (gmean())

harmonic mean (hmean())

quadratic mean (qmean())

contraharmonic mean (cmean())

logarithmic mean (lmean())

identric (exponential) mean (imean())

Seiffert's mean (seiffert_mean())

Lehmer mean (lehmer_mean())

Heronian mean (heronian_mean())

Hölder (power/generalized) mean (hoelder_mean())

arithmetic-geometric mean (agmean())

geometric-harmonic mean (ghmean())

arithmetic-geometric-harmonic mean (aghmean())

And for calculating:

midrange (midrange())

median (median())

mode (mode())

variance (var())

standard deviation (std())

Some examples of the basic functions:

>>> nums = [16, 49, 55, 49, 6, 40, 23, 47, 29, 85, 76, 20]
>>> amean(nums)
41.25
>>> aghmean(nums)
32.42167170892585
>>> heronian_mean(nums)
37.931508950381925
>>> mode(nums)
49
>>> std(nums)
22.876935255113754

Two pairwise functions are provided:

mean pairwise similarity (mean_pairwise_similarity()), which returns the mean similarity (using a supplied similarity function) among each item in a collection

pairwise similarity statistics (pairwise_similarity_statistics()), which returns the max, min, mean, and standard deviation of pairwise similarities between two collections

The confusion table class (ConfusionTable) can be constructed in a number of ways:

four values, representing true positives, true negatives, false positives, and false negatives, can be passed to the constructor

a list or tuple with four values, representing true positives, true negatives, false positives, and false negatives, can be passed to the constructor

a dict with keys 'tp', 'tn', 'fp', 'fn', each assigned to the values for true positives, true negatives, false positives, and false negatives can be passed to the constructor

The ConfusionTable class has methods:

to_tuple() extracts the ConfusionTable values as a tuple: (\(w\), \(x\), \(y\), \(z\))

to_dict() extracts the ConfusionTable values as a dict: {'tp':\(w\), 'tn':\(x\), 'fp':\(y\), 'fn':\(z\)}

true_pos() returns the number of true positives

true_neg() returns the number of true negatives

false_pos() returns the number of false positives

false_neg() returns the number of false negatives

correct_pop() returns the correct population

error_pop() returns the error population

test_pos_pop() returns the test positive population

test_neg_pop() returns the test negative population

cond_pos_pop() returns the condition positive population

cond_neg_pop() returns the condition negative population

population() returns the total population

precision() returns the precision

precision_gain() returns the precision gain

recall() returns the recall

specificity() returns the specificity

npv() returns the negative predictive value

fallout() returns the fallout

fdr() returns the false discovery rate

accuracy() returns the accuracy

accuracy_gain() returns the accuracy gain

balanced_accuracy() returns the balanced accuracy

informedness() returns the informedness

markedness() returns the markedness

pr_amean() returns the arithmetic mean of precision & recall

pr_gmean() returns the geometric mean of precision & recall

pr_hmean() returns the harmonic mean of precision & recall

pr_qmean() returns the quadratic mean of precision & recall

pr_cmean() returns the contraharmonic mean of precision & recall

pr_lmean() returns the logarithmic mean of precision & recall

pr_imean() returns the identric mean of precision & recall

pr_seiffert_mean() returns Seiffert's mean of precision & recall

pr_lehmer_mean() returns the Lehmer mean of precision & recall

pr_heronian_mean() returns the Heronian mean of precision & recall

pr_hoelder_mean() returns the Hölder mean of precision & recall

pr_agmean() returns the arithmetic-geometric mean of precision & recall

pr_ghmean() returns the geometric-harmonic mean of precision & recall

pr_aghmean() returns the arithmetic-geometric-harmonic mean of precision & recall

fbeta_score() returns the \(F_{beta}\) score

f2_score() returns the \(F_2\) score

fhalf_score() returns the \(F_{\frac{1}{2}}\) score

e_score() returns the \(E\) score

f1_score() returns the \(F_1\) score

f_measure() returns the F measure

g_measure() returns the G measure

mcc() returns Matthews correlation coefficient

significance() returns the significance

kappa_statistic() returns the Kappa statistic

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f1_score()
0.8275862068965516
>>> ct.mcc()
0.5367450401216932
>>> ct.specificity()
0.75
>>> ct.significance()
66.26190476190476

The ConfusionTable class also supports checking for equality with another ConfusionTable and casting to string with str():

>>> (ConfusionTable({'tp':120, 'tn':60, 'fp':20, 'fn':30}) ==
... ConfusionTable(120, 60, 20, 30))
True
>>> str(ConfusionTable(120, 60, 20, 30))
'tp:120, tn:60, fp:20, fn:30'

class abydos.stats.ConfusionTable(tp=0, tn=0, fp=0, fn=0)[source]¶

Bases: object

ConfusionTable object.

This object is initialized by passing either four integers (or a tuple of four integers) representing the squares of a confusion table: true positives, true negatives, false positives, and false negatives

The object possesses methods for the calculation of various statistics based on the confusion table.

accuracy()[source]¶

Return accuracy.

Accuracy is defined as \(\frac{tp + tn}{population}\)

Cf. https://en.wikipedia.org/wiki/Accuracy

Returns:	The accuracy of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.accuracy()
0.782608695652174

accuracy_gain()[source]¶

Return gain in accuracy.

The gain in accuracy is defined as: \(G(accuracy) = \frac{accuracy}{random~ accuracy}\)

Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)

Returns:	The gain in accuracy of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.accuracy_gain()
1.4325259515570934

balanced_accuracy()[source]¶

Return balanced accuracy.

Balanced accuracy is defined as \(\frac{sensitivity + specificity}{2}\)

Cf. https://en.wikipedia.org/wiki/Accuracy

Returns:	The balanced accuracy of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.balanced_accuracy()
0.775

cond_neg_pop()[source]¶

Return condition negative population.

Returns:	The condition negative population of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.cond_neg_pop()
80

cond_pos_pop()[source]¶

Return condition positive population.

Returns:	The condition positive population of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.cond_pos_pop()
150

correct_pop()[source]¶

Return correct population.

Returns:	The correct population of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.correct_pop()
180

e_score(beta=1)[source]¶

Return \(E\)-score.

This is Van Rijsbergen's effectiveness measure: \(E=1-F_{\beta}\).

Cf. https://en.wikipedia.org/wiki/Information_retrieval#F-measure

Parameters:	beta (float) -- The \(\beta\) parameter in the above formula
Returns:	The \(E\)-score of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.e_score()
0.17241379310344818

error_pop()[source]¶

Return error population.

Returns:	The error population of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.error_pop()
50

f1_score()[source]¶

Return \(F_{1}\) score.

\(F_{1}\) score is the harmonic mean of precision and recall: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:	The \(F_{1}\) of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f1_score()
0.8275862068965516

f2_score()[source]¶

Return \(F_{2}\).

The \(F_{2}\) score emphasizes recall over precision in comparison to the \(F_{1}\) score

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:	The \(F_{2}\) of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f2_score()
0.8108108108108109

f_measure()[source]¶

Return \(F\)-measure.

\(F\)-measure is the harmonic mean of precision and recall: \(2 \cdot \frac{precision \cdot recall}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:	The math:F-measure of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f_measure()
0.8275862068965516

fallout()[source]¶

Return fall-out.

Fall-out is defined as \(\frac{fp}{fp + tn}\)

AKA false positive rate (FPR)

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Fall-out

Returns:	The fall-out of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fallout()
0.25

false_neg()[source]¶

Return false negatives.

Returns:	The false negatives of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.false_neg()
30

false_pos()[source]¶

Return false positives.

Returns:	The false positives of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.false_pos()
20

fbeta_score(beta=1.0)[source]¶

Return \(F_{\beta}\) score.

\(F_{\beta}\) for a positive real value \(\beta\) "measures the effectiveness of retrieval with respect to a user who attaches \(\beta\) times as much importance to recall as precision" (van Rijsbergen 1979)

\(F_{\beta}\) score is defined as: \((1 + \beta^2) \cdot \frac{precision \cdot recall} {((\beta^2 \cdot precision) + recall)}\)

Cf. https://en.wikipedia.org/wiki/F1_score

Parameters:	beta (float) -- The \(\beta\) parameter in the above formula
Returns:	The \(F_{\beta}\) of the confusion table
Return type:	float
Raises:	`AttributeError` -- Beta must be a positive real value

Examples

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fbeta_score()
0.8275862068965518
>>> ct.fbeta_score(beta=0.1)
0.8565371024734982

fdr()[source]¶

Return false discovery rate (FDR).

False discovery rate is defined as \(\frac{fp}{fp + tp}\)

Cf. https://en.wikipedia.org/wiki/False_discovery_rate

Returns:	The false discovery rate of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fdr()
0.14285714285714285

fhalf_score()[source]¶

Return \(F_{0.5}\) score.

The \(F_{0.5}\) score emphasizes precision over recall in comparison to the \(F_{1}\) score

Cf. https://en.wikipedia.org/wiki/F1_score

Returns:	The \(F_{0.5}\) score of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fhalf_score()
0.8450704225352114

g_measure()[source]¶

Return G-measure.

\(G\)-measure is the geometric mean of precision and recall: \(\sqrt{precision \cdot recall}\)

This is identical to the Fowlkes–Mallows (FM) index for two clusters.

Cf. https://en.wikipedia.org/wiki/F1_score#G-measure

Cf. https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index

Returns:	The \(G\)-measure of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.g_measure()
0.828078671210825

informedness()[source]¶

Return informedness.

Informedness is defined as \(sensitivity + specificity - 1\).

AKA Youden's J statistic ([You50])

AKA DeltaP'

Cf. https://en.wikipedia.org/wiki/Youden%27s_J_statistic

Returns:	The informedness of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.informedness()
0.55

kappa_statistic()[source]¶

Return κ statistic.

The κ statistic is defined as: \(\kappa = \frac{accuracy - random~ accuracy} {1 - random~ accuracy}\)

The κ statistic compares the performance of the classifier relative to the performance of a random classifier. \(\kappa\) = 0 indicates performance identical to random. \(\kappa\) = 1 indicates perfect predictive success. \(\kappa\) = -1 indicates perfect predictive failure.

Returns:	The κ statistic of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.kappa_statistic()
0.5344129554655871

markedness()[source]¶

Return markedness.

Markedness is defined as \(precision + npv - 1\)

Returns:	The markedness of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.markedness()
0.5238095238095237

mcc()[source]¶

Return Matthews correlation coefficient (MCC).

The Matthews correlation coefficient is defined in [Mat75] as: \(\frac{(tp \cdot tn) - (fp \cdot fn)} {\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}\)

This is equivalent to the geometric mean of informedness and markedness, defined above.

Cf. https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

Returns:	The Matthews correlation coefficient of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.mcc()
0.5367450401216932

npv()[source]¶

Return negative predictive value (NPV).

NPV is defined as \(\frac{tn}{tn + fn}\)

Cf. https://en.wikipedia.org/wiki/Negative_predictive_value

Returns:	The negative predictive value of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.npv()
0.6666666666666666

population()[source]¶

Return population, N.

Returns:	The population (N) of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.population()
230

pr_aghmean()[source]¶

Return arithmetic-geometric-harmonic mean of precision & recall.

Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described in [Raissouli:2009].

Returns:	The arithmetic-geometric-harmonic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_aghmean()
0.8280786712108288

pr_agmean()[source]¶

Return arithmetic-geometric mean of precision & recall.

Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits)

Cf. https://en.wikipedia.org/wiki/Arithmetic-geometric_mean

Returns:	The arithmetic-geometric mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_agmean()
0.8283250315702829

pr_amean()[source]¶

Return arithmetic mean of precision & recall.

The arithmetic mean of precision and recall is defined as: \(\frac{precision \cdot recall}{2}\)

Cf. https://en.wikipedia.org/wiki/Arithmetic_mean

Returns:	The arithmetic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_amean()
0.8285714285714285

pr_cmean()[source]¶

Return contraharmonic mean of precision & recall.

The contraharmonic mean is: \(\frac{precision^{2} + recall^{2}}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean

Returns:	The contraharmonic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_cmean()
0.8295566502463055

pr_ghmean()[source]¶

Return geometric-harmonic mean of precision & recall.

Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits)

Cf. https://en.wikipedia.org/wiki/Geometric-harmonic_mean

Returns:	The geometric-harmonic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_ghmean()
0.8278323841238441

pr_gmean()[source]¶

Return geometric mean of precision & recall.

The geometric mean of precision and recall is defined as: \(\sqrt{precision \cdot recall}\)

Cf. https://en.wikipedia.org/wiki/Geometric_mean

Returns:	The geometric mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_gmean()
0.828078671210825

pr_heronian_mean()[source]¶

Return Heronian mean of precision & recall.

The Heronian mean of precision and recall is defined as: \(\frac{precision + \sqrt{precision \cdot recall} + recall}{3}\)

Cf. https://en.wikipedia.org/wiki/Heronian_mean

Returns:	The Heronian mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_heronian_mean()
0.8284071761178939

pr_hmean()[source]¶

Return harmonic mean of precision & recall.

The harmonic mean of precision and recall is defined as: \(\frac{2 \cdot precision \cdot recall}{precision + recall}\)

Cf. https://en.wikipedia.org/wiki/Harmonic_mean

Returns:	The harmonic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_hmean()
0.8275862068965516

pr_hoelder_mean(exp=2)[source]¶

Return Hölder (power/generalized) mean of precision & recall.

The power mean of precision and recall is defined as: \(\frac{1}{2} \cdot \sqrt[exp]{precision^{exp} + recall^{exp}}\) for \(exp \ne 0\), and the geometric mean for \(exp = 0\)

Cf. https://en.wikipedia.org/wiki/Generalized_mean

Parameters:	exp (float) -- The exponent of the Hölder mean
Returns:	The Hölder mean for the given exponent of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_hoelder_mean()
0.8290638930598233

pr_imean()[source]¶

Return identric (exponential) mean of precision & recall.

The identric mean is: precision if precision = recall, otherwise \(\frac{1}{e} \cdot \sqrt[precision - recall]{\frac{precision^{precision}} {recall^{recall}}}\)

Cf. https://en.wikipedia.org/wiki/Identric_mean

Returns:	The identric mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_imean()
0.8284071826325543

pr_lehmer_mean(exp=2.0)[source]¶

Return Lehmer mean of precision & recall.

The Lehmer mean is: \(\frac{precision^{exp} + recall^{exp}} {precision^{exp-1} + recall^{exp-1}}\)

Cf. https://en.wikipedia.org/wiki/Lehmer_mean

Parameters:	exp (float) -- The exponent of the Lehmer mean
Returns:	The Lehmer mean for the given exponent of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_lehmer_mean()
0.8295566502463055

pr_lmean()[source]¶

Return logarithmic mean of precision & recall.

The logarithmic mean is: 0 if either precision or recall is 0, the precision if they are equal, otherwise \(\frac{precision - recall} {ln(precision) - ln(recall)}\)

Cf. https://en.wikipedia.org/wiki/Logarithmic_mean

Returns:	The logarithmic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_lmean()
0.8282429171492667

pr_qmean()[source]¶

Return quadratic mean of precision & recall.

The quadratic mean of precision and recall is defined as: \(\sqrt{\frac{precision^{2} + recall^{2}}{2}}\)

Cf. https://en.wikipedia.org/wiki/Quadratic_mean

Returns:	The quadratic mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_qmean()
0.8290638930598233

pr_seiffert_mean()[source]¶

Return Seiffert's mean of precision & recall.

Seiffert's mean of precision and recall is: \(\frac{precision - recall}{4 \cdot arctan \sqrt{\frac{precision}{recall}} - \pi}\)

It is defined in [Sei93].

Returns:	Seiffert's mean of the confusion table's precision & recall
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_seiffert_mean()
0.8284071696048312

precision()[source]¶

Return precision.

Precision is defined as \(\frac{tp}{tp + fp}\)

AKA positive predictive value (PPV)

Cf. https://en.wikipedia.org/wiki/Precision_and_recall

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Precision

Returns:	The precision of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.precision()
0.8571428571428571

precision_gain()[source]¶

Return gain in precision.

The gain in precision is defined as: \(G(precision) = \frac{precision}{random~ precision}\)

Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)

Returns:	The gain in precision of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.precision_gain()
1.3142857142857143

recall()[source]¶

Return recall.

Recall is defined as \(\frac{tp}{tp + fn}\)

AKA sensitivity

AKA true positive rate (TPR)

Cf. https://en.wikipedia.org/wiki/Precision_and_recall

Cf. https://en.wikipedia.org/wiki/Sensitivity_(test)

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Recall

Returns:	The recall of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.recall()
0.8

significance()[source]¶

Return the significance, \(\chi^{2}\).

Significance is defined as: \(\chi^{2} = \frac{(tp \cdot tn - fp \cdot fn)^{2} (tp + tn + fp + fn)} {((tp + fp)(tp + fn)(tn + fp)(tn + fn)}\)

Also: \(\chi^{2} = MCC^{2} \cdot n\)

Cf. https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test

Returns:	The significance of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.significance()
66.26190476190476

specificity()[source]¶

Return specificity.

Specificity is defined as \(\frac{tn}{tn + fp}\)

AKA true negative rate (TNR)

Cf. https://en.wikipedia.org/wiki/Specificity_(tests)

Returns:	The specificity of the confusion table
Return type:	float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.specificity()
0.75

test_neg_pop()[source]¶

Return test negative population.

Returns:	The test negative population of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.test_neg_pop()
90

test_pos_pop()[source]¶

Return test positive population.

Returns:	The test positive population of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.test_pos_pop()
140

to_dict()[source]¶

Cast to dict.

Returns:	The confusion table as a dict
Return type:	dict

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> import pprint
>>> pprint.pprint(ct.to_dict())
{'fn': 30, 'fp': 20, 'tn': 60, 'tp': 120}

to_tuple()[source]¶

Cast to tuple.

Returns:	The confusion table as a 4-tuple (tp, tn, fp, fn)
Return type:	tuple

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.to_tuple()
(120, 60, 20, 30)

true_neg()[source]¶

Return true negatives.

Returns:	The true negatives of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.true_neg()
60

true_pos()[source]¶

Return true positives.

Returns:	The true positives of the confusion table
Return type:	int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.true_pos()
120

abydos.stats.amean(nums)[source]¶

Return arithmetic mean.

The arithmetic mean is defined as: \(\frac{\sum{nums}}{|nums|}\)

Cf. https://en.wikipedia.org/wiki/Arithmetic_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The arithmetric mean of nums
Return type:	float

Examples

>>> amean([1, 2, 3, 4])
2.5
>>> amean([1, 2])
1.5
>>> amean([0, 5, 1000])
335.0

abydos.stats.gmean(nums)[source]¶

Return geometric mean.

The geometric mean is defined as: \(\sqrt[|nums|]{\prod\limits_{i} nums_{i}}\)

Cf. https://en.wikipedia.org/wiki/Geometric_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The geometric mean of nums
Return type:	float

Examples

>>> gmean([1, 2, 3, 4])
2.213363839400643
>>> gmean([1, 2])
1.4142135623730951
>>> gmean([0, 5, 1000])
0.0

abydos.stats.hmean(nums)[source]¶

Return harmonic mean.

The harmonic mean is defined as: \(\frac{|nums|}{\sum\limits_{i}\frac{1}{nums_i}}\)

Following the behavior of Wolfram|Alpha: - If one of the values in nums is 0, return 0. - If more than one value in nums is 0, return NaN.

Cf. https://en.wikipedia.org/wiki/Harmonic_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The harmonic mean of nums
Return type:	float
Raises:	`AttributeError` -- hmean requires at least one value

Examples

>>> hmean([1, 2, 3, 4])
1.9200000000000004
>>> hmean([1, 2])
1.3333333333333333
>>> hmean([0, 5, 1000])
0

abydos.stats.agmean(nums)[source]¶

Return arithmetic-geometric mean.

Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits).

Cf. https://en.wikipedia.org/wiki/Arithmetic-geometric_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The arithmetic-geometric mean of nums
Return type:	float

Examples

>>> agmean([1, 2, 3, 4])
2.3545004777751077
>>> agmean([1, 2])
1.4567910310469068
>>> agmean([0, 5, 1000])
2.9753977059954195e-13

abydos.stats.ghmean(nums)[source]¶

Return geometric-harmonic mean.

Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits).

Cf. https://en.wikipedia.org/wiki/Geometric-harmonic_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The geometric-harmonic mean of nums
Return type:	float

Examples

>>> ghmean([1, 2, 3, 4])
2.058868154613003
>>> ghmean([1, 2])
1.3728805006183502
>>> ghmean([0, 5, 1000])
0.0

>>> ghmean([0, 0])
0.0
>>> ghmean([0, 0, 5])
nan

abydos.stats.aghmean(nums)[source]¶

Return arithmetic-geometric-harmonic mean.

Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described in [Raissouli:2009].

Parameters:	nums (list) -- A series of numbers
Returns:	The arithmetic-geometric-harmonic mean of nums
Return type:	float

Examples

>>> aghmean([1, 2, 3, 4])
2.198327159900212
>>> aghmean([1, 2])
1.4142135623731884
>>> aghmean([0, 5, 1000])
335.0

abydos.stats.cmean(nums)[source]¶

Return contraharmonic mean.

The contraharmonic mean is: \(\frac{\sum\limits_i x_i^2}{\sum\limits_i x_i}\)

Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The contraharmonic mean of nums
Return type:	float

Examples

>>> cmean([1, 2, 3, 4])
3.0
>>> cmean([1, 2])
1.6666666666666667
>>> cmean([0, 5, 1000])
995.0497512437811

abydos.stats.imean(nums)[source]¶

Return identric (exponential) mean.

The identric mean of two numbers x and y is: x if x = y otherwise \(\frac{1}{e} \sqrt[x-y]{\frac{x^x}{y^y}}\)

Cf. https://en.wikipedia.org/wiki/Identric_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The identric mean of nums
Return type:	float
Raises:	`AttributeError` -- imean supports no more than two values

Examples

>>> imean([1, 2])
1.4715177646857693
>>> imean([1, 0])
nan
>>> imean([2, 4])
2.9430355293715387

abydos.stats.lmean(nums)[source]¶

Return logarithmic mean.

The logarithmic mean of an arbitrarily long series is defined by http://www.survo.fi/papers/logmean.pdf as: \(L(x_1, x_2, ..., x_n) = (n-1)! \sum\limits_{i=1}^n \frac{x_i} {\prod\limits_{\substack{j = 1\\j \ne i}}^n ln \frac{x_i}{x_j}}\)

Cf. https://en.wikipedia.org/wiki/Logarithmic_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The logarithmic mean of nums
Return type:	float
Raises:	`AttributeError` -- No two values in the nums list may be equal

Examples

>>> lmean([1, 2, 3, 4])
2.2724242417489258
>>> lmean([1, 2])
1.4426950408889634

abydos.stats.qmean(nums)[source]¶

Return quadratic mean.

The quadratic mean of precision and recall is defined as: \(\sqrt{\sum\limits_{i} \frac{num_i^2}{|nums|}}\)

Cf. https://en.wikipedia.org/wiki/Quadratic_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The quadratic mean of nums
Return type:	float

Examples

>>> qmean([1, 2, 3, 4])
2.7386127875258306
>>> qmean([1, 2])
1.5811388300841898
>>> qmean([0, 5, 1000])
577.3574860228857

abydos.stats.heronian_mean(nums)[source]¶

Return Heronian mean.

The Heronian mean is: \(\frac{\sum\limits_{i, j}\sqrt{{x_i \cdot x_j}}} {|nums| \cdot \frac{|nums| + 1}{2}}\) for \(j \ge i\)

Cf. https://en.wikipedia.org/wiki/Heronian_mean

Parameters:	nums (list) -- A series of numbers
Returns:	The Heronian mean of nums
Return type:	float

Examples

>>> heronian_mean([1, 2, 3, 4])
2.3888282852609093
>>> heronian_mean([1, 2])
1.4714045207910316
>>> heronian_mean([0, 5, 1000])
179.28511301977582

abydos.stats.hoelder_mean(nums, exp=2)[source]¶

Return Hölder (power/generalized) mean.

The Hölder mean is defined as: \(\sqrt[p]{\frac{1}{|nums|} \cdot \sum\limits_i{x_i^p}}\) for \(p \ne 0\), and the geometric mean for \(p = 0\)

Cf. https://en.wikipedia.org/wiki/Generalized_mean

Parameters:	nums (list) -- A series of numbers exp (numeric) -- The exponent of the Hölder mean
Returns:	The Hölder mean of nums for the given exponent
Return type:	float

Examples

>>> hoelder_mean([1, 2, 3, 4])
2.7386127875258306
>>> hoelder_mean([1, 2])
1.5811388300841898
>>> hoelder_mean([0, 5, 1000])
577.3574860228857

abydos.stats.lehmer_mean(nums, exp=2)[source]¶

Return Lehmer mean.

The Lehmer mean is: \(\frac{\sum\limits_i{x_i^p}}{\sum\limits_i{x_i^(p-1)}}\)

Cf. https://en.wikipedia.org/wiki/Lehmer_mean

Parameters:	nums (list) -- A series of numbers exp (numeric) -- The exponent of the Lehmer mean
Returns:	The Lehmer mean of nums for the given exponent
Return type:	float

Examples

>>> lehmer_mean([1, 2, 3, 4])
3.0
>>> lehmer_mean([1, 2])
1.6666666666666667
>>> lehmer_mean([0, 5, 1000])
995.0497512437811

abydos.stats.seiffert_mean(nums)[source]¶

Return Seiffert's mean.

Seiffert's mean of two numbers x and y is: \(\frac{x - y}{4 \cdot arctan \sqrt{\frac{x}{y}} - \pi}\)

It is defined in [Sei93].

Parameters:	nums (list) -- A series of numbers
Returns:	Sieffert's mean of nums
Return type:	float
Raises:	`AttributeError` -- seiffert_mean supports no more than two values

Examples

>>> seiffert_mean([1, 2])
1.4712939827611637
>>> seiffert_mean([1, 0])
0.3183098861837907
>>> seiffert_mean([2, 4])
2.9425879655223275
>>> seiffert_mean([2, 1000])
336.84053300118825

abydos.stats.median(nums)[source]¶

Return median.

With numbers sorted by value, the median is the middle value (if there is an odd number of values) or the arithmetic mean of the two middle values (if there is an even number of values).

Cf. https://en.wikipedia.org/wiki/Median

Parameters:	nums (list) -- A series of numbers
Returns:	The median of nums
Return type:	int or float

Examples

>>> median([1, 2, 3])
2
>>> median([1, 2, 3, 4])
2.5
>>> median([1, 2, 2, 4])
2

abydos.stats.midrange(nums)[source]¶

Return midrange.

The midrange is the arithmetic mean of the maximum & minimum of a series.

Cf. https://en.wikipedia.org/wiki/Midrange

Parameters:	nums (list) -- A series of numbers
Returns:	The midrange of nums
Return type:	float

Examples

>>> midrange([1, 2, 3])
2.0
>>> midrange([1, 2, 2, 3])
2.0
>>> midrange([1, 2, 1000, 3])
500.5

abydos.stats.mode(nums)[source]¶

Return the mode.

The mode of a series is the most common element of that series

Cf. https://en.wikipedia.org/wiki/Mode_(statistics)

Parameters:	nums (list) -- A series of numbers
Returns:	The mode of nums
Return type:	int or float

Example

>>> mode([1, 2, 2, 3])
2

abydos.stats.std(nums, mean_func=<function amean>, ddof=0)[source]¶

Return the standard deviation.

The standard deviation of a series of values is the square root of the variance.

Cf. https://en.wikipedia.org/wiki/Standard_deviation

Parameters:	nums (list) -- A series of numbers mean_func (function) -- A mean function (amean by default) ddof (int) -- The degrees of freedom (0 by default)
Returns:	The standard deviation of the values in the series
Return type:	float

Examples

>>> std([1, 1, 1, 1])
0.0
>>> round(std([1, 2, 3, 4]), 12)
1.11803398875
>>> round(std([1, 2, 3, 4], ddof=1), 12)
1.290994448736

abydos.stats.var(nums, mean_func=<function amean>, ddof=0)[source]¶

Calculate the variance.

The variance (\(\sigma^2\)) of a series of numbers (\(x_i\)) with mean \(\mu\) and population \(N\) is:

\(\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2\).

Cf. https://en.wikipedia.org/wiki/Variance

Parameters:	nums (list) -- A series of numbers mean_func (function) -- A mean function (amean by default) ddof (int) -- The degrees of freedom (0 by default)
Returns:	The variance of the values in the series
Return type:	float

Examples

>>> var([1, 1, 1, 1])
0.0
>>> var([1, 2, 3, 4])
1.25
>>> round(var([1, 2, 3, 4], ddof=1), 12)
1.666666666667

abydos.stats.mean_pairwise_similarity(collection, metric=<function sim>, mean_func=<function hmean>, symmetric=False)[source]¶

Calculate the mean pairwise similarity of a collection of strings.

Takes the mean of the pairwise similarity between each member of a collection, optionally in both directions (for asymmetric similarity metrics.

Parameters:	collection (list) -- A collection of terms or a string that can be split metric (function) -- A similarity metric function mean_func (function) -- A mean function that takes a list of values and returns a float symmetric (bool) -- Set to True if all pairwise similarities should be calculated in both directions
Returns:	The mean pairwise similarity of a collection of strings
Return type:	float
Raises:	`ValueError` -- mean_func must be a function `ValueError` -- metric must be a function `ValueError` -- collection is neither a string nor iterable type `ValueError` -- collection has fewer than two members

Examples

>>> round(mean_pairwise_similarity(['Christopher', 'Kristof',
... 'Christobal']), 12)
0.519801980198
>>> round(mean_pairwise_similarity(['Niall', 'Neal', 'Neil']), 12)
0.545454545455

abydos.stats.pairwise_similarity_statistics(src_collection, tar_collection, metric=<function sim>, mean_func=<function amean>, symmetric=False)[source]¶

Calculate the pairwise similarity statistics a collection of strings.

Calculate pairwise similarities among members of two collections, returning the maximum, minimum, mean (according to a supplied function, arithmetic mean, by default), and (population) standard deviation of those similarities.

Parameters:	src_collection (list) -- A collection of terms or a string that can be split tar_collection (list) -- A collection of terms or a string that can be split metric (function) -- A similarity metric function mean_func (function) -- A mean function that takes a list of values and returns a float symmetric (bool) -- Set to True if all pairwise similarities should be calculated in both directions
Returns:	The max, min, mean, and standard deviation of similarities
Return type:	tuple
Raises:	`ValueError` -- mean_func must be a function `ValueError` -- metric must be a function `ValueError` -- src_collection is neither a string nor iterable `ValueError` -- tar_collection is neither a string nor iterable

Example

>>> tuple(round(_, 12) for _ in pairwise_similarity_statistics(
... ['Christopher', 'Kristof', 'Christobal'], ['Niall', 'Neal', 'Neil']))
(0.2, 0.0, 0.118614718615, 0.075070477184)