abydos.stats package

abydos.stats.

The stats module defines functions for calculating various statistical data about linguistic objects.

Functions are provided for calculating the following means:

And for calculating:

Some examples of the basic functions:

>>> nums = [16, 49, 55, 49, 6, 40, 23, 47, 29, 85, 76, 20]
>>> amean(nums)
41.25
>>> aghmean(nums)
32.42167170892585
>>> heronian_mean(nums)
37.931508950381925
>>> mode(nums)
49
>>> std(nums)
22.876935255113754

Two pairwise functions are provided:

  • mean pairwise similarity (mean_pairwise_similarity()), which returns the mean similarity (using a supplied similarity function) among each item in a collection

  • pairwise similarity statistics (pairwise_similarity_statistics()), which returns the max, min, mean, and standard deviation of pairwise similarities between two collections

The confusion table class (ConfusionTable) can be constructed in a number of ways:

  • four values, representing true positives, true negatives, false positives, and false negatives, can be passed to the constructor

  • a list or tuple with four values, representing true positives, true negatives, false positives, and false negatives, can be passed to the constructor

  • a dict with keys 'tp', 'tn', 'fp', 'fn', each assigned to the values for true positives, true negatives, false positives, and false negatives can be passed to the constructor

The ConfusionTable class has methods:

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f1_score()
0.8275862068965518
>>> ct.mcc()
0.5367450401216932
>>> ct.specificity()
0.75
>>> ct.significance()
66.26190476190476

The ConfusionTable class also supports checking for equality with another ConfusionTable and casting to string with str():

>>> (ConfusionTable({'tp':120, 'tn':60, 'fp':20, 'fn':30}) ==
... ConfusionTable(120, 60, 20, 30))
True
>>> str(ConfusionTable(120, 60, 20, 30))
'tp:120, tn:60, fp:20, fn:30'

class abydos.stats.ConfusionTable(tp=0, tn=0, fp=0, fn=0)[source]

Bases: object

ConfusionTable object.

This object is initialized by passing either four integers (or a tuple of four integers) representing the squares of a confusion table: true positives, true negatives, false positives, and false negatives

The object possesses methods for the calculation of various statistics based on the confusion table.

Initialize ConfusionTable.

Parameters
  • tp (int or a tuple, list, or dict) -- True positives; If a tuple or list is supplied, it must include 4 values in the order [tp, tn, fp, fn]. If a dict is supplied, it must have 4 keys, namely 'tp', 'tn', 'fp', & 'fn'.

  • tn (int) -- True negatives

  • fp (int) -- False positives

  • fn (int) -- False negatives

Raises

AttributeError -- ConfusionTable requires a 4-tuple when being created from a tuple.

Examples

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct == ConfusionTable((120, 60, 20, 30))
True
>>> ct == ConfusionTable([120, 60, 20, 30])
True
>>> ct == ConfusionTable({'tp': 120, 'tn': 60, 'fp': 20, 'fn': 30})
True

New in version 0.1.0.

accuracy()[source]

Return accuracy.

Accuracy is defined as

\[\frac{tp + tn}{population}\]

Cf. https://en.wikipedia.org/wiki/Accuracy

Returns

The accuracy of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.accuracy()
0.782608695652174

New in version 0.1.0.

accuracy_gain()[source]

Return gain in accuracy.

The gain in accuracy is defined as

\[G(accuracy) = \frac{accuracy}{random~ accuracy}\]

Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)

Returns

The gain in accuracy of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.accuracy_gain()
1.4325259515570934

New in version 0.1.0.

actual_entropy()[source]

Return the actual entropy.

Implementation based on https://github.com/Magnetic/proficiency-metric

Returns

The actual entropy of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.actual_entropy()
0.6460905050608101

New in version 0.4.0.

balanced_accuracy()[source]

Return balanced accuracy.

Balanced accuracy is defined as

\[\frac{sensitivity + specificity}{2}\]

Cf. https://en.wikipedia.org/wiki/Accuracy

Returns

The balanced accuracy of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.balanced_accuracy()
0.775

New in version 0.1.0.

cond_neg_pop()[source]

Return condition negative population.

Returns

The condition negative population of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.cond_neg_pop()
80

New in version 0.1.0.

cond_pos_pop()[source]

Return condition positive population.

Returns

The condition positive population of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.cond_pos_pop()
150

New in version 0.1.0.

correct_pop()[source]

Return correct population.

Returns

The correct population of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.correct_pop()
180

New in version 0.1.0.

d_measure()[source]

Return D-measure.

\(D\)-measure is defined as

\[1-\frac{1}{\frac{1}{precision}+\frac{1}{recall}-1}\]
Returns

The \(D\)-measure of the confusion table

Return type

float

Examples

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.d_measure()
0.2941176470588237

New in version 0.4.0.

dependency()[source]

Return dependency.

Implementation based on https://github.com/Magnetic/proficiency-metric

Returns

The dependency of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.dependency()
0.12618094145262454

New in version 0.4.0.

diagnostic_odds_ratio()[source]

Return diagnostic odds ratio.

Diagnostic odds ratio is defined as

\[\frac{tp \cdot tn}{fp \cdot fn}\]

Cf. https://en.wikipedia.org/wiki/Diagnostic_odds_ratio

Returns

The negative likelihood ratio of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.diagnostic_odds_ratio()
12.0

New in version 0.4.0.

e_score(beta=1.0)[source]

Return \(E\)-score.

This is Van Rijsbergen's effectiveness measure: \(E=1-F_{\beta}\).

Cf. https://en.wikipedia.org/wiki/Information_retrieval#F-measure

Parameters

beta (float) -- The \(\beta\) parameter in the above formula

Returns

The \(E\)-score of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.e_score()
0.17241379310344818

New in version 0.1.0.

error_pop()[source]

Return error population.

Returns

The error population of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.error_pop()
50

New in version 0.1.0.

error_rate()[source]

Return error rate.

Error rate is defined as

\[\frac{fp + fn}{population}\]
Returns

The error rate of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.error_rate()
0.21739130434782608

New in version 0.4.0.

f1_score()[source]

Return \(F_{1}\) score.

\(F_{1}\) score is the harmonic mean of precision and recall

\[2 \cdot \frac{precision \cdot recall}{precision + recall}\]

Cf. https://en.wikipedia.org/wiki/F1_score

Returns

The \(F_{1}\) of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f1_score()
0.8275862068965518

New in version 0.1.0.

f2_score()[source]

Return \(F_{2}\).

The \(F_{2}\) score emphasizes recall over precision in comparison to the \(F_{1}\) score

Cf. https://en.wikipedia.org/wiki/F1_score

Returns

The \(F_{2}\) of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f2_score()
0.8108108108108109

New in version 0.1.0.

f_measure()[source]

Return \(F\)-measure.

\(F\)-measure is the harmonic mean of precision and recall

\[2 \cdot \frac{precision \cdot recall}{precision + recall}\]

Cf. https://en.wikipedia.org/wiki/F1_score

Returns

The math:F-measure of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.f_measure()
0.8275862068965516

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the ConfusionTable.pr_hmean method instead.

fallout()[source]

Return fall-out.

Fall-out is defined as

\[\frac{fp}{fp + tn}\]

AKA false positive rate (FPR)

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Fall-out

Returns

The fall-out of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fallout()
0.25

New in version 0.1.0.

false_neg()[source]

Return false negatives.

AKA Type II error

Returns

The false negatives of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.false_neg()
30

New in version 0.1.0.

false_omission_rate()[source]

Return false omission rate (FOR).

FOR is defined as

\[\frac{fn}{tn + fn}\]

Cf. https://en.wikipedia.org/wiki/False_omission_rate

Returns

The false omission rate of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.false_omission_rate()
0.3333333333333333

New in version 0.4.0.

false_pos()[source]

Return false positives.

AKA Type I error

Returns

The false positives of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.false_pos()
20

New in version 0.1.0.

fbeta_score(beta=1.0)[source]

Return \(F_{\beta}\) score.

\(F_{\beta}\) for a positive real value \(\beta\) "measures the effectiveness of retrieval with respect to a user who attaches \(\beta\) times as much importance to recall as precision" (van Rijsbergen 1979)

\(F_{\beta}\) score is defined as

\[(1 + \beta^2) \cdot \frac{precision \cdot recall} {((\beta^2 \cdot precision) + recall)}\]

Cf. https://en.wikipedia.org/wiki/F1_score

Parameters

beta (float) -- The \(\beta\) parameter in the above formula

Returns

The \(F_{\beta}\) of the confusion table

Return type

float

Raises

AttributeError -- Beta must be a positive real value

Examples

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fbeta_score()
0.8275862068965518
>>> ct.fbeta_score(beta=0.1)
0.8565371024734982

New in version 0.1.0.

fdr()[source]

Return false discovery rate (FDR).

False discovery rate is defined as

\[\frac{fp}{fp + tp}\]

Cf. https://en.wikipedia.org/wiki/False_discovery_rate

Returns

The false discovery rate of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fdr()
0.14285714285714285

New in version 0.1.0.

fhalf_score()[source]

Return \(F_{0.5}\) score.

The \(F_{0.5}\) score emphasizes precision over recall in comparison to the \(F_{1}\) score

Cf. https://en.wikipedia.org/wiki/F1_score

Returns

The \(F_{0.5}\) score of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.fhalf_score()
0.8450704225352114

New in version 0.1.0.

fnr()[source]

Return false negative rate.

False negative rate is defined as

\[\frac{fn}{tp + fn}\]

AKA miss rate

Cf. https://en.wikipedia.org/wiki/False_negative_rate

Returns

The false negative rate of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> round(ct.fnr(), 8)
0.2

New in version 0.4.0.

g_measure()[source]

Return G-measure.

\(G\)-measure is the geometric mean of precision and recall:

\[\sqrt{precision \cdot recall}\]

This is identical to the Fowlkes–Mallows (FM) index for two clusters.

Cf. https://en.wikipedia.org/wiki/F1_score#G-measure

Cf. https://en.wikipedia.org/wiki/Fowlkes%E2%80%93Mallows_index

Returns

The \(G\)-measure of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.g_measure()
0.828078671210825

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the ConfusionTable.pr_gmean method instead.

igr()[source]

Return information gain ratio.

Implementation based on https://github.com/Magnetic/proficiency-metric

Cf. https://en.wikipedia.org/wiki/Information_gain_ratio

Returns

The information gain ratio of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.igr()
0.22019657299448012

New in version 0.4.0.

informedness()[source]

Return informedness.

Informedness is defined as

\[sensitivity + specificity - 1\]

AKA Youden's J statistic ([You50])

AKA DeltaP'

Cf. https://en.wikipedia.org/wiki/Youden%27s_J_statistic

Returns

The informedness of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.informedness()
0.55

New in version 0.1.0.

jaccard()[source]

Return Jaccard index.

The Jaccard index of a confusion table is

\[\frac{tp}{tp+fp+fn}\]
Returns

The Jaccard index of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.jaccard()
0.7058823529411765

New in version 0.4.0.

joint_entropy()[source]

Return the joint entropy.

Implementation based on https://github.com/Magnetic/proficiency-metric

Returns

The joint entropy of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.joint_entropy()
1.1680347446270396

New in version 0.4.0.

kappa_statistic()[source]

Return κ statistic.

The κ statistic is defined as

\[\kappa = \frac{accuracy - random~ accuracy} {1 - random~ accuracy}`\]

The κ statistic compares the performance of the classifier relative to the performance of a random classifier. \(\kappa\) = 0 indicates performance identical to random. \(\kappa\) = 1 indicates perfect predictive success. \(\kappa\) = -1 indicates perfect predictive failure.

Returns

The κ statistic of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.kappa_statistic()
0.5344129554655871

New in version 0.1.0.

lift()[source]

Return lift.

Implementation based on https://github.com/Magnetic/proficiency-metric

Returns

The lift of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.lift()
1.3142857142857143

New in version 0.4.0.

markedness()[source]

Return markedness.

Markedness is defined as

\[precision + npv - 1\]

AKA DeltaP

Returns

The markedness of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.markedness()
0.5238095238095237

New in version 0.1.0.

mcc()[source]

Return Matthews correlation coefficient (MCC).

The Matthews correlation coefficient is defined in [Mat75] as:

\[\frac{(tp \cdot tn) - (fp \cdot fn)} {\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}\]

This is equivalent to the geometric mean of informedness and markedness, defined above.

Cf. https://en.wikipedia.org/wiki/Matthews_correlation_coefficient

Returns

The Matthews correlation coefficient of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.mcc()
0.5367450401216932

New in version 0.1.0.

mutual_information()[source]

Return the mutual information.

Implementation based on https://github.com/Magnetic/proficiency-metric

Returns

  • float -- The mutual information of the confusion table

  • Cf. https (//en.wikipedia.org/wiki/Mutual_information)

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.mutual_information()
0.14738372372641576

New in version 0.4.0.

neg_likelihood_ratio()[source]

Return negative likelihood ratio.

Negative likelihood ratio is defined as

\[\frac{1-recall}{specificity}\]

Cf. https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

Returns

The negative likelihood ratio of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.neg_likelihood_ratio()
0.2666666666666666

New in version 0.4.0.

npv()[source]

Return negative predictive value (NPV).

NPV is defined as

\[\frac{tn}{tn + fn}\]

AKA inverse precision

Cf. https://en.wikipedia.org/wiki/Negative_predictive_value

Returns

The negative predictive value of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.npv()
0.6666666666666666

New in version 0.1.0.

phi_coefficient()[source]

Return φ coefficient.

The \(\phi\) coefficient is defined as

\[\phi = \frac{tp \cdot tn - fp \cdot tn} {\sqrt{(tp + fp) \cdot (tp + fn) \cdot (tn + fp) \cdot (tn + fn)}}\]
Returns

The φ coefficient of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.phi_coefficient()
0.5367450401216932

New in version 0.4.0.

population()[source]

Return population, N.

Returns

The population (N) of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.population()
230

New in version 0.1.0.

pos_likelihood_ratio()[source]

Return positive likelihood ratio.

Positive likelihood ratio is defined as

\[\frac{recall}{1-specificity}\]

Cf. https://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

Returns

The positive likelihood ratio of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pos_likelihood_ratio()
3.2

New in version 0.4.0.

pr_aghmean()[source]

Return arithmetic-geometric-harmonic mean of precision & recall.

Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 12 digits), following the method described in [RaissouliLC09].

Returns

The arithmetic-geometric-harmonic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_aghmean()
0.8280786712108288

New in version 0.1.0.

pr_agmean()[source]

Return arithmetic-geometric mean of precision & recall.

Iterates between arithmetic & geometric means until they converge to a single value (rounded to 12 digits)

Cf. https://en.wikipedia.org/wiki/Arithmetic-geometric_mean

Returns

The arithmetic-geometric mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_agmean()
0.8283250315702829

New in version 0.1.0.

pr_amean()[source]

Return arithmetic mean of precision & recall.

The arithmetic mean of precision and recall is defined as

\[\frac{precision \cdot recall}{2}\]

Cf. https://en.wikipedia.org/wiki/Arithmetic_mean

Returns

The arithmetic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_amean()
0.8285714285714285

New in version 0.1.0.

pr_cmean()[source]

Return contraharmonic mean of precision & recall.

The contraharmonic mean is

\[\frac{precision^{2} + recall^{2}}{precision + recall}\]

Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean

Returns

The contraharmonic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_cmean()
0.8295566502463055

New in version 0.1.0.

pr_ghmean()[source]

Return geometric-harmonic mean of precision & recall.

Iterates between geometric & harmonic means until they converge to a single value (rounded to 12 digits)

Cf. https://en.wikipedia.org/wiki/Geometric-harmonic_mean

Returns

The geometric-harmonic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_ghmean()
0.8278323841238441

New in version 0.1.0.

pr_gmean()[source]

Return geometric mean of precision & recall.

The geometric mean of precision and recall is defined as:

\[\sqrt{precision \cdot recall}\]

Cf. https://en.wikipedia.org/wiki/Geometric_mean

Returns

The geometric mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_gmean()
0.828078671210825

New in version 0.1.0.

pr_heronian_mean()[source]

Return Heronian mean of precision & recall.

The Heronian mean of precision and recall is defined as

\[\frac{precision + \sqrt{precision \cdot recall} + recall}{3}\]

Cf. https://en.wikipedia.org/wiki/Heronian_mean

Returns

The Heronian mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_heronian_mean()
0.8284071761178939

New in version 0.1.0.

pr_hmean()[source]

Return harmonic mean of precision & recall.

The harmonic mean of precision and recall is defined as

\[\frac{2 \cdot precision \cdot recall}{precision + recall}\]

Cf. https://en.wikipedia.org/wiki/Harmonic_mean

Returns

The harmonic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_hmean()
0.8275862068965516

New in version 0.1.0.

pr_hoelder_mean(exp=2)[source]

Return Hölder (power/generalized) mean of precision & recall.

The power mean of precision and recall is defined as

\[\frac{1}{2} \cdot \sqrt[exp]{precision^{exp} + recall^{exp}}\]

for \(exp \ne 0\), and the geometric mean for \(exp = 0\)

Cf. https://en.wikipedia.org/wiki/Generalized_mean

Parameters

exp (float) -- The exponent of the Hölder mean

Returns

The Hölder mean for the given exponent of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_hoelder_mean()
0.8290638930598233

New in version 0.1.0.

pr_imean()[source]

Return identric (exponential) mean of precision & recall.

The identric mean is: precision if precision = recall, otherwise

\[\frac{1}{e} \cdot \sqrt[precision - recall]{\frac{precision^{precision}} {recall^{recall}}}\]

Cf. https://en.wikipedia.org/wiki/Identric_mean

Returns

The identric mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_imean()
0.8284071826325543

New in version 0.1.0.

pr_lehmer_mean(exp=2.0)[source]

Return Lehmer mean of precision & recall.

The Lehmer mean is

\[\frac{precision^{exp} + recall^{exp}} {precision^{exp-1} + recall^{exp-1}}\]

Cf. https://en.wikipedia.org/wiki/Lehmer_mean

Parameters

exp (float) -- The exponent of the Lehmer mean

Returns

The Lehmer mean for the given exponent of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_lehmer_mean()
0.8295566502463055

New in version 0.1.0.

pr_lmean()[source]

Return logarithmic mean of precision & recall.

The logarithmic mean is: 0 if either precision or recall is 0, the precision if they are equal, otherwise

\[\frac{precision - recall} {ln(precision) - ln(recall)}\]

Cf. https://en.wikipedia.org/wiki/Logarithmic_mean

Returns

The logarithmic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_lmean()
0.8282429171492667

New in version 0.1.0.

pr_qmean()[source]

Return quadratic mean of precision & recall.

The quadratic mean of precision and recall is defined as

\[\sqrt{\frac{precision^{2} + recall^{2}}{2}}\]

Cf. https://en.wikipedia.org/wiki/Quadratic_mean

Returns

The quadratic mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_qmean()
0.8290638930598233

New in version 0.1.0.

pr_seiffert_mean()[source]

Return Seiffert's mean of precision & recall.

Seiffert's mean of precision and recall is

\[\frac{precision - recall}{4 \cdot arctan \sqrt{\frac{precision}{recall}} - \pi}\]

It is defined in [Sei93].

Returns

Seiffert's mean of the confusion table's precision & recall

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pr_seiffert_mean()
0.8284071696048312

New in version 0.1.0.

precision()[source]

Return precision.

Precision is defined as

\[\frac{tp}{tp + fp}\]

AKA positive predictive value (PPV)

Cf. https://en.wikipedia.org/wiki/Precision_and_recall

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Precision

Returns

The precision of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.precision()
0.8571428571428571

New in version 0.1.0.

precision_gain()[source]

Return gain in precision.

The gain in precision is defined as

\[G(precision) = \frac{precision}{random~ precision}\]

Cf. https://en.wikipedia.org/wiki/Gain_(information_retrieval)

Returns

The gain in precision of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.precision_gain()
1.3142857142857143

New in version 0.1.0.

pred_neg_pop()[source]

Return predicted negative population.

Returns

The predicted negative population of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pred_neg_pop()
90

New in version 0.1.0.

Changed in version 0.4.0: renamed from test_neg_pop

New in version 0.1.0.

pred_pos_pop()[source]

Return predicted positive population.

Returns

The predicted positive population of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.pred_pos_pop()
140

New in version 0.1.0.

Changed in version 0.4.0: renamed from test_pos_pop

New in version 0.1.0.

predicted_entropy()[source]

Return the predicted entropy.

Implementation based on https://github.com/Magnetic/proficiency-metric

Returns

The predicted entropy of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.predicted_entropy()
0.6693279632926457

New in version 0.4.0.

prevalence()[source]

Return prevalence.

Prevalence is defined as

\[\frac{condition positive}{population}\]

Cf. https://en.wikipedia.org/wiki/Prevalence

Returns

The prevelence of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.prevalence()
0.6521739130434783

New in version 0.4.0.

proficiency()[source]

Return the proficiency.

Implementation based on https://github.com/Magnetic/proficiency-metric [SLaclavik15]

AKA uncertainty coefficient

Cf. https://en.wikipedia.org/wiki/Uncertainty_coefficient

Returns

The proficiency of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.proficiency()
0.228116219897929

New in version 0.4.0.

recall()[source]

Return recall.

Recall is defined as

\[\frac{tp}{tp + fn}\]

AKA sensitivity

AKA true positive rate (TPR)

Cf. https://en.wikipedia.org/wiki/Precision_and_recall

Cf. https://en.wikipedia.org/wiki/Sensitivity_(test)

Cf. https://en.wikipedia.org/wiki/Information_retrieval#Recall

Returns

The recall of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.recall()
0.8

New in version 0.1.0.

significance()[source]

Return the significance, \(\chi^{2}\).

Significance is defined as

\[\chi^{2} = \frac{(tp \cdot tn - fp \cdot fn)^{2} (tp + tn + fp + fn)} {((tp + fp)(tp + fn)(tn + fp)(tn + fn)}`\]

Also: \(\chi^{2} = MCC^{2} \cdot n\)

Cf. https://en.wikipedia.org/wiki/Pearson%27s_chi-square_test

Returns

The significance of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.significance()
66.26190476190476

New in version 0.1.0.

specificity()[source]

Return specificity.

Specificity is defined as

\[\frac{tn}{tn + fp}\]

AKA true negative rate (TNR)

AKA inverse recall

Cf. https://en.wikipedia.org/wiki/Specificity_(tests)

Returns

The specificity of the confusion table

Return type

float

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.specificity()
0.75

New in version 0.1.0.

to_dict()[source]

Cast to dict.

Returns

The confusion table as a dict

Return type

dict

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> import pprint
>>> pprint.pprint(ct.to_dict())
{'fn': 30, 'fp': 20, 'tn': 60, 'tp': 120}

New in version 0.1.0.

to_tuple()[source]

Cast to tuple.

Returns

The confusion table as a 4-tuple (tp, tn, fp, fn)

Return type

tuple

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.to_tuple()
(120, 60, 20, 30)

New in version 0.1.0.

true_neg()[source]

Return true negatives.

Returns

The true negatives of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.true_neg()
60

New in version 0.1.0.

true_pos()[source]

Return true positives.

Returns

The true positives of the confusion table

Return type

int

Example

>>> ct = ConfusionTable(120, 60, 20, 30)
>>> ct.true_pos()
120

New in version 0.1.0.

abydos.stats.amean(nums)[source]

Return arithmetic mean.

The arithmetic mean is defined as

\[\frac{\sum{nums}}{|nums|}\]

Cf. https://en.wikipedia.org/wiki/Arithmetic_mean

Parameters

nums (list) -- A series of numbers

Returns

The arithmetric mean of nums

Return type

float

Examples

>>> amean([1, 2, 3, 4])
2.5
>>> amean([1, 2])
1.5
>>> amean([0, 5, 1000])
335.0

New in version 0.1.0.

abydos.stats.gmean(nums)[source]

Return geometric mean.

The geometric mean is defined as

\[\sqrt[|nums|]{\prod\limits_{i} nums_{i}}\]

Cf. https://en.wikipedia.org/wiki/Geometric_mean

Parameters

nums (list) -- A series of numbers

Returns

The geometric mean of nums

Return type

float

Examples

>>> gmean([1, 2, 3, 4])
2.213363839400643
>>> gmean([1, 2])
1.4142135623730951
>>> gmean([0, 5, 1000])
0.0

New in version 0.1.0.

abydos.stats.hmean(nums)[source]

Return harmonic mean.

The harmonic mean is defined as

\[\frac{|nums|}{\sum\limits_{i}\frac{1}{nums_i}}\]

Following the behavior of Wolfram|Alpha: - If one of the values in nums is 0, return 0. - If more than one value in nums is 0, return NaN.

Cf. https://en.wikipedia.org/wiki/Harmonic_mean

Parameters

nums (list) -- A series of numbers

Returns

The harmonic mean of nums

Return type

float

Raises

ValueError -- hmean requires at least one value

Examples

>>> hmean([1, 2, 3, 4])
1.9200000000000004
>>> hmean([1, 2])
1.3333333333333333
>>> hmean([0, 5, 1000])
0

New in version 0.1.0.

abydos.stats.agmean(nums, prec=12)[source]

Return arithmetic-geometric mean.

Iterates between arithmetic & geometric means until they converge to a single value (rounded to 10 digits).

Cf. https://en.wikipedia.org/wiki/Arithmetic-geometric_mean

Parameters

nums (list) -- A series of numbers

Returns

  • float -- The arithmetic-geometric mean of nums

  • prec (int) -- Digits of precision when testing convergeance

Examples

>>> agmean([1, 2, 3, 4])
2.3545004777751077
>>> agmean([1, 2])
1.4567910310469068
>>> agmean([0, 5, 1000])
2.9753977059954195e-13

New in version 0.1.0.

abydos.stats.ghmean(nums, prec=12)[source]

Return geometric-harmonic mean.

Iterates between geometric & harmonic means until they converge to a single value (rounded to 10 digits).

Cf. https://en.wikipedia.org/wiki/Geometric-harmonic_mean

Parameters
  • nums (list) -- A series of numbers

  • prec (int) -- Digits of precision when testing convergeance

Returns

The geometric-harmonic mean of nums

Return type

float

Examples

>>> ghmean([1, 2, 3, 4])
2.058868154613003
>>> ghmean([1, 2])
1.3728805006183502
>>> ghmean([0, 5, 1000])
0.0
>>> ghmean([0, 0])
0.0
>>> ghmean([0, 0, 5])
nan

New in version 0.1.0.

abydos.stats.aghmean(nums, prec=12)[source]

Return arithmetic-geometric-harmonic mean.

Iterates over arithmetic, geometric, & harmonic means until they converge to a single value (rounded to 10 digits), following the method described in [RaissouliLC09].

Parameters
  • nums (list) -- A series of numbers

  • prec (int) -- Digits of precision when testing convergeance

Returns

The arithmetic-geometric-harmonic mean of nums

Return type

float

Examples

>>> aghmean([1, 2, 3, 4])
2.198327159900212
>>> aghmean([1, 2])
1.4142135623731884
>>> aghmean([0, 5, 1000])
335.0

New in version 0.1.0.

abydos.stats.cmean(nums)[source]

Return contraharmonic mean.

The contraharmonic mean is

\[\frac{\sum\limits_i x_i^2}{\sum\limits_i x_i}\]

Cf. https://en.wikipedia.org/wiki/Contraharmonic_mean

Parameters

nums (list) -- A series of numbers

Returns

The contraharmonic mean of nums

Return type

float

Examples

>>> cmean([1, 2, 3, 4])
3.0
>>> cmean([1, 2])
1.6666666666666667
>>> cmean([0, 5, 1000])
995.0497512437811

New in version 0.1.0.

abydos.stats.imean(nums)[source]

Return identric (exponential) mean.

The identric mean of two numbers x and y is: x if x = y otherwise

\[\frac{1}{e} \sqrt[x-y]{\frac{x^x}{y^y}}\]

Cf. https://en.wikipedia.org/wiki/Identric_mean

Parameters

nums (list) -- A series of numbers

Returns

The identric mean of nums

Return type

float

Raises

ValueError -- imean supports no more than two values

Examples

>>> imean([1, 2])
1.4715177646857693
>>> imean([1, 0])
nan
>>> imean([2, 4])
2.9430355293715387

New in version 0.1.0.

abydos.stats.lmean(nums)[source]

Return logarithmic mean.

The logarithmic mean of an arbitrarily long series is defined by http://www.survo.fi/papers/logmean.pdf as

\[\begin{split}L(x_1, x_2, ..., x_n) = (n-1)! \sum\limits_{i=1}^n \frac{x_i} {\prod\limits_{\substack{j = 1\\j \ne i}}^n ln \frac{x_i}{x_j}}\end{split}\]

Cf. https://en.wikipedia.org/wiki/Logarithmic_mean

Parameters

nums (list) -- A series of numbers

Returns

The logarithmic mean of nums

Return type

float

Raises

ValueError -- No two values in the nums list may be equal

Examples

>>> lmean([1, 2, 3, 4])
2.2724242417489258
>>> lmean([1, 2])
1.4426950408889634

New in version 0.1.0.

abydos.stats.qmean(nums)[source]

Return quadratic mean.

The quadratic mean is defined as

\[\sqrt{\sum\limits_{i} \frac{num_i^2}{|nums|}}\]

Cf. https://en.wikipedia.org/wiki/Quadratic_mean

Parameters

nums (list) -- A series of numbers

Returns

The quadratic mean of nums

Return type

float

Examples

>>> qmean([1, 2, 3, 4])
2.7386127875258306
>>> qmean([1, 2])
1.5811388300841898
>>> qmean([0, 5, 1000])
577.3574860228857

New in version 0.1.0.

abydos.stats.heronian_mean(nums)[source]

Return Heronian mean.

The Heronian mean is:

\[\frac{\sum\limits_{i, j}\sqrt{{x_i \cdot x_j}}} {|nums| \cdot \frac{|nums| + 1}{2}}\]

for \(j \ge i\)

Cf. https://en.wikipedia.org/wiki/Heronian_mean

Parameters

nums (list) -- A series of numbers

Returns

The Heronian mean of nums

Return type

float

Examples

>>> heronian_mean([1, 2, 3, 4])
2.3888282852609093
>>> heronian_mean([1, 2])
1.4714045207910316
>>> heronian_mean([0, 5, 1000])
179.28511301977582

New in version 0.1.0.

abydos.stats.hoelder_mean(nums, exp=2)[source]

Return Hölder (power/generalized) mean.

The Hölder mean is defined as:

\[\sqrt[p]{\frac{1}{|nums|} \cdot \sum\limits_i{x_i^p}}\]

for \(p \ne 0\), and the geometric mean for \(p = 0\)

Cf. https://en.wikipedia.org/wiki/Generalized_mean

Parameters
  • nums (list) -- A series of numbers

  • exp (numeric) -- The exponent of the Hölder mean

Returns

The Hölder mean of nums for the given exponent

Return type

float

Examples

>>> hoelder_mean([1, 2, 3, 4])
2.7386127875258306
>>> hoelder_mean([1, 2])
1.5811388300841898
>>> hoelder_mean([0, 5, 1000])
577.3574860228857

New in version 0.1.0.

abydos.stats.lehmer_mean(nums, exp=2)[source]

Return Lehmer mean.

The Lehmer mean is

\[\frac{\sum\limits_i{x_i^p}}{\sum\limits_i{x_i^(p-1)}}\]

Cf. https://en.wikipedia.org/wiki/Lehmer_mean

Parameters
  • nums (list) -- A series of numbers

  • exp (numeric) -- The exponent of the Lehmer mean

Returns

The Lehmer mean of nums for the given exponent

Return type

float

Examples

>>> lehmer_mean([1, 2, 3, 4])
3.0
>>> lehmer_mean([1, 2])
1.6666666666666667
>>> lehmer_mean([0, 5, 1000])
995.0497512437811

New in version 0.1.0.

abydos.stats.seiffert_mean(nums)[source]

Return Seiffert's mean.

Seiffert's mean of two numbers x and y is

\[\frac{x - y}{4 \cdot arctan \sqrt{\frac{x}{y}} - \pi}\]

It is defined in [Sei93].

Parameters

nums (list) -- A series of numbers

Returns

Sieffert's mean of nums

Return type

float

Raises

ValueError -- seiffert_mean supports no more than two values

Examples

>>> seiffert_mean([1, 2])
1.4712939827611637
>>> seiffert_mean([1, 0])
0.3183098861837907
>>> seiffert_mean([2, 4])
2.9425879655223275
>>> seiffert_mean([2, 1000])
336.84053300118825

New in version 0.1.0.

abydos.stats.median(nums)[source]

Return median.

With numbers sorted by value, the median is the middle value (if there is an odd number of values) or the arithmetic mean of the two middle values (if there is an even number of values).

Cf. https://en.wikipedia.org/wiki/Median

Parameters

nums (list) -- A series of numbers

Returns

The median of nums

Return type

int or float

Examples

>>> median([1, 2, 3])
2
>>> median([1, 2, 3, 4])
2.5
>>> median([1, 2, 2, 4])
2

New in version 0.1.0.

abydos.stats.midrange(nums)[source]

Return midrange.

The midrange is the arithmetic mean of the maximum & minimum of a series.

Cf. https://en.wikipedia.org/wiki/Midrange

Parameters

nums (list) -- A series of numbers

Returns

The midrange of nums

Return type

float

Examples

>>> midrange([1, 2, 3])
2.0
>>> midrange([1, 2, 2, 3])
2.0
>>> midrange([1, 2, 1000, 3])
500.5

New in version 0.1.0.

abydos.stats.mode(nums)[source]

Return the mode.

The mode of a series is the most common element of that series

Cf. https://en.wikipedia.org/wiki/Mode_(statistics)

Parameters

nums (list) -- A series of numbers

Returns

The mode of nums

Return type

int or float

Example

>>> mode([1, 2, 2, 3])
2

New in version 0.1.0.

abydos.stats.std(nums, mean_func=<function amean>, ddof=0)[source]

Return the standard deviation.

The standard deviation of a series of values is the square root of the variance.

Cf. https://en.wikipedia.org/wiki/Standard_deviation

Parameters
  • nums (list) -- A series of numbers

  • mean_func (function) -- A mean function (amean by default)

  • ddof (int) -- The degrees of freedom (0 by default)

Returns

The standard deviation of the values in the series

Return type

float

Examples

>>> std([1, 1, 1, 1])
0.0
>>> round(std([1, 2, 3, 4]), 12)
1.11803398875
>>> round(std([1, 2, 3, 4], ddof=1), 12)
1.290994448736

New in version 0.3.0.

abydos.stats.var(nums, mean_func=<function amean>, ddof=0)[source]

Calculate the variance.

The variance (\(\sigma^2\)) of a series of numbers (\(x_i\)) with mean \(\mu\) and population \(N\) is:

\[\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2\]

Cf. https://en.wikipedia.org/wiki/Variance

Parameters
  • nums (list) -- A series of numbers

  • mean_func (function) -- A mean function (amean by default)

  • ddof (int) -- The degrees of freedom (0 by default)

Returns

The variance of the values in the series

Return type

float

Examples

>>> var([1, 1, 1, 1])
0.0
>>> var([1, 2, 3, 4])
1.25
>>> round(var([1, 2, 3, 4], ddof=1), 12)
1.666666666667

New in version 0.3.0.

abydos.stats.mean_pairwise_similarity(collection, metric=<function sim_levenshtein>, mean_func=<function hmean>, symmetric=False)[source]

Calculate the mean pairwise similarity of a collection of strings.

Takes the mean of the pairwise similarity between each member of a collection, optionally in both directions (for asymmetric similarity metrics.

Parameters
  • collection (list) -- A collection of terms or a string that can be split

  • metric (function) -- A similarity metric function

  • mean_func (function) -- A mean function that takes a list of values and returns a float

  • symmetric (bool) -- Set to True if all pairwise similarities should be calculated in both directions

Returns

The mean pairwise similarity of a collection of strings

Return type

float

Raises
  • ValueError -- mean_func must be a function

  • ValueError -- metric must be a function

  • ValueError -- collection is neither a string nor iterable type

  • ValueError -- collection has fewer than two members

Examples

>>> round(mean_pairwise_similarity(['Christopher', 'Kristof',
... 'Christobal']), 12)
0.519801980198
>>> round(mean_pairwise_similarity(['Niall', 'Neal', 'Neil']), 12)
0.545454545455

New in version 0.1.0.

abydos.stats.pairwise_similarity_statistics(src_collection, tar_collection, metric=<function sim_levenshtein>, mean_func=<function amean>, symmetric=False)[source]

Calculate the pairwise similarity statistics a collection of strings.

Calculate pairwise similarities among members of two collections, returning the maximum, minimum, mean (according to a supplied function, arithmetic mean, by default), and (population) standard deviation of those similarities.

Parameters
  • src_collection (list) -- A collection of terms or a string that can be split

  • tar_collection (list) -- A collection of terms or a string that can be split

  • metric (function) -- A similarity metric function

  • mean_func (function) -- A mean function that takes a list of values and returns a float

  • symmetric (bool) -- Set to True if all pairwise similarities should be calculated in both directions

Returns

The max, min, mean, and standard deviation of similarities

Return type

tuple

Raises
  • ValueError -- mean_func must be a function

  • ValueError -- metric must be a function

  • ValueError -- src_collection is neither a string nor iterable

  • ValueError -- tar_collection is neither a string nor iterable

Example

>>> tuple(round(_, 12) for _ in pairwise_similarity_statistics(
... ['Christopher', 'Kristof', 'Christobal'], ['Niall', 'Neal', 'Neil']))
(0.2, 0.0, 0.118614718615, 0.075070477184)

New in version 0.3.0.