abydos.compression package

abydos.compression.

The compression package defines compression and compression-related functions for use within Abydos, including implementations of the following:

  • Arithmetic for arithmetic coding

  • BWT for Burrows-Wheeler Transform

  • RLE for Run-Length Encoding

Each class exposes encode and decode methods for performing and reversing its encoding. For example, the Burrows-Wheeler Transform can be performed by creating a BWT object and then calling BWT.encode() on a string:

>>> bwt = BWT()
>>> bwt.encode('^BANANA')
'ANNB^AA\x00'

class abydos.compression.Arithmetic(text=None)[source]

Bases: object

Arithmetic Coder.

This is based on Andrew Dalke's public domain implementation [Dal05]. It has been ported to use the fractions.Fraction class.

New in version 0.3.6.

Initialize arithmetic coder object.

Parameters

text (str) -- The training text

New in version 0.3.6.

decode(longval, nbits)[source]

Decode the number to a string using the given statistics.

Parameters
  • longval (int) -- The first part of an encoded tuple from encode

  • nbits (int) -- The second part of an encoded tuple from encode

Returns

The arithmetically decoded text

Return type

str

Example

>>> ac = Arithmetic('the quick brown fox jumped over the lazy dog')
>>> ac.decode(16720586181, 34)
'align'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

encode(text)[source]

Encode a text using arithmetic coding.

Text and the 0-order probability statistics -> longval, nbits

The encoded number is Fraction(longval, 2**nbits)

Parameters

text (str) -- A string to encode

Returns

The arithmetically coded text

Return type

tuple

Example

>>> ac = Arithmetic('the quick brown fox jumped over the lazy dog')
>>> ac.encode('align')
(16720586181, 34)

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

get_probs()[source]

Return the probs dictionary.

Returns

The dictionary of probabilities

Return type

dict

New in version 0.3.6.

set_probs(probs)[source]

Set the probs dictionary.

Parameters

probs (dict) -- The dictionary of probabilities

New in version 0.3.6.

train(text)[source]

Generate a probability dict from the provided text.

Text to 0-order probability statistics as a dict

Parameters

text (str) -- The text data over which to calculate probability statistics. This must not contain the NUL (0x00) character because that is used to indicate the end of data.

Example

>>> ac = Arithmetic()
>>> ac.train('the quick brown fox jumped over the lazy dog')
>>> ac.get_probs()
{' ': (Fraction(0, 1), Fraction(8, 45)),
 'o': (Fraction(8, 45), Fraction(4, 15)),
 'e': (Fraction(4, 15), Fraction(16, 45)),
 'u': (Fraction(16, 45), Fraction(2, 5)),
 't': (Fraction(2, 5), Fraction(4, 9)),
 'r': (Fraction(4, 9), Fraction(22, 45)),
 'h': (Fraction(22, 45), Fraction(8, 15)),
 'd': (Fraction(8, 15), Fraction(26, 45)),
 'z': (Fraction(26, 45), Fraction(3, 5)),
 'y': (Fraction(3, 5), Fraction(28, 45)),
 'x': (Fraction(28, 45), Fraction(29, 45)),
 'w': (Fraction(29, 45), Fraction(2, 3)),
 'v': (Fraction(2, 3), Fraction(31, 45)),
 'q': (Fraction(31, 45), Fraction(32, 45)),
 'p': (Fraction(32, 45), Fraction(11, 15)),
 'n': (Fraction(11, 15), Fraction(34, 45)),
 'm': (Fraction(34, 45), Fraction(7, 9)),
 'l': (Fraction(7, 9), Fraction(4, 5)),
 'k': (Fraction(4, 5), Fraction(37, 45)),
 'j': (Fraction(37, 45), Fraction(38, 45)),
 'i': (Fraction(38, 45), Fraction(13, 15)),
 'g': (Fraction(13, 15), Fraction(8, 9)),
 'f': (Fraction(8, 9), Fraction(41, 45)),
 'c': (Fraction(41, 45), Fraction(14, 15)),
 'b': (Fraction(14, 15), Fraction(43, 45)),
 'a': (Fraction(43, 45), Fraction(44, 45)),
 '\x00': (Fraction(44, 45), Fraction(1, 1))}

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

abydos.compression.ac_decode(longval, nbits, probs)[source]

Decode the number to a string using the given statistics.

This is a wrapper for Arithmetic.decode().

Parameters
  • longval (int) -- The first part of an encoded tuple from ac_encode

  • nbits (int) -- The second part of an encoded tuple from ac_encode

  • probs (dict) -- A probability statistics dictionary generated by Arithmetic.train()

Returns

The arithmetically decoded text

Return type

str

Example

>>> pr = ac_train('the quick brown fox jumped over the lazy dog')
>>> ac_decode(16720586181, 34, pr)
'align'

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Arithmetic.decode method instead.

abydos.compression.ac_encode(text, probs)[source]

Encode a text using arithmetic coding with the provided probabilities.

This is a wrapper for Arithmetic.encode().

Parameters
  • text (str) -- A string to encode

  • probs (dict) -- A probability statistics dictionary generated by Arithmetic.train()

Returns

The arithmetically coded text

Return type

tuple

Example

>>> pr = ac_train('the quick brown fox jumped over the lazy dog')
>>> ac_encode('align', pr)
(16720586181, 34)

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Arithmetic.encode method instead.

abydos.compression.ac_train(text)[source]

Generate a probability dict from the provided text.

This is a wrapper for Arithmetic.train().

Parameters

text (str) -- The text data over which to calculate probability statistics. This must not contain the NUL (0x00) character because that's used to indicate the end of data.

Returns

A probability dict

Return type

dict

Example

>>> ac_train('the quick brown fox jumped over the lazy dog')
{' ': (Fraction(0, 1), Fraction(8, 45)),
 'o': (Fraction(8, 45), Fraction(4, 15)),
 'e': (Fraction(4, 15), Fraction(16, 45)),
 'u': (Fraction(16, 45), Fraction(2, 5)),
 't': (Fraction(2, 5), Fraction(4, 9)),
 'r': (Fraction(4, 9), Fraction(22, 45)),
 'h': (Fraction(22, 45), Fraction(8, 15)),
 'd': (Fraction(8, 15), Fraction(26, 45)),
 'z': (Fraction(26, 45), Fraction(3, 5)),
 'y': (Fraction(3, 5), Fraction(28, 45)),
 'x': (Fraction(28, 45), Fraction(29, 45)),
 'w': (Fraction(29, 45), Fraction(2, 3)),
 'v': (Fraction(2, 3), Fraction(31, 45)),
 'q': (Fraction(31, 45), Fraction(32, 45)),
 'p': (Fraction(32, 45), Fraction(11, 15)),
 'n': (Fraction(11, 15), Fraction(34, 45)),
 'm': (Fraction(34, 45), Fraction(7, 9)),
 'l': (Fraction(7, 9), Fraction(4, 5)),
 'k': (Fraction(4, 5), Fraction(37, 45)),
 'j': (Fraction(37, 45), Fraction(38, 45)),
 'i': (Fraction(38, 45), Fraction(13, 15)),
 'g': (Fraction(13, 15), Fraction(8, 9)),
 'f': (Fraction(8, 9), Fraction(41, 45)),
 'c': (Fraction(41, 45), Fraction(14, 15)),
 'b': (Fraction(14, 15), Fraction(43, 45)),
 'a': (Fraction(43, 45), Fraction(44, 45)),
 '\x00': (Fraction(44, 45), Fraction(1, 1))}

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the Arithmetic.train method instead.

class abydos.compression.BWT(terminator='x00')[source]

Bases: object

Burrows-Wheeler Transform.

The Burrows-Wheeler transform is an attempt at placing similar characters together to improve compression. Cf. [BW94].

New in version 0.3.6.

Initialize BWT instance.

Parameters

terminator (str) -- A character added to signal the end of the string

New in version 0.4.0.

decode(code)[source]

Return a word decoded from BWT form.

Parameters
  • code (str) -- The word to transform from BWT form

  • terminator (str) -- A character added to signal the end of the string

Returns

Word decoded by BWT

Return type

str

Raises

ValueError -- Specified terminator absent from code.

Examples

>>> bwt = BWT()
>>> bwt.decode('n\x00ilag')
'align'
>>> bwt.decode('annb\x00aa')
'banana'
>>> bwt = BWT('@')
>>> bwt.decode('annb@aa')
'banana'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

encode(word)[source]

Return the Burrows-Wheeler transformed form of a word.

Parameters

word (str) -- The word to transform using BWT

Returns

Word encoded by BWT

Return type

str

Raises

ValueError -- Specified terminator absent from code.

Examples

>>> bwt = BWT()
>>> bwt.encode('align')
'n\x00ilag'
>>> bwt.encode('banana')
'annb\x00aa'
>>> bwt = BWT('@')
>>> bwt.encode('banana')
'annb@aa'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

abydos.compression.bwt_decode(code, terminator='\x00')[source]

Return a word decoded from BWT form.

This is a wrapper for BWT.decode().

Parameters
  • code (str) -- The word to transform from BWT form

  • terminator (str) -- A character added to signal the end of the string

Returns

Word decoded by BWT

Return type

str

Examples

>>> bwt_decode('n\x00ilag')
'align'
>>> bwt_decode('annb\x00aa')
'banana'
>>> bwt_decode('annb@aa', '@')
'banana'

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the BWT.decode method instead.

abydos.compression.bwt_encode(word, terminator='\x00')[source]

Return the Burrows-Wheeler transformed form of a word.

This is a wrapper for BWT.encode().

Parameters
  • word (str) -- The word to transform using BWT

  • terminator (str) -- A character added to signal the end of the string

Returns

Word encoded by BWT

Return type

str

Examples

>>> bwt_encode('align')
'n\x00ilag'
>>> bwt_encode('banana')
'annb\x00aa'
>>> bwt_encode('banana', '@')
'annb@aa'

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the BWT.encode method instead.

class abydos.compression.RLE[source]

Bases: object

Run-Length Encoding.

Cf. [RC67].

Based on http://rosettacode.org/wiki/Run-length_encoding#Python [Cod18b]. This is licensed GFDL 1.2.

Digits 0-9 cannot be in text.

New in version 0.3.6.

decode(text)[source]

Perform decoding of run-length-encoding (RLE).

Parameters

text (str) -- A text string to decode

Returns

Word decoded by RLE

Return type

str

Examples

>>> rle = RLE()
>>> bwt = BWT()
>>> bwt.decode(rle.decode('n\x00ilag'))
'align'
>>> rle.decode('align')
'align'
>>> bwt.decode(rle.decode('annb\x00aa'))
'banana'
>>> rle.decode('banana')
'banana'
>>> bwt.decode(rle.decode('ab\x00abbab5a'))
'aaabaabababa'
>>> rle.decode('3abaabababa')
'aaabaabababa'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

encode(text)[source]

Perform encoding of run-length-encoding (RLE).

Parameters

text (str) -- A text string to encode

Returns

Word decoded by RLE

Return type

str

Examples

>>> rle = RLE()
>>> bwt = BWT()
>>> rle.encode(bwt.encode('align'))
'n\x00ilag'
>>> rle.encode('align')
'align'
>>> rle.encode(bwt.encode('banana'))
'annb\x00aa'
>>> rle.encode('banana')
'banana'
>>> rle.encode(bwt.encode('aaabaabababa'))
'ab\x00abbab5a'
>>> rle.encode('aaabaabababa')
'3abaabababa'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

abydos.compression.rle_decode(text, use_bwt=True)[source]

Perform decoding of run-length-encoding (RLE).

This is a wrapper for RLE.decode().

Parameters
  • text (str) -- A text string to decode

  • use_bwt (bool) -- Indicates whether to perform BWT decoding after RLE decoding

Returns

Word decoded by RLE

Return type

str

Examples

>>> rle_decode('n\x00ilag')
'align'
>>> rle_decode('align', use_bwt=False)
'align'
>>> rle_decode('annb\x00aa')
'banana'
>>> rle_decode('banana', use_bwt=False)
'banana'
>>> rle_decode('ab\x00abbab5a')
'aaabaabababa'
>>> rle_decode('3abaabababa', False)
'aaabaabababa'

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the RLE.decode method instead.

abydos.compression.rle_encode(text, use_bwt=True)[source]

Perform encoding of run-length-encoding (RLE).

This is a wrapper for RLE.encode().

Parameters
  • text (str) -- A text string to encode

  • use_bwt (bool) -- Indicates whether to perform BWT encoding before RLE encoding

Returns

Word decoded by RLE

Return type

str

Examples

>>> rle_encode('align')
'n\x00ilag'
>>> rle_encode('align', use_bwt=False)
'align'
>>> rle_encode('banana')
'annb\x00aa'
>>> rle_encode('banana', use_bwt=False)
'banana'
>>> rle_encode('aaabaabababa')
'ab\x00abbab5a'
>>> rle_encode('aaabaabababa', False)
'3abaabababa'

New in version 0.1.0.

Deprecated since version 0.4.0: This will be removed in 0.6.0. Use the RLE.encode method instead.