python-blosc2: a Python wrapper for the extremely fast Blosc2 compression library

Author

The Blosc development team

Contact

blosc@blosc.org

Github

https://github.com/Blosc/python-blosc2

URL

http://python-blosc2.blosc.org

PyPi

version

Gitter

gitter

Code of Conduct

Contributor Covenant

What it is

Blosc (http://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call.

Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc.

python-blosc2 is a Python package that wraps C-Blosc2, the newest version of the Blosc compressor. Currently python-blosc2 already reproduces the API of python-blosc, so the former can be used as a drop-in replacement for the later. However, there are a few exceptions for the complete compatibility that are listed here: https://github.com/Blosc/python-blosc2/blob/main/RELEASE_NOTES.md#changes-from-python-blosc-to-python-blosc2

In addition, python-blosc2 aims to leverage the new C-Blosc2 API so as to support super-chunks, serialization and all the features introduced in C-Blosc2. This is work in process and will be done incrementally in future releases.

Note: python-blosc2 is meant to be backward compatible with python-blosc data. That means that it can read data generated with python-blosc, but the opposite is not true (i.e. there is no forward compatibility).

Installing

Blosc is now offering Python wheels for the main OS (Win, Mac and Linux) and platforms. You can install binary packages from PyPi using pip:

pip install blosc2

Documentation

The documentation is here:

https://python-blosc2.readthedocs.io/en/latest/

Also, some examples are available on:

https://github.com/Blosc/python-blosc2/tree/main/examples

Building

python-blosc2 comes with the Blosc sources with it and can be built with:

git clone https://github.com/Blosc/python-blosc2/
cd python-blosc2
git submodule update --init --recursive
python -m pip install -r requirements.txt
python setup.py build_ext --inplace

That’s all. You can proceed with testing section now.

Testing

After compiling, you can quickly check that the package is sane by running the doctests in blosc/test.py:

python -m pip install -r requirements-tests.txt
python -m pytest  (add -v for verbose mode)

Benchmarking

If curious, you may want to run a small benchmark that compares a plain NumPy array copy against compression through different compressors in your Blosc build:

PYTHONPATH=. python bench/compress_numpy.py

Just to whet your appetite, here are some speed figures for an Intel box (i9-10940X @ 3.30GHz) with 64 GB RAM running Clear Linux:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
python-blosc2 version: 0.1.7
Blosc version: 2.0.0.rc2 ($Date:: 2021-05-26 #$)
Compressors available: ['blosclz', 'lz4', 'lz4hc', 'zlib', 'zstd']
Compressor library versions:
  blosclz: 2.4.0
  lz4: 1.9.3
  lz4hc: 1.9.3
  zlib: 1.2.11.zlib-ng
  zstd: 1.5.0
Python version: 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0]
Platform: Linux-5.12.6-1043.native-x86_64 (#1 SMP Sat May 22 04:04:10 PDT 2021)
Linux dist: Clear Linux OS
Processor: not recognized
Byte-ordering: little
Detected cores: 28
Number of threads to use by default: 8
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Creating NumPy arrays with 10**8 int64/float64 elements:
  *** np.copyto() *** Time for memcpy():    0.083 s (8.93 GB/s)

Times for compressing/decompressing:

*** the arange linear distribution ***
  *** blosclz, noshuffle  ***  0.219 s (3.41 GB/s) / 0.083 s (8.93 GB/s)    cr:   2.0x
  *** blosclz, shuffle    ***  0.027 s (27.26 GB/s) / 0.035 s (21.38 GB/s)  cr: 469.7x
  *** blosclz, bitshuffle ***  0.078 s (9.56 GB/s) / 0.135 s (5.53 GB/s)    cr: 488.2x
  *** lz4    , noshuffle  ***  0.223 s (3.33 GB/s) / 0.075 s (9.92 GB/s)    cr:   2.0x
  *** lz4    , shuffle    ***  0.025 s (29.69 GB/s) / 0.035 s (21.18 GB/s)  cr: 279.2x
  *** lz4    , bitshuffle ***  0.079 s (9.43 GB/s) / 0.138 s (5.40 GB/s)    cr:  87.7x
  *** lz4hc  , noshuffle  ***  1.273 s (0.59 GB/s) / 0.076 s (9.85 GB/s)    cr:   2.0x
  *** lz4hc  , shuffle    ***  0.108 s (6.87 GB/s) / 0.032 s (23.37 GB/s)   cr: 155.9x
  *** lz4hc  , bitshuffle ***  0.359 s (2.08 GB/s) / 0.037 s (19.88 GB/s)   cr: 239.5x
  *** zlib   , noshuffle  ***  2.732 s (0.27 GB/s) / 0.146 s (5.09 GB/s)    cr:   5.3x
  *** zlib   , shuffle    ***  0.129 s (5.78 GB/s) / 0.046 s (16.11 GB/s)   cr: 273.8x
  *** zlib   , bitshuffle ***  0.179 s (4.17 GB/s) / 0.058 s (12.78 GB/s)   cr: 457.9x
  *** zstd   , noshuffle  ***  1.912 s (0.39 GB/s) / 0.113 s (6.61 GB/s)    cr:   7.9x
  *** zstd   , shuffle    ***  0.223 s (3.34 GB/s) / 0.031 s (24.18 GB/s)   cr: 644.9x
  *** zstd   , bitshuffle ***  0.242 s (3.07 GB/s) / 0.038 s (19.61 GB/s)   cr: 985.6x

*** the linspace linear distribution ***
  *** blosclz, noshuffle  ***  0.099 s (7.55 GB/s) / 0.031 s (23.76 GB/s)   cr:   1.0x
  *** blosclz, shuffle    ***  0.050 s (15.02 GB/s) / 0.036 s (20.98 GB/s)  cr:  33.5x
  *** blosclz, bitshuffle ***  0.087 s (8.53 GB/s) / 0.147 s (5.08 GB/s)    cr:  55.4x
  *** lz4    , noshuffle  ***  0.085 s (8.77 GB/s) / 0.031 s (23.86 GB/s)   cr:   1.0x
  *** lz4    , shuffle    ***  0.038 s (19.53 GB/s) / 0.034 s (21.78 GB/s)  cr:  40.5x
  *** lz4    , bitshuffle ***  0.081 s (9.24 GB/s) / 0.146 s (5.09 GB/s)    cr:  59.5x
  *** lz4hc  , noshuffle  ***  1.902 s (0.39 GB/s) / 0.075 s (9.92 GB/s)    cr:   1.1x
  *** lz4hc  , shuffle    ***  0.237 s (3.14 GB/s) / 0.031 s (24.09 GB/s)   cr:  44.7x
  *** lz4hc  , bitshuffle ***  0.438 s (1.70 GB/s) / 0.035 s (21.03 GB/s)   cr:  58.0x
  *** zlib   , noshuffle  ***  2.078 s (0.36 GB/s) / 0.267 s (2.79 GB/s)    cr:   1.6x
  *** zlib   , shuffle    ***  0.239 s (3.11 GB/s) / 0.053 s (13.98 GB/s)   cr:  44.6x
  *** zlib   , bitshuffle ***  0.275 s (2.71 GB/s) / 0.065 s (11.45 GB/s)   cr:  66.9x
  *** zstd   , noshuffle  ***  2.792 s (0.27 GB/s) / 0.099 s (7.55 GB/s)    cr:   1.2x
  *** zstd   , shuffle    ***  0.374 s (1.99 GB/s) / 0.037 s (20.18 GB/s)   cr:  70.5x
  *** zstd   , bitshuffle ***  0.367 s (2.03 GB/s) / 0.053 s (14.10 GB/s)   cr:  51.2x

*** the random distribution ***
  *** blosclz, noshuffle  ***  0.245 s (3.04 GB/s) / 0.105 s (7.12 GB/s)    cr:   2.1x
  *** blosclz, shuffle    ***  0.098 s (7.59 GB/s) / 0.038 s (19.56 GB/s)   cr:   4.0x
  *** blosclz, bitshuffle ***  0.163 s (4.57 GB/s) / 0.139 s (5.35 GB/s)    cr:   4.0x
  *** lz4    , noshuffle  ***  0.240 s (3.10 GB/s) / 0.040 s (18.65 GB/s)   cr:   2.1x
  *** lz4    , shuffle    ***  0.109 s (6.83 GB/s) / 0.039 s (19.28 GB/s)   cr:   4.0x
  *** lz4    , bitshuffle ***  0.144 s (5.18 GB/s) / 0.139 s (5.35 GB/s)    cr:   4.6x
  *** lz4hc  , noshuffle  ***  1.222 s (0.61 GB/s) / 0.035 s (21.25 GB/s)   cr:   2.8x
  *** lz4hc  , shuffle    ***  0.453 s (1.65 GB/s) / 0.038 s (19.66 GB/s)   cr:   4.0x
  *** lz4hc  , bitshuffle ***  0.419 s (1.78 GB/s) / 0.041 s (17.97 GB/s)   cr:   4.5x
  *** zlib   , noshuffle  ***  4.050 s (0.18 GB/s) / 0.208 s (3.58 GB/s)    cr:   3.2x
  *** zlib   , shuffle    ***  0.654 s (1.14 GB/s) / 0.074 s (10.06 GB/s)   cr:   4.7x
  *** zlib   , bitshuffle ***  0.610 s (1.22 GB/s) / 0.078 s (9.51 GB/s)    cr:   4.6x
  *** zstd   , noshuffle  ***  2.214 s (0.34 GB/s) / 0.125 s (5.95 GB/s)    cr:   4.0x
  *** zstd   , shuffle    ***  0.874 s (0.85 GB/s) / 0.039 s (19.01 GB/s)   cr:   4.4x
  *** zstd   , bitshuffle ***  0.858 s (0.87 GB/s) / 0.054 s (13.71 GB/s)   cr:   4.6x

For the matter of comparision, here are the results for an ARM box; an Apple MacBook Air M1 (2021) with 8 GB of RAM:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
python-blosc2 version: 0.1.6.dev0
Blosc version: 2.0.0.rc2 ($Date:: 2021-05-26 #$)
Compressors available: ['blosclz', 'lz4', 'lz4hc', 'zlib', 'zstd']
Compressor library versions:
  blosclz: 2.4.0
  lz4: 1.9.3
  lz4hc: 1.9.3
  zlib: 1.2.11.zlib-ng
  zstd: 1.5.0
Python version: 3.9.5 (default, May  3 2021, 19:12:05)
[Clang 12.0.5 (clang-1205.0.22.9)]
Platform: Darwin-20.4.0-arm64 (Darwin Kernel Version 20.4.0: Fri Mar  5 01:14:02 PST 2021; root:xnu-7195.101.1~3/RELEASE_ARM64_T8101)
Processor: arm
Byte-ordering: little
Detected cores: 8
Number of threads to use by default: 8
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Creating NumPy arrays with 10**8 int64/float64 elements:
  *** np.copyto() *** Time for memcpy():    0.030 s (25.04 GB/s)

Times for compressing/decompressing:

*** the arange linear distribution ***
  *** blosclz, noshuffle  ***  0.253 s (2.95 GB/s) / 0.109 s (6.83 GB/s)    cr:   2.0x
  *** blosclz, shuffle    ***  0.036 s (20.44 GB/s) / 0.024 s (31.08 GB/s)  cr: 469.7x
  *** blosclz, bitshuffle ***  0.123 s (6.04 GB/s) / 0.238 s (3.13 GB/s)    cr: 488.2x
  *** lz4    , noshuffle  ***  0.332 s (2.24 GB/s) / 0.072 s (10.39 GB/s)   cr:   2.0x
  *** lz4    , shuffle    ***  0.035 s (21.18 GB/s) / 0.030 s (24.93 GB/s)  cr: 279.2x
  *** lz4    , bitshuffle ***  0.126 s (5.91 GB/s) / 0.239 s (3.12 GB/s)    cr:  87.7x
  *** lz4hc  , noshuffle  ***  2.365 s (0.32 GB/s) / 0.080 s (9.35 GB/s)    cr:   2.0x
  *** lz4hc  , shuffle    ***  0.136 s (5.48 GB/s) / 0.047 s (15.89 GB/s)   cr: 155.9x
  *** lz4hc  , bitshuffle ***  0.545 s (1.37 GB/s) / 0.168 s (4.42 GB/s)    cr: 239.5x
  *** zlib   , noshuffle  ***  4.875 s (0.15 GB/s) / 0.279 s (2.67 GB/s)    cr:   5.3x
  *** zlib   , shuffle    ***  0.213 s (3.50 GB/s) / 0.091 s (8.20 GB/s)    cr: 273.8x
  *** zlib   , bitshuffle ***  0.344 s (2.16 GB/s) / 0.213 s (3.50 GB/s)    cr: 457.9x
  *** zstd   , noshuffle  ***  2.961 s (0.25 GB/s) / 0.168 s (4.44 GB/s)    cr:   7.9x
  *** zstd   , shuffle    ***  0.265 s (2.82 GB/s) / 0.035 s (21.46 GB/s)   cr: 644.9x
  *** zstd   , bitshuffle ***  0.392 s (1.90 GB/s) / 0.158 s (4.73 GB/s)    cr: 985.6x

*** the linspace linear distribution ***
  *** blosclz, noshuffle  ***  0.372 s (2.00 GB/s) / 0.029 s (25.42 GB/s)   cr:   1.0x
  *** blosclz, shuffle    ***  0.065 s (11.46 GB/s) / 0.035 s (21.13 GB/s)  cr:  33.5x
  *** blosclz, bitshuffle ***  0.148 s (5.03 GB/s) / 0.250 s (2.98 GB/s)    cr:  55.4x
  *** lz4    , noshuffle  ***  0.109 s (6.84 GB/s) / 0.037 s (19.89 GB/s)   cr:   1.0x
  *** lz4    , shuffle    ***  0.052 s (14.27 GB/s) / 0.038 s (19.65 GB/s)  cr:  40.5x
  *** lz4    , bitshuffle ***  0.138 s (5.42 GB/s) / 0.250 s (2.99 GB/s)    cr:  59.5x
  *** lz4hc  , noshuffle  ***  3.962 s (0.19 GB/s) / 0.070 s (10.61 GB/s)   cr:   1.1x
  *** lz4hc  , shuffle    ***  0.366 s (2.04 GB/s) / 0.037 s (19.99 GB/s)   cr:  44.7x
  *** lz4hc  , bitshuffle ***  0.764 s (0.97 GB/s) / 0.159 s (4.69 GB/s)    cr:  58.0x
  *** zlib   , noshuffle  ***  3.290 s (0.23 GB/s) / 0.502 s (1.49 GB/s)    cr:   1.6x
  *** zlib   , shuffle    ***  0.403 s (1.85 GB/s) / 0.103 s (7.23 GB/s)    cr:  44.6x
  *** zlib   , bitshuffle ***  0.533 s (1.40 GB/s) / 0.228 s (3.27 GB/s)    cr:  66.9x
  *** zstd   , noshuffle  ***  3.747 s (0.20 GB/s) / 0.192 s (3.89 GB/s)    cr:   1.2x
  *** zstd   , shuffle    ***  0.483 s (1.54 GB/s) / 0.057 s (13.17 GB/s)   cr:  70.5x
  *** zstd   , bitshuffle ***  0.634 s (1.17 GB/s) / 0.204 s (3.65 GB/s)    cr:  51.2x

*** the random distribution ***
  *** blosclz, noshuffle  ***  0.410 s (1.82 GB/s) / 0.135 s (5.50 GB/s)    cr:   2.1x
  *** blosclz, shuffle    ***  0.087 s (8.53 GB/s) / 0.029 s (25.29 GB/s)   cr:   4.0x
  *** blosclz, bitshuffle ***  0.169 s (4.40 GB/s) / 0.236 s (3.15 GB/s)    cr:   4.0x
  *** lz4    , noshuffle  ***  0.359 s (2.08 GB/s) / 0.060 s (12.50 GB/s)   cr:   2.1x
  *** lz4    , shuffle    ***  0.075 s (9.88 GB/s) / 0.029 s (25.40 GB/s)   cr:   4.0x
  *** lz4    , bitshuffle ***  0.155 s (4.81 GB/s) / 0.239 s (3.12 GB/s)    cr:   4.6x
  *** lz4hc  , noshuffle  ***  2.053 s (0.36 GB/s) / 0.045 s (16.71 GB/s)   cr:   2.8x
  *** lz4hc  , shuffle    ***  0.797 s (0.93 GB/s) / 0.051 s (14.63 GB/s)   cr:   4.0x
  *** lz4hc  , bitshuffle ***  0.795 s (0.94 GB/s) / 0.177 s (4.21 GB/s)    cr:   4.5x
  *** zlib   , noshuffle  ***  5.562 s (0.13 GB/s) / 0.367 s (2.03 GB/s)    cr:   3.2x
  *** zlib   , shuffle    ***  0.934 s (0.80 GB/s) / 0.148 s (5.03 GB/s)    cr:   4.7x
  *** zlib   , bitshuffle ***  0.959 s (0.78 GB/s) / 0.262 s (2.85 GB/s)    cr:   4.6x
  *** zstd   , noshuffle  ***  3.841 s (0.19 GB/s) / 0.228 s (3.27 GB/s)    cr:   4.0x
  *** zstd   , shuffle    ***  1.078 s (0.69 GB/s) / 0.069 s (10.76 GB/s)   cr:   4.4x
  *** zstd   , bitshuffle ***  1.044 s (0.71 GB/s) / 0.201 s (3.71 GB/s)    cr:   4.6x

As can be seen, is perfectly possible for python-blosc2 to go faster than a plain memcpy().

Start using compression in your data workflows and feel the experience of doing more with less!

License

The software is licenses under a 3-Clause BSD license. A copy of the python-blosc2 license can be found in LICENSE. A copy of all licenses can be found in LICENSES/.

Mailing list

Discussion about this module is welcome in the Blosc list:

blosc@googlegroups.com

http://groups.google.es/group/blosc

Twitter

Please follow @Blosc2 to get informed about the latest developments.


Enjoy data!