Introducing libbytesize

Problem area

Many project have to deal with representing sizes of storage or memory. In general, sizes in bytes. What may seem to be a trivial thing turns into hundreds of lines of of code if the following things are to be covered properly:

  • using binary (GiB,…) and decimal (GB) units correctly
  • handling sizes bigger than MAXUINT64 (which is 16 EiB – 1)
  • parsing users’ input correctly with:
    • binary and decimal units
    • numeric values in various formats (traditional, scientific)
  • handling localization and internationalization correctly
    • different radix characters used in different languages
    • units being translated and typed in by users in their native format (even with non-latin scripts)
  • handling negative sizes
    • it sometimes make sense to work with these for example when some storage space is missing somewhere

Of course, not all projects working with sizes in bytes have hundreds of lines for dealing with the above points, but the result is a bad user experience. In some cases, valid localized inputs are not accepted and correctly parsed or no matter what the current locale and language configuration is the users always get the English format and unit. One of the biggest problems I see in many projects is that binary and decimal units are not used and differentiated correctly. If something shows the value 10 G, does it mean 10 GiB and thus 10240 MiB or is it 10 GB and thus 10000 MB? Sometimes one can find this piece of information in the documentation (e.g. man pages), but often one just have to guess and try. Fortunately quite rarely one can be really surprised with the documented behaviour. For example in case of the lvm utilities where g means GiB and G means GB. We should generally be doing a much better job in handling sizes right and consistently in all projects, that have to handle those. However, it’s obvious that having a few hundreds of lines of code in every such project is nonsense.

An existing solution

One of the projects that I can gladly call a good example of how to deal with sizes in bytes is the Blivet python package used mainly by the Anaconda OS (Fedora, RHEL,…) installer. It has all the concerns mentioned above addressed in a proper and well-tested way in its class called simply Size. As the title of this post reveals, I’m trying to introduce a new library here so the obvious question is: Why to invent and write something new when a good and well-tested solution already exists? The answer lies in the description of Blivet and it is the fact that it is written in Python which makes its implementation of the Size class hardly usable from any other language/environment.

One step further

The obvious solution to move further towards a widely reusable solution was to rewrite the Blivet’s Size class in C so that it can be used from this low-level language and many other languages that very often facilitate use of C libraries. However, again what may seem to be an easy thing to do is not at all that simple. The Blivet’s Python implementation is based on the Python’s type Decimal which is a numeric type supporting unlimitted precision and arbitrarily big numbers. Also, dealing with strings and their processing is way simpler in Python than in C.

Nevertheless, C also has some nice libraries for working with big and highly precise numbers, namely the GMP and MPFR libraries that were created as part of the GNU project and which are for example used by many tools and libraries doing some serious maths. So it soon became clear, that writing a C implementation of the Size class shouldn’t be an overly complicated task. And it turned out be the case.

Here it is

The result is the libbytesize library that uses GMP and MPFR together with GObject Introspection to provide a nice object-oriented API facilitating the work with sizes in bytes. It properly takes care of all the potential issues mentioned in the beginning of this post and is widely usable due to the broad support of GObject Introspection in many high-level languages. The library provides a single class called (warning: here comes the surprise) Size which right now is basically a very thin wrapper around the mpz_t type provided by the GMP library for arbitrarily big integer numbers and thus it actually stores byte sizes as numbers of bytes. That is actually the precision limitation, but since no storage provides or works with fractions of bytes, it’s no real limitation at all.

There are (at this point) four constructors 1:

  • bs_size_new() which creates a new instance initialized to 0 B,
  • bs_size_new_from_bytes() which creates a new instance initialized to a given number of bytes,
  • bs_size_new_from_str() which creates a new instance initialized to the number of bytes the given string (e.g. "10 GiB") represents,
  • bs_size_new_from_size() which is a copy constructor.

Then there are some query functions the most important of which are the following two:

  • bs_size_convert_to() which can be used to convert a given size to some particular unit and
  • bs_size_human_readable() which gives a human-readable representation of a given size – i.e. with such unit that the the resulting number is not too big nor too small

Last but not least there are many methods for doing arithmetic and logical operations with sizes in bytes. It’s probably wise to mention here that not all arithmetic operations implemented for the mpz_t type are implemented for sizes. Some of them just don’t make sense – multiplication of size by size (what is GiB**2?), the raising operation, (square) root and others. However, there are some extra ones that don’t really make much sense for generic numbers, but are quite useful when working with sizes namely the bs_size_round_to_nearest() which rounds a given size (up or down) to a nearest multiple of another size. Like for example if you need to know how much space an LVM LV of requested size will take in a VG with some particular extent size.

Since the GObject Introspection allows for having overrides and the new library is expected to be used by Blivet instead of its own Python-only implementation of the Size class, there already are Python overrides making the work with the libbytesize’s Size class really simple. Here as example python interpret session demostrating the simplicity of use:

>>> from gi.repository.ByteSize import Size
>>> s = Size("10 GiB")
>>> str(s)
'10 GiB'
>>> repr(s)
'Size (10 GiB)'
>>> s2 = Size(10 * 1024**3)
>>> s2
Size (10 GiB)
>>> s + s2
Size (20 GiB)
>>> s - s2
Size (0 B)
>>> s3 = Size(s2)
>>> sum([s, s2, s3])
Size (30 GiB)
>>> -s2
Size (-10 GiB)
>>> abs(-s2)
Size (10 GiB)

And here come the dogs

I mean docs. The project is hosted on GitHub together with its documentation. The current release is 0.2 where the zero in the beginning means that it is not a stable release yet. The API is unlikely to change in any significant way for the (stable) release 1.0, but since the library is not being used in any big project right now, we are leaving us with some "manipulation space" for potential changes. So if you find the API of the library wrong, feel free to let us know and we might change it according to your favor! If you want to get a quick but still quite comprehensive overview of the library’s API, have a look at the header file it provides.

The last thing I’d like to mention here is that the library is packaged for the Fedora GNU/Linux distribution so if you happen to be using this distribution, you can easily start playing with the library by typing this into your shell:

$ sudo dnf install libbytesize python-libbytesize ipython
$ ipython

Using ipython also gives you the TAB-completion. See the above intepret session example to get a better idea about what to type in then. Have fun and don’t forget to share your ideas in the comments!

  1. bs is the "namespace" prefix and size is the class prefix