Foundations of LVM for mere mortals

The original plan was to write a blog post about LVM Thin Provisioning because that is a technology and in the same time a feature of LVM that deserves a lot more attention and a much bigger number of deployments than it currently gets. However, while I was thinking about the contents of such blog post and while I was talking to my colleagues and some other people about LVM Thin Provisioning I realized that I’d be jumping on a train that goes really fast in the middle of its way. While thin provisioning is absolutely a great thing, its benefits and features are most visible when compared to traditional 1 LVM. But based on my experience that quickly goes to wondering about why LVM hasn’t been doing things that way from the very beginning and then about how traditional LVM actually works.

So I thought maybe I could just give a link to some existing blog post, wiki page or something on the web that readers could read first before diving into the depths of thin provisioning. However, trying to ask my favorite search engines about "introduction to LVM" didn’t really bring the results I was hoping for. Not that nothing was found or that it was all wrong or outdated, the problem was that it was all oriented on practical use of LVM – which commands one should type into a terminal to create an LVM "stack" on their system – with only really brief descriptions of the basic terms like Physical Volume, Logical Volume and Volume Group. Don’t take me wrong, those pages are greatly useful, clear and helpful if one wants to do something with LVM on their system and use it as a black box that provides various features. They are just not so great for understanding how it all works, where the limitations come from and how amazing it is. That’s why I decided to write this blog post first. And before I actually start with the real thing, I’d like to make one thing clear right away: I’m not saying everything written here is 100% correct, precise and complete. This is just how I understand and see things to the extent I care about them.

The Device Mapper

If you’re wondering now about why I’m totally changing the topic please be assured that it’s not a wrong copy-paste or anything like that. The Device Mapper (DM) is the heart of LVM 2. Actually, one could say that LVM is just a set of tools and shortcuts for DM. But a really big and powerful set of tools and shortcuts that make things an order of magnitute simpler and more usable for mere mortals.

The Device Mapper is a kernel driver that provides all the flexibility and abstraction LVM then offers to users. It does exactly what its name suggests – it maps devices to other devices based on the maps (called tables) configured by user/administrator. Now, instead of going deep into the implementation details 3, let’s have a look at some examples best describing the functionality the Device Mapper provides. The user interface for the DM is the dmsetup tool which uses the libdevmapper library doing ioctl() calls on the special (character) device node /dev/mapper/control. The basic commands of the dmsetup tools are: ls to list all devices (mappings) provided by the DM; table to list all the tables (maps) currently defined in the DM; create to create a new DM device based on a given table (map) and remove to remove/destroy a given DM device (mapping). The simplest map type (so called target) is the linear mapping which can be created like this:

# dmsetup create my_DM_dev --table '0 102400 linear /dev/sda 0'

First of all note that I’m running the command as root which is necessary because it creates a new device node. my_DM_dev is a name of the newly created device and then the table follows. It says that 102400 sectors (yes, DM 4 works with 512B sectors) starting at the 0 sector of the newly created device should be linearly mapped to the device /dev/sda starting from its sector 0. The result is that I now have a new device /dev/mapper/my_DM_dev of size 50 MiB (102400 * 512 B) and if I try e.g. echo hello > /dev/mapper/my_DM_dev, I can check that the string hello is actually written to the very beginning of /dev/sda. Of course, one would probably do things like mkfs.xfs on a device instead of echoing a string in there, but it doesn’t matter now. The important thing is that our new /dev/mapper/my_DM_dev device really is mapped to the beginning of /dev/sda. I can now go on and create another similar device, this time 200 MiB in size:

# dmsetup create my_DM_dev2 --table '0 409600 linear /dev/sda 102400'

It should now be clear that this device is mapped to the area of /dev/sda right after where the my_DM_dev device is mapped to. So effectively I created something like two partitions on my disk 5. Cool as hell, you say? I absolutely agree that this is not an easy and nice way of splitting disk into separate units used for file systems or whatever. Well, at least removing the devices is as simple as dmsetup remove my_DM_dev (assuming they are not in use). Anyway, you could do a much better thing with any of the available partioning tools with 10 % of the effort, right?. But, how about this?:

# dmsetup create my_DM_dev
0 409600 linear /dev/sda 0
409600 409600 linear /dev/sdb 0

Here I’m not using the --table option because that only allows me to specify a one-line table. But, the real strength of the Device Mapper is in more complex and complicated things like trivial linear mapping to some area on the disk. Without the option the dmsetup tool allows for the table to be specified on standard input potentially on multiple lines (and terminated with the EOF character (i.e. Ctrl+D). 6 Although the numbers might be a bit confusing at the glance, it should be clear that I created a 400MiB device, the first 200 MiB of which are linearly mapped to /dev/sda and the second 200 MiB are mapped to /dev/sdb. I can check the result e.g. by running:

# lsblk /dev/mapper/my_DM_dev
my_DM_dev 253:2    0  400M  0 dm

which tells me that the size really is 400 MiB. Also note the major number 253 which is common to all DM devices. By running dmsetup table my_DM_dev I can make sure that the device really uses both parts of the table/mapping I entered.

So now I have a device that spans over two physical devices (disks in my case) which is something that’s not possible with standard partitions. Quite neat, isn’t it? And we can go way further with things like this:

# dmsetup create my_DM_dev
0 409600 linear /dev/sda 0
409600 409600 striped 2 32 /dev/sda 409600 /dev/sdb 0

The first line is the same as in the previous example – a simple linear mapping to the beginning of /dev/sda. But what about the second line? Instead of linear we are creating a striped mapping using 2 devices with chunk size of 32 sectors (16 KiB) with the devices being /dev/sda from its sector 409600 (where the linear part of the mapping ends) and /dev/sdb from its start sector. So we have a device that does its I/O operations in a traditional way if they go to its first 200 MiB, but stripes the I/O operations to two disks when they go to its second 200 MiB (just like RAID0 would do). We could test the difference by doing:

# dd if=/dev/mapper/my_DM_dev of=/dev/null bs=10M count=20
# dd if=/dev/mapper/my_DM_dev of=/dev/null bs=10M count=20 skip=20

i.e. reading the first 200 MiB and comparing the results to reading the second 200 MiB. Of course, the second run of dd should say something like double the speed of the first run. But it’s not really that trivial to test this with all the caches, readahead and other mechanisms that are on the way to the actual physical disk especially if you are using a VM for testing this like I do. Anyway, the point is that we can combine various types of mappings supported by the Device Mapper and create devices with really interesting properties. You may ask what these "hybrids" could be good for. Well, if you have some extra information, such as your file system stores its inodes and journal at the beginning of the device it is created on, you could for example create a device that has its beginning (whatever that means for the file system) mirrored (the mirror mapping) with the rest being linear or even striped. I’m not a file system expert, but my naïve mind thinks it could be an interesting win-win solution for many use cases. Also a very important fact which may not be quite obvious from the above is that the DM devices are just block devices and thus can be used in tables (maps) of other DM devices creating really complex hierarchies with tailored properties and features.

The examples above show that Device Mapper really provides a very flexible way to work with physical storage devices by creating an abstraction layer that allows the user/administrator to create various combinations best suiting their use case. What the examples on the other hand don’t show are the capabilities of the DM regarding types of mappings (targets) it supports. Apart from the linear, striped and mirror mappings mentioned above there are also error, zero, cache, crypt, delay, flakey, mirror, multipath, raid, snapshot, thin and thin-pool targets some of which are mainly useful for testing file systems and applications (like zero, error and delay), but with the other ones being extremely useful for real use cases. For example the crypt mapping is what actually is behind what most people know as the LUKS encryption. LUKS is one of the formats/algorithms the crypt target supports for encryption and decryption of the blocks written to and read from physical devices. The snapshot mapping allows for a device to have a snapshot with a copy-on-write layer. That’s useful for doing some potentially unsafe operations because one can just revert the device to the snapshot if things go wrong, but it’s also useful for putting a read-write (cow) layer over a read-only (snapshot) device. Actually, this is how live CDs/DVDs/USBs/… are created (at least for Fedora).

As the example with the striped mapping shows the more complex targets require more parameters when being established. Also if a hierarchy of DM devices is created, it of course needs to be created in the order that whenever a device is used in a table for a to-be-created device it already exists in the system. All this means that it’s quite hard to use the dmsetup utility to create such setups even if one just puts the resulting commands or maps (tables) into file loaded at system boot. Which brings me back to the example where "something like two partitions" was created on the disk /dev/sda. The two fundamental differences between partitions and these DM devices being created on the disk are the facts that:

  1. if I rebooted the system my mappings would be lost and
  2. no tools, libraries or anything would be able to tell that I have split the disk into two separate chunks just by looking and the disk and reading from it.

Actually both of these facts (and shortcomings) are caused by the same thing – the lack of meta data. If I create two partitions on a disk, the information is written to the MBR or GPT table(s) of that disk and everybody can read it including kernel and various tools run during system boot. Whereas if I split the disk into two chunks using DM, the information about this split is only held in RAM and never written to any peristent memory. Of course it’s possible to reserve some disk space and store the information about how things should be set up (i.e. meta data) there, probably in multiple copies over multiple disks because it is a highly valuable information. And then read that information from those special places and set the DM devices up when asked. You know how this is called? Logical Volume Management. LVM does and provides many more things on top of this, but the basic functionality really is managing meta data, setting up, tearing down and replacing (dmsetup reload can be used to change the table online) Device Mapper devices at the right times.

The Logical Volume Management

Logical Volume Management (LVM) is a set of tools and specifications that build an abstract (logical) layer on top of the physical storage. It has three types of basic building blocks: Physical Volumes (PVs), Volume Groups (VGs) and Logical Volumes (LVs). Physical Volumes are block devices (disks, partitions, iSCSI LUNs, DM devices,…) that are "given" to LVM to work with. PVs are grouped into pools called Volume Groups. By adding a PV into a VG (either when creating a new VG or extending an existing one), the space on the PV is split into chunks called Physical Extents (PEs), the size of which is specified by the VG (the current default is 4 MiB) and those PEs are added to the pool (VG). It might be useful to point out one important thing here -Volume Groups are not devices, there’s no way to use the space from them directly and they have no device nodes in the /dev file system tree. They are logical units of management and atomicity. Once a PV is added to a VG, it cannot be just moved into a different VG. It has to be removed from the first VG and then added to the other VG, but there’s no notion of operations across Volume Groups. Last but not least there are Logical Volumes (LVs) that are block devices allocated in/from the VGs (as pools). Logical Volumes are logically split into Logical Extents that are mapped to Physical Extents (on Physical Volumes). This e.g. allows LVs to span over multiple PVs (disks, paritions, iSCSI LUNs, DM devices,…) either linearly, using striping, mirroring or in other ways supported by LVM.

"…linearly, using striping, mirroring…", doesn’t it sound familiar? Yes, LVs are in fact Device Mapper devices or more precisely mappings/tables. They become DM devices once activated which means that the particular mappings are created/established in the DM. But DM mappings are not persistent so something has to take care about preserving the information (meta data) about the mappings (start sectors, sizes, devices, start sectors on those devices, etc.). And that something is Volume Groups. In fact, by adding a PV into a VG, meta data is written to the beginning of the PV (by default 1 MiB of space is reserved for that) 7. Expecting some condensed binary format there? No, it’s just plain text with newlines, comments and everything. You can try running dd if=/dev/sda bs=1M count=1|less which will give you some garbage in the beginning (reserved space for MBR or whatever), but then you’ll see a plain text definition of the VG the PV /dev/sda belongs to, with all the PVs, LVs and everything LVM needs to remember. The nicer way to get this data is to run vgcfgbackup -f lvm_meta.txt which extracts only the relevant part of the first 1 MiB, but you can check that it is the very same plain text data as on the disk. Actually a backup of meta data is created whenever any changes to the LVM configuration are done (if not configured otherwise in /etc/lvm/lvm.conf) with the results being stored in the /etc/lvm/archive/ directory. Of course a backup without a way to restore the state from it would be just a sad memory and using dd to write the metadata to the right place would be cumbersome. That’s why there’s also the vgcfgrestore tool that does the exact opposite of what vgcfgbackup does. So if you understand LVM well, you can actually dump the meta data, edit it in you favorite text editor and load it back. Pretty cool, isn’t it? Well, you probably don’t want to do that, and unless you encounter a bug there’s no reason to do so 8, but it’s really nice to have such an option (especially if you know one of the LVM gurus).

LVM meta data could be a topic for a separate blog post, but I hope the text of this post before this line explains what their purpose and basic contents are. But the purpose of LVM is not to study how nice its meta data is, right?! It’s the abstraction of physical storage it provides that makes it so great. So what can be done with LVM and how does LVM do it in the background? Let’s start with the simplest example possible assuming I have two 1GiB disks:

# pvcreate /dev/sda
# pvcreate /dev/sdb
# vgcreate test /dev/sda /dev/sdb
# lvcreate -n myLV -L2G test

which creates two PVs then the test VG using those two PVs and then tries to create a 2GiB LV myLV spanning over both disks. But this results in the following error: "Volume group "test" has insufficient free space (510 extents): 512 required.". Wait, what? The disks have 1024 MiB of space and the default extent size is 4 MiB so how come there are only 510 extents and not 512? LVM stores the meta data at the beginning of the PVs, remember? So the first 1 MiBs in both PVs are reserved for meta data which means that the usable space for Physical Extents is 1023 MiB and thus there are 255 extents on every PV all together providing 2040 MiB of space. So, let’s try it again:

# lvcreate -n myLV -L2040M test

This time we get "Logical volume "myLV" created.". So it fit in and worked! There are multiple ways to specify the size – either with the -L option and the precise size (which is then rounded to a multiple of extent size), or with the -l option and the number of extents or with the -l option in combination with one of xy%VG, xy%PVS or xy%FREE. As you can probably see, all the other ways than -l with a number of extents are just syntactic sugar as they all boil down to the calculation of number of extents to use. But they are useful, especially in more complicated cases as we will soon see.

So we have the myLV LV in the test VG spanning over PVs /dev/sda and /dev/sdb. And we know LVM uses the Device Mapper to create the block device for the LV so let’s take a closer look. By running dmsetup table we can see that there are two tables:

test-myLV: 0 2088960 linear 8:0 2048
test-myLV: 2088960 2088960 linear 8:16 2048

that define the test-myLV DM device. It’s easy to see that the device is in the format VGname-LVname. The VG test has no DM device on its own because it is not a device and DM has no idea what a Volume Group is. However, there is the /dev/test directory with /dev/test/myLV symlink that in my case points to /dev/dm-2 which is the real device node for the test-myLV DM device. Other LVs would of course have similar symlinks and device nodes. On top of that all, there’s also the /dev/mapper/test-myLV symlink pointing to /dev/dm-2 (in my case) as the /dev/mapper folder is used for symlinks using DM device names. The symlinks, unlike the device nodes, are persistent in a way that they are the same after reboot. Names of the device nodes may differ and thus shouldn’t be used in /etc/fstab and similar places. One more note about the tables – the 8:0 and 8:16 are MAJOR:MINOR numbers that are device identifiers used instead of device node paths (devices are identified with them in kernel). We could also run vgcfgbackup -f lvm_meta.txt test and have a look at the meta data to see how the myLV LV is defined there. It of course contains all the information LVM needs to create the DM device for the LV anytime it is asked to do so.

I can now use lvremove test/myLV to remove the myLV LV. Can you guess what happens behind the scenes then? Of course the DM device is torn down (removed) and the meta data fields defining the LV are removed. But I still have the PVs and the VG test so I can now create a striped LV in it:

# lvcreate -n myLV -L2040M --stripes=2 test

LVM informs me that the default stripesize of 64 KiB was used and creates the LV defined in the DM with this table:

test-myLV: 0 4177920 striped 2 128 8:0 2048 8:16 2048

which shouldn’t be suprising in any way. Just note that both PVs are used starting from their 2048 sector (1 MiB) because before that the meta data is written (it is the same in case of the linear LV above). There might be people who prefer having a 100% control of what happens to their storage, but I think for the majority of us, the mere mortals, it’s much easier to just specify name, size and the number of stripes than to write the DM table with everything in the right order and all the numbers, especially if there are supposed to be multiple such devices and thus the starting sectors have to be calculated. So LVM not only makes things persistent, it also makes setting everything up a lot easier. And the more complicated the thing is on the Device Mapper layer the bigger the difference is. As another example, let’s have a look at what happens when I create a mirrored LV (we will get to the details of the command later):

# lvcreate -n myLV -L1016M --type=mirror -m1 --mirrorlog=mirrored test
  Logical volume "myLV" created.
# dmsetup table |grep test-myLV
  test-myLV: 0 2080768 mirror disk 2 253:4 1024 2 253:5 0 253:6 0 1 handle_errors
  test-myLV_mimage_1: 0 2080768 linear 8:16 2048
  test-myLV_mimage_0: 0 2080768 linear 8:0 2048
  test-myLV_mlog: 0 8192 mirror core 1 1024 2 253:2 0 253:3 0 1 handle_errors
  test-myLV_mlog_mimage_1: 0 8192 linear 8:16 2082816
  test-myLV_mlog_mimage_0: 0 8192 linear 8:0 2082816

Quite a lot of stuff LVM did for me, right? In the lvcreate command I specified the type of the LV to be mirror with 1 mirror (so the data is in two places – the origin and the mirror) and that I want the mirrorlog to be mirrored. The mirrorlog is log/journal DM needs to keep track of what’s already mirrored (in sync) and what needs to be mirrored. The default is disk which means there is only a single copy of the log on one of the PVs whereas mirrored means each PV gets a copy of the mirrorlog. 9 So LVM created two devices for the mirrorlog (test-myLV_mlog_mimage), the device that keeps them in sync (test-myLV_mlog), the devices that actually contain the data (test-myLV_mimage) and finally the mirror device that is the actual LV I wanted to create (test-myLV). The existence of the mirrorlog is also the reason why I could only create a 1016 MiB LV and not a 1020 MiB LV as one (a person understanding that the LVM meta data has to go somewhere) could expect.

Of course LVM knows about all the DM devices it creates for the mirrored LV myLV and with a little bit of extra effort I can get the information from it too:

# lvs -a -oname,vg_name,size test
  LV                   VG   LSize
  myLV                 test 1016.00m
  [myLV_mimage_0]      test 1016.00m
  [myLV_mimage_1]      test 1016.00m
  [myLV_mlog]          test    4.00m
  [myLV_mlog_mimage_0] test    4.00m
  [myLV_mlog_mimage_1] test    4.00m

The -a option tells LVM I want to see information about all LVs even the hidden/internal ones 10. The difference between the outputs with and wihout the -a option can be quite big. For example, on our server we use as a KVM host at work, there are 6 LVs, but including the hidden/internal ones, there are 51 of them! Quite a difference, isn’t it? The reason for that is the setup we have there – LVM Thin Provisioning (more on that in the next post!) using LVM RAID (and that in the next-next post) in the RAID5 configuration on top of 5 disks. Well, I would really hate writing all those 51 DM tables, making sure they are correct and setup in the correct order. So however easy the LVM’s task might looked in the beginning of this section, I’m quite sure everybody gets the point now. And LVM goes many steps further with this all. As a last example, see what happens if I try to move Physical Extents from one disk to another one:

# lvcreate -n myLV -L1020M test /dev/sda
  Logical volume "myLV" created.
# pvmove /dev/sda /dev/sdb -b && lvs -a -oname,vg_name,size,segtype test
  LV        VG     LSize    Type
  myLV      test   1020.00m linear
  [pvmove0] test   1020.00m mirror

LVM creates a temporary mirror LV (and thus a DM device) that makes sure the extents are safely mirrored from the disk /dev/sda to the disk /dev/sdb. It’s then easy to remove the extents from the disk /dev/sda on successful finish. And you know what’s really great about this all? I can have the test/myLV LV mounted and in use all the time and if anything goes wrong, I can just fire the pvmove again with no data loss! Ok, now that I see it, I must admit I lied. Here’s one more example demonstrating a similar wonderful thing:

# lvcreate -n myLV -L1016M test
  Logical volume "myLV" created.
# lvconvert --type=mirror -m1 test/myLV
# lvs -oname,vg_name,size,segtype
  LV   VG     LSize    Type
  root fedora    8.51g linear
  swap fedora    1.00g linear
  myLV test   1016.00m mirror

Remember I mentioned DM tables could be replaced/reloaded? Well, here you have it – I converted a linear LV into a mirrored LV to get another copy of the data making it more reliable. And I could have had it mounted and in use again all the time! Wanna a real life example of this all? Last week I bought a new SSD and a new HDD for my workstation. And without rebooting it once I moved my system (including the root LV mounted at / and home LV mounted at /home) to the new SSD with the extra LUKS encryption layer and later converted it together with my data LV living on the HDD into mirror LVs. So I went from from a setup with the system on a single unencrypted disk and data on a single disk that could fail any time, lose my data and make me unproductive for days to a setup using encryption and two disks for the system as well as for the data. And I didn’t have to reboot the system or come offline for a second! Really, really nice, LVM! And btw, this is why Fedora defaults to LVM partitioning/setup. It just provides much more flexibility than standard disk partitions with file systems.

I hope this long post sheds some light on how the LVM works and what its foundations are. Next time we will have a look at the LVM Thin Provisioning which takes the flexibility and set of features onto a completely different level. And honestly, it’s the next blog post that will give the real sense to this one so don’t give up here and start looking forward to reading more stuff on this blog!

P.S. – comments from the LVM team

I got together all my bravery and sent the link to this post to the Red Hat’s LVM team mailing list. To my pleasant suprise I got a very nice feedback with comments I’d like to share here:

  1. The mirrorlog defaulting to disk and thus in a single copy is a good strategy for majority of the cases, especially in the most common case with two disks. If a mirrored mode is used and one disk fails, what is left is a single copy with a log that says which parts of it are in sync. But with what when the other disk is gone? A complete resync is needed anyway when a new disk is later added rewriting the mirrorlog. In case of more than two disks having the mirrorlog mirrored on multiple disks may save the system from resyncing the working disks. But that in general doesn’t outweigh the overhead of writing the log to multiple locations.
  2. If the -m1 option is used and the --type=mirror is omitted, LVM decides (based on the configuration) whether to create a mirror or raid1 DM device. The new versions of LVM default to raid1 which has some very nice and interesting advantages over the (much older) mirror target. But that’s something I’ll get to in one of the future blog posts (about LVM RAID).
  3. There’s no need to call pvcreate. vgcreate initializes the devices if they are not initialized and it even accepts and applies almost all of the command line options accepted by pvcreate.
  4. The ‘error’ target is being used not just for testing, but also e.g. for provisioning sparse devices and for replacing failed RAID legs.


  1. aka "fat LVM" :)

  2. LVM2 to be more precise, the original LVM implementation worked and lived on its own (in 2.4 kernels)

  3. readers are welcome to study the kernel sources if they want to (I haven’t been that brave yet)

  4. just like everything else storage-related in kernel

  5. There are at least two fundamental differences between that and my two DM devices, but we will get to them later.

  6. I could of course also put the table into a file and then use < my_table and let dmsetup read it from there.

  7. unless LVM is told not to do so (it might be wise to put meta data only to some of the PVs in some cases due to different speeds and access times of the PVs)

  8. I had to do it once when I was experimenting with LVM cache in writeback mode using a used SSD drive I bought on eBay on my production system (hint: not a good idea). I had to get rid of the cache, but LVM commands didn’t allow me to do so because the cache was dirty, but attempts to flush the cache only resulted in I/O errors.

  9. I honestly don’t know why mirrored is not the default as that’s what people usually expect with to happen, I think. There’s also a third option – core which means that the mirrorlog is only kept in RAM and gets lost on reboot which results in a complete resync of the origin and mirror after boot.

  10. I had to specify the output options too, because otherwise the output was too wide.

Introducing libbytesize

Problem area

Many project have to deal with representing sizes of storage or memory. In general, sizes in bytes. What may seem to be a trivial thing turns into hundreds of lines of of code if the following things are to be covered properly:

  • using binary (GiB,…) and decimal (GB) units correctly
  • handling sizes bigger than MAXUINT64 (which is 16 EiB – 1)
  • parsing users’ input correctly with:
    • binary and decimal units
    • numeric values in various formats (traditional, scientific)
  • handling localization and internationalization correctly
    • different radix characters used in different languages
    • units being translated and typed in by users in their native format (even with non-latin scripts)
  • handling negative sizes
    • it sometimes make sense to work with these for example when some storage space is missing somewhere

Of course, not all projects working with sizes in bytes have hundreds of lines for dealing with the above points, but the result is a bad user experience. In some cases, valid localized inputs are not accepted and correctly parsed or no matter what the current locale and language configuration is the users always get the English format and unit. One of the biggest problems I see in many projects is that binary and decimal units are not used and differentiated correctly. If something shows the value 10 G, does it mean 10 GiB and thus 10240 MiB or is it 10 GB and thus 10000 MB? Sometimes one can find this piece of information in the documentation (e.g. man pages), but often one just have to guess and try. Fortunately quite rarely one can be really surprised with the documented behaviour. For example in case of the lvm utilities where g means GiB and G means GB. We should generally be doing a much better job in handling sizes right and consistently in all projects, that have to handle those. However, it’s obvious that having a few hundreds of lines of code in every such project is nonsense.

An existing solution

One of the projects that I can gladly call a good example of how to deal with sizes in bytes is the Blivet python package used mainly by the Anaconda OS (Fedora, RHEL,…) installer. It has all the concerns mentioned above addressed in a proper and well-tested way in its class called simply Size. As the title of this post reveals, I’m trying to introduce a new library here so the obvious question is: Why to invent and write something new when a good and well-tested solution already exists? The answer lies in the description of Blivet and it is the fact that it is written in Python which makes its implementation of the Size class hardly usable from any other language/environment.

One step further

The obvious solution to move further towards a widely reusable solution was to rewrite the Blivet’s Size class in C so that it can be used from this low-level language and many other languages that very often facilitate use of C libraries. However, again what may seem to be an easy thing to do is not at all that simple. The Blivet’s Python implementation is based on the Python’s type Decimal which is a numeric type supporting unlimitted precision and arbitrarily big numbers. Also, dealing with strings and their processing is way simpler in Python than in C.

Nevertheless, C also has some nice libraries for working with big and highly precise numbers, namely the GMP and MPFR libraries that were created as part of the GNU project and which are for example used by many tools and libraries doing some serious maths. So it soon became clear, that writing a C implementation of the Size class shouldn’t be an overly complicated task. And it turned out be the case.

Here it is

The result is the libbytesize library that uses GMP and MPFR together with GObject Introspection to provide a nice object-oriented API facilitating the work with sizes in bytes. It properly takes care of all the potential issues mentioned in the beginning of this post and is widely usable due to the broad support of GObject Introspection in many high-level languages. The library provides a single class called (warning: here comes the surprise) Size which right now is basically a very thin wrapper around the mpz_t type provided by the GMP library for arbitrarily big integer numbers and thus it actually stores byte sizes as numbers of bytes. That is actually the precision limitation, but since no storage provides or works with fractions of bytes, it’s no real limitation at all.

There are (at this point) four constructors 1:

  • bs_size_new() which creates a new instance initialized to 0 B,
  • bs_size_new_from_bytes() which creates a new instance initialized to a given number of bytes,
  • bs_size_new_from_str() which creates a new instance initialized to the number of bytes the given string (e.g. "10 GiB") represents,
  • bs_size_new_from_size() which is a copy constructor.

Then there are some query functions the most important of which are the following two:

  • bs_size_convert_to() which can be used to convert a given size to some particular unit and
  • bs_size_human_readable() which gives a human-readable representation of a given size – i.e. with such unit that the the resulting number is not too big nor too small

Last but not least there are many methods for doing arithmetic and logical operations with sizes in bytes. It’s probably wise to mention here that not all arithmetic operations implemented for the mpz_t type are implemented for sizes. Some of them just don’t make sense – multiplication of size by size (what is GiB**2?), the raising operation, (square) root and others. However, there are some extra ones that don’t really make much sense for generic numbers, but are quite useful when working with sizes namely the bs_size_round_to_nearest() which rounds a given size (up or down) to a nearest multiple of another size. Like for example if you need to know how much space an LVM LV of requested size will take in a VG with some particular extent size.

Since the GObject Introspection allows for having overrides and the new library is expected to be used by Blivet instead of its own Python-only implementation of the Size class, there already are Python overrides making the work with the libbytesize’s Size class really simple. Here as example python interpret session demostrating the simplicity of use:

>>> from gi.repository.ByteSize import Size
>>> s = Size("10 GiB")
>>> str(s)
'10 GiB'
>>> repr(s)
'Size (10 GiB)'
>>> s2 = Size(10 * 1024**3)
>>> s2
Size (10 GiB)
>>> s + s2
Size (20 GiB)
>>> s - s2
Size (0 B)
>>> s3 = Size(s2)
>>> sum([s, s2, s3])
Size (30 GiB)
>>> -s2
Size (-10 GiB)
>>> abs(-s2)
Size (10 GiB)

And here come the dogs

I mean docs. The project is hosted on GitHub together with its documentation. The current release is 0.2 where the zero in the beginning means that it is not a stable release yet. The API is unlikely to change in any significant way for the (stable) release 1.0, but since the library is not being used in any big project right now, we are leaving us with some "manipulation space" for potential changes. So if you find the API of the library wrong, feel free to let us know and we might change it according to your favor! If you want to get a quick but still quite comprehensive overview of the library’s API, have a look at the header file it provides.

The last thing I’d like to mention here is that the library is packaged for the Fedora GNU/Linux distribution so if you happen to be using this distribution, you can easily start playing with the library by typing this into your shell:

$ sudo dnf install libbytesize python-libbytesize ipython
$ ipython

Using ipython also gives you the TAB-completion. See the above intepret session example to get a better idea about what to type in then. Have fun and don’t forget to share your ideas in the comments!

  1. bs is the "namespace" prefix and size is the class prefix

libblockdev reaches the 1.0 milestone!

A year ago, I started working on a new storage library for low-level operations with various types of block devices — libblockdev. Today, I’m happy to announce that the library reached the 1.0 milestone which means that it covers all the functionality that has been stated in the initial goals and it’s going to keep the API stable.

A little bit of a background

Are you asking the question: "Why yet another code implementing what’s already been implemented in many other places?" That’s, of course, a very good and probably crucial question. The answer is that I and people who were at the birth of the idea think that this is for the first time such thing is implemented in a way that it is usable for a wide range of tools, applications, libraries, etc. Let’s start with the requirements every widely usable implementation should meet:

  1. it should be written in C so that it is usable for code written in low-level languages
  2. it should be a library as DBus is not usable together with chroot() and things like that and running subprocesses is suboptimal (slow, eating lot of random data entropy, need to parse the output, etc.)
  3. it should provide bindings for as many languages as possible, in particular the widely used high-level languages like Python, Ruby, etc.
  4. it shouldn’t be a single monolithic piece required by every user code no matter how much of the library it actually needs
  5. it should have a stable API
  6. it should support all major storage technologies (LVM, MD RAID, BTRFS, LUKS,…)

If we take the candidates potentially covering the low-level operations with blockdev devices — Blivet, ssm and udisks2 (now being replaced by storaged) — we can easily come to a conclusion that none of them meets the requirements above. Blivet 1 covers the functionality in a great way, but it’s written in Python and thus hardly usable from code written in other languages. The same applies to ssm 2 is also written in Python, it’s an application and it doesn’t cover all the technologies (it doesn’t try to). udisks2 3 and now storaged 4 provide a DBus API and don’t provide for example functions related to BTRFS (and even LVM in case of udisks2).

The libblockdev library is:
  • written in C,
  • using GLib and providing bindings for all languages supporting GObject instrospection (Python, Perl, Ruby, Haskell*,…),
  • modular — using separate plugins for all technologies (LVM, Btrfs,…),
  • covering all technologies Blivet supports 5 plus some more,

by which it fulfills all the requirements mentioned above. It’s only a wish, but a strong one, that every new piece of code written for low-level manipulation with block devices 6, should be written as part of the libblockdev library, tested and reused in as many places as possible instead of writing it again and again in many, many places with new, old, weird and surprising and custom bugs.


As mentioned above, the library loads plugins that provide the functionality, each related to one storage technology. Right now, there are lvm, btrfs, swap, loop, crypto, mpath, dm, mdraid, kbd and s390 plugins. 7 The library itself basically only provides a thin wrapper around its plugins so that it can all be easily used via GObject introspection and so that it is easy to setup logging (and probably more in the future). However, each of the plugins can be used as a standalone shared library in case that’s desired. The plugins are loaded when the bd_init() function is called 8 and changes (loading more/less plugins) can later be done with the bd_reinit() function. It is also possible to reload a plugin in a long-running process if it gets updated, for example. If a function provided by a plugin that was not loaded is called, the call fails with an error, but doesn’t crash and thus it is up to the caller code to deal with such situation.

The libblockdev library is stateless from the perspective of the block device manipulations. I.e., it has some internal state (like tracking if the library has been initialized or not), but it doesn’t hold any state information about the block devices. So if you e.g. use it to create some LVM volume groups and then try to create a logical volume in a different, non-existing VG, it just fails creating it at the point where LVM realizes that such volume group doesn’t exist. That makes the library a lot simpler and "almost thread-safe" with the word "almost" being there just because some of the technologies doesn’t provide any other API than running various utilities as subprocesses which cannot generally be considered thread-safe. 9

Scope (provided functionality)

The first goal for the library was to replace the Blivet’s devicelibs subpackage that provided all the low-level functions for manipulations with block devices. That fact also defined the original scope of the library. Later, we realized that we would like to add the LVM cache and bcache support to Blivet and the scope of the library got extended to the current state. The supported technologies are defined by the list of plugins the library uses (see above) and the full list of the functions can be seen either in the project’s features.rst file or by browsing the documentation.

Tests and reliability

Right now, there are 135 tests run manually and by a Jenkins instance hooked up to the project’s Git repository. The tests use loop devices to test vast majority of the functions the library provides 10. They must be run as root, but that’s unavoidable if they should really test the functionality and not just some mocked up stubs that we would believe behave like a real system.

The library is used by Fedora 22’s installation process as F22’s Blivet has been ported to use libblockdev before the Beta release. There have been few bugs reported against the library (majority of them were related to FW RAID setups) with all bugs being fixed and covered by tests for those particular use cases (based on data gathered from the logs in bug reports).

Future plans

Although the initial goals are all covered by the version 1.0 of the library there are already many suggestions for additional functionality and also extensions for some of the functions that are already implemented (extra arguments, etc.). The most important goal for the near future is to fix reported bugs in the current version and promote the library as much as possible so that the wish mentioned above gets fulfilled. The plan for a bit further future (let’s say 6-8 months) is to work on additional functionality targetting version 2.0 that will break the API for the purpose of extending and improving it.

To be more concrete, for example one of the planned new plugins is the fs plugin that will provide various functions related to file systems. One of such functions will definitely be the mkfs() function that will take a list (or dictionary) of extra options passed to the particular mkfs utility on top of the options constructed by the implementation of the function. The reason for that is the fact that some file systems support many configuration options during their creation and it would be cumbersome to cover them all with function parameters. In relation to that, at least some (if not all) of the LVM functions will also get such extra argument so that they are useful even in very specific use cases that require fine-tuning of the parameters not covered by functions’ arguments.

Another potential feature is to add some clever and nice way of progress reporting to some functions that are expected to take a lot of time to finish –like lvresize(), pvmove(), resizefs() and others. It’s not always possible to track the progress because even the underlying tools/libraries don’t report it, but where possible, libblockdev should be able to pass that information to its callers ideally in some unified way.

So a lot of work behind, much more ahead. It’s a challenging world, but I like taking challenges.

  1. a python package used by the Anaconda installer as a storage backend

  2. System Storage Manager

  3. daemon used by e.g. gnome-disks and the whole GNOME "storage stack"

  4. a fork of udisks2 adding an LVM API and being actively developed

  5. the first goal for the library was to replace Blivet’s devicelibs subpackage

  6. at higher than the most low-level layers, of course

  7. I hope that with the exception of kbd which stands for Kernel Block Devices the related technologies are clear, but don’t hesitate to ask in the comments if not.

  8. or e.g. BlockDev.init(plugins) in Python over the GObject introspection

  9. use Google and "fork shared library" for further reading

  10. 119 out of 132 to be more precise

Snakes++: Anaconda goes Python 3

Anaconda is the OS installer used by Fedora and RHEL GNU/Linux
distributions and all their derivatives. It’s written in the Python programming
language and many people say it’s one of the biggest and most complex pieces of
software written in this dynamic programming language. At the same time it is
one oldest big Python projects. [1]

[1] the first commit in Anaconda’s git repository is from Apr 24 1999, but
that’s the beginning of the GUI code being developed so the core actually
predates even these "IT-old times"

Fedora, Anaconda, Python 3

Over the time, the Python language has been evolving and so has been Anaconda’s
codebase getting not only new features and bug fixes, but also code improvements
using new language features. [2] Such evolution have been happening in small
steps over the time, but in recent years the community around the Python
language have been slowly migrating to a new backwards-incompatible version of
the language — Python 3. Python 3 is the version of Python that will get
future improvements and generally the vast majority of focus and work. There
will only be bugfixes for Python 2 in the future. [3] Users of Fedora may have
noticed that there was a proposal for a major Python 3 as Default change that
suggested migrating core components (more or less everything that’s available on
a live media) to Python 3 for Fedora 22. Since some developers and partially
also QA ignored it (intentionally or not), the deadline was missed and the
change was postponed to Fedora 23. And together with this Anaconda’s deadline
for the "Python 3 switch" (see below) was postponed to the time when Fedora 22
gets released as we identified three key facts during the discussions about the
original feature proposal (for F22):

  1. no matter how much the target is clear, people ignore it if things are not broken (at which point they start complaining :-P)
  2. it’s hard and inefficient to maintain both Python 2 and Python 3 versions of such big project as anaconda is
  3. QA has not enough capacity to test both versions and thus switching between them during the release would make things broken moving the whole process at least few weeks back
[2] an interesting fact is that the beginning of history of the Anaconda’s
sources predate the existence of True and False (boolean) values
in Python

"Python 3 only" vs. "Python 2/3 compa­tible"

Python 3 is a major step and making some code compatible with both Python 2 and
Python 3 usually requires adding if s checking which version of Python the
code is run in and doing one or another thing based on such check. There is a
very useful module called six [5] that provides many functions, lists and
checks that hide the if s, but even when using this module, the code gets
more complicated, worse readable and harder to debug (and thus maintain) by
making it Python 2/3 compatible. While for libraries (or more precisely python
packages), it is worth it as it makes them usable for a wider variety of user
code, for applications written in Python 2, it is easier and in many ways better
to just switch to Python 3.


For the reasons described above the Red Hat’s Installer team as the group of
Anaconda’s developers and maintainers decided to make all their libraries Python
2/3 compatible and to move Anaconda to Python 3. The only exception is the
pyblock (CPython) library that was developed in 2005 to provide quite a wide
range of functionality (not only) for the Anaconda installer, but which has been
over time getting more and more replaced by other libraries, utilities and other
means and become only used by the installer. Thus instead of porting the whole
library to Python 3 we decided to drop it and implement the few required
functions in the (new) libblockdev library [6] that was being born at that

[6] using GLib and GObject introspection as shown in some of my other
posts and thus being both Python 2 and Python 3 compatible

Yum vs. DNF

Not everything used by the Anaconda installer is, of course, developed and
maintained by the Installer team. There were few "external" python libraries
that were required to be made Python 2/3 compatible and then there was Yum,
used by Anaconda for one of the key things — package installation. Yum is
usually used as the yum utility, but it is also a Python package which is
the way Anaconda has been making use of it. However, Yum has been being slowly
replaced by a new project called DNF that started as a fork of Yum and that
has been replacing Yum code with either new code or calls to newly born (C)
libraries. It has been decided that Yum will never get the Python 3 support as
it will stay in the maintainance mode with new features and development being
focused on DNF. A result for Anaconda was that with the switch to Python 3 it
will also have to say "good bye" to Yum and use DNF instead. Fortunately, the
author and first maintainer of DNF — Aleš Kozumplík — gave his former team
great help and wrote the vast majority of the Anaconda’s DNFPayload
class. Still, the switch from Yum to DNF [7] in the installer was expected to
be a problematic thing and was one of the "official reasons" why the switch to
Python 3 was postponed to Fedora 23. [8]

[7] by default (previously only activated by the inst.dnf=1 boot option
[8] although so far the DNFPayload seems to be working great with only
few smaller (non-key) features being missing and added over time and
being the default for Fedora 22 (with inst.nodnf turning it off)

We need you

As can be seen from the above there are lots of things behing something that
might look like "just a switch to a newer version of Python". And where there is
lot of stuff behing something there’s also a lot of stuff that can go wrong. Be
it the Anaconda’s code (no, the 2to3 utility really doesn’t do it all
magically) or any of the libraries it uses, there is quite a lot to test and
check. That’s why we decided to give us a head start and did the switch to
Python 3 in a separate branch of the project using Copr to do unofficial
builds and composes. [9] At first, we had only been creating Rawhide composes,
but that turned out to be not enough as we spent the same time hitting and
debugging Python 3 related issues as with unrelated Rawhide issues. That’s why
we decided to spent extra time on it, ported the f22-branch code to Python 3
and started creating F22 Python 3 composes that are stable and do not suffer
with issues caused by unstable Rawhide packages.

The images are at
and we would like to encourage everybody having some free "testing time" to
download and test them by doing various more or less weird Fedora 22
installations. Please report issues at Anaconda’s GitHub project page and if
you have a patch, please a submit pull requests against our development

Last but not least, big THANKS for any help!


bcache and/vs. LVM cache

What’s going on here?

One of the bottlenecks of today’s computers is storage. While CPUs, buses and
other components of computers have really nice values of throughput going up to
several GiBs/s, disks are really slow compared to them. HDDs give few hundreds
of MiBs/s at most when performing sequential read/write and much less when doing
random I/O operations. While SSDs are much faster than HDDs especially in doing
random I/O operations they are much more expensive and thus not so great for big
amounts of data. As usual in today’s world, the key word for a win-win solution
is the word "hybrid". In this case a combination of HDD and SSD (or just their
technologies in a single piece of hardware) using a lot of HDD-based space
together with small SSD-based space as a cache providing fast access to
(typically) most frequently used data. There are many hardware solutions that
provide such hybrid disks, but they have the same drawbacks as hardware RAIDs —
they are not at all flexible and really good just for a particular use case. And
as with the hardware RAIDs the solution for better flexibility and broader range
of use cases is to use a software RAID, with hybrid disks software comes into to
this game (to win it, maybe?) with multiple approaches. Two most widely used and
probably also most advanced are bcache and LVM-cache (or dm-cache as
explained below). So what these two are and how they differ? Let’s focus on each
separately and then compare them a bit.


What it is?

bcache or Block (level) cache is a software cache technology being developed
and maintained as part of the Linux kernel codebase which as it’s name suggests
provides cache functionality on top of arbitrary (pair of) block devices. As
with any other cache technology bcache needs some backing space (holding data
that should be cached), typically on a slow device, and some cache space,
typically on a fast device. Combined with the fact that bcache is a block
cache we get the fact that both backing space and cache space could be
arbitrary block devices — i.e. disks, partitions, iSCSI LUNs, MD RAID devices,

Deployment options

The simplest solution is to use an HDD (let’s say /dev/sda) together with an
SSD (let’s say /dev/sdb), create a bcache on top of them (as described
below) and then partition the bcache device for the system. A bit more
complicated solution is to create partitions on the HDD and SSD and create one
or more bcache devices on top of partitions that are desired to be cached. Why
one should even think about this more complicated solution? It provides much
better flexibility. While by creating bcache on top of the whole HDD and SSD
devices gives us basically the same as hybrid disks except that we need two SATA
ports and we can choose from more HDD and SSD sizes creating bcache(s) on top of
partitions allows us e.g. to have some data (e.g. system data) directly on SSD
and some other data in a bcache (HDD+SSD) or even have multiple bcache devices
with different backing space and cache space sizes or even caching policies (see
below for details).

Setting up

So, let’s say we have an HDD (/dev/sda) and an SSD (/dev/sdb) and we
have some partitions created on them — let’s say /dev/sda1 on the whole HDD
(to be used for /mnt/data) and /dev/sdb1 used for system (/) plus
/dev/sdb2 (dedicated for cache) on SSD.

First of all we need to install the tools that will allow us to create,
configure and monitor the bcache. These are typically a part of a package
called bcache-tools or similar. So on my Fedora 21 system, I need to run the
following command to get it (# means it should be run as root):

# dnf install bcache-tools

Another tool we will need is the wipefs tool which is part of the
util-linux package that should already be installed in the system.

With all the necessary tools available, we can now proceed to the bcache
creation. But before we start creating something new we need to first wipe all
old weird things from the block devices (in our case partitions) we want to
use (WARNING: this removes all file system and other signatures from /dev/sda1
and /dev/sdb2 partitions

# wipefs -a /dev/sda1
# wipefs -a /dev/sdb2

Cleaned up. Now, as is usual with basically all storage technologies, we need to
write some metadata to the devices we want to use for our bcache so that the
code providing the cache technology can identify such devices as bcache devices
and so that it can store some configuration, status, etc. data there. Let’s do
it then:

# make-bcache -B /dev/sda1

This command writes bcache metadata for the backing device (space) to the
partition /dev/sda1 (which is on the HDD). Believe it or not, but this is
all we needed to create a bcache device. If udev is running and appropriate
udev rules are effective (if not, we have to do it manually [1]), we should
now be able to see the /dev/bcache0 device node and the /dev/bcache/
directory (try listing it to see what’s inside) in our file system hierarchy
which we could start using. Really? Is that everything that needs to be done?
Well, it’s not that easy. Remember that every cache technology needs backing
space and cache space and with the command above we have only defined the
backing device (space). So we now of course have to define the cache device
(space) again by writing some metadata into it:

# make-bcache -C /dev/sdb2

The result is that we now have the metadata written to both the backing device
(space) and the cache device (space). However, these devices don’t know about
each other and the caching code (i.e. the kernel in case of bcache) has no idea
about our intention of using /dev/sdb2 as a cache device for
/dev/sda1. Remember that the first make-bcache run created the
/dev/bcache0 device that was from the first moment usable? Well, it was
usable as a bcache device, but without any caching device which is not really
useful. The last step missing is to attach the cache device to our bcache device
bcache0 by writing the Set UUID from the make-bcache -C run to the
appropriate file:

# echo C_Set_UUID_VALUE > /sys/block/bcache0/bcache/attach

From now on we can enjoy the speed, noise and other improvements provided by the
use of our cache. The /dev/bcache0 device is just a common block device and
the easiest thing to do with it is to run e.g. mkfs.xfs on it, mount the
file system to e.g. /mnt/data and copy some data to it. If we later want to
detach the cache device from the bcache device, we just use the detach file
instead of the attach file in the same directory under /sys.

As I’ve mentioned in the beginning of this post, SW-based cache solutions
provide more flexibility as HW solutions. One area of such flexibility is
configuration because it is quite easy to make a SW solution configurable and
extensible compared to a HW solutions. The configuration of our bcache can be
controlled by reading and writing files under the /sys file system. The most
useful and easiest example is changing the mode of cache — the default is
writethrough which is the safest one, but which on the other hand doesn’t
save the backing device (HDD) from many random write operations. Another typical
mode is writeback which keeps the data in the cache (SSD) and once in a
while writes them back to the backing device. To change the mode we simply run
the following command:

# echo writeback > /sys/block/bcache0/bcache/cache_mode

However, this change is only temporary and we have to do the same after every
boot of the system if we want to always use the writeback mode (of course we
can do this in a udev rule, systemd service, init script or whatever we
prefer instead of doing it manually after each boot).

[1] by running # echo /dev/sda1 > /sys/fs/bcache/register

Monitoring and maintenance

Even though it is usually possible to see (and even hear [2]) the difference once
bcache is created and used instead of just using the HDD people are curious
and always want to know something more. A typical question is: "How well is the
new solution performing?"
In case of cache, the most clear performance metric
is the ratio of read/write hits and misses. Of course, the more hits compared to
misses the better. To find out more about the current state, status and stats of
a bcache another tool from the bcache-tools package can be used:

# bcache-status -s

In the output we should see quite a lot of interesting information and we can
for example also check that the desired cache mode is being used. There are
other configuration options and other stats that might be important for many
users, but these are left to the kind reader for further exploration.

[2] if the writeback mode is used many writes to the backing device are
spared and the rest is serialized as most as possible which makes the HDD
quite a lot less noisy due to R/W header not moving randomly

LVM cache (dm-cache)


We have seen in the previous part of this post that bcache is quite a powerful
and flexible solution for using HDD and SSD in a combination giving us great
performance (of the SSD) and big capacity (of the HDD). So one may ask why we
even bother with a description of some other solution. What could possibly be
better with LVM cache (dm-cache) compared to bcache?

A little bit about terminology

First of all, let’s start with clarification of why I up until now always
referred to this technology as "LVM cache (dm-cache)". Some people know,
some may not, that LVM (which stands for Logical Volume Management) is a
technology of user space abstract volume management using the Device Mapper
functionality (in both user space and kernel). As a result of that, everything
that can be done with LVM can be done by directly using the Device Mapper (even
though it is typically incomparably more complex) and anything that LVM does
needs to have the underlying (or low-level if you prefer) support in the Device
Mapper. The same applies to the caching technology which is provided by the
cache Device Mapper target and made "consumable" by the LVM cache

Okay, okay, but why?

Now, let’s get back to the big question from the first paragraph of this
section. The answer is clear and simple to people who like LVM — the LVM
for bcache is what LVM is for plain partitions. For people who don’t
like, use or totally don’t get LVM an example of quite a big difference could
be the best argument. The first step we did in order to set our bcache up was
wiping all signatures from block devices we wanted to use for both backing space
and cache space. That means that any file systems that could potentially existed
on those block devices would be removed leaving the data unreadable and
practically lost. With LVM cache it is possible to take and existing LV
(Logical Volume) with an existing (even mounted) file system and convert it to a
cached LV without any need of moving the data to some temporary place and even
without any downtime [3]. And the same applies if we for example later decide
that we want to stripe the cache pool to two SSDs (RAID 0) to get more cache
space and really nice performance or on the other hand mirror the backing device
to get better reliability (or both of course). So we may easily start with some
basic setup and improve it later as we have more HW available or different
requirements. The LVM cache also provides better control and even more
flexibility by allowing user to manually define the data and metadata parts of
the cache space with various different parameters (e.g. mirrored metadata part
on more reliable devices with striped data part for more space and better

[3] a typical approach to convert a block device into a "bcached" block
device is to freeze the data on it, move/copy it somewhere else, set the
bcache up and move the data back

Setting up

Let’s assume we have the same HW as in case of bcache — a HDD and a SSD —
but this time let’s also assume that we already have LVM set up on the HDD (or
even multiple HDDs, that makes no difference for the commands we are going to
use) and that the SSD provides 20 GiB of space . Setting up LVM on top of HDD(s)
would be a nice topic for another blog post, so let me know if you are
interested in such topic in the comments. Now we want to demonstrate one of the
benefits of the LVM cache over bcache so let’s assume all the basic LVM
setup work is done and we have an LV (Logical Volume) with some file system
and data on it using the HDD for its physical extents [4] the name of which is
DataLV and which is part of the data VG (Volume Group) (the backing
space is called Origin in LVM’s terminology). We will basically follow the
steps described in the lvmcache (7) man page (another benefit over bcache
from my point of view).

As the first step, we need to add the SSD (/dev/sdb) into the same volume
group as where our LV holding the data (DataLV) is. To do that, we need to
tell LVM that the /dev/sdb block device should become an LVM member device
(we could use a partition on /dev/sdb if we wanted to combine partitions and
LVM on our disks):

# pvcreate /dev/sdb

If that fails because of some old metadata (disk label, file system
signature…) being left on the disk we could either use the wipefs tool (as
in case of the bcache) or add the --force option to the pvcreate

Once LVM marks the /dev/sdb device as an LVM member device [5] we can now
add it to the data VG:

# vgextend data /dev/sdb

The data VG now sees the SSD as a free space for allocation if we create
more LVs in it or grow some existing ones. But we want to use it as a cache
space, right? Well, LVM only knows PVs (Physical Volumes), VGs and
LVs. However, LVs can be of various types (linear, striped, mirror, RAID, thin,
thin pool,…) which can be changed online. So let’s start with creation of a
good old LV with the size we want for our cache space and with it’s PEs
(Physical extents) being allocated on the SSD:

# lvcreate -n DataLVcache -L19.9G data /dev/sdb

I believe a concentrated reader now asks why only 19.9 GiB when we have 20
of space on the SSD. The reason is that we are going the "hard" (more
controlled) way and we need some space for a separate metadata volume which we
can now create:

# lvcreate -n DataLVcacheMeta -L20M data /dev/sdb

with the size of 20 MiB because the LVM documentation (the man page) says it
should be 1000 times smaller than the cache data LV, with a minimum size of
. If we wanted to have the DataLVcache and/or DataLVcacheMeta more
special (like mirrored), we could have created them as such right away now. Or
we could convert them later if we want to. But for now, let’s just follow our
simple (and probably most common) case. The next step we need to do is to
"engage" the data cache LV and metadata cache LV in a single LV called cache
. A cache pool is an LV that provides the cache space for the backing
space with metadata being written and kept in it. And as such, it is created from
the data cache LV, more precisely converted:

# lvconvert --type cache-pool --cachemode writethrough --poolmetadata data/DataLVcacheMeta data/DataLVcache

As you may see, we specify the cache mode on cache pool creation. The bad thing
about it is that it cannot be changed later, but the good thing about it is that
it is persistent. And honestly, other then playing with various technologies,
how often one needs to change the cache mode? If it’s really needed, the cache
pool can be simply created again with a different cache mode.

It’s been a long way here, I know, but we are almost at the end now, I
promise. The only missing step is to finally make our DataLV cached. And as
usual with LVM, it is a conversion:

# lvconvert --type cache --cachepool data/DataLVcache data/DataLV

And with that, we are done. We can now continue using the DataLV logical
volume, but from now on as a cached volume using the cache space on the

Unfortunately, there seems to be no nice tool shipped with the LVM that would
give us all the cool stats just like bcache-status does for bcache. The
only such tool I’m aware of is the lvcache tool written by Lars
Kellogg-Stedman available from this git repository: Hopefully this will change when the LVM
starts to be more widely deployed and used.

[4] LVM’s units of physical space allocation
[5] try running wipefs (without the “-a“ option!) on it
[6] with lvcreate --type cache-pool -L20G -n DataLVcache data /dev/sdb


I know it probably seemed really complicated and much harder to set up LVM
than setting up bcache, but if we wanted to, we could have dropped the
separate data and metadata cache LVs creation and do it in a single step
creating the cache pool right away. [6] I just wanted to demonstrate extra
control and possibilities the LVM cache provides. Without that, the LVM
setup would really be very similar to the bcache setup, but still we a
big advantage of doing everything online without any need to move data somewhere
else and back.

I don’t think that any of the two SW cache technologies presented in this blog
post is better than the other one. Just like I mentioned in the very beginning
of the LVM cache description, LVM cache for bcache is what LVM is for
partitions. So if somebody has some advanced knowledge and likes having things
configured the exact complex way that they think is best for their use case or
if somebody needs to deploy cache online without any downtime then LVM cache
is probably the better choice. On the other hand, if somebody just wants to make
use of their SSD by setting up SW cache on a fresh pair of SSD and HDD and they
don’t want to bother with all the LVM stuff and commands, the bcache is
probably the better choice.

And as usual, having to independent and separate solutions for a single problem
leads into many new and great ideas that are in the end shared because what gets
implemented in one of them usually sooner or later makes it to the other too,
typically even improved somehow. Let’s just hope that this will also apply to
bcache and LVM cache and that both technologies are deployed widely enough
to be massively supported, maintained and further developed.


  • As Barry Shilliday pointed out in the comments, the LVM cache
    can be done even easier with cache pool creation and LV conversion into a cached LV in
    a single step by:
    # lvcreate –type cache -L 19.9G -n DataLV_cachepool data/DataLV /dev/sdb
  • I was informed that the lvs command now supports options to list some
    cache stats. The # lvs -o help 2>&1 | grep cache command lists all the cache
    stats/settings that can be printed out by commands like:
    # lvs -o cache_read_hits,cache_read_misses data/DataLV.
  • According to the updated lvmcache(7) man page:
    The cache mode can be changed on an existing LV with the command:
           lvconvert --cachemode writethrough|writeback VG/CacheLV

Introducing blivet-gui — a new GUI storage management tool


Let’s be honest and start directly with the inconvenient truth 1 — storage management is hard. There are many technologies using many ways how to make our data available faster, more reliably, with high availability, over long distances, etc. What’s even harder is that all those technologies can be combined together to get a combination of their advantages. Do you think having tens of iSCSI LUNs combined in an MD RAID exporting RAID devices used as LVM PVs together with two SSD drives combined in a another MD RAID providing fast dm-cached LVs with an XFS file system on top of all this exported as a GlusterFS sounds crazy, overly complicated and totally unusable? 2 Five words: “Welcome to the enterprise storage!” Where something like that is a perfectly usable solution providing all the features mentioned in the beginning of this post.

The Anaconda installer and blivet

Being a storage expert or at least a senior system administrator it is quite easy to run a few commands in a shell to set the stack described above up. But we are in the age of nice, shiny graphical user interfaces that are expected and thus required to support many many more or less crazy combinations and configuration options. And when does storage configuration usually happen? When the system is installed. That’s the reason why the most comprehensive and feature-complete storage management (or more precisely, storage configuration) UI was over the years implemented in the Anaconda installer, the RHEL and Fedora 3 OS installer. Just a small demonstration — the only two technologies from the example described above currently unsupported in the Anaconda installer are GlusterFS and dm-cache 4 with the list of supported technologies being at least three times as long as the list of abbreviations from that example.

Having probably the only code in the world supporting so many technologies it started to become evident that it would be useful to put such code in a library usable for other projects. And thus the code majorly rewritten by David Lehman in 2009 was in early 2013 split into a library (or more precisely Python package) which was due to its nature given name blivet“[blivet] is an undecipherable figure, an optical illusion and an impossible object” 5 –because it seems impossible to provide high-level and easy to understand API for something that is so complicated as (enterprise) storage.

Birth of blivet-gui

When Vojtěch Trefný approached me last year telling me that he doesn’t like Anaconda installer’s partitioning UI and he saw a bachelor thesis topic focused on improving it (by implementing a visualisation of the changes that will happen on the disks) I was really happy to see somebody who doesn’t “just complain”, but actually wants to make some effort to make things better. Since the topic focused on visualisation had already been taken by somebody else 6, we needed to come up with something new. And that’s how the idea of a new storage management tool based on the blivet library was born. We gave Vojtěch a link to blivet documentation and he then, to my surprise, came after a few months to me as a supervisor for the thesis with an announcement that he has it done and in a shape that fulfills the requirements specified in the bachelor thesis’ description — partitions and LVM support, user and developer documentation and the whole thing being embedable into another application. After defending the thesis and the implementation Vojtěch has been continuing with the work on the blivet-gui tool adding new features (LUKS, kickstart generation,…), fixing bugs and improving the UX (clickable objects in the visualisation,…) with me being partially a QA, source of some ideas and comments about the implementation. And in the most recent days a press agent spreading the word about such a great and promising tool the blivet-gui already is 7.

Now and the future

The blivet-gui tool is under heavy development and for now, it doesn’t support all the features the blivet library supports. But it was intentionally announced 8 and made publicly known in this shape becase I’m sure Fedora and the whole open-source world is a great community with a lot of clever and productive people who can help with development, QA, documentation, ideas, feature requests and everything else. We are a community of great people ready to do great things that can literally help thousands and millions of people all around the world. And we all know that storage management is a hard thing to do even harder when trying to do it right. So let’s go guys, everybody can help to make blivet-gui more user- or developer-friendly, secure, reliable and feature-complete! Tell us what you like or hate on it most, tell us what you miss in it most, tell us what other storage management tools do better. Every rational and non-aggressive input is welcome, the patches being welcome most, of course. :) Cloning the blivet-gui repository at GitHub is a great way to start.


The purpose of the blivet-gui tool is not to replace GParted, gnome-disks-utility or any other storage management tool. It is just another option for storage management providing a different (although not disjoint) feature set than other storage management tools.

  1. reaching for the Nobel Prize for peace, of course
  2. I really like how storage gives you a chance to use 20 cryptic abbreviations in a single sentence
  3. and their derivatives
  4. both already being discussed and planned by the Anaconda-storage sub-team
  6. not implemented yet though

Minimalistic example of the GLib’s GBoxedType usage


As I’ve explained in one of the previous posts, it is possible to use
advantages of the GObject introspection even with a plain C non-GObject
code. It is okay to write C functions taking arguments and returning values,
call g-ir-scanner and g-ir-compile on them and then call them from
Python or any other language supporting GObject introspection. However, that’s
not entirely true as it per se only works with elementary types like numbers and
strings and arrays of such values plus structs with no pointer fields.

So what if some functions need to take or return complex values not only numbers
or strings? And why it’s only structs with no pointer fields? Let’s start with
the second question. Imagine the following situation: caller (e.g. Python) calls
a function that returns a struct containing a number, a string and a pointer to
another struct and the ownership transfer (extra metadata for GObject
introspection) is set to full which means the caller takes the ownership of
the returned value. What if the caller wants to copy or delete such value? In
case of number or string or array of such values it is simple. The same applies
to a simple struct with no pointers (the introspection data documents struct’s
fields and their types).

GBoxedType declaration example

So the problem is missing code for copying and freeing the complex values and
the first question coming to mind is: "Can’t I simply tell the caller how to
copy and free such values?"
And that’s what GLib’s GBoxedType is all
about. It is a wrapper type around plain C structs which provides information
how to copy and free such values. Let’s have a look at a minimalistic example
showing how such type can be declared:

#include <glib-object.h>
#include <glib.h>

#define TEST_TYPE_DATA (test_data_get_type ())
GType test_data_get_type ();

typedef struct _TestData TestData;

struct _TestData {
    gchar *item1;
    gchar *item2;

  * test_data_copy: (skip)
  * Creates a copy of @data.

TestData* test_data_copy (TestData *data);

 * test_data_free: (skip)
 * Free's @data.
void test_data_free (TestData *data);

First the glib-object.h and glib.h header files need to be included
because they define types and functions necessary for a definition of a new
GBoxedType. Then a macro and a function for getting type of the new GBoxedType
need to be declared for the type system to work with the type. Of course, there
has to be a definition of the actual struct holding the data. It can be done in
two steps as in the above example or in one step as:

typedef struct TestData {
    type1 field1;
    type2 field2;
} TestData;

defining the struct type and "non-struct" type [1] both at once, but GLib coding
style recommends the two-steps definition. And the core are the two functions
for creating a new copy and freeing a value of the new GBoxedType with not much
surprising signatures.

[1] in C these are two different type namespaces

With definitions of the functions above it will be possible to call functions
that return a TestData* value and get the values of the item1 and
item2 fields. It would also be possible to create a new TestData object
and passing values to its fields. However, it is often useful to declare and
define one more function:

 * test_data_new: (constructor)
 * @str1: string to become the .item1 field
 * @str2: string to become the .item2 field
 * Returns: (transfer full): new data
TestData* test_data_new (gchar *str1, gchar *str2);

It is a constructor function that, given values of the fields, returns a new
object of type TestData. It is only a convenience function here where it
should just passing the values to the struct’s fields, but as you can imagine, it
can do a lot more if needed.

GObject definition example

The implementation of the functions declared above is really
straightforward. The only exception is the test_data_get_type function that
creates and registers the type in the type system:

GType test_data_get_type (void) {
    static GType type = 0;

    if (G_UNLIKELY (!type))
        type = g_boxed_type_register_static ("TestData",
                                             (GBoxedCopyFunc) test_data_copy,
                                             (GBoxedFreeFunc) test_data_free);

    return type;

It defines a global variable type of type GType and if it is not set
(i.e. set to 0), it assigns it a new value created by the
g_boxed_type_register_static with arguments that are quite clear, I’d
say. The use of G_UNLIKELY macro tells the compiler that this condition will
hardly ever be evaluated to TRUE which is a simple but useful optimization.


With the functions and types declared in the test_data.h and defined in the
test_data.c files the working introspectable library can be created with the
following commands:

$ gcc -c -o test_data.o -fPIC `pkg-config --cflags glib-2.0 gobject-2.0` test_data.c
$ gcc -shared -o test_data.o
$ LD_LIBRARY_PATH=. g-ir-scanner `pkg-config --libs --cflags glib-2.0 gobject-2.0` --identifier-prefix=Test --symbol-prefix=test --namespace Test --nsversion=1.0 --library test_data --warn-all -o Test-1.0.gir test_data.c test_data.h
$ g-ir-compiler Test-1.0.gir > Test-1.0.typelib

The first two call gcc to produce the shared dynamic
library that can be then loaded e.g. by Python. The third line is the invocation
of the g-ir-scanner utility that produces an XML containing the
introspection (meta)data. It gets compiler and linker flags for the libraries
required by the, prefixes for identifiers (like types,
constants,…) and symbols (functions), namespace name and version, the name of
the library that should be scanned and paths to the sources that should be
scanned and -o Test-1.0.gir option that specifies the output file name. Name
of the file should match the namespace-nsversion.gir pattern. And finally
the last command compiles the Test-1.0.gir file to its binary representation
that is expected to match the same name pattern with the .typelib extension.
If you are reproducing the steps above, feel free to have a look at the produced
Test-1.0.gir file as it is quite easily readable and understandable, I’d
say. And if you are hardcore hacker, feel free to have a look at the
.typelib file too, of course. Just remember that running cat on it may
"nicely" change your terminal’s runtime configuration [2].

[2] use reset to get the defaults back in such cases


Having the definitions, declarations, introspection (meta)data available both in
the XML and binary forms, it’s time to test the result. The easiest way is
running ipython as it provides a TAB-TAB completion. It just have to be
told where to find the .typelib file and of course the
library that it needs to load. Both are in the current directory so:


Runs the ipython in the properly set up environment. To test the library and
newly defined struct/class/object type it has to be loaded from the
gi.repository. Then it can be instantiated with the constructor or without it
and fields can be introspected (TAB-TAB) and used:

In [1]: from gi.repository import Test
In [2]: td = Test.Data()
In [3]: td.item1 = "ahoj"
In [4]: td.item2 = "cau"

In [5]: td.item1
Out[5]: 'ahoj'

In [6]: td.item2
Out[6]: 'cau'

In [7]: td2 ="nazdar", "zdar")

In [8]: td2. # hit TAB-TAB
td2.copy   td2.item1  td2.item2

In [8]: td2.item2
Out[8]: 'zdar'


That’s not entirely bad, is it? One doesn’t get an introspectable struct
completely for free if it is not trivial, but defining three (copy, free, new)
of the four functions defined above is a good practice anyway. So in the end
it’s all about adding one more function and two declarations (the TYPE macro
and the get_type function prototype) and calling two utilities producing the
introspection data. Quite easy if I think about writing language-tailored
bindings for any language that comes to my mind. And with these constructs one
gets bindings for all the languages supporting GObject introspection. To define
a new type’s method, it just needs to have a test_data prefix and take the
TestData* value as the first argument. Let me know in the comments if there
is anything unclear. If I know the answer, I’ll reply ASAP and possibly update
the post with such information.