Monthly Archives: December 2015

Foundations of LVM for mere mortals

The original plan was to write a blog post about LVM Thin Provisioning because that is a technology and in the same time a feature of LVM that deserves a lot more attention and a much bigger number of deployments than it currently gets. However, while I was thinking about the contents of such blog post and while I was talking to my colleagues and some other people about LVM Thin Provisioning I realized that I’d be jumping on a train that goes really fast in the middle of its way. While thin provisioning is absolutely a great thing, its benefits and features are most visible when compared to traditional 1 LVM. But based on my experience that quickly goes to wondering about why LVM hasn’t been doing things that way from the very beginning and then about how traditional LVM actually works.

So I thought maybe I could just give a link to some existing blog post, wiki page or something on the web that readers could read first before diving into the depths of thin provisioning. However, trying to ask my favorite search engines about "introduction to LVM" didn’t really bring the results I was hoping for. Not that nothing was found or that it was all wrong or outdated, the problem was that it was all oriented on practical use of LVM – which commands one should type into a terminal to create an LVM "stack" on their system – with only really brief descriptions of the basic terms like Physical Volume, Logical Volume and Volume Group. Don’t take me wrong, those pages are greatly useful, clear and helpful if one wants to do something with LVM on their system and use it as a black box that provides various features. They are just not so great for understanding how it all works, where the limitations come from and how amazing it is. That’s why I decided to write this blog post first. And before I actually start with the real thing, I’d like to make one thing clear right away: I’m not saying everything written here is 100% correct, precise and complete. This is just how I understand and see things to the extent I care about them.

The Device Mapper

If you’re wondering now about why I’m totally changing the topic please be assured that it’s not a wrong copy-paste or anything like that. The Device Mapper (DM) is the heart of LVM 2. Actually, one could say that LVM is just a set of tools and shortcuts for DM. But a really big and powerful set of tools and shortcuts that make things an order of magnitute simpler and more usable for mere mortals.

The Device Mapper is a kernel driver that provides all the flexibility and abstraction LVM then offers to users. It does exactly what its name suggests – it maps devices to other devices based on the maps (called tables) configured by user/administrator. Now, instead of going deep into the implementation details 3, let’s have a look at some examples best describing the functionality the Device Mapper provides. The user interface for the DM is the dmsetup tool which uses the libdevmapper library doing ioctl() calls on the special (character) device node /dev/mapper/control. The basic commands of the dmsetup tools are: ls to list all devices (mappings) provided by the DM; table to list all the tables (maps) currently defined in the DM; create to create a new DM device based on a given table (map) and remove to remove/destroy a given DM device (mapping). The simplest map type (so called target) is the linear mapping which can be created like this:

# dmsetup create my_DM_dev --table '0 102400 linear /dev/sda 0'

First of all note that I’m running the command as root which is necessary because it creates a new device node. my_DM_dev is a name of the newly created device and then the table follows. It says that 102400 sectors (yes, DM 4 works with 512B sectors) starting at the 0 sector of the newly created device should be linearly mapped to the device /dev/sda starting from its sector 0. The result is that I now have a new device /dev/mapper/my_DM_dev of size 50 MiB (102400 * 512 B) and if I try e.g. echo hello > /dev/mapper/my_DM_dev, I can check that the string hello is actually written to the very beginning of /dev/sda. Of course, one would probably do things like mkfs.xfs on a device instead of echoing a string in there, but it doesn’t matter now. The important thing is that our new /dev/mapper/my_DM_dev device really is mapped to the beginning of /dev/sda. I can now go on and create another similar device, this time 200 MiB in size:

# dmsetup create my_DM_dev2 --table '0 409600 linear /dev/sda 102400'

It should now be clear that this device is mapped to the area of /dev/sda right after where the my_DM_dev device is mapped to. So effectively I created something like two partitions on my disk 5. Cool as hell, you say? I absolutely agree that this is not an easy and nice way of splitting disk into separate units used for file systems or whatever. Well, at least removing the devices is as simple as dmsetup remove my_DM_dev (assuming they are not in use). Anyway, you could do a much better thing with any of the available partioning tools with 10 % of the effort, right?. But, how about this?:

# dmsetup create my_DM_dev
0 409600 linear /dev/sda 0
409600 409600 linear /dev/sdb 0

Here I’m not using the --table option because that only allows me to specify a one-line table. But, the real strength of the Device Mapper is in more complex and complicated things like trivial linear mapping to some area on the disk. Without the option the dmsetup tool allows for the table to be specified on standard input potentially on multiple lines (and terminated with the EOF character (i.e. Ctrl+D). 6 Although the numbers might be a bit confusing at the glance, it should be clear that I created a 400MiB device, the first 200 MiB of which are linearly mapped to /dev/sda and the second 200 MiB are mapped to /dev/sdb. I can check the result e.g. by running:

# lsblk /dev/mapper/my_DM_dev
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
my_DM_dev 253:2    0  400M  0 dm

which tells me that the size really is 400 MiB. Also note the major number 253 which is common to all DM devices. By running dmsetup table my_DM_dev I can make sure that the device really uses both parts of the table/mapping I entered.

So now I have a device that spans over two physical devices (disks in my case) which is something that’s not possible with standard partitions. Quite neat, isn’t it? And we can go way further with things like this:

# dmsetup create my_DM_dev
0 409600 linear /dev/sda 0
409600 409600 striped 2 32 /dev/sda 409600 /dev/sdb 0

The first line is the same as in the previous example – a simple linear mapping to the beginning of /dev/sda. But what about the second line? Instead of linear we are creating a striped mapping using 2 devices with chunk size of 32 sectors (16 KiB) with the devices being /dev/sda from its sector 409600 (where the linear part of the mapping ends) and /dev/sdb from its start sector. So we have a device that does its I/O operations in a traditional way if they go to its first 200 MiB, but stripes the I/O operations to two disks when they go to its second 200 MiB (just like RAID0 would do). We could test the difference by doing:

# dd if=/dev/mapper/my_DM_dev of=/dev/null bs=10M count=20
# dd if=/dev/mapper/my_DM_dev of=/dev/null bs=10M count=20 skip=20

i.e. reading the first 200 MiB and comparing the results to reading the second 200 MiB. Of course, the second run of dd should say something like double the speed of the first run. But it’s not really that trivial to test this with all the caches, readahead and other mechanisms that are on the way to the actual physical disk especially if you are using a VM for testing this like I do. Anyway, the point is that we can combine various types of mappings supported by the Device Mapper and create devices with really interesting properties. You may ask what these "hybrids" could be good for. Well, if you have some extra information, such as your file system stores its inodes and journal at the beginning of the device it is created on, you could for example create a device that has its beginning (whatever that means for the file system) mirrored (the mirror mapping) with the rest being linear or even striped. I’m not a file system expert, but my naïve mind thinks it could be an interesting win-win solution for many use cases. Also a very important fact which may not be quite obvious from the above is that the DM devices are just block devices and thus can be used in tables (maps) of other DM devices creating really complex hierarchies with tailored properties and features.

The examples above show that Device Mapper really provides a very flexible way to work with physical storage devices by creating an abstraction layer that allows the user/administrator to create various combinations best suiting their use case. What the examples on the other hand don’t show are the capabilities of the DM regarding types of mappings (targets) it supports. Apart from the linear, striped and mirror mappings mentioned above there are also error, zero, cache, crypt, delay, flakey, mirror, multipath, raid, snapshot, thin and thin-pool targets some of which are mainly useful for testing file systems and applications (like zero, error and delay), but with the other ones being extremely useful for real use cases. For example the crypt mapping is what actually is behind what most people know as the LUKS encryption. LUKS is one of the formats/algorithms the crypt target supports for encryption and decryption of the blocks written to and read from physical devices. The snapshot mapping allows for a device to have a snapshot with a copy-on-write layer. That’s useful for doing some potentially unsafe operations because one can just revert the device to the snapshot if things go wrong, but it’s also useful for putting a read-write (cow) layer over a read-only (snapshot) device. Actually, this is how live CDs/DVDs/USBs/… are created (at least for Fedora).

As the example with the striped mapping shows the more complex targets require more parameters when being established. Also if a hierarchy of DM devices is created, it of course needs to be created in the order that whenever a device is used in a table for a to-be-created device it already exists in the system. All this means that it’s quite hard to use the dmsetup utility to create such setups even if one just puts the resulting commands or maps (tables) into file loaded at system boot. Which brings me back to the example where "something like two partitions" was created on the disk /dev/sda. The two fundamental differences between partitions and these DM devices being created on the disk are the facts that:

  1. if I rebooted the system my mappings would be lost and
  2. no tools, libraries or anything would be able to tell that I have split the disk into two separate chunks just by looking and the disk and reading from it.

Actually both of these facts (and shortcomings) are caused by the same thing – the lack of meta data. If I create two partitions on a disk, the information is written to the MBR or GPT table(s) of that disk and everybody can read it including kernel and various tools run during system boot. Whereas if I split the disk into two chunks using DM, the information about this split is only held in RAM and never written to any peristent memory. Of course it’s possible to reserve some disk space and store the information about how things should be set up (i.e. meta data) there, probably in multiple copies over multiple disks because it is a highly valuable information. And then read that information from those special places and set the DM devices up when asked. You know how this is called? Logical Volume Management. LVM does and provides many more things on top of this, but the basic functionality really is managing meta data, setting up, tearing down and replacing (dmsetup reload can be used to change the table online) Device Mapper devices at the right times.

The Logical Volume Management

Logical Volume Management (LVM) is a set of tools and specifications that build an abstract (logical) layer on top of the physical storage. It has three types of basic building blocks: Physical Volumes (PVs), Volume Groups (VGs) and Logical Volumes (LVs). Physical Volumes are block devices (disks, partitions, iSCSI LUNs, DM devices,…) that are "given" to LVM to work with. PVs are grouped into pools called Volume Groups. By adding a PV into a VG (either when creating a new VG or extending an existing one), the space on the PV is split into chunks called Physical Extents (PEs), the size of which is specified by the VG (the current default is 4 MiB) and those PEs are added to the pool (VG). It might be useful to point out one important thing here -Volume Groups are not devices, there’s no way to use the space from them directly and they have no device nodes in the /dev file system tree. They are logical units of management and atomicity. Once a PV is added to a VG, it cannot be just moved into a different VG. It has to be removed from the first VG and then added to the other VG, but there’s no notion of operations across Volume Groups. Last but not least there are Logical Volumes (LVs) that are block devices allocated in/from the VGs (as pools). Logical Volumes are logically split into Logical Extents that are mapped to Physical Extents (on Physical Volumes). This e.g. allows LVs to span over multiple PVs (disks, paritions, iSCSI LUNs, DM devices,…) either linearly, using striping, mirroring or in other ways supported by LVM.

"…linearly, using striping, mirroring…", doesn’t it sound familiar? Yes, LVs are in fact Device Mapper devices or more precisely mappings/tables. They become DM devices once activated which means that the particular mappings are created/established in the DM. But DM mappings are not persistent so something has to take care about preserving the information (meta data) about the mappings (start sectors, sizes, devices, start sectors on those devices, etc.). And that something is Volume Groups. In fact, by adding a PV into a VG, meta data is written to the beginning of the PV (by default 1 MiB of space is reserved for that) 7. Expecting some condensed binary format there? No, it’s just plain text with newlines, comments and everything. You can try running dd if=/dev/sda bs=1M count=1|less which will give you some garbage in the beginning (reserved space for MBR or whatever), but then you’ll see a plain text definition of the VG the PV /dev/sda belongs to, with all the PVs, LVs and everything LVM needs to remember. The nicer way to get this data is to run vgcfgbackup -f lvm_meta.txt which extracts only the relevant part of the first 1 MiB, but you can check that it is the very same plain text data as on the disk. Actually a backup of meta data is created whenever any changes to the LVM configuration are done (if not configured otherwise in /etc/lvm/lvm.conf) with the results being stored in the /etc/lvm/archive/ directory. Of course a backup without a way to restore the state from it would be just a sad memory and using dd to write the metadata to the right place would be cumbersome. That’s why there’s also the vgcfgrestore tool that does the exact opposite of what vgcfgbackup does. So if you understand LVM well, you can actually dump the meta data, edit it in you favorite text editor and load it back. Pretty cool, isn’t it? Well, you probably don’t want to do that, and unless you encounter a bug there’s no reason to do so 8, but it’s really nice to have such an option (especially if you know one of the LVM gurus).

LVM meta data could be a topic for a separate blog post, but I hope the text of this post before this line explains what their purpose and basic contents are. But the purpose of LVM is not to study how nice its meta data is, right?! It’s the abstraction of physical storage it provides that makes it so great. So what can be done with LVM and how does LVM do it in the background? Let’s start with the simplest example possible assuming I have two 1GiB disks:

# pvcreate /dev/sda
# pvcreate /dev/sdb
# vgcreate test /dev/sda /dev/sdb
# lvcreate -n myLV -L2G test

which creates two PVs then the test VG using those two PVs and then tries to create a 2GiB LV myLV spanning over both disks. But this results in the following error: "Volume group "test" has insufficient free space (510 extents): 512 required.". Wait, what? The disks have 1024 MiB of space and the default extent size is 4 MiB so how come there are only 510 extents and not 512? LVM stores the meta data at the beginning of the PVs, remember? So the first 1 MiBs in both PVs are reserved for meta data which means that the usable space for Physical Extents is 1023 MiB and thus there are 255 extents on every PV all together providing 2040 MiB of space. So, let’s try it again:

# lvcreate -n myLV -L2040M test

This time we get "Logical volume "myLV" created.". So it fit in and worked! There are multiple ways to specify the size – either with the -L option and the precise size (which is then rounded to a multiple of extent size), or with the -l option and the number of extents or with the -l option in combination with one of xy%VG, xy%PVS or xy%FREE. As you can probably see, all the other ways than -l with a number of extents are just syntactic sugar as they all boil down to the calculation of number of extents to use. But they are useful, especially in more complicated cases as we will soon see.

So we have the myLV LV in the test VG spanning over PVs /dev/sda and /dev/sdb. And we know LVM uses the Device Mapper to create the block device for the LV so let’s take a closer look. By running dmsetup table we can see that there are two tables:

test-myLV: 0 2088960 linear 8:0 2048
test-myLV: 2088960 2088960 linear 8:16 2048

that define the test-myLV DM device. It’s easy to see that the device is in the format VGname-LVname. The VG test has no DM device on its own because it is not a device and DM has no idea what a Volume Group is. However, there is the /dev/test directory with /dev/test/myLV symlink that in my case points to /dev/dm-2 which is the real device node for the test-myLV DM device. Other LVs would of course have similar symlinks and device nodes. On top of that all, there’s also the /dev/mapper/test-myLV symlink pointing to /dev/dm-2 (in my case) as the /dev/mapper folder is used for symlinks using DM device names. The symlinks, unlike the device nodes, are persistent in a way that they are the same after reboot. Names of the device nodes may differ and thus shouldn’t be used in /etc/fstab and similar places. One more note about the tables – the 8:0 and 8:16 are MAJOR:MINOR numbers that are device identifiers used instead of device node paths (devices are identified with them in kernel). We could also run vgcfgbackup -f lvm_meta.txt test and have a look at the meta data to see how the myLV LV is defined there. It of course contains all the information LVM needs to create the DM device for the LV anytime it is asked to do so.

I can now use lvremove test/myLV to remove the myLV LV. Can you guess what happens behind the scenes then? Of course the DM device is torn down (removed) and the meta data fields defining the LV are removed. But I still have the PVs and the VG test so I can now create a striped LV in it:

# lvcreate -n myLV -L2040M --stripes=2 test

LVM informs me that the default stripesize of 64 KiB was used and creates the LV defined in the DM with this table:

test-myLV: 0 4177920 striped 2 128 8:0 2048 8:16 2048

which shouldn’t be suprising in any way. Just note that both PVs are used starting from their 2048 sector (1 MiB) because before that the meta data is written (it is the same in case of the linear LV above). There might be people who prefer having a 100% control of what happens to their storage, but I think for the majority of us, the mere mortals, it’s much easier to just specify name, size and the number of stripes than to write the DM table with everything in the right order and all the numbers, especially if there are supposed to be multiple such devices and thus the starting sectors have to be calculated. So LVM not only makes things persistent, it also makes setting everything up a lot easier. And the more complicated the thing is on the Device Mapper layer the bigger the difference is. As another example, let’s have a look at what happens when I create a mirrored LV (we will get to the details of the command later):

# lvcreate -n myLV -L1016M --type=mirror -m1 --mirrorlog=mirrored test
  Logical volume "myLV" created.
# dmsetup table |grep test-myLV
  test-myLV: 0 2080768 mirror disk 2 253:4 1024 2 253:5 0 253:6 0 1 handle_errors
  test-myLV_mimage_1: 0 2080768 linear 8:16 2048
  test-myLV_mimage_0: 0 2080768 linear 8:0 2048
  test-myLV_mlog: 0 8192 mirror core 1 1024 2 253:2 0 253:3 0 1 handle_errors
  test-myLV_mlog_mimage_1: 0 8192 linear 8:16 2082816
  test-myLV_mlog_mimage_0: 0 8192 linear 8:0 2082816

Quite a lot of stuff LVM did for me, right? In the lvcreate command I specified the type of the LV to be mirror with 1 mirror (so the data is in two places – the origin and the mirror) and that I want the mirrorlog to be mirrored. The mirrorlog is log/journal DM needs to keep track of what’s already mirrored (in sync) and what needs to be mirrored. The default is disk which means there is only a single copy of the log on one of the PVs whereas mirrored means each PV gets a copy of the mirrorlog. 9 So LVM created two devices for the mirrorlog (test-myLV_mlog_mimage), the device that keeps them in sync (test-myLV_mlog), the devices that actually contain the data (test-myLV_mimage) and finally the mirror device that is the actual LV I wanted to create (test-myLV). The existence of the mirrorlog is also the reason why I could only create a 1016 MiB LV and not a 1020 MiB LV as one (a person understanding that the LVM meta data has to go somewhere) could expect.

Of course LVM knows about all the DM devices it creates for the mirrored LV myLV and with a little bit of extra effort I can get the information from it too:

# lvs -a -oname,vg_name,size test
  LV                   VG   LSize
  myLV                 test 1016.00m
  [myLV_mimage_0]      test 1016.00m
  [myLV_mimage_1]      test 1016.00m
  [myLV_mlog]          test    4.00m
  [myLV_mlog_mimage_0] test    4.00m
  [myLV_mlog_mimage_1] test    4.00m

The -a option tells LVM I want to see information about all LVs even the hidden/internal ones 10. The difference between the outputs with and wihout the -a option can be quite big. For example, on our server we use as a KVM host at work, there are 6 LVs, but including the hidden/internal ones, there are 51 of them! Quite a difference, isn’t it? The reason for that is the setup we have there – LVM Thin Provisioning (more on that in the next post!) using LVM RAID (and that in the next-next post) in the RAID5 configuration on top of 5 disks. Well, I would really hate writing all those 51 DM tables, making sure they are correct and setup in the correct order. So however easy the LVM’s task might looked in the beginning of this section, I’m quite sure everybody gets the point now. And LVM goes many steps further with this all. As a last example, see what happens if I try to move Physical Extents from one disk to another one:

# lvcreate -n myLV -L1020M test /dev/sda
  Logical volume "myLV" created.
# pvmove /dev/sda /dev/sdb -b && lvs -a -oname,vg_name,size,segtype test
  LV        VG     LSize    Type
  myLV      test   1020.00m linear
  [pvmove0] test   1020.00m mirror

LVM creates a temporary mirror LV (and thus a DM device) that makes sure the extents are safely mirrored from the disk /dev/sda to the disk /dev/sdb. It’s then easy to remove the extents from the disk /dev/sda on successful finish. And you know what’s really great about this all? I can have the test/myLV LV mounted and in use all the time and if anything goes wrong, I can just fire the pvmove again with no data loss! Ok, now that I see it, I must admit I lied. Here’s one more example demonstrating a similar wonderful thing:

# lvcreate -n myLV -L1016M test
  Logical volume "myLV" created.
# lvconvert --type=mirror -m1 test/myLV
# lvs -oname,vg_name,size,segtype
  LV   VG     LSize    Type
  root fedora    8.51g linear
  swap fedora    1.00g linear
  myLV test   1016.00m mirror

Remember I mentioned DM tables could be replaced/reloaded? Well, here you have it – I converted a linear LV into a mirrored LV to get another copy of the data making it more reliable. And I could have had it mounted and in use again all the time! Wanna a real life example of this all? Last week I bought a new SSD and a new HDD for my workstation. And without rebooting it once I moved my system (including the root LV mounted at / and home LV mounted at /home) to the new SSD with the extra LUKS encryption layer and later converted it together with my data LV living on the HDD into mirror LVs. So I went from from a setup with the system on a single unencrypted disk and data on a single disk that could fail any time, lose my data and make me unproductive for days to a setup using encryption and two disks for the system as well as for the data. And I didn’t have to reboot the system or come offline for a second! Really, really nice, LVM! And btw, this is why Fedora defaults to LVM partitioning/setup. It just provides much more flexibility than standard disk partitions with file systems.

I hope this long post sheds some light on how the LVM works and what its foundations are. Next time we will have a look at the LVM Thin Provisioning which takes the flexibility and set of features onto a completely different level. And honestly, it’s the next blog post that will give the real sense to this one so don’t give up here and start looking forward to reading more stuff on this blog!

P.S. – comments from the LVM team

I got together all my bravery and sent the link to this post to the Red Hat’s LVM team mailing list. To my pleasant suprise I got a very nice feedback with comments I’d like to share here:

  1. The mirrorlog defaulting to disk and thus in a single copy is a good strategy for majority of the cases, especially in the most common case with two disks. If a mirrored mode is used and one disk fails, what is left is a single copy with a log that says which parts of it are in sync. But with what when the other disk is gone? A complete resync is needed anyway when a new disk is later added rewriting the mirrorlog. In case of more than two disks having the mirrorlog mirrored on multiple disks may save the system from resyncing the working disks. But that in general doesn’t outweigh the overhead of writing the log to multiple locations.
  2. If the -m1 option is used and the --type=mirror is omitted, LVM decides (based on the configuration) whether to create a mirror or raid1 DM device. The new versions of LVM default to raid1 which has some very nice and interesting advantages over the (much older) mirror target. But that’s something I’ll get to in one of the future blog posts (about LVM RAID).
  3. There’s no need to call pvcreate. vgcreate initializes the devices if they are not initialized and it even accepts and applies almost all of the command line options accepted by pvcreate.
  4. The ‘error’ target is being used not just for testing, but also e.g. for provisioning sparse devices and for replacing failed RAID legs.

Footnotes


  1. aka "fat LVM" :)

  2. LVM2 to be more precise, the original LVM implementation worked and lived on its own (in 2.4 kernels)

  3. readers are welcome to study the kernel sources if they want to (I haven’t been that brave yet)

  4. just like everything else storage-related in kernel

  5. There are at least two fundamental differences between that and my two DM devices, but we will get to them later.

  6. I could of course also put the table into a file and then use < my_table and let dmsetup read it from there.

  7. unless LVM is told not to do so (it might be wise to put meta data only to some of the PVs in some cases due to different speeds and access times of the PVs)

  8. I had to do it once when I was experimenting with LVM cache in writeback mode using a used SSD drive I bought on eBay on my production system (hint: not a good idea). I had to get rid of the cache, but LVM commands didn’t allow me to do so because the cache was dirty, but attempts to flush the cache only resulted in I/O errors.

  9. I honestly don’t know why mirrored is not the default as that’s what people usually expect with to happen, I think. There’s also a third option – core which means that the mirrorlog is only kept in RAM and gets lost on reboot which results in a complete resync of the origin and mirror after boot.

  10. I had to specify the output options too, because otherwise the output was too wide.