Shared LVM driver for iSCSI, Fibre Channel or other block device sharing storage systems

Authors: Mihály Héder, Tamás Marlok, MTA SZTAKI

Overview

In this wiki page we introduce the Shared LVM storage driver for OpenNebula. This solution is useful in the cases when one wants to use a SAN solution that is accessible via Fibre Channel, iSCSI, ATA over Ethernet or by other means as a block device from a linux host. For instance, we have used this solution on iSCSI-enabled Dell MD3600i and Fujitsu Eternus DX90, and also on an old EMC Clariion CX300 via Fibre Channel. Furthermore, we know of cases in which this solution works over ATA over Ethernet.

Dedicated SAN storage hardware, especially those with redundant controllers are mainly preferred because of their high performance and reliability, and in some cases for other features like advanced backup, mirroring to a disaster recovery site, snapshots, etc. Now, the question is how one uses these systems in an OpenNebula cloud? The major problem here is that virtual machines are getting created and destroyed in a rather fast pace on a cloud. Another issue to solve is that ideally, one should be able to live migrate VM-s between hosts.

Unfortunately however, the majority of the SAN storage systems - especially in the cost-effective price range - does not have an API through which OpenNebula could allocate Virtual Disks (or LUNs in other terminology) on the fly. One can introduce a “storage frontend” linux server into the infrastructure, that attaches all the SANs-as one or a few sizeable block devices per storage system. Then these block devices can be further partitioned by LVM and the volumes can be offered as block devices trough linux’s iSCSI server implementation, tgtd. This storage frontend server can be instructed by the OpenNebula frontend to create/remove LV-s as necessary and to manage the iSCSI server configuration at the same time. We know of cloud deployments work this way fairly well. Still there are many problems with this setup. On the performance side there is an inherent performance penalty that results from attaching the SAN trough iSCSI or other means on the storage frontend, which is then wrapped in again in an iSCSI session. Also, in this setup, all storage traffic goes through a single server which diminishes the advantages of having a multiple storage controllers in the SAN system itself (although theoretically it would be possible to create a failover storage frontend service but this further complicates the matters). Actually, if one plans a new cloud investment, and wishes to use the linux+lvm+tgtd scenario it is more sensible to put all the disks in the linux server right away and not buying SAN at all. Also, there is a chance that in the future some storage solutions will offer proper (e.g. authorized, quota-limited, easily programmable) and standardized (e.g. not only of VMWare) APIs for VM disk allocation at a fair price, in which case OpenNebula project should develop a dedicated storage driver for those future APIs.

Yet, we have to use our current, regular SAN systems now, and we want to do it without an intermediate linux box. To do this, we should attach the block device in question in all of our VM hosts and also on the OpenNebula Frontend. Upon this block device we use one or more LVM volume groups. This volume group is then shared across all the real servers, hence the name of this scheme is Shared LVM. For an overview, see the drawing below:

The advantages of this setup is that there is only one iSCSI connection, LVM and file system layer (compare this to the solution where there is a file system in an image file which again is stored on a file system over a network), which makes it rather efficient. Also, this setup supports easy live migration - depending on virtio disk cache configuration.

CLVM setup versus Read-Only LVM metadata

When using LVM on a single shared block device with default settings, the problem arises that nothing guarantees that the LVM metadata is not modified concurrently on multiple hosts. If this happens - even though the chances of a concurrent modification are rather low, as usually creating/removing an LV takes up 1-2 seconds - the LVM metadata might get corrupted which in turn can lead to data loss.

However, shared LVM on a SAN is nothing new. Red Hat introduced Clustered LVM support in its enterprise server product a long time ago. In a general setup (that is not in OpenNebula) where any of the hosts might modify the LVM partitions CLVM should be used that implements clustered locking and metadata distribution. CLVM implements distributed locking by its own clvm daemon plus Red Hat's cluster manager, cman. If you plan to use CLVM please read: https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/LVM_Cluster_Overview.html

However, the complexity and the dependencies that CLVM brings into the setup can be problematic with certain linux distributions and setups. In OpenNebula - because we know that the front-end is the only host that wants modify the LVM metadata - we can save the usage of CLVM and set local locking on the front-end and read-only metadata on all the other hosts.

You can change the locking to read-only in /etc/lvm/lvm.conf:

    # Type of locking to use. Defaults to local file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata corruption
    # if LVM2 commands get run concurrently).
    # Type 2 uses the external shared library locking_library.
    # Type 3 uses built-in clustered locking.
    # Type 4 uses read-only locking which forbids any operations that might 
    # change metadata.
    locking_type = 4

(Keep in mind however, that as long as the LVM locking is set read-only on a machine, you won't be able to modify any Volume Groups, not even those on local disks. This is not a big issue, unless you planned to deploy VM-s to both local LVM volumes AND shared devices.)

Once you the locking is read-only, you won't be able to run LVM commands that modify something, e.g.:

root@node1:~# lvcreate --name "test" --size 10G shared_vg
  Write locks are prohibited with read-only locking.
  Can't get lock for shared_vg

Note that you can still activate/deactivate logical volumes as this is a per-machine information and not stored in the metadata. Once you set read-only locking on all hosts except a single front-end that uses local locking, there will be no concurrent modifications.

Important : it is crucial to understand that both CLVM, and read only VG-s have an effect in the LVM layer only. One could still write the disk on a block-device level with e.g. dd with a root account.

On a side note, if someone still needs a clustered locking solution but without CLVM, in theory there is a third solution that we have never tested. The LVM locking directory might be placed to a shared file system that may already be there on all hosts because of the OpenNebula deployment. But we have not tested this approach so try it with extra caution.

Using the Shared LVM driver

The frontend and the nodes have to connect to the same LUN. We can use lots of protocols to do that, depending on the storage type (e.g. iSCSI, iSCSI+multipath, AoE, etc.). We have to create an LVM Volume Group (VG) on the top of the LUN (the LVM metadata lies on the physical device, so it is enough to execute the command on the frontend).

It is recommended to change the locking type to read only on the nodes, so they cannot make (even by accident) any changes on the LVM metastructure (lvm.conf: locking_type = 4, see above). Alternatively, you can set up CLVM on the nodes and frontend, is making the VG (Red Hat-type) cluster aware.

After we have the same (and active) VG on all nodes and frontend(s), we can create the shared_lvm datastore. In order to to that, we have to copy the driver files into the <one_dir>/remotes/datastore/shared_lvm and the <one_dir>/remotes/tm/shared_lvm directories, plus we have to enable it in the oned.conf. You can find the files in the download section at the bottom of this page.

for example:

DATASTORE_MAD = [
    executable = "one_datastore",
    arguments  = "-t 10 -d fs,vmware,iscsi,lvm,shared_lvm"]

TM_MAD = [
    executable = "one_tm",
    arguments  = "-t 10 -d dummy,lvm,shared,qcow2,ssh,vmware,iscsi,shared_lvm" ]

In the shared LVM driver, all of the LVM metadata changes are executed on the frontend, so until you have only one frontend, you don’t have to worry about concurrent metadata writes, because local locking is in effect. Before a node makes anything related to block device (deploy a VM, suspend a VM, etc.) it executes an lvs command (making the LVM metadata up-to-date - just to be sure), and changes the current LV’s state to active.

Notes

  • Not just the LVM metadata writes, but the disk cloning (dd command) is also executed by the frontend. It means, that the frontend is very bandwidth-hungry in case of deploying new VM-s, or saving the disks. The more bandwidth you have, the less you have to wait for a new VM.
  • Before live migration, none of the transfer manager ( tm ) scripts are invoked (except the system tm’s premigrate script). So we cannot execute a lvchange command on the destination host. If the Volume Group (VG) wasn’t active on that host, then the live migration won’t work. In case of CLVM, this is not an issue, because CLVM propagates a new LV between the cluster members, and makes it active on these hosts.
  • We wrote a wrapper script, which calls the premigrate script of the parent tm’s premigrate script for each disks used by the VM. Thus we can make all of the LV’s active on the destination host.
  • (Live migration still needs a shared system datastore, but we think with the premigrate scripts, we would be able to make that work on ssh datastore too). This function is still under testing. We will release it, after we find it ready for that.
  • (not Shared LVM-specific) newer virtio versions are able to cache disk operations. If enabled, this can break live migration. However, according to our measurements, caching does not give extra performance - actually it seems to slow down VM disk io a bit, therefore we keep it turned off.

Differences between this and the OpenNebula LVM driver (summary):

  • Shared LVM uses LV cloning instead of creating snapshot (as mentioned). This means longer VM deploying but (much) faster write operations. Many people experienced error messages (including us) related to snapshots, when tried to deploy a VM. Without snapshots this is not an issue anymore. update: in OpenNebula 4.0 the official driver will use cloning instead of snapshotting
  • Shared LVM executes all LVM commands which causes metadata changes on the frontend (also mentioned before). Thus (assuming that you only have one frontend) you doesn’t need CLVM, to prevent concurrent LVM metadata writes. In this case you should change the locking type on the nodes to read-only (refer to the recommendation in the previous chapter)

Download

shared_lvm · Last modified: 2013/07/10 12:28 by Mihály Héder
Admin · Login