DEMYSTIFYING STORAGE VIRTUALISATION
Virtualisation of a data centre is incomplete without storage virtualisation, which has turned into a top priority for today’s CIOS and data centre administrators. Fortunately, Linux has integrated storage virtualisation well with the kernel and user spac
Storage is the most expensive aspect of a data centre. Compared to server and networking technologies, standardisation in storage has lagged behind. This makes storage management an administrator’s nightmare. Storage is also the slowest component and could lead to performance bottlenecks. Storage virtualisation is all about shielding where the data is stored, the way it is stored, and the type of storage.
The rationale and benefits of storage virtualisation are the same as that of server virtualisation. Server and desktop virtualisation have been the primary focus of data centre administrators and CIOS for more than a decade. Storage virtualisation has become mainstream only in the last few years.
In enterprises, information storage needs are increasing at the rate of 30 per cent annually. The lack of a clear picture regarding how resources are used results in data centres operating at only 30-40 per cent of their full capacity. The increase in storage capacity is not restricted to the cost of hardware purchased, but also additional operational costs for power, cooling and administration overheads. The principal nancial motivation for storage virtualisation is to reduce costs without degrading performance and without additional complexity.
Virtualisation technologies like thin provisioning, data deduplication and automated tiering maximise storage utilisation and also enhance application throughput. Virtualisation also enhances the interoperability between vendor products. Though the storage virtualisation technologies have advanced significantly over the last few years, logical volume managers (LVM) have been around for more than a decade.
Different types of storage virtualisation
Virtualisation abstracts the physical storage devices and presents them as logical storage units for the application. The abstraction layer could be present in the host, the network or the storage device.
Host/server-based virtualisation through Logical Volume Managers (LVM): The Logical Volume Manager is a software layer between the le system and the operation system. This layer shields the complexity that lies below. Physically, the data might be stored on one or multiple disks based on the size and RAID technology used. ven a le of size 10 MB may span more than one disk. These operations might take a toll on the server’s performance if all these operations are software based. Business critical applications that need higher performance methods such as RAID have hardware assistance from storage vendors.
To understand the LVM concept, we must rst familiarise ourselves with the concepts of Physical Volume, Volume Group and Logical Volumes. Physical Volume (PV) is either the full hard disk or a portion of it. Volume Group (VG) is formed combining one or more PVS. Logical Volumes (LV) reside on the VGS for the le systems to exist. The interesting part of Logical Volumes is that we can create a le system that spans more than your largest hard disks in the pool.
LVM has been supported from version 2.4.x series of the Linux kernel. So, LVM is supported on most popular versions of Linux such as Ubuntu, Redhat and Linux Mint. There are two versions of LVM—LVM1 and LVM2. Between the two, LVM2 is more popular and has dependency on a module called
the Device Mapper (See Figure 3). The Enterprise Virtual Management System also has dependency on Device Mapper for the basic function of mapping a block device. On Linux, dmsetup, which is a command line wrapper to access Device Mapper functionality, is available.
Network-based virtualisation: Technologies such as Storage Area Networks allow hard disks and tapes to be available to multiple servers. Almost all operating systems for the enterprises have the ability to connect to a SAN either over IP or bre channel. This also simpli es the management of storage hardware from different vendors.
SCSI has been a standard protocol for computing devices to communicate with the peripherals. The same communication has been extended for communication over the well-established Internet protocol, TCP/IP. This is ISCSI, which is the low cost alternative to the bre channel. Due to the inherent computational overhead of TCP/IP, the ISCSI communication can cause a high load on the CPU. To overcome this limitation, some vendors came up with TCP Of oad engines. One of the open source implementations of ISCSI network protocol is the open-iscsi project, which is a high-performance implementation of RFC3720.
Storage device virtualisation: This is where the latest technology updates are happening. Storage vendors are bringing innovative solutions to market at a rapid pace. Virtualisation on storage devices can be at the block level or le level. The block level virtualisation can be found in intelligent disk subsystems. The servers access the storage through the I/O channels by means of LUN masking and RAID. The common protocols that run between the server and storage are SCSI, FCOE and ISCSI.
Virtualisation at the le level is available via NAS servers, which take the responsibility of le system management. Some of the free or open source NAS products are Open ler, FREENAS and Cryptonas. Open ler has been covered in a series that started in the August 2011 issue of LINUX For You. FREENAS has extensive features for networking, services, drive management and monitoring. It also has advanced features such as snapshots and thin provisioning. Cryptonas has been developed with disk encryption as the focus. It comes in two avours Cryptonas-server and CryptonasCD. Cryptonas was earlier called Cryptobox.
There are several advantages of storage devices. Servers are freed up from the Cpu-intensive tasks of virtualisation operations and applications perform better as RAID operations are done by storage controllers. The administration of storage devices can be done in isolation and nearer to the physical devices. Heterogeneous (multi-vendor) storage devices can be deployed to get the best of each vendor technology. While this is an advantage, sometimes the proprietary features of products from some vendors can make setting up a solution an uphill task.
Storage virtualisation in enterprise environments: In a data centre, when you migrate from physical servers to virtual servers, you could gain on CPU utilisation. The physical servers that were used in the range of 20 to 60 per cent are now placed as VMS on a single physical machine and could maximise server utilisation. Server utilisation also means that the read-write access density increases on the storage. Addressing this challenge is what decides the success of storage virtualisation.
In enterprise environments, there is the constant challenge of ensuring the performance of business critical applications and the increasing demand for vast storage capacity, while keeping the overall costs as low as possible.
Hierarchical storage management: Hierarchical storage management (also known as data lifecycle management) is a concept in which the most recent data is stored in storage subsystems that are quick to access, and stored on types of media that offer the fastest access. As the
data ages, it can be stored on archival systems that could take a longer time to retrieve (See Figure 5). By doing this, data centre designers can optimise the price-performancecapacity equation. This hierarchical storage management was in operation even before virtualisation came into force. By virtualising the storage, it could be simpler to completely automate this process.
In an enterprise, the applications are tiered based on business criticality. Applications such as Enterprise Resource Planning and Customer Order Management are the most important and highly demanding applications. They need storage types such as SSD to meet their application needs.
Applications that assist in business decisions such as analytics could be the second tier that would need storage with lower latency, but the media type to be used would be based on a cost-benefit analysis. Enterprise applications such as Content Management Systems could fall into a category where the capacity is of a higher priority compared to I/O performance. Such data is typically stored on media such as discs with RAID capabilities. Email is an application type that could fall into both business critical services as well as capacity intensive applications.
For statutory reasons and also to maintain the history of business, some data needs to be archived. The need for retrieving such data is rare and a slight delay in its retrieval does not adversely impact business. Such data is typically archived either on tapes or virtual tapes.
Let us now understand some key technologies that drive data centres to full virtualisation.
Data deduplication
Deduplication is the process of eliminating redundancy of data, thus reducing the number of storage devices. This not only has an impact on capacity, but also network bandwidth in case of data being backed up over the network. There are two types of deduplication inline and post-process. Inline deduplication kicks in before data is written on to the disk. Post-process deduplication analyses the data at a later time without interfering during the write process. Each has its own advantages and disadvantages.
Thin provisioning
Traditional provisioning requires the full disk space to be available when con guring capacity for an application. There is no de nite way to overcome the dilemma of how much of extra space needs to be allocated for future growth. Hence, administrators typically allocate storage space using a wild guess. This full allocation not only leaves unused capacity unavailable to other applications, but also increases the operational costs of keeping the storage devices running.
In contrast, thin provisioning allows administrators to allocate disk space that is even greater than the current capacity available in the data centre. This also works great when used in conjunction with the storage configurations of multiple applications. It is just like the over- subscription of network bandwidth that telecom operators practice or what the insurance companies do. The actual allocation of disc space happens just- in- time and hence there is no hogging of unused disk space by a few applications.
Thin provisioning does come with a few limitations. One of the scenarios for which thin provisioning is not recommended is when applications expect the data to be in contiguous blocks for I/O optimisation.
Automatic storage tiering
Automatic storage tiering, or auto tiering, is built on the concept of Hierarchical Storage Management. The dynamic movement of information between disks of different types to meet the performance, cost and capacity is auto tiering. The movement from one level to another is triggered based on policies set by the storage administrator.
Multi-tenancy
Multi-tenancy is about ensuring that data that is being accessed by an application has a boundary that allows access to only the applications that have permissions. This will enable the same storage to be used by different applications within a single enterprise or different enterprises, without compromising on the data security. Multi-tenancy has the most utility value when used in the cloud environment, where the service provider hosts data services for different companies in the same industry.
Future trends
Two prominent trends that will emerge in 2012 are the adoption of Ssd/ ash as part of the Tier-1 storage layer and increased storage utilisation. Another trend that could go beyond 2012 is the convergence of servers and storage devices into single systems.