Elements of server disk subsystems

The disk and file subsystems of a computer are usually not the subject of much user attention. The hard drive is a fairly reliable thing and functions as if on its own, without attracting the attention of the average user at all.

Having mastered the basic techniques for working with files and folders, such a user brings them to complete automation, without thinking about the existence of additional tools for servicing the hard drive. Disk management is completely transferred to the operating system.

The difficulty begins either when the file system shows a clear decrease in performance, or when it begins to crash. Another reason for a more careful study of this topic: installing several “screws” on a PC at the same time.

Like any complex device, a hard drive needs regular maintenance. Although Windows 7 partially takes care of these concerns, it is not able to solve all the problems for you on its own. Otherwise, “brakes” are guaranteed over time. At a minimum, you need to be able to do the following things:

  • Clean the file system from garbage. The concept of garbage includes temporary files, proliferated browser cookies, duplicate information, etc.
  • Defragment your hard drive. The Windows file system is built in such a way that what the user sees as a whole is actually individual fragments of files scattered on the magnetic surface of the hard drive, combined into a chain: each previous fragment is known to each next one. To read a file as a whole, you need to assemble these parts together, which requires a large number of reading cycles from different places on the surface. The same thing happens when recording. Defragmentation allows you to collect all these pieces into one place.
  • View and edit information about sections.
  • Be able to open access to hidden and system files and folders.
  • If necessary, be able to work with several “screws” at once.

And also perform some other useful actions. In our note we will not discuss the entire range of these issues, but will focus only on a few.

How to read partition information?

For those who are not in the know, let us explain: in Windows there is such a thing as a “snap-in”.

This is an executable file with the .msc extension that runs like a regular exe. All snap-ins have a uniform interface and are built on COM technology - the basis of the internal structure of this operating system.

The Disk Management window is also a snap-in. You can run it by typing its name diskmgmt.msc in the “Run” window as shown in the following figure:

As a result, we will see the window of the snap-in itself with the heading “Disk Management”. This is what the application looks like:

This interface is intuitive and simple. In the top panel of the window we see a list of all volumes (or partitions) available on the “screw” with accompanying information about them, such as:

  • Section name.
  • Type of section.
  • Its full capacity.
  • Its status (different sections may have different statuses).
  • The remaining free space, expressed in gigabytes and as a percentage of the total.

And other information. The bottom panel contains a list of drives and partitions. It is from here that you can perform operations on volumes and drives. To do this, right-click on the volume name and select a specific operation from the “Actions” submenu.

The main advantage of the interface is that everything here is collected in a pile - there is no need to wander through different menus and windows to carry out your plans.

Volume Operations

Let's look at some non-obvious operations with partitions. First, let's discuss the transition from the MBR format to the GPT format. Both of these formats correspond to different types of bootloader. MBR is a classic but now outdated bootloader format.

It has obvious limitations both in volume size (no more than 2 TB) and in the number of volumes - no more than four are supported. Do not confuse volume and section - these are somewhat different concepts from each other. Read about their differences on the Internet. The GPT format is built on GUID technology and does not have these restrictions.

So if you have a large disk, feel free to convert MBR to GPT. However, in this case, all data on the disk will be destroyed - they will need to be copied to another location.

Virtualization technology has penetrated everywhere. It didn't bypass the file system either. If you wish, you can create and mount so-called “virtual disks”.

Such a “device” is a regular file with the .vhd extension and can be used like a regular physical device - both for reading and writing.

This opens up additional opportunities for cataloging information. This concludes our story. Disk management in Windows 7 is a fairly broad topic, and if you dive into it, you can discover a lot of new things.

16.01.1997 Patrick Corrigan, Mickey Applebaum

The configuration options for server disk subsystems are varied, and, as a result, confusion is inevitable. To help you understand this difficult issue, we decided to look at the main technologies and the economic justification for their use.

DISC

The configuration options for server disk subsystems are varied, and, as a result, confusion is inevitable. To help you understand this difficult issue, we decided to look at the main technologies and the economic justification for their use.

With server disk subsystems, you have many options to choose from, but the abundance makes it difficult to find the system that is best for you. The situation is complicated by the fact that during the selection process you will have to understand a considerable amount of false information and marketing hype.

Our review of the main technologies of server disk subsystems and a discussion of the feasibility of their use from the point of view of cost, performance, reliability and fault tolerance should help us understand the essence of this issue.

DISK INTERFACES

Whether you are specifying a new server or upgrading an existing one, the disk interface is a critical issue. Most drives today use SCSI or IDE interfaces. We will look at both technologies, describe their implementations, and discuss their operation.

SCSI is a standardized ANSI interface that has several variations. The original SCSI specification, now called SCSI-I, uses an 8-bit data channel with a maximum data transfer rate of 5 Mbps. SCSI-2 allows several variations, including Fast SCSI with an 8-bit data channel and transfer rates of up to 10 Mbit/s; Wide SCSI with a 16-bit data channel and transfer speeds up to 10 Mbit/s; and Fast/Wide SCSI with a 16-bit data channel and transfer rates up to 10 Mbps (see Table 1).

TABLE 1 - SCSI OPTIONS SCSI-1 Maximum performance Channel width Frequency
Number of devices* 5 Mbit/s 8 digits 8
5 MHz
SCSI-2 Fast SCSI 5 Mbit/s 10 Mbit/s 8
10 MHz Fast/Wide SCSI 20 Mbit/s 10 Mbit/s 8; 16**
* 16 digits ** supported devices include HBA

With the advent of the “wide” 16-bit Fast/Wide SCSI, the 8-bit versions were sometimes called “narrow” - Narrow SCSI. Recently, several more SCSI implementations have appeared: Ultra SCSI, Wide Ultra SCSI and SCSI-3. Compared to more common options, these interfaces have some performance advantages, but since they are not yet very widespread (the number of devices using these interfaces is very limited), we will not discuss them in our article.

The SCSI-I cabling system is a linear bus with the ability to connect up to eight devices, including the host bus adapter (HBA). This bus design is called SCSI with an asymmetrical output signal (single-ended SCSI), and the cable length can reach nine meters. SCSI-2 (almost replacing SCSI-I) supports both SCSI with single-ended output signal and differential SCSI. Differential SCSI uses a different signaling method than single-ended SCSI and supports up to 16 devices on a loop up to 25 meters long. It provides better noise reduction, which in many cases means better performance.

One of the problems with differential SCSI is device compatibility. For example, today there are a limited number of varieties of differential SCSI-compatible tape drives and CD-ROM drives. Differential devices and HBAs are typically slightly more expensive than single-ended devices, but they have the advantage of supporting more devices per channel, longer cable runs, and in some cases, better performance.

When choosing SCSI devices, you should be aware of compatibility issues. Single-ended SCSI and differential SCSI can use the same wiring, but single-ended and differential devices cannot be combined. Wide SCSI uses a different cabling system than Narrow SCSI, so it is not possible to use Wide SCSI and Narrow SCSI devices on the same channel.

HOW SCSI WORKS

In SCSI, the device controller (for example, a disk controller) and the interface with the computer are different devices. The computer interface, HBA, adds an additional interface bus to the computer for connecting multiple device controllers: up to seven device controllers on a SCSI channel with a single-ended output signal and up to 15 on a differential channel. Technically, each controller can support up to four devices. However, at the high transfer rates of today's high-capacity drives, the device controller is usually built into the drive to reduce noise and electrical interference. This means you can have up to seven drives on a single-ended SCSI link and up to 15 on a differential SCSI link.

One of the advantages of SCSI is the processing of multiple, overlapping commands. This support for overlapping I/O gives SCSI drives the ability to fully share their read and write operations with other drives in the system, allowing different drives to process commands in parallel rather than in order.

Since all the intelligence of a SCSI disk interface resides in the HBA, the HBA controls OS access to the disks. As a result, the HBA, rather than the computer, resolves translation and device access conflicts. In general, this means that, provided that correctly written and installed drivers are used, the computer and the OS do not see any difference between the devices.

In addition, because the HBA controls access between the computer's internal expansion bus and the SCSI bus, it can resolve access conflicts between both, providing advanced capabilities such as link failure/recovery service. Break/repair allows the OS to send a search, read, or write command to a specific device, after which the drive is left to itself to execute the command, allowing another drive on the same channel to receive the command in the meantime. This process can significantly improve the throughput of disk links with more than two disks, especially when the data is distributed or scattered across disks. Another advanced feature is synchronous data exchange, which increases overall disk channel throughput and data integrity.

IDE

IDE is a de facto standard widely used in PCs based on x86 processors. This is only a general recommendation for manufacturers, so everyone can freely develop a specific IDE for their devices and adapters. As a result, products from different manufacturers, and even different models from the same manufacturer, turned out to be incompatible with each other. Once the specification became established, this problem largely disappeared, but incompatibility is still possible.

Unlike SCSI, IDE delegates the intelligent functions to the disk, not the HBA. The HBA for the IDE has virtually no intelligence and simply directly connects the computer bus to the disks. Without an intermediate interface, the number of devices on one IDE channel is limited to two, and the cable length is limited to three meters.

Since all the intelligence of IDE devices resides in the devices themselves, one of the devices on the channel is designated as the channel master, and the built-in controller on the second is disabled, and it becomes a channel slave. The master device controls access through the IDE channel to both devices and performs all I/O operations for them. This is one possibility of conflict between devices due to different manufacturers' implementations of the IDE interface. For example, one drive may be designed to work with a specific controller circuit, but the host device it is connected to may use a different type of controller. In addition, the newer Enhanced IDE (EIDE) drives use an expanded set of commands and translation tables to support larger capacity and higher performance drives. If they are connected to an old standard, main IDE drive, not only do they lose their advanced features, but they may not provide you with all of their available capacity. Worse, they may report their full capacity to the OS without being able to use it, which can lead to data corruption on the disk.

The possibility of data corruption is due to the fact that each OS perceives disk configuration information differently. For example, DOS and system BIOS only allow a maximum disk capacity of 528 MB. NetWare and other 32-bit systems do not have these limitations and are able to read the entire IDE disk directly through its electronics. When you create multiple partitions from different operating systems on the same disk, each of them sees the capacity and configuration differently, and this can lead to overlapping partition tables, which, in turn, significantly increases the risk of data loss on the disk.

The original IDE architecture cannot recognize drives larger than 528 MB and can only support two devices per channel at a maximum transfer rate of 3 Mbps. To overcome some of the limitations of IDE, the EIDE architecture was introduced in 1994. EIDE supports greater capacity and performance, but its transfer rates of 9 to 16 Mbps are still slower than SCSI transfer rates. Additionally, unlike the 15 devices per channel for SCSI, it can support a maximum of four per channel. Note also that neither the IDE nor the EIDE implement multitasking functionality. And therefore, they cannot provide the same level of performance as SCSI interfaces in a typical server environment.

Although the IDE standard was originally developed for disks, it now supports tape devices and CD-ROMs. However, sharing a channel with a CD-ROM or tape device can negatively impact disk performance. Overall, SCSI's performance and expandability advantages make it superior to IDE or EIDE for most high-end server applications that require high performance. However, for entry-level applications where performance or extensibility is not a big concern, an IDE or EIDE will suffice. At the same time, if you need disk redundancy, then an IDE, due to the potential problems associated with the master-slave approach, is not the best option. In addition, you should be wary of possible overlapping partition tables and master-slave device incompatibility issues.

However, there are several cases where the IDE and EIDE interfaces can be used in high-end servers. It is common practice, for example, to use a small IDE disk for the DOS partition on NetWare servers. The use of CD-ROM drives with an IDE interface for loading software is also widely practiced.

REDUCED DISK SYSTEMS

Another important issue to discuss when defining a server specification is redundancy. There are several methods for increasing the reliability of a multi-disk disk system. Most of these redundancy schemes are variations of RAID (which stands for Redundant Array of Inexpensive or Independent Disks). The original RAID specification was designed to replace large, expensive mainframe and minicomputer drives with arrays of small, low-cost drives designed for minicomputers—hence the word "low-cost." Unfortunately, it's rare to find anything inexpensive in RAID systems.

RAID is a series of implementations of redundant disk arrays to provide varying levels of protection and data transfer rates. Since RAID involves the use of disk arrays, the best interface to use is SCSI as it can support up to 15 devices. There are 6 RAID levels: from zero to five. Although some manufacturers advertise their own redundancy schemes, which they call RAID-6, RAID-7 or higher. (RAID-2 and RAID-4 are not available on network servers, so we won’t talk about them.)

Of all RAID levels, zero has the highest performance and the least security. It involves having at least two devices and writing data to both drives in a synchronized manner, with the drives appearing as one physical device. The process of writing data to multiple disks is called drive spanning, and the actual method of writing this data is called data striping. With striping, data is written to all disks block by block; this process is called block interleaving. The block size is determined by the operating system, but typically ranges from 2 KB to 64 KB. Depending on the design of the disk controller and HBA, these sequential writes may overlap, resulting in increased performance. Thus, RAID-0 by itself can improve performance, but does not provide protection against failures. If a disk failure occurs, the entire subsystem fails, usually resulting in complete data loss.

An option for data striping is data scattering. As with striping, data is written sequentially to multiple disks that are being filled. However, unlike striping, writes are not necessarily written to all disks; if a disk is busy or full, the data can be written to the next available disk - this allows disks to be added to an existing volume. Like the RAID-0 standard, the combination of disk padding and data striping improves performance and increases volume volume, but does not provide protection against failures.

RAID-1, known as disk mirroring, involves installing pairs of identical disks, with each disk in the pair being a mirror image of the other. In RAID-1, data is written to two identical or nearly identical pairs of disks: when, for example, one disk fails, the system continues to work with the mirror disk. If mirrored disks have a common HBA, then the performance of this configuration, compared to a single disk, will be lower, since data must be written sequentially to each disk.

Novell has narrowed the definition of mirroring and added the concept of duplexing. According to Novell terminology, mirroring refers to pairs of disks when they are connected to a server or computer through a single HBA, while duplication refers to pairs of mirrored disks being connected through separate HBAs. Redundancy provides redundancy for the entire disk link, including the HBA, cables, and disks, and provides some performance improvements.

RAID-3 requires a minimum of three identical drives. This is often referred to as n minus 1 (n-1) technology because the maximum system capacity is defined as the total number of disks in the array (n) minus one parity disk. RAID-3 uses a writing technique called bit interleaving, where data is written to all disks bit by bit. For each byte written to n-disks, a parity bit is written to the “parity disk”. This is an exceptionally slow process because before parity information can be generated and written to the "parity disk", data must be written to each of the n-disks of the array. You can increase the performance of RAID-3 by synchronizing the spinning mechanisms of the disks so that they work in lockstep. However, due to performance limitations, the use of RAID-3 has declined sharply, and very few server products based on RAID-3 are sold today.

RAID-5 is the most popular RAID implementation on the network server market. Like RAID-3, it requires at least three identical disks. However, unlike RAID-3, RAID-5 stripes data blocks without using a dedicated parity disk. Both the data and the checksum are written throughout the array. This method allows independent disk reads and writes, and also allows the operating system or RAID controller to perform multiple concurrent I/O operations.

In RAID-5 configurations, the disk is accessed only when parity information or data is read from/written to it. As a result, RAID-5 has higher performance than RAID-3. In practice, RAID-5 performance can sometimes match or even exceed that of single-disk systems. Such performance improvements, of course, depend on many factors, including how the RAID array is implemented and what native capabilities the server's operating system has. RAID-5 also provides the highest level of data integrity of any standard RAID implementation because both data and parity information are written in stripes. Because RAID-5 uses block striping rather than bit striping, spin synchronization does not provide any performance benefit.

Some manufacturers have added extensions to their RAID-5 systems. One of these extensions is the presence of a “hot-spare” disk built into the array. If a disk failure occurs, the hot spare disk immediately replaces the failed disk and copies the data back to itself using parity recovery in the background. However, remember that rebuilding a RAID-5 disk results in a serious drop in server performance. (For more information about hot-swap and hot-spare drives, see the sidebar "Hot Drive Features.")

RAID systems can be organized either using software loaded on the server and using its processor to operate, or using a specialized RAID controller.

Software-based RAID systems consume a significant portion of the system processor resources, as well as system memory, which greatly reduces server performance. Software RAID systems are sometimes included as an operating system feature (as is done in Microsoft Windows NT Server) or as a third-party add-on (as is done in NetWare and the Macintosh operating system).

Hardware-based RAID systems use a dedicated RAID array controller; it typically has its own processor, cache memory, and ROM software to perform disk I/O and parity functions. Having a dedicated controller to perform these operations frees up the server's processor to perform other functions. Additionally, because the adapter's processor and software are specifically tuned to perform RAID functions, they provide greater disk I/O performance and better data integrity than software-based RAID systems. Unfortunately, hardware-based RAID controllers tend to be more expensive than their software-based competitors.

MIRRORING, DUPLICATING AND FILLING

Some operating systems, including NetWare and Windows NT Server, allow disk mirroring across multiple disk channels, thereby providing an additional level of redundancy. As mentioned earlier, Novell calls the latter approach disk duplication. When combined with disk padding, duplication can provide greater performance than single-disk systems and can generally outperform hardware RAID-5 implementations. Because each half of a mirrored pair of disks uses a separate disk channel, writes to the disks, unlike when the disks are on the same HBA, can be written to simultaneously. Duplication also allows for split seeks - the process of dividing read requests between disk channels for faster execution. This feature doubles disk read performance by allowing both channels to search for different blocks from the same data set in parallel. This also reduces the performance impact when writing to disk, since one channel can read data while the other writes.

NetWare supports up to eight disk channels (some SCSI adapters provide multiple channels), which means you can have multiple channels for each duplicated pair. You can even choose to organize up to eight separate mirror channels. Windows NT Server also provides software mirroring and duplication, but does not yet support parallel recording and split search.

When choosing a redundant disk system, you must consider four main factors: performance, cost, reliability, and failure protection.

When it comes to performance, the built-in capabilities of the server operating system are a major factor, especially when disk redundancy comes into play. As stated earlier, NetWare disk duplication combined with disk padding provides better performance than hardware or software RAID. However, the performance of hardware RAID is generally superior to that of Windows NT Server's built-in disk services. Generally speaking, the technology and performance of RAID systems have been continuously improving for several years.

Another potential performance issue for RAID systems is data recovery in the event of a disaster. Until recently, if a disk failed, you had to shut down the RAID array to restore it. Also, if you wanted to change the size of the array (increase or decrease its capacity), you had to take a full backup of the system, and then reconfigure and reinitialize the array, erasing all data during the process. In both cases, the system is unavailable for quite a long time.

To solve this problem, Compaq developed the Smart Array-II controller, which allows you to increase the capacity of the array without reinitializing the existing array configuration. Other manufacturers, including Distributed Processing Technology (DPT), have announced that their controllers will perform similar functions in the not-too-distant future. Many of the new arrays have utilities for various operating systems, with which the array can be restored after replacing a damaged device without shutting down the server. However, keep in mind that these utilities eat up a lot of server resources and thus negatively affect system performance. To avoid this kind of difficulty, system restoration should be carried out during non-working hours.

There has been a recurring debate in industry publications and RAID vendor publications about the cost differences between mirroring, duplication, and RAID implementations. Mirroring and duplication provide 100% doubling of disks and (if duplicated) HBAs, while RAID implementations have one HBA and/or RAID controller plus one more disk than the capacity you want to end up with. According to these arguments, RAID is cheaper because the number of disks required is smaller. This may be true if the performance limitations of software-based RAID implementations included in the operating system, such as those found in Windows NT, are tolerable to you. In most cases, however, a dedicated RAID controller is required to achieve adequate performance.

SCSI drives and standard adapters are relatively inexpensive, while a high-quality RAID controller can cost up to $4,500. To determine the cost of your system, you must consider the optimal configurations for all components. For example, if you need approximately 16 GB of addressable disk space, you can implement a mirror configuration with two 9 GB drives per channel and gain some excess capacity. In the case of RAID-5, for performance and reliability reasons, it is better to go with five 4 GB drives to increase the number of spindles for striping data and thereby the overall performance of the array.

When using an external disk subsystem, the cost of a mirrored configuration will be approximately $10,500 for 18 GB of available space. This figure is based on actual retail prices: $2,000 for one drive, $250 for one HBA, and $300 for each external drive subsystem plus cables. A RAID-5 system configured with 16 GB of addressable space using five 4 GB drives would cost about $12,800. This figure is based on actual retail prices for a DPT RAID-5 array.

Many RAID systems include "proprietary" components designed by the manufacturer. At a minimum, the case and rear panel are “branded”. HBAs and RAID controllers are also often “proprietary”. Some manufacturers also use non-standard holders and tires for wheels. Some provide them separately for a reasonable price, others only with a disc and, as a rule, at a high price. The latter approach can be costly when you need to repair or expand your system. Another way the vendor paints you into a corner is by providing disk administration and monitoring software that only works with specific components. By avoiding non-standard components whenever possible, costs can usually be kept down.

When comparing the reliability of redundant disk systems, there are two factors to consider: the possibility of system failure or failure of any component, and the likelihood of data loss due to component failure. (Unfortunately, RAID or mirroring cannot save you from the main cause of data loss - user error!)

P = t / Tc,

where t is the operating time and Tc is the combined time between failures of the components.

Assuming a year of failure-free operation (8760 hours) and a hypothetical disk Tc of 300,000 hours, the failure rate becomes 3%, or slightly less than one in 34. As the number of components increases, the probability of any component failure increases. Both RAID and mirroring increase the chance of failure but reduce the chance of data loss.

Table 2, taken from the Storage Dimensions newsletter entitled "Fault-tolerant storage systems for always-on networks," shows the probability of failure calculated using the above formula versus the probability of data loss for four populated drives, a five-drive RAID array, and eight mirrored drives. (Assumes all drives are the same size and all three systems provide the same usable capacity. To obtain the fact sheet, visit the Storage Dimensions page: http://www.storagedimensions.com/raidwin/wp-ovrvw.html.)

TABLE 2 - ESTIMATES OF PROBABILITY OF FAILURE

While mirroring combined with disk padding has a greater statistical chance of disk failure, it also has a significantly lower probability of losing data when a disk fails. Additionally, with a properly designed redundant system, recovery times can be significantly shorter.

This example doesn't take many factors into account. To obtain a statistically correct figure, the mean time between failures of all disk system components, including HBAs, cables, power cords, fans and power supplies, must be calculated. Of course, these calculations tell only what can happen given the reliability of the assumed components, but it is not at all certain that this will happen.

When choosing a disk system, you must clearly know which components are not duplicated. In RAID systems, this may include HBAs, RAID controllers, power supplies, power cables, and cables. One of the benefits of redundancy with separate disk subsystems on each channel is the elimination of most single points where failures can occur.

CONCLUSION

In general, SCSI devices are a better choice for a server's disk subsystem than IDE or EIDE drives. SCSI drives with capacities up to 9 GB per disk are easy to obtain, while today's EIDE drives have a maximum capacity of about 2.5 GB. When using multiple dual-channel HBAs, the total SCSI capacity can easily exceed 100 GB, while the EIDE limit is 10 GB. SCSI also has better performance; Moreover, SCSI does not suffer from the problems that the master-slave approach in IDE/EIDE entails.

If you need disk redundancy, there are several options. Novell NetWare redundancy combined with disk padding provides both excellent performance and protection from failures. Hardware RAID is also a good choice, but typically has lower performance and higher costs. If you're using Windows NT and performance is important to you, then hardware RAID may be a better choice.

Patrick Corrigan is President and Senior Consultant/Analyst at The Corrigan Group, a consulting and training company. He can be contacted at: [email protected] or via Compuserve: 75170.146. Mickey Applebaum is a Senior Network Consultant at GSE Erudite Software. He can be contacted at: [email protected]

INTRODUCTION TO THE FUNCTIONS OF DISK SUBSYSTEMS

"Hot" functions of disk subsystems

The terms hot-swap, hot spare, and hot-rebuild, which are widely used to describe the specific functions of disk subsystems, are often misunderstood.

Hot swap is a feature that allows you to remove a failed disk from a disk subsystem without shutting down the system. Hot swap support is a hardware feature of your disk subsystem, not RAID.

In hot-swappable systems, hard drives are typically mounted on a sled that allows the ground pins between the drive and the chassis to remain connected longer than the power and controller lines. This protects the drive from damage due to static discharge or electrical arcing between the contacts. Hot-swappable disks can be used in both RAID arrays and mirrored disk systems.

"Hot recovery" means the system's ability to restore the original disk configuration automatically after replacing a failed disk.

Hot spare drives are built into a RAID array and typically remain idle until needed. At some point after the hot spare drive replaces the failed drive, you need to replace the failed drive and restore the array configuration.

A disk system with hot-swap capability and hot spare disks does not necessarily have the ability to perform hot recovery. Hot Swap simply allows you to quickly, safely and easily remove/install a drive. Hot spare would appear to provide hot rebuild in that it allows the failed drive in the RAID array to be replaced immediately, but the failed drive still must be replaced and then commanded to rebuild. Today, all RAID systems available for the PC platform require user intervention at some level to begin data restoration - at least at the level of loading the NLM module on the NetWare server or pressing the start button in the NT Server application menu.



The goal of fault-tolerant architectures is to ensure the operation of an information system with low maintenance costs and zero downtime. Insufficient system availability can result in huge financial losses for the company. This amount consists of the cost of reduced employee productivity due to a system failure, the cost of work that cannot be completed until the system is restored, and the cost of repairing failed system elements. Therefore, when implementing enterprise-critical applications, it is worth considering that the cost of downtime due to system failures fully justifies the investment of considerable resources in installing fault-tolerant architectures.

To build a fault-tolerant system, you need to pay attention to several of its main components. The reliability of the disk subsystem is critical. Let's look at the main characteristics of fault-tolerant disk subsystems and take a closer look at their implementation using RAID technology.

What is behind the fault tolerance of the disk subsystem?

A fault-tolerant system automatically detects failed components, then very quickly determines the cause of the failure and reconfigures those components.

The key to creating a fault-tolerant system is to provide protective redundancy based on both hardware and software. This redundancy implements error detection algorithms that are used in conjunction with diagnostic algorithms to identify the cause of the error.

There are three main methods for detecting errors. The first is Initial Testing, which is carried out by the manufacturer before the final integration of the system. At this stage, hardware defects that could arise during the production and assembly of system components are identified.

The second method - Concurrent Online Testing - refers to the time of normal operation of the system. This method looks mainly for errors that may have appeared after installing the system. One of the most well-known methods of operational testing is parity checking. It ensures that every byte of data transmitted through a computer system reaches its next component intact. The parity method only detects the presence of an error and cannot determine which bit is lost. Therefore, it is used in conjunction with an Error Correction Code, which determines exactly what data is lost, allowing the system to quickly recover it.

Finally, the third error detection method is Redundancy Testing. It verifies that the system's fault tolerance controls are functioning correctly.

A fault-tolerant system must be able to failover to an alternate device in the event of a failure, and also inform the administrator of any configuration changes so that he can restore failed components before their duplicates fail. To do this, the system must send messages to the administrator console, log all errors on disk for periodic review, and also be able to send an external message if a failure occurs while the administrator is not at his workplace.

When choosing a fault-tolerant system, you must also consider its ability to adapt to new technologies, since computers and disk devices with higher performance are appearing at a fantastic rate.

Finally, users should not forget that the best implementation of fault tolerance requires them to periodically back up data to magnetic tape or optical disk to ensure its safety in the event of a catastrophe greater than the failure of any system component. Fault tolerance is unlikely to save you in the event of a fire, earthquake or terrorist bomb.

Disk RAID subsystems

While system performance can be affected by many factors, such as power loss or overheating, nothing matters more than protecting the data on your drives. A disk failure causes a long period of system downtime because the data must be reconstructed before program execution can resume.

In 1987, three researchers from the University of Berkeley published a paper describing methods for providing fault tolerance using arrays of small (3.5- and 5.25-inch) disk drives that could achieve the performance characteristics of a single, expensive Single Large Expensive Disk (SLED) in mainframes. This technology is called RAID - Redundant Array of Inexpensive Disks (redundant array of inexpensive disks). Below we will look at the main characteristics of the six RAID levels.

RAID levels have different performance characteristics and different costs. The fastest is RAID 0 (duplexing method), followed by RAID 3 or RAID 5 (depending on the size of the entries). The cost of each method depends on the total amount of disk space required. For example, for small to medium files, mirroring may be less expensive than RAID 3 or 5.

When choosing a fault-tolerant disk subsystem, you should also consider software for automatic data recovery in the event of a failure. If we are talking about a local network file server, it is important that the data can be recovered with minimal effort on the part of the LAN administrator and with minimal losses for the server users. For example, for RAID 0, recovery is simply copying the data from the secondary drive to the restored or replacement drive. For RAID 3, 4 and 5 systems, manufacturers supply software that recovers data by XOR segments. These programs run in the background, allowing users to continue working while recovering. RAID systems with built-in intelligent processors can perform reconstruction much faster than their counterparts that use software that is executed by the main system processor.

Traditional RAID systems have undeniable advantages, but they also create many problems. Different RAID levels provide different performance and cost, leaving administrators to find the best option for their particular system. Today's RAID disk subsystems are quite complex to manage and configure. Increasing the amount of disk space and reconfiguring the subsystem is also a labor-intensive and time-consuming process.

To cope with these problems, new disk array technologies are being developed that can automatically configure themselves to different levels that no longer fit into the traditional framework of specified RAID levels. We will look at products of this type from Hewlett-Packard and EMC.

Hewlett-Packard AutoRAID

After four years of hard work, Hewlett-Packard's storage division has developed a new technology that realizes the redundancy inherent in traditional RAID while eliminating many of its shortcomings. The AutoRAID disk subsystem automatically selects the RAID level that meets user requirements and also implements a number of other important features.

The core of the technology is a set of disk subsystem controller algorithms for managing the addresses of data blocks. Traditional disk arrays, such as RAID 4 or 5, use static, predefined algorithms to translate host computer data block addresses to disk addresses. The AutoRAID developers abandoned this approach and preferred to use dynamic algorithms for intelligently mapping any block address on the host to any disk in the array. This display may change during system operation.

Dynamic algorithms allow the controller to move data stored in a disk array to any location on any disk without having any impact on the data or how the host computer addresses it. Thanks to this, this technology makes it possible to convert one RAID level to another. Based on available information about the different performance characteristics of different RAID levels, the disk subsystem dynamically adapts to best meet the needs of the host computer.

Another important feature of this approach is the simple mixing of disks of different sizes and performance in one subsystem. Some traditional disk arrays have similar tools, but in them, subsystem configuration is a complex and time-consuming process. Configuration in AutoRAID is quick and easy. One of the administrator's tasks when configuring any disk array is creating virtual disks from the available physical space. Users work with virtual disks, which the subsystem controller presents to them as physical disks. When configuring a traditional disk array, an administrator must know the characteristics of each physical disk so that they can be grouped together to create virtual disks. AutoRAID relieves the administrator of these complexities. Now it is enough for him to know the total amount of memory in the disk array. The administrator determines the amount of memory required for each virtual disk, and then mapping algorithms automatically group the physical disks to ensure the most efficient use of available space and provide the best performance.

Reconfiguring the subsystem is also not difficult. One of the most common reasons for reconfiguration is the need to increase disk space. Traditional RAID subsystems solve this problem in two ways. The first is to add enough disks to create a new redundancy group. This method can be quite expensive. In the second case, the administrator saves all data to a backup disk, adds new disks, reconfigures the entire subsystem, and restores the data. Obviously, this process will require a lot of time, during which the system is not functioning.

Reconfiguration to add additional disk space looks much simpler. The administrator just needs to install new disks and create another virtual disk. This work is done interactively and takes a few seconds.

This ease of system reconfiguration relies on AutoRAID's dynamic mapping technology. Each disk of the array is treated as a sequence of blocks. When new disks are added, their blocks are added to the shared pool of available memory. Mapping algorithms allow the controller to use each block independently, helping to achieve better system performance, cost and availability.

A unique feature of AutoRAID technology is the automatic and direct use of new disks to improve the performance of the disk subsystem. When a new disk is installed, the data is evenly redistributed across all disks in the subsystem. This process is called balancing and occurs in the background between host computer operations. Distributing data evenly across all disks creates more opportunities to perform multiple operations on data simultaneously. For transaction processing systems, increasing the number of parallel operations means increasing overall performance.

Another innovation of the described technology is based on the balancing method - the so-called “active hot spare”. The functionality of an active hot spare is the same as that of a hot spare in a traditional array. If any disk fails, the subsystem controller immediately initiates a rebuild process that reconstructs the lost data on the spare disk and restores redundancy to the subsystem. In conventional arrays, the spare disk is not used until something happens to the system because it contains spare space for recovered data. Sometimes temporary storage is created on a hot spare disk, but it must be disposed of as soon as any disk fails.

HP AutoRAID technology uses hot spares to improve subsystem performance. The balancing process distributes user data across all system disks, including the hot spare disk (the more disks used for data, the greater the performance). In this case, on each disk, part of the space is reserved for data recovery in the event of a failure. During the system rebuild process, the reconstructed data will be stored on the backup section of each of the array disks.

EMC RAID-S

Storage system manufacturer EMC is offering a new implementation of RAID technology, RAID-S, that provides improved performance and data protection and eliminates many of the shortcomings of traditional RAID systems.

RAID-S cannot be categorized into any one RAID level. Leveraging new advances in hardware, software and data mapping, EMC combines the positive aspects of RAID 4 and 5 and RAID 6 with new technologies to create a new data protection design. RAID-S disk arrays are designed for use in mainframe-class systems.

RAID-S will enable users to build storage systems that provide the best balance between performance, data protection and system availability. RAID-S allows you to choose the RAID level that best suits your organization's needs. In addition, EMC allows you to combine RAID-S technology, RAID 1 disk array and other company disk storage systems in one system.

For example, a large bank may operate online transaction processing systems to serve its customers, as well as batch processing systems to handle administrative tasks. Each application has its own requirements for storing and accessing data. EMC disk systems will provide each of them with the necessary level of availability and data protection.

Levels of RAID excellence

RAID 0. RAID 0 is not inherently a fault-tolerant system, but it can significantly improve performance. In a conventional system, data is written to the disk sequentially until its capacity is exhausted. RAID 0 distributes data across the array disks as follows. If, for example, four disks are used, then data is written to the first track of the first disk, then the first track of the second disk, the first track of the third and the first track of the fourth. The data is then written to the second track of the first disc, and so on. This data distribution allows you to simultaneously read and write data on four disks and thereby increases system performance. On the other hand, if one of the disks fails, data will also have to be recovered on all four disks.

RAID 1. RAID 1 implements data mirroring/duplexing, creating a second copy of the data on a separate drive for each drive in the array. Duplexing, in addition to the data on the disk, also duplicates the adapter card and cable, providing even greater redundancy. The method of storing two copies of data is a reliable way to implement a fault-tolerant disk subsystem, and it has found widespread use in modern architectures.

RAID 2. RAID 2 distributes data on the array's disks bit by bit: the first bit is written on the first disk, the second bit is written on the second disk, etc. Redundancy is provided by several additional disks where the error correction code is written. This implementation is more expensive because it requires more redundancy overhead: an array with 16 to 32 primary disks must have three additional disks to store the correction code. RAID 2 provides high performance and reliability, but its use is limited primarily to the scientific research computer market due to high minimum disk space requirements. Network file servers do not currently use this method.

RAID 3. RAID 3 distributes data on the array's disks on a byte-by-byte basis: the first byte is written on the first disk, the second byte is written on the second disk, and so on. Redundancy is provided by one additional disk, where the sum of data modulo 2 (XOR) for each of the main disks is written. In this way, RAID 3 breaks up the records of data files, storing them on multiple disks simultaneously and providing very fast reads and writes. XOR segments on an additional disk allow you to detect any malfunction of the disk subsystem, and special software will determine which of the disk drives in the array has failed. Using byte-byte data distribution allows simultaneous reading or writing of data from multiple disks for files with very long records. Only one read or write operation can be performed at a time.

RAID 4. RAID 4 is similar to RAID 3 except that the data is distributed across disks in blocks. One additional disk is also used to store XOR segments. This implementation is useful for files with very short writes and a higher frequency of reads than writes, since multiple reads can be performed simultaneously if the disk block size is appropriate. However, only one write operation at a time is still allowed, since all writes use the same secondary disk to calculate the checksum.

RAID 5. RAID 5, like RAID 4, uses block-based data distribution, but XOR segments are distributed across all disks in the array. This allows multiple write operations to be performed simultaneously. RAID 5 is also useful for files with short records.

Live Migration

A live data migration strategy allows, in particular, to store the most active data in RAID 1, which has the highest performance, and less active data in the lower-cost RAID 5. In most systems, actively used data makes up a small portion of the total information stored. Thus, the bulk of the data will be stored on RAID 5. This technology provides system administrators with two key advantages. First, it frees them from agonizing over which RAID level to choose. Second, the disk subsystem continuously optimizes the performance and cost of disk storage, just as would be the case if an administrator spent all his or her time tuning the system.

Features of RAID-S implementation:

    RAID-S calculates error correction code for redundancy at the disk driver level rather than at the subsystem controller level. This unloads the controller, freeing it from processing I/O requests, and thereby improves the performance of the disk subsystem.

    In RAID-S, data is not chunked across physical disks as in traditional RAID implementations, but rather remains intact on the disk. This allows you to use existing tools for monitoring and configuring the I/O subsystem

    without additional staff training.

    Because data is not distributed across disks, even if multiple disks fail at the same time, the information on the remaining volumes of the RAID-S group will still be available to applications on the host computer.

    RAID-S implements advanced technology and is prepared to easily integrate future technologies, protecting users' long-term investments.

The material is divided into three parts: A - theory, B - practice, C - creating a multiboot flash drive.

A. General theory (popular).

1. Iron.

All physical devices that we use every day to store information (HDD, CD-ROM, flash drive, and even flopstick) are block I/O devices. They can connect to a computer via various interfaces: IDE, SATA, eSATA, USB. The operating system provides a single, transparent method for the user and application software programmer to read/write information from/to these media.

Drivers communicate directly with the hardware. A driver is a program loaded into the operating system. It is a layer between the OS and devices, providing the OS with a standard software interface for I/O block devices.

2. Data on a physical disk.

These devices are called block devices because information is written and read on them in blocks (sectors, clusters) of a fixed size. The block size is a multiple of 512 bytes. The block approach is necessary to ensure high speed of the disk subsystem.

The disk itself is formatted (partitioned) at a low level (at the factory). The disk consists of cylinders. A cylinder is a circle on a disk plate. The first cylinders are located in the center of the disk plate, the last - on the outer edge. Each cylinder is divided into sectors. Blocks on the disk are organized within sectors. In addition to the data itself, information for error control is recorded in blocks. The controller inside the hard drive works with this information and is not visible from the outside. The driver sends commands to the disk controller at the “read 10 blocks 10 cylinders 20 sectors” level.

All useful data recorded on the media is organized into sections. In Windows, each partition is usually represented as a logical drive (C, D, E, ...). On removable media (flash drive, CD, flopstick), as a rule, one single partition is created; on internal hard drives, on the contrary, there are usually several partitions. The data in the partition is organized in a file system.

Each partition can independently set its own block size - the cluster size. It regulates the speed/economy balance. A block is the smallest addressable unit of disk layout. A cluster combines several blocks - it is the minimum addressable unit in a partition.

Thus, the following logical hierarchy is established (from bottom to top): block, sector, cylinder - cluster - section - file, directory.

In most file systems, a file can span one or more clusters. Thus, if the file size is smaller than the cluster size, then the file will occupy the entire cluster. For any file on the disk, a number of bytes will be allocated that is a multiple of the cluster size. Some file systems can divide one cluster into several files (packing), but this is rather an exception (for now). Thus, the larger the cluster size, the higher the speed and the more space is lost on half-full clusters.

3. Physical disk partitioning.

The partition size is also measured in blocks. This is why, when dividing a disk into partitions, the size expressed in bytes can be slightly adjusted by the program.

Since there can be multiple partitions on a disk, they need to be listed somewhere, indicating the limits and properties of each partition. For this purpose, a partition table is used, which is located at the beginning of the physical disk (the beginning of the disk is its first block in accordance with addressing). In the classic case, it is part of the MBR (master boot record), which entirely occupies the first block. The entire partition table is allocated 64 bytes. Each table entry consists of the start and end addresses of the partition, the partition type, the number of sectors in the partition and the partition “load” flag and occupies 16 bytes. Thus, the maximum number of partitions on a disk is limited to four (16 × 4 = 64).

This happened historically, but over time it became obvious that 4 sections are not always enough. A solution to the problem has been found. Those partitions that are marked in the disk header (in the MBR) are called Primary. There should still be up to 4 of them inclusive. Additionally, the concept of Extended partitions was introduced. An extended partition contains one or more subpartitions and does not contain a file system. It itself is a full-fledged primary partition.

Because the extended partition's subpartitions are not listed in the disk partition table, they cannot be marked as bootable. The bootable partition is the partition from which the operating system begins to boot. It is flagged in its partition table entry. Thus, only one of the 4 primary sections can be marked. An extended partition cannot be bootable, since it does not have a file system.

The layout of the extended section is described at the beginning of it. By analogy with MBR, there is EBR (Extended boot record), located in the first sector. It describes the layout of the logical drives of this extended partition.

An optical disk and flash drive usually have only one partition, since smaller divisions do not make sense there. Typically, when burning a CD, the ISO 9660 file system is used. A disc image with this file system is called an ISO image. It is often used in isolation from the physical disk as a container for data transfer, since any image is a bit-by-bit exact copy of the physical media.

4. File system.

Each disk partition intended for data storage (that is, all partitions except the extended one) is formatted in accordance with some file system. Formatting is the process of creating a file system structure in some space on a disk - a partition. The file system organizes user data in the form of files located in some hierarchy of directories (folders, directories).

The structure of directories and files in a partition in the classic case is described in the file table. As a rule, the table takes up some space at the beginning of the section. After the table the data itself is written. This creates a system where the structure is described separately and the data (files) are stored separately.

If a file is deleted from disk, it is removed from the file table. The space it occupied on the disk is marked as free. But there is no physical cleanup of this place. When a disk is written to, the data is written to free space. Therefore, if after deleting a file you create a new one, there is a possibility that it will be written in place of the deleted one. When quickly formatting (used in the vast majority of cases) a partition, only the table is also overwritten. The procedure for recovering files after deletion or formatting is based on these features.

During operation, physical damage may occur on the disc. Some blocks may become unreadable. These blocks are called “bad sectors”. If a bad disk hits while reading a disk, an I/O error occurs. Depending on where the bad block appeared and how many of them appeared, either part of the contents of the files or part of the file table may be lost.

When trying to write to a bad block, the disk controller must identify the problem and allocate a new space on the disk surface for this block, and remove the old space from use (relocate bad block). It does this unnoticed by the OS and drivers, on its own. This happens as long as there is a reserve of space for transfer.

5. Working with disk.

The operating system provides the ability to work with disks at the file, partition and device level. The specific implementation of access to each level depends on the specific OS. But in any case, the general thing is that the physical disk and any of its partitions can be accessed in the same way as a regular binary file. That is, you can write data to it, and you can read data from it. Such features are especially useful for creating and restoring disk images and cloning disks.

In UNIX operating systems, all storage devices are represented as files in the /dev directory:

    sda, sdb, sdc, ... - physical disks (HDD, including external ones, flash drives, IDE drives);

    fd0, fd1 - flops.

Partitions on each disk are available as sda1, sda2, sd3, ...

Disks are numbered in the order in which the BIOS sees them. Partition numbering is in the order in which partitions were created on the disk.

To make an image (an image is a bitwise copy of the information located on a disk or partition) of an entire disk (for example, the first one in the BIOS - sda), you need to subtract data from /dev/sda into any other file specially created for the image, using a sequential copy program file contents. To write the image to a file, you need to use the same program to subtract data from the image in /dev/sda. By analogy, you can create/restore an image of a partition (for example, the first one on the first disk - sda1) by accessing /dev/sda1 instead of /dev/sda.

6. Mounting.

To "turn" a disk device into a collection of files and directories that can be accessed, it must be mounted. There is no such thing as a mount in Windows. There, partitions are simply connected to logical drives (C:, D:, E, ...). Information about which letter to assign to which drive is stored in the OS itself.

In UNIX, the concept of mounting is fundamental to working with disks and provides much more flexibility than in Windows. Mounting is the process of binding some source of a disk image (either the disk itself or a file with its image) to some directory in the UNIX file system. The file system in UNIX starts from one point - from the root directory (/), and no logical drives C, D, E exist.

When a UNIX OS starts booting, a disk partition marked as root is mounted in the root directory /. OS service directories located at the root of the file system must be created on the disk partition. Other partitions can be mounted to them, or files can be written directly to the main partition (mounted to /).

The key point is that the disk image source (a block device, an image file, or a directory in an already mounted file system) can be mounted to any directory at any file system nesting level that begins with /. Thus, different logical partitions of a physical disk are represented by directories in a single file system, as opposed to the separate file systems of different logical drives in Windows (where each disk is treated as an autonomous file system with its own root).

To mount, you must specify the file system of the image, mounting options, and the directory to which it will be bound.

Due to this flexibility, you can link one directory to several different places in the file system, make a disk image and mount it without writing it to disk, or expand an ISO image. And all this is done without the use of third-party utilities.

7. MBR - boot area.

At the beginning of a physical disk there is usually an MBR (master boot record). This is the boot area of ​​the disk. When the computer boots, the BIOS determines which disk is the primary one and looks for the MBR on it. If it is found, then control is transferred to it. If not, an error is displayed stating that the boot disk was not found.

In the MBR, in addition to the partition table (described above), there is program code that is loaded into memory and executed. It is this program that must determine the boot partition on the disk and transfer control to it. Control transfer occurs in a similar way: the first block (512 bytes) of the boot partition is placed in RAM and executed. It contains the program code that initiates the loading of the OS.

Due to the fact that control from the BIOS, when the computer boots, is transferred to a program recorded on the disk, it is possible to make the choice of boot partition more flexible. This is what the GRUB and LILO boot loaders, widely used in the UNIX world, do. There is currently no point in using the latest bootloader on modern computers. With GRUB you can give the user a choice of which partition to boot and how.

The GRUB code is too large to fit in the MBR. Therefore, it is installed on a separate partition (usually the partition mounted in /boot) with a FAT, FFS or Ext2 file system. Code is written to the MBR that loads GRUB code from a specific partition and transfers control to it.

GRUB independently or with the help of the user determines from which partition the boot should occur. In the case of a Winsows partition, control is simply transferred to it in the same way as it would be from a regular MBR. In the case of Linux, the bootloader performs more complex actions. It loads the OS kernel into memory and transfers control to it.

Making a backup of the boot area of ​​a disk is as easy as backing up the entire disk or a separate partition. The bottom line is that the MBR occupies the first 512 bytes of the /dev/sda disk. Therefore, for MBR backup it is necessary to read the first 512 bytes of /dev/sda into the file, and for recovery, on the contrary, the file must be read into /dev/sda.

When we talk about disk subsystem resources, we can name three of them: the amount of space, the read and write speed in MB/sec and the read-write speed in the number of input/output operations per second (Input/Output per second, IOPS, or simply I /O).

Let's talk about volume first. I will give considerations that should be taken into account and an example of calculation.

The considerations are as follows:

Disk space is occupied by the virtual machine disk files themselves. Therefore, you need to understand how much space they need;

If we plan to use thin disks for all or part of the VMs, then we should plan their initial volume and subsequent growth (hereinafter, thin disks mean the corresponding type of vmdk files, that is, the thin provisioning function in the ESX(i) implementation) . The fact is that thin provisioning functionality can be implemented on a storage system regardless of ESX(i), and I do not mean the functionality of storage systems);

By default, the hypervisor creates a paging file for each VM, the size of which is equal to the amount of its RAM. This swap file is located in the VM folder (by default) or on a separate LUN;

If you plan to use snapshots of the state, then you should also plan a place for them. The following considerations can be taken as a starting point:

If snapshots will exist for a short period after creation, for example, only for the duration of the backup, then we reserve ten percent of the VM disk size for them;

If snapshots will be used with average or unpredictable intensity, then it makes sense to allocate about 30% of the VM disk size for them;

If snapshots for VMs are actively used (which is important in scenarios where VMs are used for testing and development), then the volume they occupy can be several times greater than the nominal size of virtual disks. In this case, it is difficult to give exact recommendations, but doubling the size of each VM can be taken as a starting point. (Hereinafter, a snapshot refers to the corresponding functionality of ESX(i). The fact is that snapshots can be implemented on a storage system independently of ESX(i), and I do not mean the functionality of storage systems.)

An example formula looks like this:

Space capacity for a group of VMs = Number of VMs x (Disk size x T +

Disk size x S + Memory capacity - Memory capacity x R).

T - coefficient of thin disks. If such disks are not used, it is equal to 1. If they are used, then it is difficult to give an abstract assessment; it depends on the nature of the application in the VM. Essentially, thin disks take up less space on a storage system than the nominal size of the disk. So, this coefficient shows what proportion of the nominal size is occupied by virtual machine disks;

S is the size of snapshots. 10/30/200 percent, depending on the duration of continuous use;

R is the percentage of reserved memory. The reserved memory does not fit into the paging file; the paging file is created smaller. Its size is equal to: the amount of VM memory minus the amount of reserved memory.

For estimated input data, for example, see table. 1.3.

Table 1.3. Data for disk subsystem capacity planning

We get an estimate of the required volume:

Infrastructure group - 15 x (20 + 20 x 10% + 2 - 2 x 0) = 360 GB;

Application servers - 20 x (40 + 40 x 10% + 2 - 2 x 0) = 920 GB;

Critical servers - 10 x (100 + 100 x 10% + 6 - 6 x 0.5) = 1130 GB;

Test and temporary - 20 x (20 x 30% + (20 x 30%) x 200% + 2 - 2 x 0) = 400 GB.

Therefore, we can create two 1.4 TB LUNs and distribute virtual machines between them approximately equally. Or create 4-5 LUNs of 600800 GB each and place machines of different groups on different LUNs. Both options (and intermediate ones between them) are acceptable. The choice between them is made based on other preferences (for example, organizational ones).

Another resource of the disk subsystem is performance. In the case of virtual machines, the speed in MB/sec is not a reliable criterion, because when a large number of VMs access the same disks, the accesses are inconsistent. For a virtual infrastructure, a more important characteristic is the number of input/output operations (IOPS, Input/Output per second). The disk subsystem of our infrastructure must allow more of these operations than the virtual machines request.

What is the path taken by the guest OS to access physical disks in the general case:

1. The guest OS passes the request to the SAS/SCSI controller driver (which emulates the hypervisor for it).

2. The driver passes it to the SAS/SCSI virtual controller itself.

3. The hypervisor intercepts it, combines it with requests from other VMs and passes the common queue to the physical controller driver (HBA in the case of FC and hardware iSCSI or Ethernet controller in the case of NFS and software iSCSI).

4. The driver sends the request to the controller.

5. The controller transmits it to the storage system via a data network.

6. The storage controller accepts the request. This request is a read or write operation from some LUN or NFS volume.

7. A LUN is a "virtual partition" on a RAID array of physical disks. That is, the request is transmitted by the storage controller to the disks of this RAID array.

Where the disk subsystem bottleneck may be:

Most likely, at the level of physical disks. The number of physical disks in a RAID array is important. The more there are, the better read-write operations can be parallelized. Also, the faster (in I/O terms) the disks themselves, the better;

Different levels of RAID arrays have different performance levels. It is difficult to give complete recommendations, because in addition to speed, RAID types also differ in cost and reliability. However, the basic considerations are:

RAID-10 is the fastest, but uses disk space least efficiently, taking 50% to support fault tolerance;

RAID-6 is the most reliable, but suffers from low write performance (30-40% of RAID-10 at 100% write), although reading from it is as fast as from RAID-10;

RAID-5 is a compromise. Write performance is better than RAID-6 (but worse than RAID-10), storage efficiency is higher (the capacity of only one disk is taken up for fault tolerance). But RAID-5 suffers from serious problems associated with long data recovery after disk failure in the case of modern high-capacity disks and large RAID groups, during which it remains unprotected from another failure (turning into RAID-0) and sharply loses in productivity;

RAID-0, or “RAID with zero fault tolerance,” cannot be used to store significant data;

Storage system settings, in particular storage controller cache. Studying the storage system documentation is important for its correct configuration and operation;

Data network. Especially if you plan to use IP storage, iSCSI or NFS. I in no way want to say that they should not be used - such systems have been used by many for a long time. What I'm saying is that you need to try to make sure that the load being transferred to the virtual environment will have enough network bandwidth to accommodate the planned throughput.

The resulting speed of the disk subsystem follows from the speed of the disks and the algorithm for parallelizing disk accesses by the controller (meaning the RAID type and similar functions). The ratio of the number of read operations to the number of write operations is also important - we take this ratio from statistics or from documentation for applications in our VMs.

Let's look at an example. Let's assume that our VMs will create a load of up to 1000 IOps, 67% of which will be read, and 33% will be write. How many and what kind of disks will we need if we use RAID-10 and RAID-5?

In a RAID-10 array, all disks are involved in read operations, and only half are involved in write operations (because each data block is written to two disks at once). In a RAID-5 array, all disks are involved in reading, but when writing each block, there is an overhead associated with calculating and changing the checksum. You can think of one write to a RAID-5 array as causing four writes directly to the disks.

Write - 1000 x 0.33% = 330 x 2 (since only half of the disks are involved in recording) = 660 IOps.

In total, we need 1330 IOps from the disks. If we divide 1330 by the number of IOps stated in the performance characteristics of one disk, we get the required number of disks in a RAID-10 array for the specified load.

Read - 1000 x 0.67% = 670 IOps;

Write - 1000 x 0.33% = 330 x 4 = 1320 IOps.

In total, we need 1990 IOps from the disks.

According to manufacturers' documentation, one SAS 15k hard drive processes 150-180 IOps. One SATA 7.2k drive - 70-100 IOps. However, there is an opinion that it is better to focus on slightly different numbers: 50-60 for SATA and 100-120 for SAS.

Let's finish the example.

When using RAID-10 and SATA, we need 22-26 disks.

When using RAID-5 and SAS, we need 16-19 disks.

It is obvious that the calculations I have given are quite approximate. Storage systems use various types of mechanisms, primarily caching, to optimize the operation of the storage system. But this information is useful as a starting point for understanding the disk subsystem sizing process.

The methods for obtaining the required number of IOPS for a VM and the read-to-write ratio remain behind the scenes. For an existing infrastructure (when transferring it to virtual machines), this data can be obtained using special information collection tools, for example VMware Capacity Planner. For the planned infrastructure - from application documentation and personal experience.