Root filesystem on RAID
by Martin Schulze
| |
What should you do if it is inacceptable to use a single disk or partition for
the root filesystem? Don't use it. Use two or three. This article
provides a solution for this problem.
|
|
1. RAID Introduction |
RAID Introduction
RAID stands for "Redundant Array of Inexpensive Disks". It is meant to speed up disk access and to secure it at the same time. RAID, though, is not new. It was invented 1987 at the University of California, Berkeley. Before Linux it was only available in the form of special hardware that was quite expensive. Of course, it could only be used in high-end computing centers.
During the last decades performance of processing units has been increased by five to ten times each year, depending on the statistic you believe. During the same period disk capacity has only doubled or tripled while prices were only halved every one to two years. The used electronics doesn't reflect the current processor speed. This results in I/O being the new bottle neck of modern computers. This has already happened. Just try to compile our famous XFree86 source on a dumb pII-233 with regular SCSI disk layout.
By the time people at Berkeley realized this they were able to forsee that there wouldn't be new epoch making technology for hard disks in the near future. Since magnetic and mechanic oriented disks were kept and since laws of physics only permit slight improvements there needed to be other solutions.
This resulted in the definition of several RAID levels. Nowadays they are not only used in high-level computer rooms but also by the so called middle-end sector. Since some fellow kernel hackers decided to implement RAID for the Linux kernel this technique can also be used by low-end PCs and regular people can be satisfied by the received advantages.
RAID levels share the following properties:
- Several different physical disks are combined and accessed as a compound element. Under Linux this is done by the driver for multiple devices, also known as /dev/md*.
- The stored data is distributed over all disks in a well defined way.
- The data is stored in a redundant way over the disks, so in case of failure data can be recovered.
By dividing data into equal chunks and distributing them over all affected disks one gets higher I/O performance than with using only one fast disk. The reason for this lays in the ability to request data from the disks in a parallel fashion. The easiest way is to use this is called striping mode or RAID level 0, though it doesn't contain any redundancy.
The latter is achieved in different ways. The simpliest way is to store the data on two equal disks. This is defined in RAID-1, which is also known as mirroring. Of course, one only gets performance increase when at least four disks are used.
More efficient redundancy is obtained when not the whole data is duplicated but a unique checksum is generated and stored together with regular data. If one disk should fail one is able to reconstruct its data by using all data chunks of that stripe together with the calculated checksum. The easiest way to calculate a checksum is to XOR all data chunks in a stripe. This is defined in RAID level 4 and 5. The unofficial level 6 uses another chunk for a different checksum algorithm, which results in two redundant disks and even better breakdown avoidance.
How to set up RAID
Using filesystems with RAID has many advantages. First there is speed. RAID combines several disks and reads/writes chunks from the disks in a sequence. Second you are able to get bigger filesystems than your largest disk (useful for /var/spool/news/, /pub/ etc.). Third there is the possibility to achieve redundancy so a disk failure won't end up in data loss. For technical information on RAID please refer to <ftp://ftp.infodrom.north.de/pub/doc/tech/raid/>.
To do RAID with Linux you need a kernel with appropriate support. First of all this refers to support for the "multiple devices driver" (CONFIG_BLK_MD_DEV). Linux 2.0.x supports linear and striping modes (the latter is also known as RAID-0). Linux kernel 2.1.63 also supports RAID level 1, 4 and 5. If you want to use these levels for 2.0.x, you'll have to install the kernel patch that is mentioned at the end of this article.
To use either of them you need to activate the appropriate driver in the kernel. I'd suggest you compile a kernel of your own anyway. Additionally you need to have special tools installed. For linear mode and RAID level 0 you need the mdutils package that should be included in your distribution. To use RAID level 1, 4 or 5 you need to have the raidtools package installed, which superseedes the mdutils package.
Striping works most efficient if you use partitions of exactly the same sizes. Linux' RAID driver will work with different sizes, too, but is less efficient as you may imagine after reading some RAID documents. In that case the driver doesn't use all disks for striping after a certain amount of disk space is used. The maximum number of disks will be used at any time.
After setting up RAID and combining several disks to a compound device you don't access the disks directly again using /dev/sd*. Instead you make use of the multiple devices driver that provides /dev/md*. These devices are block devices just like normal disks so you simply have to create a filesystem on them and mount them.
The default setup of the Linux kernel provides up to four such
compound devices. Each MD can contain up to eight physical disks
(block devices). If your setup requires either more combined devices
or more compound devices you have to edit include/linux/md.h within
the Linux kernel source tree, especially MAX_REAL and MAX_MD_DEV. For
testing purpose you can use some loopback devices instead of physical
disks.
Swapping over RAID
The Linux kernel includes native support for distributing swapspace over several disks. Instead of setting up a RAID-0 device and directing swap to it you simply add all swap partitions to /etc/fstab and use 'swapon -a' to activate all of them. The kernel uses striping (RAID-0) for them. Here's a sample setup:
/dev/sda3 none swap sw /dev/sdb3 none swap sw /dev/sdc3 none swap sw
Setting up RAID
Setting up RAID for normal filesystems such as /var, /home or /usr is quite simple. After you have partitioned your disks, you need to tell the RAID subsystem how you like to organize the partitions. This is written down in /etc/mdtab for later activation. It can be done by issuing the following command.
mdcreate -c4k raid0 /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1
If you want to use RAID level 1, 4 or 5 you have to use an additional configuration file that reflects the disk setup. Figure 1 gives an example.
raiddev /dev/md1 raid-level 5 nr-raid-disks 5 chunk-size 8 |
| Figure 1: Sample RAID-5 configuration file /etc/raid/raid5.conf |
These levels are more complicated and need a special signature on top of the compound device. This signature is generated by the mkraid command. So the remaining setup looks like:
mkraid /etc/raid/raid5.conf mdcreate raid5 /dev/md1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1
Now there are two RAIDs created, the first one consists of three partitions while the second uses five of them. Depending on the data that ought to be stored on them different chunksizes have been selected. The next step is to activate these devices with:
mdadd -ar
From now on you may refer to /dev/md0 and /dev/md1 as block devices that may contain your filesystems. In order to use these devices you have to issue this command during the boot sequence. Please check out the startup sequence of your distribution. Some of them (e.g. Debian) have already included this.
After the kernel knows how disks are organized you can create your
filesystem on the new devices and add them to /etc/fstab just as
usual.
Root filesystem on RAID
The use of RAID for the root filesystem is a little bit tricky. The problem is that LILO can't read and boot the kernel if it is not stored linear on the disk (like it is on ext2 or msdos filesystem). The solution is to put the kernel on a different partition with a normal filesystem and activate RAID after the kernel is booted.
This way LILO would boot the kernel but the kernel itself would be unable to mount the root filesystem because its RAID subsystem isn't initialized yet. Now you're in trouble, right? No.
For late 2.1.x kernels there's also a kernel parameter that can be used to load the kernel from a RAID. This is
md=, , , ,dev0,dev1,...,devn
This needs to be added to lilo using the append="" option or directly at the lilo prompt during boot stage. You'll find more information in Documention/md.txt in the Linux source tree. This facility is not subject of this article, so you will have to refer to the kernel documentation instead.
For stable kernels (2.0.x) and "not soo late" development kernels (2.1.x) you need a mechanism to call some programs after the kernel is loaded but before it tries to mount the root filesystem. This refers at least to mdadd.
The only way to achive this is to use the initial ramdisk which is also known as initrd. General information about initrd may be found in Documentation/initrc.txt inside of the kernel source tree.
You probably have to compile your own kernel, although you could try the one that comes together with your distribution if it has all facilities included. You'll need to add modules support to the described solution. However, figure 2 shows additional kernel compilation options that are needed for the described setup. Besides you have to include support for your SCSI card etc. If you're incertain about the options please refer to the Kernel-HOWTO and use the '?' key to display a description of the referenced driver.
...
*
* Additional Block Devices
*
Loopback device support (CONFIG_BLK_DEV_LOOP) [m/n/Y/?] y
Multiple devices driver support (CONFIG_BLK_DEV_MD) [Y/n/?] y
Linear (append) mode (CONFIG_MD_LINEAR) [N/y/m/?] n
RAID-0 (striping) mode (CONFIG_MD_STRIPED) [Y/m/n/?] y
RAM disk support (CONFIG_BLK_DEV_RAM) [Y/m/n/?] y
Initial RAM disk (initrd) support (CONFIG_BLK_DEV_INITRD) [Y/n/?] y
...
|
| Figure 2: Configuring the kernel |
If the Linux kernel uses initrd it mounts the given ramdisk as root filesystem and executes /linuxrc if that one could be found. After finishing this the kernel continues its boot process and mounts the real root filesystem. The old initrd root will be moved to /initrd if that directory is available or unmounted otherwise. If it is only moved the ramdisk remains in memory. So on systems with short memory you should get the kernel to remove it entirely when it's not needed anymore.
The initrd file is a "simple" rootdisk. It has to contain all the files that are needed for executing the /linuxrc file. This includes a working shell if it's a shell script and all tools that are used by this script. In order to execute programs this also includes a working libc together with ld.so and tools. Alternatively you can link the included programs statically and don't need a shared libc. Seeing as how this doesn't save any space, it's not needed.
After you have initialized RAID from /linuxrc you need to tell the kernel where its new root filesystem resides. At that time it may be configured to use the initrd as root filesystem. Fortunately our fellow kernel hackers have designed another easy interface to set the root filesystem.
This facility makes use of the /proc filesystem. The device number of the new root filesystem needs to be echo'ed to /proc/sys/kernel/real-root-dev and the kernel continues with that setting after /linuxrc is finished.
As lilo normally isn't able to boot from a non-linear block device (such as RAID) you need to reserve a small partition with the kernel and initial ramdisk on it. I've decided to use a 10MB partition that I use as /boot and put stuff on it. This also enables the possibility to put some binaries there and to access it from a rescue floppy. I wonder why one should use this, Linux runs stable as hell, but for the sake of security...
10MB is plenty of space for only one kernel and a ramdisk of approx. 1MB size. Currently my system only uses 2.5 MB of it, so there is enough room to play with. Due to the fact that /boot uses a usual filesystem (e.g. ext2), you can keep /etc/lilo.conf to point to /boot/vmlinuz in your setup.
You have to decide what needs to be done in your /linuxrc script. You only need to activate RAID and tell the kernel where your root filesystem resides. Figure 3 shows a sample /linuxrc program.
#! /bin/ash |
| Figure 3: Sample /linuxrc file |
You may use any block device as the root filesystem. In the given example 0x900 is used. This stands for major number 9 and minor number 0 which is the encoding for /dev/md0.
Next you have to make a list of binaries and additional files needed. Of course, you need some device files in /dev/ as well. To get the /linuxrc script working at all, you also need to have /dev/tty1. The other devices depend on your /etc/mdtab file. You will at least need /dev/md0.
The above example uses these binaries: ash, mount, umount, mdadd. You need these additional files: mdtab, fstab and mtab and for safety passwd.
The mdtab file I use is shown in figure 4.
/dev/md0 raid0,4k,0,93f5553f /dev/hda2 /dev/hdb2 /dev/md1 raid0,8k,0,3ffaa1d8 /dev/hda4 /dev/hdb4 |
| Figure 4: Sample initrd/etc/mdtab file |
Therefore these block devices have to be created on the initial ramdisk:
/dev/hda2 /dev/hda4 /dev/hdb2 /dev/hdb4 /dev/md0 /dev/md1 /dev/md2 /dev/md3
You have to use the mknod command to create these device files. You'll find out their major and minor numbers by investigating your real /dev directory or by reading Documentation/devices.txt from the kernel's source directory. The following commands creates tty1 and md0:
mknod dev/tty1 c 4 1 mknod dev/md0 b 9 0
How does one create the initial ramdisk?
The best thing you can do is to create the directory /tmp/initrd and install everything you need in it. Once you're finished you determine the used diskspace (du -s) and then create the initrd file. The following commands would create a initial ramdisk with 1meg size. To use it, your kernel has to include support for the loopback device.
dd if=/dev/zero of=/tmp/initrd.bin bs=1024k count=1 mke2fs /tmp/initrd.bin mount -o loop /mnt /tmp/initrd.bin
Since you make use of dynamic linked binaries you need to make sure that the Linker and the dynamic libraries are installed, too. You need to copy at least /lib/libc*.so and /lib/ld-linux.so.2 and /lib/ld-2.0.*.so. You'll also need an appropriate /etc/ld.so.config file. Appropriate in this context means that "/lib" should be the only line in it. You'll also need to create a new library cache /etc/ld.so.cache file with "ldconfig -r /tmp/initrd". Of course you also have to install the needed binaries in appropriate directories /sbin and /bin.
Don't forget to create the /proc directory or mount will fail. The fstab and mtab files can be empty. They will only be read, no program will write to it, but they have to to exist on the initial ramdisk. For the /etc/passwd file it's sufficient to only include the root user.
After you have copied everything from /tmp/initrd to the ramdisk mounted to /mnt (see above), umount it (e.g. with the command "umount /mnt") and move the file to /boot/initrd.bin. Now you need to tell lilo to load the kernel and the ramdisk. That's no problem, just use a record in /etc/lilo.conf similar to the one that is shown in figure 5.
image=/boot/vmlinuz
initrd=/boot/initrd.bin
label=linux
read-only
|
| Figure 5: /etc/lilo.conf |
Issue the command "lilo" and you're nearly done. Since the RAID subsystem is now configured at boot stage before any /etc/init.d scripts are issued you should disable the mdadd call in the /etc/init.d scripts.
As you might have guessed already this setup implies that you have a
running Linux system installed on some non-RAID disk. You can at
least install a small base system on your swap partitions, compile the
kernel on a different machine, setup RAID on the appropriate machine,
move the files and continue installation afterwards.
More resources
- RAID documentation
- Kernel HOWTO
- Bootprompt HOWTO
- Root RAID Cookbook
- Patch
to support RAID 1, 4 and 5 in 2.0.x kernels
- Utilities to manage RAID
The author
Martin Schulze studies computer science in Oldenburg, Germany. For several years he has used Linux and tries to improve it where he can. Nowadays he maintains several machines in his home city and is involved with many Linux projects, such as being the RAID maintainer for Debian GNU/Linux. He can be reached via e-mail at joey@infodrom.org.
Source: Linux Journal 6/99