Читать книгу DRBD-Cookbook - Joerg Christian Seubert - Страница 5

Оглавление

3 Preliminary considerations

Before we take a closer look at the basic configuration of a two-node cluster, there are some basic considerations. If you have already made your selection or have special requirements, you can safely skip this chapter - but you do so at your own risk.

I myself am not one for reading through endless introductions, and I know colleagues who read the introductions very carefully and then didn’t know what to do when it came time to implement them.

The important thing for me in the preliminary considerations is to avoid unnecessary work, so that you don’t have any downtime at the end of a test run or even in a running, productive cluster.

And nothing is more deadly to a cluster than not being available.

That’s why the old do-it-yourself motto applies here, too:

Measure first, then cut!

3.1 Disk drive – physically vs. LVM

So, let’s first take a look at how the disk drive should be "designed".

Let’s first take a "physical disk device", i.e. an additional disk partition, in addition to the "classical" partitions like swap, root (/) and /home.

This solution has the advantage that there is no additional "virtualization layer" holding up processing operations, which could lead to performance degradation in case of doubt.

The disadvantage is that a subsequent increase or decrease in size, can only be carried out with increased effort if hardware actually has to be replaced.

Using Logical Volume Manager, or LVM for short, gains more headroom, but adds a virtualization layer on very tight systems, which can lead to the aforementioned performance degradation.

Both types of disks work with DRBD!

In the systems I have set up, I generally use Logical Volume Manager because the advantage of adding disks after the fact outweighs the disadvantage of performance degradation.

3.2 Filesystem on the disk device

In principle, a DRBD could also be used as a RAW device. Whether and which file system "runs" on the DRBD device does not really matter. Nevertheless, I would like to take a closer look at the different working methods of the file systems used to help you decide. All file systems have their specific advantages and disadvantages based on the way they work. For perhaps understandable reasons, I won’t go into more detail about tree structures or the like at this point. If you are interested in these specific points, you should consult the relevant technical literature or www.wikipedia.com.

3.2.1 UFS / ext2

The good old ’UNIX File System’ - because that’s what UFS stands for - was developed in the early 1980s and was the standard file system for all UNIX derivatives until the early 1990s. Today, however, it is only used in isolated cases.

However, the basic concept was passed on to the following file system generations:

all data is stored in blocks on the hard drive and

to get to a data block, the address of the memory block is stored in an area called "superblock", which is accessed first by the operating system.

In this way a tree structure is obtained, because each stored file is assigned a specific "inode number".

If a search is made for a specific file within the file system, the entire file tree must always be searched, which can take a comparatively long time for larger file trees with many substructures.

The "second extended file system" (ext2) essentially adopts this structure, but so-called "plugins" - i.e. extensions - can be added to handle fragmentation, compression and recovery of deleted data.

3.2.2 ext3 / ext4

The ext3 and ext4 file systems have evolved from the ext2 with the addition of a so-called journal and the ability to change the size of the file system while the file system is in use.

In a journaling file system, all changes are recorded in a special memory area called journal before the actual write to the selected block takes place. This makes it easier to reconstruct the writes if, for example, the system crashes or the power goes out during the write operation.

Another point of improvement of ext3 resp. ext4 over ext2 was the increase in file system partitions from 16 TB to 32 TB for ext3 and 1 EB (= exabyte) for ext4. Such device sizes could not have been imagined when the UFS was developed.

In addition, there are the extensions regarding the number of files and directories as well as the size of the individual files, which was still limited to 2 TB for ext2, could be between 16 GB and 2 TB for ext3, and is finally only limited by the size of the disk partition for ext4.

3.2.3 xfs

The file system xfs, originally developed by Silicon Graphics (SGI) exclusively for the in-house UNIX system "IRIX", is one of the oldest file systems. But just because something is getting on in years doesn’t mean it has to be "bad". It sets standards with maximum values of 16 EB per file system, a maximum number of 263 files and a size per file of 8 EB.

It also has significant advantages over ext3 and BtrFS, especially in terms of speed.

Some time ago, I had a case where about 100 GB needed to be copied from one host to another. The source filesystem was a BtrFS and the copy ran - to save LAN resources - over a TAR that was compressed on the source machine, pushed through an SSH tunnel and decompressed again on the target machine.

This work took a little over an hour - probably because the file system had many subdirectories.

After the work on the source file system was finished, among other things it was enlarged to 200 GB, I spontaneously decided to use xfs as the new file system.

The recovery time was 20 minutes!

Needless to say, I have been a self-confessed fan of this file system since that time, especially because the throughput has been confirmed in normal operation.

3.2.4 BtrFS

The BtrFS - spelled out B-Tree file system and not "Better FS" or even "Butter FS" - follows a completely different approach than the file systems available in the Linux environment so far. It is based partly on the considerations of the ZFS, which was developed about seven years before also by Sun Microsystems (in the meantime merged into ORACLE). It has built-in RAID, volume management, checksum-based protection against data transfer errors, and uses copy-on-write. copy-on-write is a method where a copy is not real until it is changed by one of the parties. As long as all parties involved have not changed their copy, it is sufficient to save the original once - in the respective file system. The integrated RAID system distinguishes between occupied and free data blocks, so when reconstructing a failed RAID volume, only the occupied disk space needs to be mirrored, which saves an enormous amount of time. In addition, this RAID works with larger data blocks than is the case with classic RAID methods. In a RAID1, there is no mirroring of all data blocks of a data carrier - regardless of whether they are occupied or not - but only the occupied blocks are distributed to all available data carriers. In this way, a RAID1 can be formed from an odd number of disks with different capacities without losing storage space. The B-tree structure - after which the file system is named - comes from the central concept of xfs. BtrFS is now used by SuSE as the file system of the future, while RedHat announced in August 2017 that it would discontinue long-term support for BtrFS in RHEL. Whereas so far it has remained with this announcement and for example fedora¹ uses BtrFS quite automatically unless something else is chosen during installation. In addition to the experience described above, the following considerations should be made with regard to using BtrFS on production servers in conjunction with DRBD:

1 Most Linux servers purchased as 19-inch devices are equipped with a hardware RAID controller. This means that the RAID functionality of BtrFS is not needed here, since the hard disks connected to the RAID array have the same capacity anyway, otherwise the hardware RAID controller will not work properly or the storage space cannot be used.
2 The copy-on-write functionality explained above adds an additional virtualization layer that is already done by distributing the data to multiple cluster nodes with DRBD. However, a BtrFS array across multiple cluster nodes is not supported.
3 In certain cases, using BtrFS on a DRBD device makes perfect sense if you don’t want to miss out on the multiple functionalities of BtrFS.

An example is the possibility to create a snapshot of the file system. You should, however, be careful before automatically creating such snapshots, as this quickly consumes disk space that you might need for other purposes.

3.2.5 OCFS2

The Oracle Cluster File System 2 is a file system that enables simultaneous access from several cluster nodes to a disk device in a cluster array (concurrent access), developed by Oracle for open source clusters. The coordination works via the Distributed Lock Manager.

The required packages are included as of SLES 11 SP3.

3.2.6 Conclusion

You should also think very carefully about which file system you want to use during the basic considerations for setting up a cluster. If you have made the wrong decision, it is only possible to change the file system with increased effort. In this book I show a "recipe" (cf. text number 8 - without increasing the LVM volume) how you can help yourself here. However, you must be aware that this can only be achieved with a downtime of the cluster array. If you are comparing the file systems shown here and do not need the special features that OCFS2 or BtrFS bring, xfs is the method of choice.

Подняться наверх