High Availability systems under Linux

ArticleCategory: [Choose a category for your article]

System Administration

AuthorImage:[Here we need a little image form you]

TranslationInfo:[Author and translation history]

AboutTheAuthor:[A small biography about the author]

Atif is a chameleon. He changes his roles, from System Administrator, to programmer, to teacher, to project manager, to whatever is required to get the job done.
Occasionally you can find him programming, or writing documentation on his laptop while sitting on the toilet seat.
Atif thinks that he owes a lot to Linux and open-source community and projects for being his teacher.
More about him can be found at his homepage

Abstract:[Here you write a little summary]

When designing a mission critical systems, either during flowcharting or when building it physically with boxes, cables etc, one has to ask the following questions :

How important are the service that will run on these machines to you?
How many other services are dependent on the service you are going to run on these machines (think NIS/NFS/DB/LDAP server)?
What happens if a part of the machine fails. (power supply, network cable, hardisk etc)?
What happens when the machine fails completely?

When I ask these questions to myself, I get the same answer most of the time.
I will get fired :)

On the other hand when I ask my self the question "Will the Operating System Fail" I alway get this answer.
No. You are not running 32 bit extensions for a 16 bit patch to an 8 bit operating system originally coded for a 4 bit microprocessor, written by a 2 bit company, that can't stand 1 bit of competition. ( got this from a .sig)

Now for some serious discussion.

ArticleIllustration:[This is the title picture for your article]

High Availability systems under Linux -- Image borrowed from http://lwn.net/Gallery/i/suits.gif

ArticleBody:[The article body]

Why HA?

Even though I trust Linux blindly, I don't trust the companies that make the machines, power supply, network cards, motherboards etc, and I am always afraid that if one of these fail, my system will be unusable. Hence the service will be unavailable, further more I will be taking down all the company services even though they are not directly related to me. For example

Some service that I don't even know exists , I have nothing to do with at all, may start misbehaving, because it can not resolve billingSys106.company.com. hmmmm, let me think what can be the reason, Oh I was responsible of DNS and decided against the company regulations to run it on Linux. :)
Or someone can not use the SAP system, because my LDAP server is down. Oh wait a sec, didn't I fight 3 months to move the SAP authentication to LDAP ??
Or no weenies can log into their Win Workstations. Hey, we just have a Unix box down, why should your NT setup be disturbed with that. Oh! last time when nobody was watching, I moved the NT domain controller to Linux+Samba with authentication to LDAP.

The same of course can happen on a Windows Server, but there won't be a lot of hoo haa about it because the dummies are used to it, but I warn you: If this will happen on a Linux box, there will be a lot of "you just can not trust Linux", etc, etc from the management.

In one of the companies I worked for, the NFS server was feeding data to a corporate web server, Intranet server, database server, and many other services that will bring the company to a halt.
Of course using NFS was a bad choice, but let's just go with it for the sake of an example.
This server was made HA using Sun's cluster solution that would cost you both your arms and legs
Another service which was most important was the intranet used by +1500 people.

Now lets discuss this concept in a little Depth.

What is HA?

High Availability is what it says it is.
Something that is Highly Available.

Some service that is really important to keep your company functional.
Example:

intranet site
File server
Mail Service
DNS service

These services can fail due to two factors.

Software misbehavior
Hardware misbehavior

For hardware misbehaviors a lot of caution is taken by the management when ordering hardware, for example, every machine would have redundant power supplies, Raid 5, etc
What is often over looked is the software misbehavior.
Believe it or not, I have seen Linux boxes hang up because of a sudden problem with Network card, overheating of the CPU etc.

The big boss is not really interested to know if the power supply went down or the system halted due to a faulty network card.
The only thing your boss, employees and customers are interested are that the "service" should be available.
Note that I have highlighted the term service.
Of course the service runs on a machine, and redirecting the service and requests to another healthy machine is the art of High Availability.

Example implementations of HA

In this example we will theoretically create an Active/Passive cluster running an apache server, serving the intranet.
To create this small cluster, we will use one good machine with lots of RAM, and many CPUs and another one with just enough RAM/CPU to run the service.
The fist machine will be the master node while the second will be backup node.
The job of the backup node is to take over the services from the master node if it thinks that the master node is not responding.

How will this work

Lets just think, how our users access the intranet.
They type http://intranet/ in their browser and the DNS server redirects them to 10.0.0.100 (example ip)

How about if we put two servers running this intranet service which different ip address, and just ask the DNS server to redirect to the second one if the master node comes down.
Sure, thats one possibility, but there are issues about DNS caching on the clients etc and perhaps you want to run the DNS server on a HA cluster itself.
Another possibility, if master node fails, then the slave node may take over its ip address and start serving the requests.
This method is called IP takeover, and is the method that we will be using in our examples. Now all browsers will still be accessing http://intranet/ which will translate to 10.0.0.100 even if the master node fails without making any changes to the DNS.

How do clusters talk

How would the master/slave know that the other node in the cluster has failed?
They will talk to each other over a serial cable and over a cross link Ethernet cable (for redundancy, serial cable or Ethernet cable may fail) and check each others heartbeat (yes like the heartbeat you have) If your heartbeat stops, then you are probably dead
The program to monitor the heartbeats of the cluster nodes is called... guess...heartbeat.
heartbeat is available at http://www.linux-ha.org/download/
The program for ip address take over is called fake and is integrated in heartbeat.

If you do not have an extra network card to put in two machines you may run heartbeat over a serial cable (null modem) only.
On the other hand network cards are cheap, so add another one for redundancy.

Preparing the Cluster nodes

As previously mentioned, we will use one cool machine and another not so cool machine.
Both machines will be equipped with 2 network cards each and at least one serial port.
We will need one cross link cat 5 RJ45 (Ethernet) cable and a null modem (cross link serial cable)

We will use the first network card on both machines for their Internet ip addresses (eth0)
We will use the second network card on both machine for a private network to talk udp heartbeat (eth1)

We will give both machines their Internet ip addresses and names.
For example to eth0 of both nodes
clustnode1 with ip address 10.0.0.1
clustnode2 with ip address 10.0.0.2

Now we will reserve a floating ip address (this is the service ip address that I highlighted earlier)
10.0.0.100 (intranet). We don't need to assign it to any machine at the moment

Next we configure the machines for their second network card and give them any ip addresses from a range that is not used.
for example to eth1 of both nodes an ip address with netmask 255.255.255.0
clustnode1 ip address 192.168.1.1
clustnode2 ip address 192.168.1.2

Next we connect the serial cables to Serial port 1 or 2 of the machines and make sure that they are working/talking with each other.
(Make sure that you connect to the same port of each machine, its easier that way)
See http://www.linux-ha.org/download/GettingStarted.html

Installing heartbeat

Installing the software is straight forward, heart beat is available in rpm and tar.gz both binary and source packages.
If you have problem installing the software, then you probably should not be taking the responsibility to install a HA system (it won't be HA, perhaps it will be NA)
There is an excellent Getting Started with Linux-HA guide so I wont replicate the information here.

Configuring the cluster

configure the hearbeat

example if heartbeat configuration files are in /etc/ha.d
then
edit file /etc/ha.d/authkeys with your favourite editor

#/etc/ha.d/authkeys
auth 1
1 crc
#end /etc/ha.d/authkeys

you can later move to md5 or sha when you are more comfortable, for the first test leave the authentication mechanism to be 1.

edit /etc/ha.d/ha.cf

debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility     local0
deadtime 10
serial  /dev/ttyS3  #change this to appropriate port and remove this comment
udp     eth1      #remove this line if you are not using a second network card.
node    clustnode1
node    clustnode2

edit file /etc/ha.d/haresources

#masternode ip-address service-name
clustnode1 10.0.0.100 httpd

this defines that the masternode is clustnode1, for example when the clustnode1 goes down then clustnode2 will take over the service, but when clustnode1 comes backup again, it will reclaim its service. That is why we are using a good and not so good machine (clustnode1 is the good machine)
The second item defines the ip address that should be taken over with the service , and the third item defines the name of the service.
When the machine1 takes over the service, it will try to execute

/etc/ha.d/httpd start

if it does not find the file then it will try

/etc/rc.d/init.d/httpd start

The same is true when giving up a service if clustnode2 is giving up a service, it will try

/etc/ha.d/httpd stop

if it does not find the file then it will try

/etc/rc.d/init.d/httpd stop

When you are finished with the configuration on clustnode1, you can copy the files to node2.
in the directory /etc/ha.d/rc.d you will find the script called ip-request, etc which will do the job of assigning the ip address etc.
now start /etc/rc.d/init.d/heartbeat on both machines.
install a different index page on the machines to be served by the http server

for example.
on clustnode1

echo hello world from clustnode1 >/yourWwwDocRoot/index.html
and on clustnode2
echo hello world from clustnode2 >/yourWwwDocRoot/index.html

make sure that on both nodes, the service httpd does not start automatically on boot, remove the links from the rcN directories or even better move the startup script "httpd" or "apache" from /etc/rc.d/init.d/ to /etc/ha.d/rc.d/ on both machines
If everything is setup correctly and hearbeat is running and communicating then clustnode1 will have the ip address 10.0.0.100 and it will be replying to the http requests.
try it a couple of times and make sure that its replying. If everything seems ok, then shutdown clustnode1 and within 10 seconds, clustnode2 will take over the service and the ip address.
Your max down time will be 10 seconds.

What about data integrity issues

When service httpd moves from node1 to node2 it does not see the same data. I loose all the files that I was creating with my httpd CGI's.

Two Answers:
1. You should never write to file from your CGI's. (use a network database instead.. MySQL is pretty good)
2. You can attach the two nodes to a central external SCSI storage, and make sure that only one is talking to it at one time, and also make sure that you change the SCSI id of the host card on machine a to 6 and leave on machine b 7 or vice -versa.
I have tried this with Adaptec 2940 SCSI cards, and they let me change the SCSI id. Most cheap cards will not let you do that.
Some Raid controllers are sold as cluster-aware controllers but make sure that the vendor will allow you to change the HOST ID of the card without buying Microsoft cluster kit.
I had to NetRaid adapters from HP and they definitely do not support Linux. I had to break them to have a good feeling about the money spent.

Next step will be to buy Fibrechannel cards, a fibrechannel hub and a Fibrecahnnel storage to create a small SAN, they are definitely more costly than using shared SCSI but they are a good investment.
You can run GFS (Global File System, see below in resources) over FC which allows you to have transparent access to the storage from all machines as if they were local storage.

We are using GFS in production environment over 8 machines where 2 of them are in a similar HA configuration as I have described above.

What about active/active cluster

You can easily build an Active/Active server if you have a good storage system that allows concurrent access. Examples are Fibrechannel and GFS.
If you are content with Network filesystems such as NFS, you may use that, but I would not suggest that.

Anyway, you can map serviceA to clustnode1 and serviceB to clustnode2 example of my haresource file

clustnode2 172.23.2.13 mysql
clustnode1 172.23.2.14 ldap
clustnode2 172.23.2.15 cyrus

I use GFS for storage so I don't have a problem with concurrent access to data and can run as many services as is manageable by these machines.
Here clustnode2 is the master for mysql and cyrus which clustnode1 is the master for ldap.
If clustnode2 goes down then clustnode1 takes over all the ip addresses and the services.

Resources

Linux-HA.org: The home page of Linux HA
kimberlite clustering technology: A Kimberlite Cluster provides support for two server nodes connected to a shared SCSI or Fibre Channel storage subsystem, in an active-active failover environment. The software provides the ability to detect when either node leaves the cluster, and will automatically trigger recovery scripts which perform the procedures necessary to restart applications on the remaining node. When the node rejoins the cluster, applications can be moved back to it, manually or automatically, if required. Sample recovery scripts are provided. Kimberlite is designed to deliver the highest levels of data integrity and be extremely robust. It is suitable for deployment in any environment that requires high availability for un-modified Linux applications.
ultra monkey: Ultra Monkey is a project to create load balanced and highly available services on a local area network using Open Source components on the Linux Operating System. At this stage the focus is on producing a scalable, highly available web farm, though the technology is easily expandable to other services such as email and FTP.
Linux Virtual Server: The Linux Virtual Server is a highly scalable and highly available server built on a cluster of real servers, with the load balancer running on the Linux operating system. The architecture of the cluster is transparent to end users. End users only see a single virtual server.
4U cluster / 4U SAN (Shameful plug): 4U cluster and 4U SAN is HA cluster and SAN implementation by our company 4Unet.
If you are an ISP, Carrier, or a telecom company and require High Availability solutions to be designed and implemented then 4Unet will be the right place to ask.
Note: 4Unet is an integrator, they do not sell clusters or SANs, they implement it for their customers. All technologies used for these clusters/SAN are open source.
4Unet's target customers are only ISPs, Careers, and telecom companies.
Global File System: The Global File System (GFS) is a shared disk cluster file system for Linux. GFS supports journaling and recovery from client failures. GFS cluster nodes physically share the same storage by means of Fibre Channel or shared SCSI devices. The file system appears to be local on each node and GFS synchronizes file access across the cluster. GFS is fully symmetric, that is, all nodes are equal and there is no server which may be a bottleneck or single point of failure. GFS uses read and write caching while maintaining full UNIX file system semantics.