Saturday, November 20, 2010

ubuntu - ksm vs apparmor

Lately I was supposed to do some ksm tests to check if it can be used on a multi-user terminal server to reduce the amount of ram consumed by multiple eclipse instances. I started with ubuntu Lucid. I enabled ksm (echo "1" > /sys/kernel/mm/ksm/run). I also launched multiple eclipse and firefox instances. Then I checked /sys/kernel/mm/ksm/pages_shared which showed 0 all the time, even when the machine started swapping.

Then I upgraded to Maverick and also had no luck with ksm - the situation was the same. I also ran a ksm testing procedure described here:

http://dustinkirkland.wordpress.com/2010/02/06/ksm-now-enabled-in-ubuntu-lucid/

still no activity showed by ksm.

As a last resort, I rebooted with kernel option "apparmor=0" and voila - ksm started running!!!

However - I could not find this connection between apparmor and ksm anywhere - so this post is just to inform you - BEFORE YOU RUN KSM ON UBUNTU, ALWAYS TURN APPARMOR OFF

Saturday, November 6, 2010

KVM vs soft lockups

About half a year ago I started using proxmox with KVM for virtualizing my systems. The product itself is brilliant in my opinion, but still suffers from kvm bugs.

I found that some guests would freeze after live migration from one physical host to another. Dmesg on the VMs showed soft lockups on cpus:

BUG: soft lockup - CPU#3 stuck for 10s

Having googled the problem, I found that the problem appears on hosts with virtio drivers (virtio-net and virtio-disk). The probability is even better if you run an smp guest.

To resolve this problem, just switch back to ide emulation and e1000 for network.

Thursday, May 20, 2010

Systemimager and KVM guests - tips

Recently I've been figuring out how to deploy kvm guests with systemimager and pxe. This is a continuation of my previous story.

If you go for kvm you will probably end up using virtio network and virtio-hdd drivers with your guests as they provide best performance. Using these drivers implies some strange hdd naming. You place your instalation od /dev/vda instead of /dev/sda. Unfortunately systemimager (or better - systeminstaller - one of its components), will not recognize vda disks properly which leads to some complications.

Normally, when you run si_prepareclient on your golden-client, partitioning configuration is placed in file /etc/systemimager/autoinstallscript.conf in section <config><disk> (this is done by systeminstaller in fact). Then this section is used by si_mkautoinstallscript to generate parted commands in your <imagename>.master script. In case of /dev/vda disk, the <config><disk> section is empty, therefore you don't get the partitioning section in .master script. Due to this, deployment of your kvm guest image will fail as no partitions will be created. You may either want to debug systeminstaler (try do tebug perl code ;-) or just work it around, which is what I chose.

What you need to do
First, manually add the <disk> section on the golden client to /etc/systemimager/autoinstallscript.conf after you run si_prepareclient. The file might now be looking like this. The <disk> section is in italics.

This one is taken from redhat with no LVM, the hdd size is 12gb, and 2gb go for swap:
<config>
<disk dev="/dev/vda" label_type="msdos" unit_of_measurement="MB">
<!--
This disk's output was brought to you by the partition tool "parted",
and by the numbers 4 and 5 and the letter Q.
-->
<part num="1" size="101" p_type="primary" p_name="-" flags="boot" />
<part num="2" size="10676" p_type="primary" p_name="-" flags="-" />
<part num="3" size="2097" p_type="primary" p_name="-" flags="-" />
</disk>


<fsinfo line="10" real_dev="/dev/vda2" mount_dev="LABEL=/" mp="/" fs="ext3" options="defaults" dump="1" pass="1" />
<fsinfo line="20" real_dev="/dev/vda1" mount_dev="LABEL=/boot" mp="/boot" fs="ext3" options="defaults" dump="1" pass="2" />
<fsinfo line="30" real_dev="tmpfs" mp="/dev/shm" fs="tmpfs" options="defaults" dump="0" pass="0" />
<fsinfo line="40" real_dev="devpts" mp="/dev/pts" fs="devpts" options="gid=5,mode=620" dump="0" pass="0" />
<fsinfo line="50" real_dev="sysfs" mp="/sys" fs="sysfs" options="defaults" dump="0" pass="0" />
<fsinfo line="60" real_dev="proc" mp="/proc" fs="proc" options="defaults" dump="0" pass="0" />
<fsinfo line="70" real_dev="/dev/vda3" mount_dev="LABEL=SWAP-vda3" mp="swap" fs="swap" options="defaults" dump="0" pass="0" />

<boel devstyle="udev"/>

</config>

Second: pull the image from the golden-client (si_getimage).

Third: check your /var/lib/systemimager/scripts/<imagename>.master if it contains "parted" section (cat <imagename>.master |grep parted or something like this should verify this).

Fourth: have your image deployed to a new vm:
si_mkclientnetboot --image yourimage --flavor yourimage
The --flavor is needed to use a kernel with virtio drivers for deployment - it needs to recognize your /dev/vda and virtio ethernet.

That's all. Reboot your new vm with pxe boot and watch your image being deployed.
Comments welcome ;-)

Wednesday, May 12, 2010

My virtualization choice for 2010

Recently I've come to the point where I need to switch to a different virtualization platform in my datacenter. I've been using vmware ESX/ESXi but while my infrastructure grew, maintaining a farm of ESX/ESXi servers has become - let's say it shortly - a pain ;-). Lots of competitive platforms have emerged recently which are in my opinion better than ESXi (even if free), so I decided to switch to one of them. There was a lot of hesitation in my choice - so here is a story about it.

In fact, if you don't want to go into vmware, you are left with a choice between XEN and KVM. So the first thing I did was just enabling them on my plain systems. I found Xen to be rock-solid and fast and this was definitely my favourite. I tested some xen guests with my production load and the performance loss was minimal. My production distro was working flawlessly. On the downside - pxe is not supported by xen out-of-the-box (you can have it with pypxeboot package). Also xen seems a bit behind the mainstream with its old kernel (still 2.6.18). KVM was worse on performance, which I think was due to virtio drivers. With my production load I mainly hit limits with virtio-net rendering large number of interrupts - system cpu would jump high. I also needed to switch to newer version of my distro to have virtio support out of the box. On the other hand, KVM receives lots of development and bleeding edge features (e.g. page merging).

Both Xen and KVM share similar downsides:
-> lack of documentation, especially on networking. I have a multiple vlan environment. To get it to work, I spent several hours browsing through forums and putting scraps of information together ;-)
-> lack of fine management interface. Both kvm and xen on ubuntu/redhat, use virt-manager from redhat which I don't like. There is also ovirt but afaik it's not intended for production use now. There are probably some more on sourceforge, but I did not test. If you know any - just comment.

So now when the hypervisor was chosen, I just need to find a distro on which I want to run Xen. I decided to go into Citrix XenServer. It's just ready to run appliance. Multi-vlan environment is handled out of the box and documentation provided. Also console is nice, though windows-based.

...

But finally I ended up with KVM. ;-) Citrix and Redhat are now main contributors to xen. Recently redhat 6 beta has come out. Following was the press release stating that redhat is totally dropping xen in release 6 and goes to kvm instead. So right now I think Citrix being th only developer of xen is in a position similar to vmware. So if I choose Xen, I am going to lock myself into a deal with one company. And now that redhat officially goes into kvm, I suppose they will commit lots of code to it. So I decided I can live for some time with worse performance than Xen and wait for redhat 6 and their stability patches to kvm.

As a platform for KVM I chose proxmox which totally ROCKS in my opinion. It's debian under the hood with pretty recent kernel (2.6.32 AFAIR). When you install it, it's just ready to go. Being debian, proxmox can be deployed in numbers easily with systemimager/preseed. I also expect puppet integration to be a snap. You don't need to download any license keys as it is with XenServer and Vmware. Vlan setup turned out to be easy, gui is web-based and works flawlessly. So right now proxmox is my 100% hit.

Thursday, May 6, 2010

Consistency in the datacenter

One of the main challenges of system administration is dealing with chaos.
This is in fact one of the features of the world we live in. Every construction degrades over time. Every organized structure turns into chaos if not maintained approprietaly.

At its birth, the server is just like an innocent child. Clean, lean,configured, doing what's it designed for. Over time, developers, sysadmins, etc.
(whoever has access to it) introduce changes to this setup. Some of them are made by you and hopefully are recorded somewhere. Some of them you probably don't authorize and don't even know they were performed. One day you find yourself saying "Where the fuck this load is comming from?" And you discover that something that was once designed to be a plain mysql server now runs apache, nfs and a whole bunch of other stuff.

Of course life of a sysadmin is hard. You got lots of machines and hardly have time to record every single thing you've done on each of them. You want certain aspects of your systems to be the same on every single server. For example you want every machine to get its time from your company's ntp server, log to a remote logging server, get its hostname from local dns, etc. etc..

This can be called configuration consistency management. Usually when talking
about large server farms people focus on rapid deployment (1000 machines in an
hour - large numbers are impressive). But in fact this is not the deployment itself which poses the greatest challenge. Maintaining configuration over time is much harder.

So how you do it? Depends on what farm you are running. If you run a homogenic
computational cluster, you probably have just one type setup that every worker node in the cluster should have. If you work for a dotcom, you are probably dealing with larger number of configs (database servers, www servers, cache servers etc.). Also if the dotcom runs more than one website, the number aforementioned server types is multiplied. Add different linux distributions, bsd, solaris to the mix and you find yourself in the middle of chaos.

In case of homogeneous server farm, you are likely to have just one configuration to maintain. Your focus is not on imposing server configuration, but rather on maintaining a consisitent server image over time. When you deal with different configurations, different websites, you probably want to control only certain aspects of every server (ntp for example), while others are left untouched and can be modified directly by people working on a certain website (users, groups, etc).

The tool I recommend for running a homogenic system farm is systemimager (http://systemimager.sourceforge.net). It is a suite which combines automated deployment (via pxe & rsync/torrent), and further server management. You install one cluster node, get its image to the systemimager server. The image is located in a directory into which you can chroot and make changes. You deploy the image on the nodes with pxe. A small linux distro loads itself into a ramdisk first. The distro creates filesystems and rsync-s the image from systemimager server. And
here comes the most interesting part about consistency. After you have deployed all your systems and you need to update them, there's no need to go to all servers,run scripts, etc. You just chroot into your image, intall software, add users, whatever. Then you rsync your image to the nodes. And voila - all your nodes are upgraded and in consistent state. Of course systemimager is smart enough and will not overwrite anything in /home, /tmp, /varetc.. Downside - AFAIK it now supports fully only linux

If you run lots of different system types, systemimager is probably not the best
choice for you, as you would need to have a separate image for every single node
type. Puppet (http://reductivelabs.com) is probably better here. Puppet is a language to describe system configuration. You can specify which packages must be present, users to be added, services to be running. A process "puppetd" runs on every node and and applies this configuration periodically. You can have
classes of nodes for different hardware types, different services. What's important - you can model only certain aspects of your configuration and leave others untouched.

Sunday, May 2, 2010

Systemimager & puppet

I've recently read this article on puppet wiki describing how to deploy systems for puppet. The method uses kickstart and cobbler. Here I describe how to provision systems for puppet with systemimager.

The goal
Do a bare metal provisioning of a huge number of servers with systemimager.
Have them automagically registered in puppet.

Why I do it
Systemimager has certaing advantages over deploying servers with kickstart or preseed:
-> It is distro-independent
-> It scales (installations can be done with torrent which enables you to deploy several hundreds nodes in around 10 minutes - there is a paper on this on systemimager website)
-> Deployed image is always the same as opposed to kicstart. In kickstart the resulting OS is based on the state (version of packages) in repos from which it has been installed. So unless you keep a local mirror on which you control package versions, you cannot really ensure that OSes you are deploying are equal. So if you decide to re-deploy a system a month later you might find that it's different from your previous installations.
-> Images can be easily modified further on and changes can be put to the clients without interrupting their work (if you want to update a client image, just chroot into it and do apt-get update or yum update - as simple as that ;-). Then populate the change to clients with systemimager tools. They will sync the changes you've done in your image to clients.
-> You can easily monitor progress/errors of your installation with systemimager-monitor
-> With systemimager you get not only a deployment solution, but also cluster management tools like parallel shell, file syncing and - most important - si_updateclient utility. Suppose you have deployed your image to servers and forgot to put your software in /opt. You chroot into your reference image on systemimager-server and untar your software. Then you run si_updateclient on the client and voila changes are synced - your package is installled on the client. This finely complements puppet which is not designed to transfer large data with profiles.

Assumptions
-> I assume that a flavour of linux is to be deployed (any distro *NOT* using grub2 will do). Examples are based on centos.
-> You have a systemimager-server installed and running. This requires dhcp, pxe-boot and storage for images, all of them set up for systemimager. Systemimager has a set of wizards for it.
-> You have puppetmaster instance already in place.

Procedure overwiev
-> Manually install basic, mini linux on one of your new servers. You only install basic release of your linux flavour (just like "Base installation" in centos). This speeds up deployments as the image hass less files.
-> Prepare it for puppet
-> Have its image retrieved by systemimager
-> Deploy the image to other systems
-> Register all systems in puppet

Step 1: Install linux reference image
No comments here - just use your distro iso ;-)

Step 2: Modify the OS to operate with puppet
First install ntpd and configure it. Puppet uses certificates for security. It's likely that hwclock on new servers does not show correct time. So the csr generated by puppet might be valid somewhere in the future or far in the past. Even if it is signed by the puppetmaster, It will not be valid at the time of deployment.
Install puppet client and configure it to point to your puppetmaster & start at boottime.
Install systemimager-client.
Edit /etc/systemimager/updateclient.local.exclude and add /var/lib/puppet/ (if you do further management using systemimager suite, contents of this directory will be left untouched).
Configure passwordless ssh to your clients from systemimager-server. Generate ssh-keys without passphrase (or have a passphrase and further use ssh-agent to cache it) on systemimager-server. Copy ./root/ssh/id-rsa.pub to your clients to /root/.ssh/authorized_keys
Do further modifications as you like.

Step 3: Retrieve golden client image with systemimager
Please, see systemimager manual for details. This is general procedure:

On the systemimager-server:
/etc/init.d/systemimager-server-rsyncd start

On the "golden client":
si_prepareclient --server systemimager-server-ip

On the systemimager-server:
si_getimage --image img-name --golden-client client-ip-addr

The image is stored in a plain dir in /var/lib/systemimager/images/. You can chroot into it and adjust if you forgot something in step 2.

Step 4: Deploy the image to other systems
On the systemimager-server prepare other clients to pxe-boot:
si_mkclientnetboot --netboot --clients ip-list-of-nodes --image img-name

This command generates dhcp,pxe,tftp configuration for your clients so that they install the image next time they boot.

Reboot your new servers and watch them deploying the image ;-) (You might have time for a cup of coffe here unless you are using torrents for deployment which is extremely fast ;-)

After the last node is deployed, run:
si_mkclientnetboot --localboot --clients ip-list-of-nodes
This makes nodes boot from local hdd instead of pxe.

Step 5: Register new clients with puppet
After reboot, all you need to do is to sign new nodes' certificates as they appear. They are ready for puppet configuration. If you have problems at this stage (not all clients appear in puppet etc.), you may use parallel shell from systemimager to troubleshoot (just like: si_psh --hosts 'host_list' 'puppetd --verbose -o'). For this stuff you enabled passwordless login in your image.


Summary
I think the procedure described here is a more versatile replacement for kickstart and preseed instalations. Besides, systmemimager a great cluster management tool by itself.
-> It's faster and less complicated.
-> You don't need a local copy of your repo.
-> Easy to fine tune your images (no scripting for this as it is with kickstart).
-> Systemimager configures pxe, dhcp, tftp stuff for you.
-> If you have a homogenic cluster (HPC worker nodes are a good example) not so big, you may find that you don't even need puppet to manage it. Systemimager will do.

I mention systemimager in some of my posts. Please, check them out on the tag cloud.
Comments are very welcome as usual ;-)

Tuesday, April 27, 2010

Notes on simple design

Clusters and server farms are complex stuff. Complex are applications that run on them. (these are truisms don't they). So why add even more complexity with management. I mean the sysadmin's job is to simplify his things as much as possible. The sysadmin's work should put the least possible fingerprint on the whole setup. Simple cluster designs may not have a high-tech look&feel. They don't run the bleeding edge, untested software with funky features. They rather follow the KISS rule and are based on primitive designs. They don't require as much attention as complicated clusters.

Here are some interesting quotes I've come across:

"Autopilot: Automatic Data Center Management"
Michael Isard
Microsoft Research
"We believe that simplicity is as important as fault-tolerance when building a large-scale reliable, maintainable system. Often this means applying conservative design principles: in many cases we rejected a complex solution that was more efficient, or more elegant in some way, in favor of a simpler design that was good enough. This requires constant discipline to avoid unnecessary optimization, and unnecessary generality. ―Simplicity‖ must always be considered in the context of the entire system, since a solution that looks simpler to a component developer may cause nightmares for integration, testing, or operational staff.

21st Large Installation System Administration Conference (LISA ’07)
On Designing and Deploying Internet-Scale Services
James Hamilton
"Keep things simple and robust. Complicated algorithms and component interactions multiply the difficulty of debugging, deploying, etc. Simple and nearly stupid is almost always better in a high-scale service-the number of interacting failure modes is already daunting before complex optimizations are delivered. Our general rule is that optimizations that bring an order of magnitude improvement are worth considering, but percentage or even small factor gains aren’t worth it."

Here goes what I do to keep the setup simple:

Automation
When automating (for example server deployment), I usually avoid writing one big script which does everything. Deploying a server usually takes several steps and a few different scenarios. You choose different installations based on machine types, change machine's vlan, make it network boot, supply the image etc. etc. Scenarios also differ. Sometimes a new server comes and you need to introduce it to the cluster and sometimes you need to redeploy a machine which is broken. Sometimes you redeploy a machine and assign it to a different server farm. A script to tackle all of this all would be error prone and also difficult to test in production (you don't test errornous thing on production right?).
Instead, I break the automated action into small pieces. For each of them I write a small script (one script to take care of networking setup, one for pxe-booting a server etc., one for updating cluster membership etc.). Then I test these scripts. It's easy as they are not robust and unlikely to have errors. I also try to base them on primitive protocols e.g. using telnet or ssh with an expect library. Then I try to glue them together into bigger stuff.

Operating systems
Operating systems are simple when they are standard installations. Having large number of servers, one should not customize any element of them by hand. Instead I found it is quite good to package all your scripts into a software package like rpm or deb and roll it out. For example, you may have a package called "my-custom-setup-.rpm" which places custom scripts into /usr/local. It also installs all the dependancies - rpm handles that. In general, it's also a nice idea to have a local copy of your distro's repo.
To distribute /etc/ contents, puppet from Reductivelabs is a great tool. I usually use puppet to ensure the latest version of my custom rpm is installed and to make changes to /etc. (puppet is not good at distributing large files, so syncing binaries with it is no good - rpms handle this much better)

Networking
If possible, it's desirable to have all the machines within a single network without vlans. Vlans introduce complexity into server deployment process. Usually you deploy a server within one vlan where you have PXE and tftp. When deployment is finished you need to change the vlan, which involves changing it on your switch. You run a large number of machines, so you need to automate this process. You also need to keep a mapping between your servers and switch ports. After you change the vlan, you cut the access to your server. So now you need to access its management processor to reset it to get a new address. You also need to know management address of every server. In most cases having vlans is a must and cannot be avoided, but if you don't need it, don't use it ;-)

Agents
To effectively manage the servers, you usually install some agents on them. For me the best agent is sshd with pubkey authentication. It's rock stable, which gives you two things: you don't have problems with accessing your servers. If you have - then you can safely assume that the whole machine is down for some reason (swapped out, turned off, etc.) - this is the second thing. OK - I also run puppet clients. But the general idea is that management agents should be simple and known-for-years stuff.

Reliability
Clusters' reliability is based on multitude of nodes. As a rule - I don't guarantee my users any failover mechanism. It's up to the application to handle it. I don't build any HA into systems themselves.

Information
It's common that in the lifecycle of a cluster some machines come and go, some migrate between clusters. It's crucial to keep this information somewhere. A simple, "primitive" approach to this could be like this one. DNS is a database, simple, widely-used, guaranteed to work (replication built-in ;-). The client is built into any linux.

It's very easy to become "innovative" in the bad sense when it comes to clusters. I mean being "bleeding edge" often means "complicated". Many ready to use datacenter solutions are all-in-one bulky software, which promises everything, but in fact it's very complicated, tailored to specific systems and platforms (take a look HP SIM, IBM Director).

In fact, most sysadmins I know, end up with custom tools built on opensource components with simplicity in mind.

Sunday, April 18, 2010

Degrees of control

This post was inspired by what happened to me lately at work. A guy from security came in and told it would be great if we could allow only certain packages to be installed on our linux boxes. Everything what is not specified on the machine's profile would be automatically erased.

When I look at this situation, I come to think that there are times and setups when you want to control every change that happens on your server farm and sometimes you only want to control some parameters of your machines.

So there are basically two approaches:
-> "God-mode": you have a reference server image to which you introduce changes and then sync your servers to this image (changes entered manually on your servers are overwritten)
-> "modelling-mode": you say: this server must have an httpd & postfix running, also group apache needs to be present, etc. . You care only about httpd, postfix and apache group - the rest can be modified freely.

Approach 1. you can use if you run a homogenic server farm, just like an HPC cluster where you have a headnode and a number of similar worker nodes. This approach does not deal well with situations where you have a mixture of different OSes, hardware and machine types. On the upside - you always know what you are running. The security guy is always happy ;-) Also - tools used here are quite simple. All you gotta do is to sync your cients with reference image.

Approach 2. you use if you run more diverse environment (who'd suppose ;-). I mean here a bunch of large websites, serving different domains, several database configurations, proxies etc. - see here you might easily have over 10 installation types, each of them possibly running different OS, hardware etc. When you think about it, it is easy to realize that controlling this mess with approach 1 is impossible. Especially when there are several admins, each controlling his domain of expertise. It's likely that your database admins don't know your configuration tools . Also they know databases better than you. So it's reasonable only to assure that package postgres or mysql is installed on their machines and leave other system tuning up to your fellows.

Some words about tools that can be used here:

For approach 1:
-> systemimager - a cluster deployment and management suite. You store images of your servers in a central repository. They are plain directories so you can chroot into them, install some software, add users, etc. and then propagate changes to your clients. All of this is done with rsync so you don't interrupt your farm members' work.

-> startng machines with common nfs root - machines mount a common root filesystem from a NFS server. What you change on the nfs share is immediately propagated to clients

For approach 2:
I recommend running puppet + nagios. With puppet you ensure that certain aspects of your servers are the way you want them (i.e. apache installed, user apache present etc.). However puppet fails on reporting, so you need to monitor how puppet imposes your configuration with nagios checks. All the rest is in the hands of your fellow admins.

Comments and suggestions higly welcome ;-)

Thursday, April 15, 2010

Using DNS for inventory tracking

This post will be quite short - rather a tip ;-).
Recently I started using DNS to store information about my assets. It turned out to be a primitive but very handy way to do it.

You usually keep track of your hardware in some kind of spreadsheet or a database. They can be updated automatically or manually. Ready tools to do it are ocsinventory or puppet storeconfigs. This is cool, but to access the data you need to launch your browser, enter fields etc. etc. It all takes time, particularly when you need to look up information on only one host.

I found things turn much simpler if you put vital hardware data into your DNS (hinfo, txt custom fields are just for that). You can access them later on using "host -t " command from any unix-like os.

For one of my hosts the output might look like this:

host -t txt compute002
descriptive text "location: Rack103a-5"
descriptive text "role: compute-node"
descriptive text "hardware: ibm, 8gb ram, 2xIntel Xeon 5540"

Wednesday, April 14, 2010

Linux clusters - first aid kit

Here is a list of tools to build and manage linux clusters and farms. Some of them I used a lot, some I only know to be respected among fellow sysadmins. All of them are free and/or GPL-ed.

Distros

-> OSCAR - quoting the webpage "OSCAR allows users, regardless of their experience level with a *nix environment, to install a Beowulf type high performance computing cluster. It also contains everything needed to administer and program this type of HPC cluster.". If you are in the HPC business and need MPI, scientific libraries etc. installed out of the box - this one is for you ;-)

-> ROCKS - quote "Rocks is an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls. Hundreds of researchers from around the world have used Rocks to deploy their own clusters". AFAIK this one is based on redhat. Job queueing system, MPI and scientific apps included in so called "rolls" (see the doc).

Administrator toolbox

-> cobbler - deployment tool using kickstart/preseed. Redhat-centric.

-> systemimager - one of my personal favourites. A VERY massive server deployment tool using rsync and torrents. It supports many cool features like cloning a server without the need to bring it down. Hardware independent imaging possible with a few hacks. When you run a homogenic server farm it's also a very convenient server management tool - images are stored as directory structures - it's possible to chroot in them, change, and sync changes to live systems! ;-). Some killer apps like parallel multithreaded shell and file distribution tool are also included.

-> capistrano - an automation tool. Have not used personally, but I plan to check it out some day

-> TORQUE resource manager - none of HPC clusters can live without a proper queueing system. Torque maintains your job queue, determines free resources (cpus, ram) on your compute nodes and distributes computations on cluster nodes.

-> mcollective - A job execution framework. I have not used it personally, but plan to give it a shot. From the webpage: "The Marionette Collective aka. mcollective is a framework to build server orchestration or parallel job execution systems. Primarily we'll use it as a means to programmatically execute actions on clusters of servers."

-> puppet - a language to describe your datacenter. Just one thing to say about it - A MUST-HAVE. It's hard to describe all the things this tools does (definitely, expect a detailed post on it soon). It is a language to describe server configuration. You can group machines in classes, describe what services to run, which users to have on various type of nodes with various operating systems. If you seek a simpler replacement for cfengine - this is the choice ;-)

-> ganglia - a monitoring framework designed for clusters. Supports aggregation functions for multiple nodes.

-> freeIPA - authentication framework based on kerberos,ldap.

If you think the list is not complete, please comment ;-)

Sunday, April 11, 2010

How to migrate linux between different hardware

Any experienced systems administrator must have come across this issue in his life. Your server became outdated 3 years ago with all RAM and disk slots allready filled. You need more power. Gotta buy a new server, set it up, install the software, configure it and run the services on the new server.

But wait a minute... The old redhat 3.x you are using presently is hardly available now. What about the code your fellow admin wrote 3 years ago and left soon afterwards (code still works but you don't have idea about its dependancies etc.).

The best idea is to clone the server and deploy it on the new one. But this is usually hardware dependant - which means you cannot redeploy the clone on servers which require different drivers).

Fortunately, linux handles hardware changes quite nicely and it is easy to setup hardware independent imaging.

However some conditions apply:

-> on the new server, the only thing that will change is probably modules used, along with initrd. These things are hardware dependant.

-> you will be running grub1 on the destination server (grub2 is not supported by systemimager AFAIK - see later)

-> you might experience some minor problems with udev which may fail to start during boot. From my experience it turns out, that this is usually not a major problem.

Software:

-> systemimager
-> an iso of your linux distribution

Hardware:
-> old server (ServerA)
-> new server (ServerB)
-> a third server (ServerC)to run systemimager-server - just for migration time

Procedure overview:

We first get the image from ServerA and transfer it to systemimager-server on ServerC. We also install a basic, plain operating system on serverB.
Then we overwrite serverB with ServerA's image excluding parts which are hardware dependent (modprobe.conf, modprobe.d). We regenerate initrd and reboot ;-)


Diving in

-> install your linux distro onto ServerB from iso

-> On ServerC download and install systemimager-server package (instructions available on the systemimager website. My ubuntu has it out-of-the-box in apt). The server on which you install should have enough disk capacity to hold ServerA's filesystem. Start systemimager-server-rsyncd service.

-> download and install systemimager-client on ServerA and ServerB

-> on serverA turn off firewall, shut down production services (possibly - as many daemons as you can). Then run:
si_prepareclient --server serverC

-> on serverC:
si_getimage --image image-serverA --golden-client serverA

this will start image retrieval which is done with rsync

-> while the image is being cloned, login to serverB and say which files will not be overwritten. On serverB - edit the file /etc/systemimager/updateclient.local.exclude and add the following lines:

/etc/modprobe.d/

/etc/modprobe.conf


-> when image retrieval is finished, on serverB run:

si_updateclient --server serverC --image image-serverA --no-bootloader

this will transfer the image from serverC onto your new server serverB

-> to hold new hardware on serverB, you need to regenerate initrd and look if all grub entries are correct. mkinitrd reads /etc/modprobe* to determine which hardware modules are needed to start the system. They are not overwritten by cloning because you excluded them earlier. (redhat provides easy command for the whole process called new-kernel-pkg). On redhat it also works to reinstall the kernel with rpm --force option.

-> after this is finished, reboot.

Conclusion

That's it. In practice this has worked for me several times. I would mostly reimage IBM eServer (from /dev/sda) on an HP ProLiant (on /dev/cciss/c0d0) and vice-versa. It worked well on dell blades. I successfully done V2V and P2V migrations on vmware/xen,kvm. The systems were redhat 5 and 4, debian.
However I expect some problems might arise if you tried to advance too much in kernel versions (for example run ancient redhat with a modern kernel on new hardware). An idea to handle this is to exclude not only modprobe* stuff from imaging but also whole /boot partition and /lib/modules/*. This is still to be tested ;-)