A virtual computer in Amazon’s EC2 service, called an instance, has an ephemeral existence. They are created as needed but when they are stopped they disappear. This comes as no surprise to those of us who work with virtualization on a regular basis, but it can be a bit of a shock to the administrator who is used to physical systems backed by real disk drives. When an instance is halted, it and its local storage (including its root filesystem) disappear. There are ways around this, but those are beyond what I want to talk about today.
The common way to protect an instance from its own disappearing act is to take a snapshot of it and create a new machine image in S3. These images are bootable and can be used to create new instances. So it is common practice to periodically create one of these machine images. This is usually done with a three-step process: bundle up the file system (using ec2-bundle-vol), copy the bundle up to S3 (using ec2-upload-bundle) then turn the bundle in to a machine image.
I have performed this process countless times. In fact it is so routine that I wrote a script to perform all the steps for me with appropriate file and S3 bucket names. Some time back one of our instances developed the unfortunate symptom of refusing to be bundled. Since it wasn’t a customer site I kept putting the issue on the back burner. Today I finally started tracking down the real problem.
In order to understand what went wrong we first have to understand how bundles are created. The script ec2-bundle-vol first creates a file system inside a large file. It mounts this file as a separate file system. Then rsync is used to copy the root file system in to this new file system. Everything is copied except for a list of excluded paths. Obviously the mount point for the new file system is one of the excluded paths, but there are a number of other paths as well: the actual list varies based on the operating system. Once the rsync is finished the file is unmounted and then split in to files of exactly 10 megabytes each. Now the bundle is ready for uploading.
So I began troubleshooting my problem. The error message wasn’t very enlightening. It simply said “execution failed” and listed the (very long) rsync command. I turned on more verbose output from the script and then started seeing rsync errors, but still no insight in to why those errors were happening. A google search for the error didn’t produce very many useful hits, but it did point me at a curious forum posting from someone who complained about the very same problem. His workaround was to stop named before running ec2-bundle-vol. He claimed this made the problem go away. At first I wasn’t sure I believed it, but I did note that some of the whining from rsync was about paths in named’s chroot-ed environment. Then it hit me! Those paths were all in /var/named/chroot/proc. This is a filesystem that is stealthily bound to /proc via “mount -n –bind”. And /proc is one of those paths that is intentionally excluded from the bundle’s rsync (at least on Linux systems). I reasoned that if /proc is being excluded by default then named’s copy of /proc should also be excluded. The contents of /proc would be entirely useless in a machine image as it is just going to get mounted over anyway. I added that path to the exclusion list and ec2-bundle-vol ran just fine. So why did it run successfully when named was stopped (as claimed by the forum poster)? Because the service script that mounts the file system for “start” also unmounts it for “stop”. It wasn’t the act of stopping the named process that solved the problem, it was the act of unmounting the stealthy file system.
So if you are running named in a chroot-ed environment on Linux, your instance may not be bundle-able until you add /var/named/chroot/proc to your list of directories to exclude.