Classic case of a change hurting you later

So this morning I got into work to find a that a database machine had crashed. This particular machine is a legacy, old machine (debian 4) that acts as a gateway between multiple systems so it has mounts to lots of other systems. It came back up cleanly, but two mounts did not remount properly (one nfs and one cifs). After a bit of investigation I found the following:

1. The nfs mount did not come back because it is on a multi-protocol file server and it is a mount buried deep in a path. The folks who own the directory decided they did not like BusRouting and changed it to ‘Bus Routing’. So not only did the path disappear, but NFS really doesn’t like spaces in path names. Simple fix was to remove the space and then educate the folks, but a larger issue is with multi-protocol servers people can easily do an innocuous name change which can break other systems.

2. The cifs mount problem was a classic case as the server was rebuilt a month ago and the admin that rebuilt it did not know what the user account for doing the mount was so he did not migrate the user account to the new server. In addition, we moved from windows 2003 and windows 2008 which has additional security so on older linux machines you need to use the IP address instead of the machine name to connect using mount.cifs. Fix was to recreate the mount user account on the windows machine and change the fstab file on linux to use the IP.

These problems lasted for a month and were not caught sooner because (a) we do not have good automated monitoring systems, (b) the mounts were used by processes that do not run very often, and (c) the DBA that normally monitors these items has been on a large project for the last 2 months and not even reading his email (with no backfill).

The root cause of these issues are lack of appropriate resources for the amount of work we have. Because of this we are not able to properly document systems and monitor them.