Why our servers uptime never go to 70 day?

Blog post by our Sys Admin Grégoire Doumergue

Uptime is probably the easiest reliability and stability measurement factor that we compare between admins. Our tips here!

Blog post by our Sys Admin Grégoire Doumergue

The uptime contest 

As a sysadmin I’ve read this funny conversation very often:

< Admin1> Booh I need to reboot this FreeBSD 5.2 box, it has a 1102 days uptime 🙁

< Admin2> Are you kidding you n00b my Debian 2.0 is 1302 days up !

< Admin3> …

Uptime is probably the easiest reliability and stability measurement factor that we compare between admins. It’s somewhat fascinating to read that a server is up for thousands of days when we know how difficult it is for such a complex thing to run . (There’s also this legend telling the sad story of a Novell server found walled up after years of running … well done Novell Marketing crew). But we all know uptime doesn’t mean a thing. What actually is running on this OS ? A simple DNS forwarder, or a heavy J2EE application server ? What is the average load, is it really used, and by how many people ? And most important question that makes me write this blog entry: How many security holes are left in this old 2.2 linux kernel that runs for years ? Just check how many vulnerabilities are disclosed in the Linux kernel source every year: old kernels are an enjoyable playground for black hats.

 

Yeah, reboot !

 So to me, it is very important that the operation system kernel – along other critical software, like OpenSSH, Apache & co. –  is upgraded as often as possible. And for that, there is no escape: we have to reboot the system. Uptime, 0 min. Booh.

But now that I upgrade my servers often, I’m not sad; I’m a little proud though. Proud to see a server boot and have all of its services run again without fear.

Yeah, reboot, and reboot them all. Don’t forget one of the IT security principles: The platform’s weakness is as weak as its weakest component. So if one system stays with a “300day” vulnerability, it doesn’t matter if you keep the others system up-to-date.

Other than the obvious security one, there are many other reasons periodic reboots are helpful and important:

– I’ve faced this situation too many times: In an emergency situation, I had to reboot a server, there was no alternative. And at the exact moment the “reboot” command awaits for my last approval, I wonder: “this system did not reboot for over a year. Will it reboot correctly ?” Add to that the pression of an angry customer, and a nightly emergency … this can’t be my work anymore. Rebooting your systems as often as possible makes you more confident that all services will be delivered again after some crash. 

– Shutting down a server, even for a few minutes, is by definition a loss of service. That’s why forcing you to reboot your servers force you to think about a better availability: automatic or manual failover solutions, web sessions sharing, and so on.

– Speaking about load balancing and failover: testing them regularly is as important as testing backups (You do test backups, right ?). So a straight reboot can be a good way to make sure the failover mechanism is working.

– You can test monitoring also. It appears that a too-good-working server may be out of monitoring scope because it never fails. Make it fail manually and you’ll see if your monitoring software is aware of it.

Why 70 days ? Why not 20 or 90 ?

 My personal rule about kernel upgrades is to trigger them at least every two months. This frequency can vary according to the number of servers you have, your personal work load, and how you need your platform up-to-date.

I easily remember when I have to make my security upgrades: it’s at the beginning of odd months. Of course, if a big, brutal and scary CVE security alarm appears in the wild, I don’t wait a single day.

Some advice

If you agree with me, great ! Here are some tips I’ve learned during my first upgrade-and-reboot sessions:

First of all, document your upgrade procedures: despite you configure correctly your init and monitoring systems, write down what you need to manually reinstall, recompile, verify or start. For example, I have to recompile VirtualBox and AOE kernel module drivers each time the kernel is upgraded. Write down what servers can be rebooted daily, and which one has to be rebooted at night. Write down the priority of your infrastructure’s reboots.

Also, trace the state of each server: use for example a simple spreadsheet to note which server needs to be upgraded, and then rebooted. Here is an example:

serveur

Here you can see that server4 has been installed between september and november 2012. On January I can’t upgrade server1 for some reason (marketing. It’s always marketing’s fault), and I haven’t upgraded both server2 and server3 yet.

Last but not least: communicate. Send a short email to your co-workers and customers, even if you are 100% sure your work won’t disturb theirs. That’s exactly what you want to tell them: despite you cut a server down, the service is still delivered. And second, you tell them you take care of your servers, even if they’re fine.

That’s how we handle security at SquidSolutions: the more proactive the better. And this has to do with the lowest level of our operating systems: their kernel. We also have the chance to use Linux, which has a really good developer community that react very quickly to security threats. So, let’s use this power.