When Things Go From Good to Sad... to Seriously Serious

Written By : Travis Graham

(this was written by our intrepid Director of Operations, Travis Graham, who was too busy to proofread and edit it, so I did it for him. I swear he wrote, like, 95% of it. I added the top tips and the thing about open office hours. I certainly didn’t have to fix any typos or weird comma things. – Kevin)

Some of the most common alerts that come in are http timeout, memory utilization, server down, and ssh timeout alerts. Depending on the customer and how they’ve built their application infrastructure out gives me an initial idea of where to look. Having the hands on experience of dealing with thousands of these types of alerts and fixing them creates a “run book”, of sorts, to know where to start investigating and what things to look for in order to fix the problem before it becomes seriously serious. Here are some tips for dealing with these basic alerts and how to find and fix them.

HTTP Alerts

The most common cause of http alerts is the passenger global queue has backed up causing the http check to fail.

The first thing I like to check is “top” to get an overview of the current running passenger processes: total number running, their uptime, and memory footprint. If there are no passenger processes, I check to see if a recent deploy has gone out and look to see if apache was restarted or is even running at all. Sometimes a typo makes its way into the code base which causes apache to fail to start. You are deploying to staging; aren’t you!?

A couple quick tips on top: You can change the column the list is sorted on by using < and >, and see what command is actually being run by hitting c. Knowing which URL is misbehaving in your app is priceless and being able to sort by CPU and memory usage is nice too.

If there are passenger processes and they look normal across the board, I jump to check passenger-status to see if the global queue has backed up. Most often, this is the case, and a simple apache restart will clear the global queue, and the site will start loading again. If an apache restart clears things up, you’re good to go. If the global queue fills up quickly after restarting apache, check the number of requests each passenger process has served and make sure they are incrementing. If they aren’t incrementing and you see “Sessions: 1”, this means something about the request is either long running or blocking. Debugging passenger processes is a post in itself; so, more on that at a later date.

Sometimes, there may be an abundance of connections to the server and apache gets overloaded. You may just be getting a legitimate increase in traffic; so, I would start by increasing the number of MaxClients in your apache config and see if that gets your app back up and running. If you think the traffic might be malicious in nature, you’ll want to track down the connections and count them to see if there’s an extremely high number of connections from a single or a few IPs. If so, you’ll want to see if it’s a valid user that’s abusing your app or someone up to no good. Using this “netstat -an | grep -e ‘:80 ‘ -e ‘:443’ | awk ‘{ print $5 }’ | awk -F: ‘{ print $1 }’ | sort | uniq -c | sort -n | tail -n 20” will give a sorted list of the top 20 connections to the server via http/https. You can then investigate the IPs and make a judgement call on blocking connections. Using “sudo iptables -I INPUT -s -j DROP” will drop all traffic coming from that address.

Memory Alerts

Often times, passenger processes have gone off the reservation, have become leaky, and need to be killed. We have a moonshine plugin that monitors passenger processes for memory utilization and ensures they are still being maintained by the passenger application spawner. If a passenger process is no longer being managed by the spawner, it can’t be killed when traffic decreases or, depending on your passenger setting, it maxes out on requests served or times out.

If your application hasn’t been broken out into separate roles per server, you may have overloaded what your current server is capable of. Often times, a database gets big or resque processes start to grow as they fork. Restarting these services can release memory that’s not had a chance to be released and will buy some time to find a longer term solution, such as debugging memory leaks, planned downtime for a server upgrade, or separating your database out onto its own server.

Another common memory hog is a late night cronjob that kicks off a rake task. Just because it’s 2am doesn’t mean you can load the whole database into memory to update a few million rows. Don’t let nasty rake tasks keep you or your ops team up at night. Find and fix the memory hungry code or break the single task into smaller parts.

One thing to be careful of is the very real chance the OOM killer starts killing off important processes. While working to reduce memory usage, sometimes you need to ensure data integrity. I like to set the OOM killer to ignore things, like mysql or redis-server, if the server looks close to OOMing and I need more time. You may also want to protect your SSH session; otherwise, you’ll be disconnected and can’t fix the problem without rebooting the server. You can do this using “echo -17 > /proc/$PID/oom_adj” where $PID is the PID of the process you are wanting to protect. Once you have your processes protected, kill off anything else that isn’t essential that might be using memory to buy enough time for a clean shutdown or restart.

Server Down

Yep, not much you can do to rescue this type of alert unless it’s a false positive. If the server is down, restart it and begin digging into the logs. The most common cause of a server down alert is such a rapid growth of memory that the server OOMs before a memory alert can come in. These alerts are hard to pinpoint with a cause, but good monitoring can be a life saver. We use Scout to monitor our servers, so we can check our Scout graphs for a rapid increase in memory usage or possibly disk usage that took the server down. Some other interesting pieces of information about what might have lead up to the server going down can be gathered from traffic graphs and passenger graphs to see if there was a sudden increase in traffic that caused too many passenger processes to spin up. Checking through the server logs and apache logs to see if there’s a common thread to follow in order to find the cause can sometimes be tedious; because when a server OOMs and goes down, a lot of valuable information might be lost because the logs can’t be written to.

If your server is up, but your app isn’t responding you might be getting DoS’d or flooded. If you can connect to your server, possibly via KVM or IPMI, you can still troubleshoot the cause and block traffic that’s preventing your ping check from being green. You’ll want to use the netstat command I mentioned earlier, but taylor it to fit the need of the moment.

SSH Alerts

Normally, SSH alerts take some time to investigate because you have to wait for the server to become available again to checkout what’s happening. You might get lucky and make it in the first or second attempt. Jump to the log directory and check the auth log or security log for failed SSH attempts. If you have a script kiddie that happens to be doing a dictionary brute force attack on your server, you’ll see a plethora of failed attempts for either the same user with bad passwords, or an alphabetical list of random usernames. We use SSH keys and no passwords to increase the security of our servers; so, a valid user trying to login and failing that many times is very unlikely. You’ll see the IP address of the machine trying to connect, so you’ll be able to block it in the firewall. Using iptables on the server is an easy way to quickly block the offending IP address from making future attempts until iptables is restarted or a reboot happens. Use the iptables command mentioned earlier to block this traffic.

A long term fix would be installing and configuring something like fail2ban and letting it automatically block IPs based on the rules you setup. This can be a premature optimization, so I would wait until SSH alerts become a problem before spending time setting it up. More often than not, the quick block using iptables is sufficient. If you do go this route, be sure to configure your hosts.allow correctly by white listing your IP so you don’t lock yourself out late at night while making a typo run on your password.

To wrap things up with a word of advice, all these solutions to common problems we see are based on having monitoring setup and alerting when a set of conditions have been met. Sometimes, you’re able to set the thresholds for alerts low enough so you have time to react and fix things. Sometimes, it’s just not possible to be proactive. The best way ahead is with a plan and remaining calm. If you don’t do the basic investigative steps to correctly identify the problem, you may make mistakes and “fix” things that are not the real problem and possibly cause more harm. While things can be stressful, knowing the basic steps of investigation and paths to fix common problems goes a long way to keeping your app up and running and your customers happy.

This is also probably a good time to mention that we’re doing open office hours on July 12th from 2-3PM Eastern if you’ve got questions about anything in this post or pretty much anything else.