Red(is) Alert!

Written By : Kevin Lawver

October 22, 2012

We have a lot of customers who use Redis as a part of their infrastructure. Most of them start off using it with Resque to process background jobs. Then, they figure they’ve got keys and values they want to keep somewhere, and well, Redis is sitting there looking all friendly and you’ve already got it configured for Resque, so why not use it for storing things?

And that’s when things get interesting. You see, Redis keeps everything in memory and saves to disk periodically based on configuration. Unfortunately, Redis kicks off a new process during a save, which uses up as much memory as the main process. So, in essence, you always have to use less than 50% of the available memory on your server when using Redis.

Why? Well, as soon as your Redis process grows large enough to hit the magical 50% memory limit, it can no longer save. When this happens, the only thing you can do is foreground save, which leaves Redis unavailable until the save is finished (which on large installations can be more than a minute) or you delete keys until things can save again.

The neckbeards in the audience may be feeling superior at this point, thinking to themselves, “It’s your own fault. You didn’t allocate memory correctly in the first place” or “that’s what monitoring is for.” And you’d be partially correct. The problem is that web apps grow over time, and sometimes get popular very quickly. That means you may end up with more jobs than you have workers to promptly handle - which means the queue grows - which means Redis grows - which means someone’s waking up in the middle of the night to babysit a sick Redis that can’t save because it ate too much. Or, true story, someone (let’s say Resque) decides to log stack traces to Redis and you somehow trigger the world’s biggest stack trace over and over again, causing Redis to explode, leaving little bits of stack traces dripping off the ceiling.

There are things you can do to avoid these disasters, and most of them are not hard (although they may be expensive), and some things to watch out for when monitoring Redis to make sure you’re looking at the right things (because, on top of a tendency to be a naughty little imp when it gets full, it also lies).

Over-Provision Everything

The argument from Redis fans is that if you only configured your servers correctly to begin with, then you wouldn’t have any problems with it. And in exactly one way, they’re right. So go ahead and provision your Redis server like your future in-laws told you to buy diamonds and houses - always get the one twice as big as you think you’ll need because it will eventually feel small.

Get more memory than you think you’ll need, more disk, and have more workers available than you think you’ll need. Just do it.

Always Have a Backup Plan

One way to get around foreground saving on the primary server is to have a standby that you can foreground save on without affecting workers or users. This allows you to babysit a sad Redis without making any other part of your infrastructure sad. Yes, you’re spending for another server that’s the same size as the primary (that part is really important - it needs to be able to take over in case the primary goes away), but your peace of mind is worth it, right?

Also, if you need to restart Redis for any reason, and you have a lot of data, restarting can take a long time. So, you break the slave relationship, have your app talk to the slave Redis, restart the master and then make it a slave of the old slave. You can then reverse that process if you need to restart the former slave.

The other backup plan you need to have is to know what you can delete without causing irreparable harm to your app. Know how to get those keys and delete them (like Resque stack traces). When you hit that 50% memory mark, this should be the first thing you do to try to get back to your happy place.

Monitoring the Right Things at the Right Time

The INFO command returns a lot of information, and a lot of is actually useful. Some of it, though, is wrong. For example, never trust the used_memory_human entry, because I’ve seen it be off by more than 500mb. Always look at used_memory for monitoring and do the conversion yourself if you need to.

You need to set up alerts for yourself (using Scout maybe) that trigger before Redis gets to 50% of available memory, because once Redis passes that mark, things are going to be very sad.

You should also monitor last_save_time and alert if it hasn’t saved in a reasonable amount of time, where reasonable is defined by you. Another option to get this info is to monitor the Redis log file for failed saves (we use the Log Watcher plugin in Scout for this for “Can’t save in background”)

If you’re using Redis for Resque (and you probably are), you need to closely watch the size of your queues, know your average velocity on completing jobs, and do some math to figure out how efficient your workers are. Why? Because you need to know when your workers will no longer be able to keep up with the pace of incoming jobs with enough time to spin up more workers if you need them. You don’t want to get into a state where you have queues on fire (putting Redis into a bad state) and not enough workers to put out that fire.

Look Into Append-Only File

Redis has an alternate persistence strategy, called AOF that eases some of the pain of forking the process to save a snapshot, but not all of it, since the folks at Redis suggest using both. If you’d like to read more on Redis persistence, antirez wrote an epic blog post about it.

Separate Functions

Don’t run your workers on your Redis server. Workers use up RAM that Redis needs to be able to save things. Workers should have their own servers, just like your app server is a separate thing from your Redis server (please tell me it’s separate).

If you’re happily using Redis for Resque, and then want to use Redis for some other nefarious purpose, get yourself another Redis server. Don’t combine them. The two uses are completely different and grow differently.

Have an Expiration Strategy

Redis is in a weird place between “hot” caching like memcached where things fall out of the bottom of the cache when memory is needed and disk-based data stores where you don’t need to store everything in memory. So, the storage strategy in Redis is a little hard to get your head around. You have it because it’s persistent, which is good. But, you also have to have enough memory to keep everything in memory all the time. So, you can’t keep everything in it forever, because it’s also not horizontally scalable (yet).

You need to come up with a expiration plan for how you’re going to expire things out of Redis and either forget them forever or move them somewhere less volatile and less “expensive” (disk is cheaper than RAM). You could do it by activity, by time, whatever, just have a plan and have it built into the code so it can be run if you need to free up memory to make Redis happy again.

Because Redis will get sad. It just will.

In Conclusion

Redis is a great in-memory database. It’s extremely flexible, developer- friendly and easy to get started with. The key with any piece of software is to use it correctly, know what it looks like when it’s about to fail, and how to keep it well-fed and happy. I could (and probably will) have written this blog post about any of the software our customers use, because we’ve had problems with all of them. I hope this post saves you some time and makes you a happier and healthier Redis user!