What if Redis Stops Working - How Do I Keep My App Running

An interesting question came up on the CacheManager repository in issue 146. What do I do, if the Redis server dies and might be offline for a while? Can we just continue working with in-memory cache and wait for Redis to come back online? And if so, how would I do that with CacheManager.

This issue and the possible solutions are actually really interesting to look at because it is a great showcase for very common issues in distributed systems or microservices where you have to rely on certain things you do not control yourself, and, services might be unavailable from time to time...

Let's Just Disable the Redis Cache Layer

The original intent of the reported issue was, that the user wanted to temporarily disable the Redis cache layer while the Redis server is down.

With the StackExchange.Redis multiplexer, it is easy to listen to two events, ConnectionFailed and ConnectionRestored, so, I could actually disable all Redis related things in between outages.

But, CacheManager doesn't allow you to change the cache layers after it has been instantiated, there is also no way to remove layers later on. Is that something I should add? I hesitated to do so for a while because of reasons. Let's see what happens if we implement something similar outside of CacheManager.

Solution? Manage Two Cache Instances

My basic idea was to have two separated cache instances configured. One with Redis and the other one with in-memory storage only. Whenever Redis dies, we switch to the in-memory fallback. When the connection to Redis comes back online, we also switch back...

Here is a small test program which does exactly that:

internal class Program
{
    private static void Main(string[] args)
    {
        var cacheKeeper = new CacheKeeper<int>();

        while (true)
        {
            var value = cacheKeeper.Cache.AddOrUpdate("key", 1, (v) => v + 1);
            Console.WriteLine("Current value is " + value);
            Thread.Sleep(500);
        }
    }

    public class CacheKeeper<T>
    {
        private readonly ICacheManager<T> _distributed;
        private readonly ICacheManager<T> _inMemory;
        private bool _distributedEnabled = true;

        public CacheKeeper()
        {
            var multiplexer = ConnectionMultiplexer.Connect("localhost");

            multiplexer.ConnectionFailed += (sender, args) =>
            {
                _distributedEnabled = false;

                Console.WriteLine("Connection failed, disabling redis...");
            };

            multiplexer.ConnectionRestored += (sender, args) =>
            {
                _distributedEnabled = true;

                Console.WriteLine("Connection restored, redis is back...");
            };

            _distributed = CacheFactory.Build<T>(
                s => s
                    .WithJsonSerializer()
                    .WithDictionaryHandle()
                        .WithExpiration(ExpirationMode.Absolute, TimeSpan.FromSeconds(5))
                    .And
                    .WithRedisConfiguration("redis", multiplexer)
                    .WithRedisCacheHandle("redis"));

            _inMemory = CacheFactory.Build<T>(
                s => s
                    .WithDictionaryHandle()
                        .WithExpiration(ExpirationMode.Sliding, TimeSpan.FromSeconds(5)));
        }

        public ICacheManager<T> Cache
        {
            get
            {
                if (_distributedEnabled)
                {
                    return _distributed;
                }

                return _inMemory;
            }
        }
    }
}

While running this, you should periodically kill the Redis server and short after, start it again to force the events to trigger. After a while you should start seeing console messages like this:

Current value is 1
Current value is 2
Current value is 3
Connection failed, disabling redis...
Current value is 1
Current value is 2
Current value is 3
Current value is 4
Current value is 5
Connection restored, redis is back...
Current value is 1
Current value is 2
Current value is 3
Current value is 4
Current value is 5

Yey, seems to work, right?

*But wait, we have a counter, isn't that counter supposed to continue counting? Instead it starts over and over... *

Well, if you think about it... We just switched the storage of the counter and the caches do not know about each other.

What if we would share the in-memory instance?

Yes, at least when Redis goes down, it would still continue to increment the same counter. But what should happen when Redis comes back up? You would have to sync that data back to Redis somehow...

I also did not configure Redis to persist the cached data, meaning, if Redis comes back online, it is empty, too. If I configure Redis to persist the data, the counter's value would eventually be the value it had before Redis died.

But if you continue to count using the in-memory storage, each instance of the app could have a totally different value as they work detached from Redis. And now it gets really difficult to synchronize the data back to Redis...

What happens if that sync back to Redis fails, or your app stopped working in the meantime?

The data would be lost.


Long story long, this gets really messy really quickly!

Conclusions

You can implement some fallback cache mechanism for those occasions where Redis is not available but you still want to keep your app running properly. But this simple example already illustrates a lot of issues you might run into pretty quickly when you try to continue working with the data "normally".

If you have to rely on the cached data and that data changes state, you have to deal with stale data and handle synchronizing it back and forth with Redis. This gets really complicated really quick! And it is questionable if it is actually worth the effort to implement such a complex system, because you can set up Redis servers with failover or clusters which should have a really good up-time!

And that's why I would say you should not try to do something like this, ever!

Distributed systems should still handle problems like this gracefully though. If you for example rely on Redis as a persistent store, you have to rely on Redis being available! When Redis goes down, you have to deal with it. It is the same as if your file system is gone or your SQL Server is down.

A lot of those systems disable all writes and set their cluster in maintenance state. Then, you can display a message to the user to let them know that something bad is going on... ;)

I think, the only reasonable use case would be to fall back to a read-only in-memory cache. You still have access to the (eventually stale) data to at least show something to the user.

That being said, CacheManager doesn't have a flag to disable writes and you can also not share in-memory cache instances between CacheManager instances. Sounds like this could be a reasonable new feature to support such things as maintenance windows.

What are your thoughts? Let me know in the comments!