For months, it was always the same web server in the cluster which struggled with high load. We kept wondering whether our load balancer distributes the load unfairly, but found nothing. The server had a higher CPU and memory usage than the others, but did not serve more HTTP requests and did not have more I/O.
My investigations on the memory usage were entangled with other problems. For example, Linux's memory accounting has some weirdnesses and sometimes correlates memory allocations to the wrong process. I found a way to reset the memory accounting; this also freed a good amount of memory. At first, it looked like an acceptable workaround, but the problem returned quickly, and my trick ended up freeing less memory each time I used it.
The problem became bad enough to turn into a serious problem for the stability of our cluster; sometimes, the server ceased to work, leading to mass website outages. Only a reboot could fix it. Turning off that one server completely would only shift the problem to another server, which spoiled our plans to engage an exorcist - the server wasn't cursed, it really was a systematic technical problem in search of a mundane solution.
The first breakthrough was a closer look at the Linux kernel's memory allocator statistics. It showed a huge spike on 16 and 32 byte allocations, holding gigabytes of memory. This smelled like a memory leak, so I turned on the leak detector (KMEMLEAK). Which reported: nothing. No memory leaks.
An even closer look inside the memory allocator revealed that most of those allocations were from the NFS client, caching file names. Clearing the cache, however, did not free these. This must be a leak. After spending a few days reading the kernel source code, I finally had an Eureka moment: in January, the kernel developers accepted a change which optimized NFS directory listings by adding more file names to the cache , but it had a bug which sometimes leaked memory. KMEMLEAK was unable to find this leak because the filesystem cache is excluded from KMEMLEAK, making it blind to bugs like this.
After I reverted the optimization patch, the problem was gone for good. I sent my fix to the Linux kernel developers who merged it into all affected branches .
And why did this affect only one web server? I don't know for sure. It's probably our hash ring algorithm which always assigns that one web site with hundreds of thousands of files to the "cursed" web server, which accumulates the leaks more quickly. Eventually, it will affect all servers, but much more slowly.