So, your team has worked hard on designing a high-performance software system. One of the key components is a thread or service designed to buffer and protect the application from slow operations by caching the response of lookups from slower or more distant systems (like frequently used credential data or permissions, for example that might be coming from database servers).

On the test-bench, everything looks good and then the code goes into production. Everything scales nicely up to peak loads, and then the whole system begins to stall and stutter. Performance ratchets in a staccato sinusoid, working just fine here, and coming to an almost complete halt there. After a while, as overall load falls and end-users become frustrated, it all sorts itself out and runs smoothly, until just after the next peak load.

You direct the team to investigate, to profile code, to monitor logs and performance. Nobody can find a link. Your engineers come up with plans to optimize the cache service and eke every microsecond of performance out of it.

Mysteriously, the problem just gets worse, not better despite thousands of hours and weeks or months of investigation, and tuning. Someone discovers that disabling hyperthreading on the server mitigates the problem significantly, but nobody’s closer to a solution.

Congratulations, you’ve hit a common, but very rarely understood problem in high-performance systems design. It’s bitten almost every high-performance systems shop, yet almost nobody has truly solved the problem, because hardly anyone has understood the true cause at the time. Most engineers end up working around it, because they never quite know where to look.

Let’s save the day and tell your team where to look. It might not be the source of your particular problem, but ruling it out early can save you tens of thousands of dollars in development, and far more than that in frustrated customers.

Your problem revolves around the modern adaptive scheduler, and the fact that it is doing what it is designed to do, and not what your team are expecting it to do.

Here’s one possible configuration:


This is a pass-through configuration. You might instead have the cache as a separate lookup out of the direct chain of communication with the database, but the problem can affect that configuration as well (Even if the cache entity just serves as a separate thread or process operating as an abstraction layer).

The pass-through configuration is one of the most common, as the cache thread/process buffers and abstracts database access, allowing you more versatility and hiding implementation details from an application that really shouldn’t need to get its hands any dirtier than they need to be. It’s also something you can bolt into an existing design fairly easily.

All in all, this sort of design looks eminently sensible and elegant. The application makes requests, a separate thread or process (the cache box in the diagram above) takes care of things as asynchronously as the application allows, returns cache-hits (if it is caching) and otherwise abstracts access to the data-source (or database, in this diagram, which might represent multiple databases on a number of remote and potentially distant servers).

Despite the simplicity and elegance, the problem starts to appear. Everything scales smoothly to peak loads, and then performance stalls, then suddenly resumes, then stalls again, with no apparent pattern until the load finally comes off significantly.

Your engineers try to profile things, and everything points to inefficiency in the cache-component. Hundreds or thousands of hours are burned tuning and tweaking and improving its efficiency. Perhaps even adding more intermediate components.

The counter-intuitive catch is that the more efficient the middle component(s) becomes, the worse the problem gets. Why?

Because the component is CPU-bound. Think about that for a moment.

Your devs are caching all of this lovely data for speed, eliminating everything from it that they can that might block throughput. By extension, that means everything that might cause the process (or thread) itself to block and yield its timeslice early.

So, the component runs to the end of its timeslice often, not yielding back to the scheduler. As you achieve peak loads, it is doing that nearly constantly.

As far as the scheduler is concerned, that makes it a bad citizen.

And what does the scheduler do with bad citizens? It reduces their effective priority, giving them fewer and fewer of the available timeslices, and instead handing those timeslices to good citizens, like your database or your application.

The application is a good citizen, but it’s now starving, because it can’t get the darn data that it needs. In the scheduler’s eye, though, the application is being very well-behaved about CPU resources and is being allocated them aplenty – it just cannot use them. The components at the ends of the data-transfer pipelines get more cycles than they can use, because they cannot elicit timely responses from the piece(s) in the middle.

The cache-component, on the other hand, is desperate for cycles, and as soon as it gets a timeslice (usually because the application has ultimately blocked waiting for data that just wasn’t arriving) it tries to get through the backlog of work, shovelling data back to the application as fast as it can, and overrunning its timeslices again, reinforcing the opinion of the scheduler that it is naughty and that punishment needs to continue. A burst of work happens, the application gets things done and things stall again. And again. And again, ad nauseum.

Eventually, the overall load on the system falls. Connections time out. Users log off. The overall system stabilizes as the cache-component’s work finally slows and it starts running out of work more often than it runs out of timeslices. The scheduler sees that it is behaving better and its effective priority is returned to normal slowly.

The problem goes away. Until tomorrow’s peak load. Or the evening’s.

Meanwhile, your engineering team is losing hair and sleep and developing ulcers, because all the obvious signs point to delays or blocking in the cache-component, and they’re crawling all over the code to eliminate those, and after each hard slog the problem just seems worse than it was.

Even increasing the number of logical or physical processors and processor cycles only seems to serve as a temporary balm, allowing the system to scale a little further at the expense of even more severe performance stuttering.

This is basic resource starvation (specifically scheduling starvation), akin to the dining philosophers problem. The problem is that it’s hard to see that that is what it is, and the dining philosophers are usually used to illustrate mutual exclusion problems, so few people ever think to look in the right place, since this is essentially a queuing problem, not a multithread-deadlock.

So, what’s the solution here?

There are several, actually. One is to force the middle components to yield to the scheduler between discrete operations.

Another is to set up processor affinity and use round-robin cooperative scheduling, though manually yielding becomes more important than ever, because if you never yield, no other process ever gets a turn on that processing core.

There are doubtless other relatively simple solutions depending on your exact needs and configurations, many of which become obvious once you actually understand what the problem is. You can leave those to your engineers to discover, once they’re aware of the true cause of your application performance problems.

A problem understood is a problem half solved.

Categories: How To, Software, Technology.

Got a news tip or a press-release? Send it to [email protected].
Read previous post: