Featuring...

So, your team has worked hard on designing a high-performance software system. One of the key components is a thread or service designed to buffer and protect the application from slow operations by caching the response of lookups from slower or more distant systems (like frequently used credential data or permissions, for example that might be coming from database servers).

On the test-bench, everything looks good and then the code goes into production. Everything scales nicely up to peak loads, and then the whole system begins to stall and stutter. Performance ratchets in a staccato sinusoid, working just fine here, and coming to an almost complete halt there. After a while, as overall load falls and end-users become frustrated, it all sorts itself out and runs smoothly, until just after the next peak load.

You direct the team to investigate, to profile code, to monitor logs and performance. Nobody can find a link. Your engineers come up with plans to optimize the cache service and eke every microsecond of performance out of it.

Mysteriously, the problem just gets worse, not better despite thousands of hours and weeks or months of investigation, and tuning. Someone discovers that disabling hyperthreading on the server mitigates the problem significantly, but nobody’s closer to a solution.

Congratulations, you’ve hit a common, but very rarely understood problem in high-performance systems design. It’s bitten almost every high-performance systems shop, yet almost nobody has truly solved the problem, because hardly anyone has understood the true cause at the time. Most engineers end up working around it, because they never quite know where to look.

Let’s save the day and tell your team where to look. It might not be the source of your particular problem, but ruling it out early can save you tens of thousands of dollars in development, and far more than that in frustrated customers.

Your problem revolves around the modern adaptive scheduler, and the fact that it is doing what it is designed to do, and not what your team are expecting it to do.

Here’s one possible configuration:

acd1

This is a pass-through configuration. You might instead have the cache as a separate lookup out of the direct chain of communication with the database, but the problem can affect that configuration as well (Even if the cache entity just serves as a separate thread or process operating as an abstraction layer).

The pass-through configuration is one of the most common, as the cache thread/process buffers and abstracts database access, allowing you more versatility and hiding implementation details from an application that really shouldn’t need to get its hands any dirtier than they need to be. It’s also something you can bolt into an existing design fairly easily.

All in all, this sort of design looks eminently sensible and elegant. The application makes requests, a separate thread or process (the cache box in the diagram above) takes care of things as asynchronously as the application allows, returns cache-hits (if it is caching) and otherwise abstracts access to the data-source (or database, in this diagram, which might represent multiple databases on a number of remote and potentially distant servers).

Despite the simplicity and elegance, the problem starts to appear. Everything scales smoothly to peak loads, and then performance stalls, then suddenly resumes, then stalls again, with no apparent pattern until the load finally comes off significantly.

Your engineers try to profile things, and everything points to inefficiency in the cache-component. Hundreds or thousands of hours are burned tuning and tweaking and improving its efficiency. Perhaps even adding more intermediate components.

The counter-intuitive catch is that the more efficient the middle component(s) becomes, the worse the problem gets. Why?

Because the component is CPU-bound. Think about that for a moment.

Your devs are caching all of this lovely data for speed, eliminating everything from it that they can that might block throughput. By extension, that means everything that might cause the process (or thread) itself to block and yield its timeslice early.

So, the component runs to the end of its timeslice often, not yielding back to the scheduler. As you achieve peak loads, it is doing that nearly constantly.

As far as the scheduler is concerned, that makes it a bad citizen.

And what does the scheduler do with bad citizens? It reduces their effective priority, giving them fewer and fewer of the available timeslices, and instead handing those timeslices to good citizens, like your database or your application.

The application is a good citizen, but it’s now starving, because it can’t get the darn data that it needs. In the scheduler’s eye, though, the application is being very well-behaved about CPU resources and is being allocated them aplenty – it just cannot use them. The components at the ends of the data-transfer pipelines get more cycles than they can use, because they cannot elicit timely responses from the piece(s) in the middle.

The cache-component, on the other hand, is desperate for cycles, and as soon as it gets a timeslice (usually because the application has ultimately blocked waiting for data that just wasn’t arriving) it tries to get through the backlog of work, shovelling data back to the application as fast as it can, and overrunning its timeslices again, reinforcing the opinion of the scheduler that it is naughty and that punishment needs to continue. A burst of work happens, the application gets things done and things stall again. And again. And again, ad nauseum.

Eventually, the overall load on the system falls. Connections time out. Users log off. The overall system stabilizes as the cache-component’s work finally slows and it starts running out of work more often than it runs out of timeslices. The scheduler sees that it is behaving better and its effective priority is returned to normal slowly.

The problem goes away. Until tomorrow’s peak load. Or the evening’s.

Meanwhile, your engineering team is losing hair and sleep and developing ulcers, because all the obvious signs point to delays or blocking in the cache-component, and they’re crawling all over the code to eliminate those, and after each hard slog the problem just seems worse than it was.

Even increasing the number of logical or physical processors and processor cycles only seems to serve as a temporary balm, allowing the system to scale a little further at the expense of even more severe performance stuttering.

This is basic resource starvation (specifically scheduling starvation), akin to the dining philosophers problem. The problem is that it’s hard to see that that is what it is, and the dining philosophers are usually used to illustrate mutual exclusion problems, so few people ever think to look in the right place, since this is essentially a queuing problem, not a multithread-deadlock.

So, what’s the solution here?

There are several, actually. One is to force the middle components to yield to the scheduler between discrete operations.

Another is to set up processor affinity and use round-robin cooperative scheduling, though manually yielding becomes more important than ever, because if you never yield, no other process ever gets a turn on that processing core.

There are doubtless other relatively simple solutions depending on your exact needs and configurations, many of which become obvious once you actually understand what the problem is. You can leave those to your engineers to discover, once they’re aware of the true cause of your application performance problems.

A problem understood is a problem half solved.

  • Share/Bookmark
Categories: How To, Software, Technology.

Possibly related posts

Performance that doesn’t perform, Cool tools – Game Booster, Snowglobe and traffic flooding, Third-party Second Life viewers, Power-supply tips

8 Responses to “When high-performance programming goes bad – scheduler starvation”

  1. Nightbird Glineux says:

    Naughty code that must be punished. OK. *grin*

  2. That was fun. For a second I thought I had accidentally clicked through to an nginx forum *smiles*

  3. Tateru Nino says:

    It’s a little out of my usual line, but I used to manage development teams and even did a stint as a CTO for a while. May as well share some of the fruits.

  4. That’s a good thing, I enjoyed the post (by coincidence we’re in the middle of a cache tuning/cdn implementation exercise :)

  5. Thoria says:

    Excellent, thank you! That’s worth keeping in mind, and is also a good illustration of unintended consequences. One wonders if a recent communication from the technical side of Linden Lab about hyper-threading performance problems is dealing with a situation similar to this.

  6. Nightbird Glineux says:

    Question: Does the scheduler care if it gives cycles to the application (or the database) that it can’t use? In other words, would the scheduler prefer for the CPU be idle than give cycles to the naughty *giggle* cache manager?

  7. Tateru Nino says:

    It doesn’t really care. Higher priority applications get more cycles – ‘badly behaved’ (which can include ‘working as designed’) processes tend to get pushed down to the point where they only get idle cycles, if they’re consistently running to the end of their allocated timeslices.

    That’s what generates the staccato behaviour – the system starves, causing the whole pipeline to stop, the ends eventually block, leaving idle cycles, which then bring the pipeline to life. All the parts start working again, and then scheduler starvation stalls the components in the middle again, bringing everything to an eventual halt. Rinse and repeat.

  8. CaptainCrunch says:

    A very similar problem appears when you have multiple execution-contexts that have to share signalling or state data.

    If you have the option, changing execution priority can help a lot, as can moving to a different scheduling engine. Using a Real-Time scheduler and flagging the cache-process as such will do a lot for maintaining a steady maximum throughput.

    To be frank, I have never have much trusted fair-share style scheduling systems – while they work well for the common workloads, they really only work well for the common workloads.

    Give them a workload that has a feeder component, or one that is heavily DMA and/or IO bound and things get strange.

    Additionally, most of the development and modelling that has been done for Scheduler performance is based on Multiple-Process workloads, not on the newer Multiple-Thread environment.

    The tuning paramaters (and the associated bottlenecks and scheduler (and dont forget the memory manager) prioritisation quirks) change subtly (but significantly) when you are looking at using inter-process communication (pipes, sockets & the like) rather than intra-process communication (shared memory regions, mutex-locks, etc).

    Part of this will be to do with Process vs. Process prioritisation as opposed to Process vs. (Thread vs. Thread vs. Thread) prioritisation.

    Marshall K McKusick has some very interesting things to say about thread scheduling in his ACMpresentation from 2004.

    MKK’s scheduling work has been the basis for that done by many other people. In particular, he has the following to say about how the legacy queuing systems work:

    The system tailors this short-term scheduling algorithm to favor interactive
    jobs by raising the scheduling priority of threads that are blocked waiting for
    I/O for one or more seconds and by lowering the priority of threads that
    accumulate significant amounts of CPU time.

    Your post is wonderfully composed, and does a very good job of exposing a rarely approached portion of the system to a much wider audience – thankyou.


Your Ad Here

Leave a Reply

Commenters are to be civil, courteous and respectful to others, insofar as it is possible to do so. Beyond that, you're not required to agree with the opinions expressed by me or by others. Think for yourselves!
First time commenters will wind-up in the moderation queue and your comment won't appear right away. Ditto for anything that gets flagged by the anti-spam rules.

  • ipv6-test.com
  • Support us

    Writing is my day job. Site advertising pays for the hosting, but nothing else. Help keep us in coffee and keyboards

    ... or donate in Second Life at this location.

  • NCI - free education and information for new Second Life users