Date: | March 14, 2006 / year-entry #93 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20060314-00/?p=31933 |
Comments: | 23 |
Summary: | At the PDC, one person came to talk to me for advice on a resource management problem they were having. To simplify, their system generated dozens of work items, each of which required significant resource consumption. For the sake of illustration, let's say that each of the work items was a single-threaded computationally-intensive operation that... |
At the PDC, one person came to talk to me for advice on a resource management problem they were having. To simplify, their system generated dozens of work items, each of which required significant resource consumption. For the sake of illustration, let's say that each of the work items was a single-threaded computationally-intensive operation that required 180MB of memory. The target machine was, say, a quad-processor machine with 1GB of RAM. (This was a standalone system, so it could assume that no other programs were running on the computer.) Their first design involved creating a thread for each work item and letting them all fight it out for resources. This didn't work out so great because all of the work items would be fighting over the four CPUs and requiring several times the available system memory, resulting in thrashing both the scheduler (more runnable threads than CPUs) as well as the memory manager (working set larger than available memory). The result was horrible. The second design was to serialize the work items. Run one work item, then when it completes, run the next work item, and so on until all the work items were complete. This ran much better because only one work item was active at a time, so it could monopolize the CPU and memory without interference from other work items. However, this didn't scale because the performance of the program didn't improve after adding processors or memory. Since only one work item ran at a time, all of the extra CPUs and memory simply sat idle. The solution to this problem is to make sure you maximize one of the limiting resources. Here, we have two limiting resources, CPU and memory. Since each work item required an entire CPU, running more work items simultaneously than available CPUs would result in scheduler thrashing, so the first cap on the number of work items was determined by the number of CPUs. Next, since each work item required 180MB of memory, you could run five of them in a 1GB machine before you started thrashing the memory manager. Therefore, this work item list will saturate the CPUs first, at four work items. Once you figure out how many work items you can run at once, it's a simple matter of running only that many. A master scheduler thread maintained a list of work to be done and fired off the first four, then waited for them to complete, using the In real life, some of the work items were really child processes, and there were dependencies among the work items, but those complications could be accomodated in the basic design. |
Comments (23)
Comments are closed. |
But this solution still doesn’t scale up when adding CPUs. ;-)
(The fix would be fairly simple: Figure out the number of "open" CPUs and amount of available memory at runtime, then figure out how many tasks can run in your available memory, and finally limit the total number of running tasks to min(NR_CPUs, NR_tasks_from_mem).)
BryanK:
Raymond already mentioned your fix. In his example case, he found your NR_CPUs was 4 and NR_tasks_from_mem was 5. 4 was the lower of the two, so the example limited it to 4 work item threads at a time.
It’s also possible do find the sweet spot experimentally. I was on a project that involved writing a LOT of data to an (Oracle) SQL Server. We knew that the server had multiple CPUs, but didn’t know how many were available. (Other things affected performance such as network and disk bandwidth, too.) We found our sweet spot writing 10,000 records between commits, using 7 threads.
Why not use a counting semaphore for this? That’s how I solved a similar problem without requiring a central resource allocation thread.
BryanK: Urgh.
Figuring it out at runtime and then using all available resources is OK in this situation when, as Raymond mentioned, it’s a standalone system.
As a general solution, that’s horrible.
If it’s running on any kind of serious server with multiple services running (and your most powerful servers, which are the ones you’ll want to use for this sort of job, will probably be so tasked) then figuring out how parallel to be should probably be a choice for the user or admin.
I may not want the process sucking up all 4 cpus or all the physical RAM in the machine if there’s other stuff that I’d like to run with "reasonable" responsiveness. Especially if there are other large jobs that other users may be running at the same time.
Also, on a HT machine, trying to run 4 threads simultaneously might give you worse performance than running 2 at a time due to shared cpu caches (especially if the threads each have a 180Mb working set!)
On a virtualised OS, things could be even more weird with regards to the resources the VM actually has (and what other things are competing for them) and how much it reports it has.
There’s a reason that "make" takes a "-j n" option for parallelisability instead of figuring it out itself, and it’s not just because it’s easier to write that way!
:)
Raymond, I’m surprised you’re not recommending NT’s thread pool as the solution.
A thread pool is not a solution, it is one possible implementation of a solution. The problem here is to figure out how to efficiently divide up the work at a relatively high level, not the nitty-gritty about wether to use thread pools or semaphores or whatever.
The thread pool is poorly-suited for this scenario; it’s designed for a collection of independent short-duration tasks. The problem here is at a much higher level.
We did this very easily under VMS using job queues. I’ve always wondered why so-called more advanced OSes don’t provide this fundamental capability.
Raymond, could you elaborate on why a thread pool is "poorly-suited" here? It’s definitely the first thing that came to my mind — in the case where you have plenty of ram, make a pool with nthreads = nprocs.
Couldn’t one use the i/o completion port for this? Just set it up properly, and queue the work items to the port. It’s a pity (no offense meant, Raymond) that MS has got the ports patented, it’s such a simple idea really, and useful too…
Bryan, it scales just fine, simply change "n" from 4 to whatever. He didn’t say that it scaled automatically, just that it could scale as more processors were added. There is no "n" to change in the serial version.
Am I missing something or is this kind of problem exactly (maybe even *only*) what semaphores were invented for? Once you figure out the count, fire off as many threads as you want and have the first thing they do be wait on a semaphore of count 4 (in this case) and signal it when they’re done.
Strikes me as a lot easier than writing a thread scheduler… is there some reason that wouldn’t work?
Deals with separate processes too, just use a named semaphore.
Coderjoe: I know that Raymond solved the problem for 4 CPUs and 5-tasks-worth-of-memory.
I was responding to his comment about "The second design was to serialize the work items", where he said "However, this didn’t scale because the performance of the program didn’t improve after adding processors or memory." The final solution he posted here (limit yourself to 4 tasks) *still doesn’t scale up when you add CPUs*. That’s all I was saying. ;-)
The only way to scale up when the user adds CPUs is to determine your task-count limit at runtime. Or let the user tell you how many tasks to run at once (as in make’s -j option).
Adam — I know, any solution to this problem is going to be really bad in the general case. (Doubly so if you have no idea how much RAM/CPU the OS is using at any given moment. You can sort of figure out the RAM usage, but CPU usage is harder.) There is no good, general solution to that that I can think of, other than letting the admin tell the program how much to do at a time. But given the constraints here, it’d work OK.
As far as HT virtual CPUs: Yes, and that’s why we run our SolidWorks machines with HT disabled. It runs noticeably faster (part of this is because SolidWorks is single threaded anyway, but part of it is the shared caches and the shorter pipeline on each virtual CPU). In the specific case mentioned here, the admin could disable HT if it became an issue. (And obviously this may not work if the machine needs to do other stuff, but see above: this is already a crappy solution for that case.)
Yes, I/O completion port is a much simpler option. I/O completion port event takes care of the Wait behavior of the 4 running threads and allow a 5th thread to run, which make most full use of the CPU resources.
Fortunately, Windows has this build-in strong resource management for us!!
Anyway, Raymond Chen’s post gives a brief idea of how to implement I/O Completion port ourselves :-)
They have dozens of work items requiring 180MB and you "fix" it with a scheduler thread? Where did they get them in the first place? I’d make sure those "work items" are profiled and optimised first!
Oh, Jeffrey, but they’re patented! That’s the whole idea. Look up USPTO#06223207.
GregM: True. But the user can’t change "n"; they don’t have the source code. (Or a compiler, necessarily.) All they can do is add CPUs, and when they do that, they see that the program doesn’t scale up.
(It also doesn’t scale down: If the user removes a CPU, they’ll get all the same thrashing happening again.)
In this case, yes, the user and the developer were the same company. But who wants to bet that the people aware of the reasons for this limitation will still be with the company when "the server’s too slow!" reports start coming in? Someone, eventually, is going to want to replace this server with one that has more CPUs (or add CPUs to the existing server), and that person may not know that they have to bump "n" and recompile the software.
OTOH, if the company eventually starts running something else on this server, then a hardcoded limit would be better than trying to figure out the limit at runtime. There doesn’t seem to be a good automatic solution to this kind of problem. Making the limit changeable via the registry or a config file would probably be the best option.
One thing not to forget is that Raymond’s theatrical, just-for-an-example work items just need CPU time and memory. Real world tasks may often wait around for disk access, network messages, database queries, etc… If your items do a lot of this type of stuff then things become more complicated if you’re looking for the most efficient scheduler.
I wonder if a scheduler that controlled the running of multiple,for example, “memory/cpu” limited work items along with multiple “disk limited” work items, along with “network limited” work items would be a way to go.
Another idea could be to assign scores to each work item, for example Work Item A is 50% CPU and 50% disk, Work Item B is 90% CPU, 10% network (maybe store this info in a database or something). Then with those scores (and another value like the “size” of the item relative to the other items) somehow workout what to run when. Of course this could become so complicated you may never be able to understand why the system isn’t running correctly.
I also believed that having more threads than the number of available CPUs leads to degradation of the overall performance. However, a simple test shows it’s not true.
In my test, a job is simple arithmetic calculations with an array of 100000 floating point numbers. Each job thread is running a loop with a number of calculations. Depending on the command line parameters, the test program can start a number of threads with the specified number of iterations.
Funny is that, say, one thread with 160000000 iterations provides lower performance than 16 threads with 10000000 iterations each. On a single-CPU machine.
You can easily reproduce the case, or if somebody is interested I can send the source code of my program (written in Delphi).
Can anybody explain this?
P.S. The overall performance actually degrades when the number of simultaneous worker threads is larger than 32 or so.
"Funny is that, say, one thread with 160000000 iterations provides lower performance than 16 threads with 10000000 iterations each. On a single-CPU machine."
What springs to mind is that more threads = more time for your program when other threads are runnable (and they will be), assuming all threads are given an equal time-slice. However I would be surprised if it was that simple, or even if it was correct!
Yes, this is one of the first conclusions everybody comes to when I explain the case :-)
But I do not think this is that simple, because in fact there are no applications that are actively working at the time the test is running. Of course there are hundreds of threads in the idle system, but they are mostly sleeping.
In any case, practically the test proves that you can at least ignore the overhead for thread context switching. Of course if the number of threads in your program is reasonable.
But I’m still curious why is it so, because it’s against the theory :-) So if you have another explanation, please share it.