Date: | November 22, 2006 / year-entry #392 |
Tags: | history |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20061122-50/?p=28933 |
Comments: | 35 |
Summary: | Back in the Windows 95 days, people swore that increasing the value of MaxBPs in the system.ini file fixed application errors. People usually made up some pseudo-scientific explanation for why this fixed crashes. These explanations were complete rot. These breakpoints had nothing to do with Windows applications. They were used by 32-bit device drivers to communicate... |
Back in the Windows 95 days, people swore that increasing the value of These breakpoints had nothing to do with Windows applications. They were used by 32-bit device drivers to communicate with code in MS-DOS boxes, typically the 16-bit driver they are trying to take over from or are otherwise coordinating their activities with. A bunch of these are allocated at system startup when drivers settle themselves in, and on occasion, a driver might patch a breakpoint temporarily into DOS memory, removing it when the breakpoint is hit (or when the breakpoint is no longer needed). Increasing this value had no effect on Windows application. I fantasized about adding a "Performance" page to Tweak UI with an option to increase the number of "PlaceBOs". I would make up some nonsense text about this setting controlling how high in memory the system should place its "breakpoint opcodes". Placing them higher will free up memory for other purposes and reduce the frequency of "Out of memory" errors. Or something like that. I was reminded of this story by my pals in products support who were trying to come up with a polite way of explaining to their customer that there is no /7GB (And by the way, the technical reason why the user-mode address space is limited to eight terabytes was given by commenter darwou: The absence of a 16-byte atomic compare-and-exchange instruction means that bits need to be sacrificed to encode the sequence number which avoids the ABA problem.) |
Comments (35)
Comments are closed. |
I’ve a similar story for you: I play in a band, while recording once, our singer was making (self admitted) ludicrous/impossible requests for strange effects.
To properly illustrate this, the soundman told us the story of the NND: To deal with these requests he says "I have just the thing!", hauls out some equipment, tangles in some leads and has the offending requestor operate the dial "very slowly and carefully, the subtlest change on the dial can have huge effects". Then he sets the recording playing and allows the good person after the bonkers request lovingly and with full creative control adjusts the dial.
This fine device, the NND, is of course the no noticeable difference machine.
It’s the old story – random reinforcement breeds superstition. I recall people telling me in the Windows 95 days that changing some DUN registry keys would speed up my cable modem.
Funny how some people try to rationalize it, though. Sometimes I wonder if they came up with the explanations themselves using some bizarre analogies, or if one of their friends fed it to them just to see how gullible they were.
"random reinforcement breeds superstition" – That was a fun tactic in the game Black and White. You randomly praise or punish your creature and it ends up neurotic.
http://support.microsoft.com/kb/131041/en-us
This kind of article makes one believe it helps apps.
"This kind of article makes one believe it helps apps."
Only if you miss the explanation of how it affects the device drivers.
I’m more curious about the reference to ‘SYSINI.INI’ (which would appear to be a symptom of a Trojan infection, as opposed to SYSTEM.INI, as documented in SYSINI.WRI)…
It might actually be a viable business model…
http://www.siltechcables.com/wawcs0115596/tn-philosophy.html
"Part of the beauty of the design of Siltech cables, however, is beyond our design efforts. Nature itself ensures that our cables improve with age, due to the conductive qualities of silver."
ahh has it been 10 years already?
http://www.ftc.gov/os/1996/10/c3688cmp.htm
even made the top 3 worst pc tech of all time:
http://www.pcworld.com/article/id,125772-page,2/article.html
:)
One day I found that someone pulled the WiFI ADSL router from the shelf below the printer and extended its antennae. Since our network is wired and I have disabled WiFi in the router in order to prevent "bandwitdth sharers", I folded the antennae and returned the router to their original location. A few days later I found it again outside the shelf with the antennae extened. It seems that one of the times that the boss restarted the router (the ADSL connection is a bit unstable there), he noticed the antennae and thought that they could affect the ADSL "reception". I explained him that they aren’t used for ADSL but for the local network and returned it again to its position.
After third time I found the router outside I left it there. It make happy the boss, don’t affect the network since WiFi isn’t used and the reset button it’s more accessible in that location :-) . Unfortunately, the ADSL continue having cuts from time to time…
The favourite pointless ritual, that will cure all ills, in Apple land is ‘Repair Permissions’. It doesn’t matter what issue you have, or what forum you post in, someone will suggest that repairing permissions will fix it. Unsanity provided a pretty good debunking over a year ago: http://www.unsanity.org/archives/000410.php
Still the myth persists: I was having problems with iDVD. When I posted my query on Apple’s user discussion board, mentioning that an OpenGL error was being reported, what is the first suggested remedy? Repair Permissions! I gave hope up at that point.
Re: knowledgebase articles: they are (or were) written by a variety of sources, some of which are also susceptible to this sort of urban legend.
Funny meta-reference: Since Wikipedia doesn’t want original research, I recently found myself googling for KB and other technical references *that I had written* to cite in the Wikipedia article I was rewriting…
Why not detect support for the cmpxchg16 instuction and increase the address space if it exists?
That would be great. Also because all new processors from both AMD and Intel now have the cmpxchg16b (it was missing from the first versions of Athlon64s and Opterons).
I find much joy when people "hear" that a distinct digital bitpattern sounds better on some hardware than the identical bitpattern reproduced on other hardware.
Seriously Raymond, why not use all those new instructions? What is the point in IHVs putting so much effort and money in new hardware capabilities when it still has to run legacy code mix?
We have cmpxchg16b and we don’t use it. It is a wasted silicone.
We have MONITOR/MWAIT thread synchronization primitives (for spinlocks, semaphores and such) and yet kernel doesn’t use them. Again wasted silicone which could be put to good use otherwise.
Next year we are going to have CPUs with CRC32 instruction. How long will it take for Microsoft software to pick it up if at all?
It is a shame that some of us are paying to have more features in their hardware and then we get slapped in the face by lazy software vendors who find it "too complicated" to test more than one code path and who never heard of automatic cpu dispatch.
Talk about optimizing for edge cases just for the sake of it! It’s not because the instructions exist that they’re free to call!
"Talk about optimizing for edge cases just for the sake of it!"
Well excuse me, but that instruction is now part of the most 64-bit CPUs on the market and it will continue to be present in all future architectures. So it is not an edge case.
Furthermore, it is not an optimization — it enables certain feature (atomic operation on 128-bit values) which would not be possible to implement otherwise.
"It’s not because the instructions exist that they’re free to call!"
FYI, majority of new instructions are:
take same number of cycles but do more work per cycle
have the same performance/number of clocks but do not incur penalties or do not create dependencies and stalls
Rationale:
New instructions are made to be used. If you don’t, you are anal and here is an analogy:
You asked city mayor to make a bypass so you can drive quickly from one end of the city to the other.
When it is finished you start using it but instead of driving fast like you should drive on the highway, you crawl like in the middle of a downtown traffic jam.
That’s because the first hardware is using monster oxygen-free monofilament crystalline transparent copper cables that cost $200/metre. Sheesh, don’t you know anything about audio feng shui?
Deaf & Blind > I find much joy when people "hear" that a distinct digital bitpattern sounds better on some hardware than the identical bitpattern reproduced on other hardware.
I find it amusing and frustrating when people make statements such as this one with the knowledge that digital is ‘precice and exact’ to back up that assertion, yet failing to take into account the analogue path and other digital and analogue signal processing that may differ between the hardware.
For a perceived performance boost, just install the XP Royale theme – it’s like go faster stripes for your computer. It really does make one’s computer feel sleeker and more powerful.
Back in the 90s I did actually put go faster stripes on my computer. Red insulating tape.
Naturally, I also changed the LED display on my front panel and pointed it out to my friends: "See? Up from 66 to 67Mhz now, with the go faster stripes!"
Sadly, nobody would pay me to make a similar upgrade to their machines.
[just install the XP Royale theme]
Is this available from MS anymore? I know it used to be on the New Zealand downloads page (here: http://www.microsoft.com/downloads/details.aspx?FamilyID=15373c73-d5f6-4af0-b583-d633cb021612&DisplayLang=en) but it’s been gone for a while. Why was it removed so soon after being publicly available?
[I’ll let you guess. -Raymond]
It may cause compatibility issues, I guess.
The current architecture of x64 CPUs only allows 48 bits of address space to be used, which would yield 128TB of user address space if it were all available. I’m sure that by the time people actually need more than 8TB of address space, sufficiently many people will have the necessary CPU instruction that MS will implement it, allowing people to use the full 128TB of address space.
Um, Gabe, no, it doesn’t. ;-)
You’re confusing physical address space with linear address space. The x64 allows pointers to be 64 bits wide; therefore, you can access 2**64-1 different bytes in a program. The fact that *current* x64 implementations only have 48 address lines on their memory bus is an implementation detail. Writing a program that expects pointers to be 48 bits long, or only have the lower 48 bits be nonzero, based on the address bus size, is a bug.
(Besides, in the same way the Pentium Pro moved from a 32-bit address bus to a 36-bit address bus, the x64 processors could change their physical addressing range at any time. Now the PPro had some issues with actually addressing all that, because they went higher than their linear addresses, but the x64 change wouldn’t.)
> Next year we are going to have CPUs with CRC32 instruction. How long will
> it take for Microsoft software to pick it up if at all?
It’s never that simple. My company’s software, for example, runs on everything from Win95 to Vista, and yes, we have customers that really *do* run the darned thing on Win95, on boxes like a Pentium 100 with 64 megs of RAM. When you have to support a very wide range of systems, you have to focus more on the lower-common-denominators than you do on the highest-end systems, because there are a lot more lower-common-denominators than there are high-end boxes. Adding in specialized support for "X" random feature of the higher-end boxes requires a lot of additional testing to make sure you didn’t break anything on the low-end boxes too; and in a company as large as Raymond’s, that requires approval processes and committees too.
I wouldn’t blame Microsoft in the slightest for not using cmpxchg16b or crc32 in Vista, or even for ignoring those for some years from now. They have bigger fish to fry than immediately adding in support for every little pet instruction that Intel and AMD toss into their chips.
> We have cmpxchg16b and we don’t use it. It is a wasted silicone.
Mmm… wasted silicone. *imagines jiggly processors walking down the street*
Sean W. said:
"It’s never that simple"
Well actually it is, compiler generates Pentium code for you, what is stopping you from using SIMD instructions in inline assembler or separate .asm files?
You just have to check if the feature is present and if you are running under OS which supports it. You can get 4x speedups by using SSE and SSE2 instruction set extensions. One good example of a program which still works on Windows 95 and Pentium but flies on latest hardware is VirtualDub by Avery Lee.
You can also get Intel C/C++ Complier and let it perform automatic CPU dispatch for you. It can also vectorize most common code and it optimizes a lot better (again up to 4x faster code) than even the latest MSVC 2005 compiler (which by the way has been considerably improved over the 2003 and VC6).
Sean W. said:
"than immediately adding in support for every little pet instruction"
If you believe that the instruction which will perform CRC32 checksum on a block of data is just a "little pet instruction", then I don’t see a point in discussing this matter with you anymore.
(continues to read CUDA Programming Guide)
"Mmm… wasted silicone"
Obviously a typo.
> If you believe that the instruction which will
> perform CRC32 checksum on a block of data is
> just a "little pet instruction", then I don’t
> see a point in discussing this matter with you anymore.
Your loss.
CRC32 has some value as an instruction, but the fact is that the code to compute it is well-known and well-optimized, and, as a general rule, every time a new specialized instruction like this gets added, it gets phased out a few years later because the more general solution is usually faster or better or both. Remember the x86 string instructions? How about aad/aam? They were all added in their day to make things go faster — and have been successively phased out ever since (the string instructions are slower than separate instructions on modern processors, and things like aad/aam don’t even exist on x64). Time and time and time again, the industry has failed to learn (and been forced to relearn) the lessons of RISC: That a few instructions that work really well in the general case are a lot better than a lot of instructions that are only usable in special cases.
CRC-32 runs just fine in software and has for decades now. Including it in hardware gives at best a marginal speedup to only a handful of applications, at a significant cost in silicon: Place ya bets you can’t find any common application in use today whose main bottleneck is computing CRC-32s.
> You just have to check if the feature is
> present and if you are running under OS which
> supports it. You can get 4x speedups by using
> SSE and SSE2 instruction set extensions.
And what about the hidden support and maintenance costs? Every time you add a conditional to your code, you’re adding a fork: Some people will take the left path, and some people will take the right; and if you have a bug in the left side but not in the right, or they compute different values in rare circumstances (and yes, that DOES happen), you’re ensuring that one group of people will see different output from your software than another. Each time a feature like this is added, you increase your support costs, so you have to seriously think and discuss it before you do it. There’s already too much software out there where if it crashes, people say, "You probably need to disable SSE/MMX/SSE2/extension-du-jour to make it work." I don’t blame Microsoft — or any other company — for ignoring new instructions, because the performance potentially gained from them has to be balanced against the potential support and maintenance costs. It’s benefit of the users vs. cost to the company, and sometimes you have to land on the company side of that equation. Nothing is free, and if you’ve been reading Raymond’s blog carefully, you should know that by now.
> Obviously a typo.
Well, *DUH*.
Sean W. said:
"the fact is that the code to compute it is well-known and well-optimized"
Do you really think any optimized code for CRC32 calculation could beat a single instruction?
Sean W. said:
"Remember the x86 string instructions?"
Those still exist and P4 has fast strings mode which gets triggered if certain conditions are met.
Sean W. said:
"the lessons of RISC"
So you are a RISC guy? Well maybe those few fast instructions are fine for simple tasks but how many of them it takes to implement complex algorithm? Code becomes big, falls out of instruction cache, and decoding bandwidth is limited resource after all. I take CISC over RISC anytime.
Sean W. said:
"you can’t find any common application"
Of the top of my head any archiver, then also TCP/IP stack if hardware does not have checksum offloading.
Sean W. said:
"and if you have a bug in the left side but not in the right"
That is why I advocated use of Intel compiler — it generates runtime auto-dispatcher and different code paths based on the single source. So you don’t really have to maintain different code paths.
Sean W. said:
"or they compute different values in rare circumstances"
That is indeed very rare. I witnessed it once in my code and it was SSE code which had 4 times smaller error as a side-effect of vectorization so it was a good thing.
Sean W. said:
"the performance potentially gained from them"
The performance can be up to 4x greater, isn’t that worth some trouble?
You can give as many excuses for being lazy as you want but they are just that — excuses and lousy at best.
My point is that the software is lagging behind the hardware big time. It is time for software developers to take some responsibility for their sluggish code considering all those GFLOPS at their disposal.
BryanK, you are confusing linear address space with virtual address space. Although you can create a pointer to any theoretical address within the 64-bit address space, current x64 chips are designed so you can only create a mapping within the top or bottom 128TB of that address space.
In other words, when you tell an Opteron how to map a physical address to a virtual address, that virtual address must fit within 48 bits and the physical address must fit within 40 bits. Xeon processors also have 48-bit virtual address spaces but only 36-bit (33 address lines and 64 data lines) physical address spaces.
The reason you can’t use the full address space is that the page tables just aren’t big enough. In theory, a table to map every 64-bit address into a 4k page of 48-bit physical address would require over 2^48 bytes of memory per process. Since this isn’t practical, page tables are broken down into 4 levels of 512 entries each. Adding a 5th level would let you use an additional 9 bits of virtual address space at the cost of using more memory to store page tables and more cycles to access them. At some point enough programs will need more than 128TB of address space and the tradeoffs will be worth it, but that time is far off in the future.
BryanK, there is no getting around that 48-bit limit. If you attempt to use an address that has bits 48-63 set to anything other than the same as bit 47, you get a GP fault. This is different from a page fault, which you get when you access an address which is in your address space but not currently mapped to physical memory.
The x64 architecture supports 2MB pages, but all that does is eliminate one level of page tables (making memory access faster at the expense of making page faults slower). Windows has for a while allowed you to let it use large pages for the kernel, and now you can request large pages in a user program if you have sufficient permission.
> Do you really think any optimized code for CRC32 calculation could beat a single instruction?
Yes, if the single instruction takes 1000 clocks to run.
Now obviously this isn’t the case with the crc32 instruction, but in general, yes, an optimized set of smaller instructions *can* beat a single larger instruction.
(As for your dislike of RISC: That’s funny, because effectively *every processor* is now RISC, including Intel-compatibles. They convert the CISC instructions of your program to RISC "microcode", and then they run the microcode.)
Gabe: You’re right, I was thinking you were talking about the physical address bus size being 48 bits. So never mind there.
But linear addresses *are* virtual addresses. ;-) The fact that some of them aren’t valid (because there’s no corresponding PTE) does mean that you can’t use all of them, but the processor just takes your linear address and looks it up in the current set of page tables. The linear versus virtual distinction only exists if you decide that invalid page faults from accessing pages that the OS hasn’t given you are somehow "different" from invalid page faults that come from accessing pages that the processor decided to not support.
But yes, processors that force the page table architecture to 4 levels of 512 64-bit entries each are going to page-fault if you give them a linear address outside the 48 bits that they give you.
I remember that the i686 had a "page size extension" mode that (IIRC anyway) made the pages 4M in size instead of 4K. This might help with the 48-bit limit (it’d raise it to 58 bits, because the extra 10 bits from the page size wouldn’t require any more page-table levels). OTOH, a lot of non-kernel code that runs on Windows probably depends on 4K pages now, which probably means these large pages won’t ever be turned on. (If the x86-64 even supports that extension in 64-bit mode…)
But that might be a way to give more *usable* linear/virtual address space to a program, if it was supported. (And if it didn’t require tons of bookkeeping in the kernel.)
BryanK said:
“As for your dislike of RISC”
I am well aware that it is RISC under the hood. But my point was that if RISC was good enough, there would be no need for high-level languages.
If you haven’t noticed, high-level languages hide the complexity and poor manageability of assembler code and enable you to focus on the problem by giving you larger, already optimized, building blocks which you use to solve it.
IMHO, same goes for CISC .vs. RISC. Imagine if you had to calculate say square root using only addition and subtraction? Yes, the CISC’s DIV would take much more cycles to complete than RISC’s ADD/SUB but in the end CISC would still be a winner.
BryanK said:
“in general, yes, an optimized set of smaller instructions *can* beat a single larger instruction.”
Well, correct me if I am wrong but aren’t those “larger” instructions actually being made to beat that set of “smaller” instructions?!?
Lets say that CRC32 uses two registers, one as a pointer and the other holds the block length in bytes. It returns the CRC32 checksum in accumulator.
Perhaps it can be encoded with just two bytes but lets for the sake of argument say that it can be encoded with 3 bytes or 4 bytes in 64-bit mode if you count REX prefix.
Now compare that to the code size of the shortest and fastest possible version of CRC32 written in pure assembler.
Then take latest Intel Optimization Manual and read the part where it says that you have to have less than 18 instructions in a tight loop in order to keep maximal decoding bandwidth.
After that write another loop around your CRC32 code to call it N times for say blocks of 2048 bytes. Your already too big code just got bigger and slower, single instruction form however only now got it’s first loop which fits in instruction cache and is streamed from the buffer instead of decoded again and again.
Any bells ringing yet?
Sure thing Raymond. Sorry about that, it is just that I am pretty sensitive when it comes to assembler and code optimization because that is what I do for living.
Out of deference to Raymond, I will let the subject drop, despite what I could say to Igor about optimization. This *should* be about placebo settings (oh, how I wish I could add one in my own apps too!) and nothing else.
But I would like to add one more thing, because I *did* get directly, personally insulted up there: Igor, please don’t call me lazy just because I prefer algorithmic optimizations to raw assembly optimizations, okay? Rewriting your whole app in assembly is a little silly when finding your bottlenecks and carefully replacing those with algorithmic improvements will give you much higher speedups — and without switching languages. I use assembly only where appropriate and necessary these days; I wrote whole apps in 100% carefully-tuned assembly when I was younger, and I sure as heck know better than do that kind of silliness anymore.
"Work smarter, not harder." — Scrooge McDuck
"Premature optimization is the root of all evil." — Edsger Dijkstra