Date: | September 14, 2004 / year-entry #336 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040914-00/?p=37873 |
Comments: | 67 |
Summary: | The x86 architecture does things that almost no other modern architecture does, but due to its overwhelming popularity, people think that the x86 way is the normal way and that everybody else is weird. Let's get one thing straight: The x86 architecture is the weirdo. The x86 has a small number (8) of general-purpose registers;... |
The x86 architecture does things that almost no other modern architecture does, but due to its overwhelming popularity, people think that the x86 way is the normal way and that everybody else is weird. Let's get one thing straight: The x86 architecture is the weirdo. The x86 has a small number (8) of general-purpose registers; the other modern processors have far more. (PPC, MIPS, and Alpha each have 32; ia64 has 128.) The x86 uses the stack to pass function parameters; the others use registers. The x86 forgives access to unaligned data, silently fixing up the misalignment. The others raise a misalignment exception, which can optionally be emulated by the supervisor at an amazingly huge performance penalty. The x86 has variable-sized instructions. The others use fixed-sized instructions. (PPC, MIPS, and Alpha each have fixed-sized 32-bit instructions; ia64 has fixed-sized 41-bit instructions. Yes, 41-bit instructions.) The x86 has a strict memory model, where external memory access matches the order in which memory accesses are issued by the code stream. The others have weak memory models, requiring explicit memory barriers to ensure that issues to the bus are made (and completed) in a specific order. The x86 supports atomic load-modify-store operations. None of the others do. The x86 passes function return addresses on the stack. The others use a link register. Bear this in mind when you write what you think is portable code. Like many things, the culture you grow up with is the one that feels "normal" to you, even if, in the grand scheme of things, it is one of the more bizarre ones out there. |
Comments (67)
Comments are closed. |
"The x86 architecture does things that almost no other modern architecture does"
x86 is not a modern architecture, though. It should be compared to things like VAX and 68K. Through a quirk of history it’s happened to live on, but it’s still not modern. Its implementations might be, of course, but the ISA is prehistoric.
"The x86 supports atomic load-modify-store operations. None of the others do. "
What is fetchadd on Itanium?
Very enlightening. I wonder what the security implications of some of these differences are. For example, if return addresses are in registers rather than on the stack, then would a buffer overflow bug be much harder to exploit?
"The x86 uses the stack to pass function parameters; the others use registers."
How would you pass more than 32 (less the number of registers you can’t use; there’s a zero register for instance) arguments to a function?
x86 spills to stack more often than other architectures (not enough architectural registers), but it’s not mandatory.
Rats I missed fetchadd.
For the details on parameter passing, I refer readers to my earlier series on calling conventions. Typically, excess parameters are spilled onto the stack.
"How would you pass more than 32…arguments to a function?"
Maybe one register could hold the starting location of the argument list and another holds the number of arguments…
I was about to post about how some older processors I dealt with in my younger days supported the things you say that only x86 does, but DrPizza already did that.
Of course, the system I spent most of my time worrying about these kinds of things on — the Commodore 64 with a Motorola 6502 chip — was 8-bit anyway, so alignment was never much of a problem! It did (sort of) come up with indirect addressing, because then it needed a 16-bit value which IIRC couldn’t straddle a page boundry since the address would overflow and you’d end up with the value at the start of the current page rather than the next one. Indexed indirect made things even more interesting, because even if you were 16-bit aligned the index offset couldn’t cross a page boundary. (Apologies if I got any of this wrong. It’s been a long time since I’ve had to think about this stuff.)
These days I tend to stick to high-level languages (with an appropriate amount of concern for issues like alignment) so I don’t know a great deal about more modern CPUs, but it’s still interesting so it’s nice to hear from someone who does.
There are situations on x86 where an explicit memory barrier of some sort is needed.
http://www.microsoft.com/whdc/driver/kernel/MPmem-barrier.mspx
DrPizza: Calling conventions generally allow only 4-6 arguments to be passed in registers; beyond that, they’re placed on the stack.
Adrian: The return address may still have to be stored on the stack in a non-leaf function, since there is only one link register. However the link register and the abundance of registers which the return address can be moved to reduce the number of target functions somewhat.
Ben: The 6502 was produced by MOS Technology/Commodore Semiconductor Group, not by Motorola. MOS was the company that made mask fixing commonplace.
I’ve written on both MOS and CBM on Everything2. Perhaps not the most accurate articles I’ve written, and not on the best site for info, but dead-tree information sources pretty much back me up.
BTW, that’s to Ben Cooke.
Though, many of the odd behaviors of the x86 are improvements over the even odder behaviors of even older machine which these "modern" chips are going back to.
IBM 360s passed parameters via registers and used a return link register because they just didn’t have a stack. (Return link registers make nested calls tough, and recursive calls very tough)
Ben: are you sure that wasn’t the 6509 or the 6510? :)
Cooney: And now x86 assembly language turns out to be a reasonably good compact byte-code for a hardware JITter feeding a RISC-like core.
(Uh, machine code, not assembly language.)
"DrPizza: Calling conventions generally allow only 4-6 arguments to be passed in registers; beyond that, they’re placed on the stack. "
Which makes them much like x86, then.
Spilling to the stack is used by pretty much all architectures. And presumably the return value must be stored somewhere other than a register, because you want to make calls more than one function deep.
I don’t know how many are currently used, but back in the early days, MIPS machines passed the first 3 function arguments in registers, and all the rest were passed on the stack. I remember having to modify a number of Unix commands and libraries to handle varargs properly (well, "properly" isn’t really correct, since what the early MIPS compilers did was not ANSI standard).
That’s the problem. In pre-prototype (K&R) C, all functions must be assumed to be varargs functions, and that’s hard if you don’t pass arguments in the stack. Newer architectures have the benefit of prototypes, so can default to calling conventions that pass on the stack only when they have to (through spillage) or when using varargs.
DrPizza: It’s been forever and a day since I’ve done any asm, but that doesn’t seem like a problem to me. Since only one function is returning at any one time, all functions can store their return values in the same register without trouble.
I seem to recall that one common calling convention for real-mode x86 involves storing the return value in AX.
How do you do locks without atomic load-modify-store?
josh: There are two alternatives I know of.
The most common is the CAS (Compare And Set) instruction. Its parameters are a memory address, an expected value and a new value. It atomically loads from the memory address and then stores the new value if the current value matches the expected value. It sets a condition flag indicating what happened.
The other is LL/SC (Load Linked, Store Conditional). These are two separate instructions. LL atomically loads from a memory address and remembers the address. SC stores a new value to that address so long as it is known not to have been modified since the LL. (False positives are possible; the granularity of modification tracking is unspecified but is likely to match the cache line size.)
You can implement a generalised load-modify-store by looping through atomic-load, modify, CAS or LL, modify, SC until the CAS or SC succeeds.
Clearly everyone else has better memories than me. Writing Motorola was a brain fart since I was thinking at the same time about the 68000 series that DrPizza mentioned. I have here somewhere data sheets from Commodore Semiconductor Group about the CPU in the Commodore 64 although they’re buried somewhere in my big heap of old stuff in my attic.
Simon Cooke (too many similar names around here – I overlap with two people!) mentioned two other model numbers that I remember from somewhere, so now I’m left trying to remember what I remember each one from.
I know I’ve dealt with a 6502 and a 6510 at some point, but I can’t remember which numbers go with which things. (or maybe I’m confusing myself with non-CPU chip numbers from the same period)
Wasn’t 6510 used in the C64? And isn’t 6502 just an older version of it?
Raymond,
I would like to state that I think we are witnessing one of those moments where it is good to be the weirdo.
James
6502 was used on VIC-20 (and Apple II, and
a number of others, as I recall). 6510 was
used on Commodore 64.
Don’t forget SEH. On the x86, because of the unreliable unwinding, SEH requires registrations (small objects allocated on the stack and chained in a single-link list) to associate frame handlers. On all other architectures, there’s a single calling convention, so unwinding is always reliable and the system can get the frame handler simply by matching the program counter against a table
Years ago I read a (DDJ? Byte?) magazine article in which the author referred to the x86 architecture as a "code museum" for its layers of "architectural history".
I always liked this term.
Don’t forget the 8502 in the C=128, with its fast mode. "Fast" being a blazing 2 MHz. ;)`
Don’t forget the FPU stack *shudder*.
On the other hand, with x86-64 x86 is getting rid of some of the worst stuff. For example the low number of general purpose registers. Also, I remember reading that at least in Windows, usage of x87/MMX/3DNow! has been obsoleted in favor of SSE/SSE2 (only SSE registers are saved in context switches or something?) so assuming x86-64 programs become popular, x86 chip makers can decrease the amount of resources for obsolete specialty instructions in the future.
"DrPizza: It’s been forever and a day since I’ve done any asm, but that doesn’t seem like a problem to me. Since only one function is returning at any one time, all functions can store their return values in the same register without trouble. "
But what do they put back into the register when they return?
Say you’re at 0x100 and you call a function foo at 0x200 and you have 4 byte instructions. The return value will be presumably 0x104. Say foo+0x08 then calls a function bar. You duly set the return value register to 0x20c and execute bar. bar returns to 0x20c. Now what gets put into the return value register? How does it know where to return to?
The caller’s return value must be preserved across function calls, which means it’s got to be put somewhere other than a register, and surely that place is the stack.
DrPizza: I think you’re confusing return address and return value.
A nitpick:
* PPC has variable length instructions (VLE — 2 byte instructions).
* MIPS is god’s revenge on debuggers, as there are 16- 32- and 64- bit modes (although the latter is just a register issue and still has 32bit instructions). IIRC, functions can be either 16bit or 32bit ABI mode, and both can appear in the same compiled item.
Raymond Chen has a nice entry about the weird x86 architecture. Don’t miss the references to memory barriers in the feedback!…
Norman,
I have run into very few applications where I needed to use memcpy deal with poorly aligned structures. When the code is properly written, alignment is usually not an issue.
Although Raymond’s description of the instruction set modern x86 processors accept is correct, it isn’t actually what the chips have. For example the underlying hardware has quite a few more registers and is far more like a RISC chip. The x86 instructions are converted into the "RISC" instructions. Additionally by analysing register usage and data dependencies it is possible to make use of the extra registers.
There is an excellent talk by Bob Colwell that I highly recommend watching. It is the 7th item on http://www.stanford.edu/class/ee380/winter-schedule-20032004.html
(Click on the camera thingy on the right)
In response to some of the comments in favor of x86, in particular the observations that automatic alignment fixup is handy and that under the hood the x86 has a multi-register architecture I would make the following observation in reply:
Sure. But it wastes a ton of silicon making all that happen.
More modern architectures can have smaller dies for the same computing power, which means lower power consumption and less cooling requirements. If those processors were produced in the same quantities as x86’s they’d be cheaper too.
There is a reason that most embedded applications don’t use x86.
With every process shrink the amount of chip real estate devoted to the x86 decoder goes down as a proportion of the whole chip. It also turns out that many of the modern RISC chips take a similar approach as well (they also have decades of legacy instructions) although it isn’t quite to the same degree.
As for power/cooling, it is perfectly possible to make cheaper chips that do x86 instructions and meet those targets. See Via and Transmeta. Intel just happen to optimise for performance and marketing at the expense of power and cooling. (They spent $300m on the initial marketing for the Pentium 4 in 2001.)
Even looking at the ARM (a favourite in the embedded space), there are other design goals (eg code density with the thumb extensions).
For anyone who is interested in this low level chip stuff, I highly recommend the comp.arch newsgroup.
James Curran: "IBM 360s passed parameters via registers and used a return link register because they just didn’t have a stack. (Return link registers make nested calls tough, and recursive calls very tough)"
"Passed"? 360 architecture is still alive and well, although it’s now called zArchitecture (via S/370 and ESA/390). You can still happily run OS/360 Programs on the latest z/OS, that’s compatibility that Windows programmers can only dream of. But it’s true there was no stack, until S/370 and all later generations. The stack however is a bit different to what most programmers expect, since you can’t allocate areas for local variables, parameter lists and so on, only hardware registers and return addresses are saved, and normally instructions do not reference the stack. OTOH it has a few tricks up its sleeve like allowing subroutine linkage over calls which transfer control synchronously to code in other address spaces.
But we never had problems calling nested or recursive routines anyway since most programs use a simple call convention established by the OS.
In regards to Tony’s comment that a lot of silicon is wasted on alignment and register renaming, there is actually an interesting trade-off in x86’s favor. Additional registers can be added to the micro-architecture without changing the ISA. So, newer generation chips (when silicon real estate is cheaper) can reduce register contention w/o the need of recompiling any code. Also, as pipeline depth, branch prediction, and other micro-architectural changes are made between chip generations, recompiles aren’t necessary to avoid register dependency conflicts. The hardware can find more instruction level parallelism that the compiler could (since the compiler was years older).
DrPizza: Yes, you are right in that something needs to store the return address *somewhere*. That may be the stack, or you could have register windows.
Register windows give you access to only a certain number of registers at a time – say, 32 out of 512. When you enter a function you tell it you need ‘n’ registers, and they get shifted to make room.
Eventually, you may spill out the other end, but the worst case for that is as bad as storing on the stack anyway.
Now, a disclaimer.
I don’t know whether any processor actually uses this technique – this is merely a vague recollection from my university days. But it sounds pretty clever. :)
"DrPizza: I think you’re confusing return address and return value. "
No, I just miswrote "value" when I meant "address".
KJK::Hyperion: the mixture of calling conventions on x86 isn’t a fundamental property of the architecture, it’s just how Windows does things (because back in the day, it increased performance). Modern Linux uses only one calling convention for instance, and has a table based exception dispatch ABI.
Quote:
Until the day that internet packets automatically get their contents aligned differently depending on what kind of architecture is going to read them, and disk files get their contents realigned, and the layouts of .BMP and other structures get dynamic realignments, unaligned data will have to be worked with one way or another. Hardware can do it faster.
Even on an architecture without aligment fixups, it’s still possible to access unaligned structure members. The compiler might need an __unaligned qualifier or might do it automatically – you just end up with breaking an unaligned access into two aligned accesses and some shifts, which is probably as fast as hardware could do it.
For all it’s uglyness x86 survives because as someone once put it: "x86 *owns* the binary".
On a related note, anyone clocked this?
http://www.wired.com/news/technology/0,1282,64914,00.html
Steve P: Register windows aren’t just theoretical. SPARC has fixed-size register windows. IA64 has variable-size register windows, as described in http://weblogs.asp.net/oldnewthing/archive/2004/01/13/58199.aspx
Greetings,
Quote:
Until the day that internet packets automatically get their contents aligned differently depending on what kind of architecture is going to read them, and disk files get their contents realigned, and the layouts of .BMP and other structures get dynamic realignments, unaligned data will have to be worked with one way or another. Hardware can do it faster.
—
—
I think internet datapackets is an orthogonal issue.
1) As far as I know, alignment issues are only for certain primitive data types, for example, integer. You can read a character on any alignment.
2) Packing of C structures are handled for you automatically. So if you have a structure with a 3 byte character array followed by an integer, the compiler will pad the structure so that the integer is placed on the proper alignment.
Therefore binary data that you write out to disk will be properly aligned when you read it back.
3) As far as I know, unaligned data access on x86 is slower than aligned access. So in fact, if your goal is speed, alignment is a good thing. As far as internet data packets, if you look at many of the low level TCP/IP sockets structures, they are padded with dummy bytes to make alignment work.
Also, when I do low level TCP/IP communications, as data is received on a socket, it gets copied into the approriate data structures. If it was a on machine that required alignment, since my structures are aligned for me, by the compiler, everything works.
Regards,
Dave
Data Alignment, one more thing
I forgot to mention that the C library routine, malloc, also pads out any memory it returns so that you are guaranteed alignment.
If you request one byte, the chunk you get will be larger than one byte, it will be padded to the next alignment size. That is why lots of small malloc requests are inefficient.
Take a look at the back of the K&R C book where it discusses malloc. Your requested memory chunk size is multiplied and rounded to ensure proper alignment.
Regards,
Dave
Also observe that the SSE extensions to x86 also have alignment restrictions and the instructions raise exceptions if used on unaligned data. There is a special unaligned move instruction if you really need it, but it’s slower than the regular aligned move instruction, and you have to explicitly choose to use it.
Ben Hutchings: I was assuming CAS would count as load-modify-store, which apparently it doesn’t. :/ (I mean you’re loading a value, performing an arithmetic operation, and then potentially storing a different value back…) LL/SC is an interesting concept that I was not aware of. Thanks for the explanation. :)
mpz: I don’t know about SSE, but x87/MMX/3DNow! registers can be swapped on demand. You can skip saving them on a context switch and only save when another thread actually tries to use them.
Isaac, Mike:
Ahh, the VIC-20. That would have been where I encountered the 6502. I’m glad there are lots of people with good memories around here to set me straight: I won’t make *that* mistake again! :)
I never had the pleasure of owning a C128, but I remember that they were quite nutty machines. Didn’t they also have a second processor inside for running CP/M or something? I remember a friend showing me how he could boot either the Commodore 128’s own kernel, the C64’s kernel (and presumably use some other hardware borrowed from the C64) and with the help of a floppy disk also boot CP/M and change into a high-resolution text mode to make it more usable. (40×25 isn’t very much for a real command-line OS, of course.)
I jumped straight from C64 to Amiga, though, and aside from a small amount of tinkering I never did much "real programming" of the 68k chip in my A500. Instead, I learned a bunch of high-level languages enventually learning C for the first time. Those were the days. :)
[QUOTE]
That may be the stack, or you could have register windows… I don’t know whether any processor actually uses this technique.
[/QUOTE]
According to Raymond’s earlier entries on calling conventions, it sounds like IA-64 does something along those lines:
http://weblogs.asp.net/oldnewthing/archive/2004/01/13/58199.aspx
Regarding SSE instructions, Tony Cox already mentioned that they have alignment restrictions that are not fixed up by the hardware. Specifically, the data needs to be 16-byte aligned. I’ve found _aligned_malloc to be quite handy in this regard. As far as favoring SSE due to context switching rules, I don’t know. And since there’s been a lot of talk about register usage, I think the VC++ .NET compiler will make use of SSE registers to pass parameters if certain optimizations are enabled (not sure though).
SSE/2 registers need ‘special’ operating system support to be saved across context switches (fxsave). MMX overlaps with the floating point registers so anything that saves floating point states across context switches (fsave/fnsave) will automagically save MMX states.
Just wanted to add: the 6502 was used in the original C64, the 6510 was used for the slightly revamped C64-C.
A minor nit: a typical malloc implementation reserves extra memory if needed to ensure that the *next* request is correctly aligned.
It might be argued that on an architecture like x86 which allows unaligned memory access, a conforming malloc implementation could ignore alignment entirely. I’m not much of a language lawyer, but "correctly aligned" could be taken to mean "aligned so as to be accessible without crashing". In practice, I’d be shocked to see a malloc implementation which imposes such a performance penalty.
Point of information: The PowerPC handles misaligned reads into and writes out of general purpose registers without exceptions (modulo page-crossing reads) but with some performance penalty.
Someone asked how you can implement atomic operations without load-modify-store; PowerPC uses <a href="http://www.go-ecs.com/ppc/ppctek1.htm">the reservation instructions</a>, <a href="http://publibn.boulder.ibm.com/doc_link/en_US/a_doc_lib/aixassem/alangref/lwarx.htm">lwarx and stwcx</a>. The special store instruction only succeeds if nobody else modified the location since the special load instruction, and reports whether the store succeeded.
9/15/2004 7:26 PM David
> It might be argued that on an architecture
> like x86 which allows unaligned memory
> access, a conforming malloc implementation
> could ignore alignment entirely.
That is correct. Furthermore the compiler can ignore alignment when generating object code. The reason ordinary compilers do not ignore alignment is exactly as you said, ordinary compilers are written with at least some degree of respect for performance (execution efficiency), which is obtainable except for structure layouts that are imposed by external requirements.
Even on CPUs which impose alignment in hardware, compilers and their associated malloc’s would be free to ignore alignment in laying out data though of course they would have to provide fixups to copy the data to aligned locations when necessary. In fact when external requirements impose layouts different from what the CPU requires, compilers and/or runtimes already have to provide fixups this way (or else C programs have to call memcpy() etc.). This is the reason why it is often faster for fixups to be done in the CPU instead of in software.
What do you mean 41-bit instructions? How can an instruction span 5 bytes and an extra bit?
The ia64 encodes 3 instructions in 128 bits. Subtract 5 bits of overhead and you get 41 bits per instruction.
I think we may have things backwards.
Parameter passing and return addresses are determined by the compiler, not the architecture. Because the x86 processors have few registers, the compiler’s choices are restricted – but it is still up to the compiler. If an IA64 compiler wants to pass parameters on the stack, there is nothing to stop it.
The Win32 ABI specifies the valid calling conventions. You can’t just make up a new one – it has to play friendly with SEH.
Note that causing an exception counts as "interfacing with the rest of the OS", so anywhere that can raise an exception needs to have glue. In order for the OS to be able to unwind exception frames, there are very specific rules about function prologues and epilogues so that the OS’s exception dispatcher can unwind a partially-executed function properly.
Sure, you can "glue" it, but since every memory access can potentially result in an exception (STATUS_IN_PAGE_ERROR for example), you’re going to have to erect glue around every memory access. That’s an awful lot of glue.
Instead of doing it en masse, Windows 95 did it incrementally.
Belated answers to exercises and other questions.