Date: | December 16, 2005 / year-entry #389 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20051216-10/?p=32943 |
Comments: | 25 |
Summary: | A common problem when answering technical questions is that people sometimes ask a question that can't or shouldn't be answered because it is based upon a misunderstanding. What's particularly frustrating is when they insist that you answer their question as posed, even when you try to explain to them that their question is itself flawed.... |
A common problem when answering technical questions is that people sometimes ask a question that can't or shouldn't be answered because it is based upon a misunderstanding. What's particularly frustrating is when they insist that you answer their question as posed, even when you try to explain to them that their question is itself flawed. It's as if somebody asked you the question, "Do I have to use the remote control to lock my kangaroo?" You could answer the question literally ("No"), but the person asking the question would walk away with the wrong conclusion ("Wow, kangaroos are self-locking!"). Robert Flaming recalls a similar analogy I made with balsa wood and nails. Here's an example of a question that betrays misunderstanding.
When you turn on hyperthreading, each individual physical processor acts as if it were two virtual processors. From Task Manager's point of view, the computer has four virtual processors. The two virtual processors associated with each physical processor are completely equivalent. It's not like one is physical and one is virtual. They are both virtual and compete equally for a share of the one physical CPU. When you set processor affinities, you set them to virtual processors. To find out which virtual processors are associated with the same physical processor, you can call the |
Comments (25)
Comments are closed. |
But of course in many cases turning on Hyperthreading reduces your performance due to Intel’s botched implementation.
For the heavy simulation workload we do, I did extensive benchmarks and we actually got poorer performance when HT turned on, especially because users would treat it like a 4-way rather than 2-way machine.
You are much better off if you somehow can communicate to the users "yes it says 4 cpus, but only use 2". If you treat it as a 4 way machine often tasks would take longer to run in the end.
So maybe you should have told them "don’t turn it on at all".
The correct answer to such a question is ‘mu’. See http://en.wikipedia.org/wiki/Mu_%28Japanese_word%29.
(Just happy to be able to provide a useful comment once, even if it’s not technical :) )
"you can call the GetLogicalProcessorInformation function"
The documentation says that function requires Windows Vista, Windows XP Professional x64 Edition, Windows Server "Longhorn" or Windows Server 2003. So I can "try to call" or in five years "will be able to call" with decent success, but in today’s Windows installed base it will be rare that I "can call" GetLogicalProcessorInformation.
I’m glad that future generations will have this function though. It’s pretty darned hard to scope out HT processors right now. Intel has some ugly code that plays with SetProcessorAffinityMask in order to divine whether the "processors" are real or HT, but it would have been nice if they had added a CPUID function to just tell us whether HT was enabled.
vince: We see the same thing with the 3D CAD program that we use. When HT is on, it runs slower.
Of course, it’s also single-threaded, so maybe that has something to do with it.
Vince: I already covered that topic last year http://blogs.msdn.com/oldnewthing/archive/2004/09/13/228780.aspx for
You can always answer "yain" (which is a German word meaning ‘ya’ and ‘nein’, "yes and no", at the same time.)
Or you can start with "the standard expert answer" which is "well, it depends …"
:-)
Raymond: I’m not sure about vince, but our 3D CAD software slowdown happens on both 2K Pro+SP4 and XP Pro+SP2. According to your earlier post, XP understands HT and can schedule processes appropriately.
Our issue doesn’t appear to be related to scheduling, just something strange that happens in the CPU when HT is on and it’s getting used heavily by one thread.
"Our issue doesn’t appear to be related to scheduling, just something strange that happens in the CPU when HT is on and it’s getting used heavily by one thread."
Caches are very important, and HT has to share one cache between two virtual processors.
Normally, when something else must run briefly on your UP (uni-processor) machine, the CAD thread is stopped, the other thread is started, it finishes, and then the CAD thread starts again. Each time this happens, the code & data for the CAD thread is (likely to be) flushed from the cache, and the cache warms up again when the CAD thread is re-started.
Now, on the HT system, the OS knows it should prefer an empty physical CPU to one with a thread on it, but it can’t find such a CPU, so it starts the short-lived thread on another virtual processor sharing with your CAD thread. The CAD thread isn’t stopped, but it is sharing its cache with the other thread. This causes a lot more misses than normal, and in highly optimised inner loops (which your engineers may have used in heavy calculations) this makes things many times slower.
(For example, suppose your code does millions of "random" accesses in a 400×400 array of int32s. With 1MB data cache such an algorithm can be tuned to fit in the cache and run very fast. However if half the cache is being used by another thread, half your accesses go back to RAM, which is an order of magntiude slower. The code will run very slowly until that other thread goes away and you get all of the cache back.)
The OS can’t really detect this, so the only thing to do about it is to turn of HT. The same can happen on a real multi-processor machine, and on any system where some resources are shared. But it’s annoyingly common on HT, which is why the technology hasn’t been as a big a money-spinner as Intel hoped.
If the hamburger came from Hamburg, where the heck is CHEESEBURG?
Incidentally, the classic Unix version of this question is something like "I can use stat() to tell a soft link from a file. How can I tell a hard link from a file?" (You can’t; every entry pointing to a file, including the original one, is a hard link.)
Intel did add to CPUID to check whether HT is available on the processor. The problem is that it might be on the processor but disabled by the BIOS or the OS.
So the code on Intel’s site is what you need. It just so happens that I was porting this code to my app this week. This is what I came up with:
————————
function AvailableProcessorCount: DWORD;
//returns total number of processors available to system including logical hyperthreaded processors
var
i: Integer;
ProcessAffinityMask, SystemAffinityMask: DWORD;
Mask: DWORD;
begin
if GetProcessAffinityMask(GetCurrentProcess, ProcessAffinityMask, SystemAffinityMask) then begin
Result := 0;
for i := 0 to 31 do begin
Mask := 1 shl i;
if (ProcessAffinityMask and Mask)<>0 then begin
inc(Result);
end;
end;
end else begin
//can’t get the affinity mask so we just report the total number of processors
Result := OperatingSystemInfo.ProcessorCount;
end;
end; (* AvailableProcessorCount *)
function AvailableProcessorCoreCount: DWORD;
(* Returns total number of processors available to system excluding logical hyperthreaded processors.
We only have to do significant work for Intel processors since they are the only ones which implement
hyperthreading.
It’s not 100% clear whether the hyperthreading bit (CPUID(1) -> EDX[28]) will be set for processors
with multiple cores but without hyperthreading. My reading of the documentation is that it will be
set but the code is conservative and performs the APIC ID decoding if either:
1. The hyperthreading bit is set, or
2. The processor reports >1 cores on the physical package.
If either of these conditions hold then we proceed to read the APIC ID for each logical processor
recognised by the OS. This ID can be decoded to the form (PACKAGE_ID, CORE_ID, LOGICAL_ID) where
PACKAGE_ID identifies the physical processor package, CORE_ID identifies a physical core on that
package and LOGICAL_ID identifies a hyperthreaded processor on that core.
The job of this routine is therefore to count the number of unique cores, that is the number of
unique pairs (PACKAGE_ID, CORE_ID).
If the chip is not an Intel processor, or if it is Intel but doesn’t have multiple logical processors
on a physical package then the routine simply returns AvailableProcessorCount. *)
function GetMaxBasicCPUIDLeaf: DWORD;
begin
asm
PUSH EBX
MOV EAX,0
CPUID
MOV Result,EAX
POP EBX
end;
end; (* GetMaxBasicCPUIDLeaf *)
function ProcessorPackageSupportsLogicalProcessors: Boolean;
const
HT_BIT = $10000000;
FAMILY_ID = $00000F00;
EXT_FAMILY_ID = $00F00000;
PENTIUM4_ID = $00000F00;
var
VendorID: array [1..12] of char;
RegEDX: DWORD;
ProcessorSupportsHT: Boolean;
begin
ZeroMemory(@VendorID, SizeOf(VendorID));
RegEDX := 0;
Result := False;//may be overwritten later
asm
PUSH EBX
//call CPUID with EAX=0 and record the result in VendorID
MOV EAX,0
CPUID
//test the maximum basic CPUID leaf and quit if it’s less than 1 which we need below
CMP EAX,1
JL @@quit
//record Vendor ID
MOV [DWORD PTR VendorID+0],EBX
MOV [DWORD PTR VendorID+4],EDX
MOV [DWORD PTR VendorID+8],ECX
//call CPUID with EAX=1 and record the EDX register
MOV EAX,1
CPUID
MOV RegEDX,EDX
@@quit:
POP EBX
end;
if VendorID=’GenuineIntel’ then begin
if (RegEDX and HT_BIT)<>0 then begin
Result := True;
end;
end;
end; (* ProcessorPackageSupportsLogicalProcessors *)
function GetLogicalProcessorCountPerPackage: DWORD;
const
NUM_LOGICAL_BITS = $00FF0000;
var
RegEBX: DWORD;
begin
asm
PUSH EBX
MOV EAX,1
CPUID
MOV RegEBX,EBX
POP EBX
end;
Result := ((RegEBX and NUM_LOGICAL_BITS) shr 16);
end; (* GetLogicalProcessorCountPerPackage *)
function GetMaxCoresPerPackage: DWORD;
var
RegEAX: DWORD;
begin
if GetMaxBasicCPUIDLeaf>=4 then begin
asm
PUSH EBX
MOV EAX,4
MOV ECX,0
CPUID
MOV RegEAX,EAX
POP EBX
end;
Result := (RegEAX shr 26) + 1;
end else begin
Result := 1;
end;
end; (* GetMaxCoresPerPackage *)
function GetAPIC_ID: DWORD;
var
RegEBX: DWORD;
begin
asm
PUSH EBX
MOV EAX,1
CPUID
MOV RegEBX,EBX
POP EBX
end;
Result := RegEBX shr 24;
end; (* GetAPIC_ID *)
var
i: Integer;
PackCoreList: TIntegerList;
ThreadHandle: THandle;
LogicalProcessorCountPerPackage, MaxCoresPerPackage, LogicalPerCore,
APIC_ID, PACKAGE_ID, CORE_ID, LOGICAL_ID, PACKAGE_CORE_ID,
CORE_ID_MASK, CORE_ID_SHIFT, LOGICAL_ID_MASK, LOGICAL_ID_SHIFT,
ProcessAffinityMask, SystemAffinityMask, ThreadAffinityMask, Mask: DWORD;
begin
Result := 0;
Try
//see Intel documentation (Y:IntelIA32_manuals) for details on logical processor topology
if OperatingSystemInfo.PlatformID=VER_PLATFORM_WIN32_NT then begin
MaxCoresPerPackage := GetMaxCoresPerPackage;
if ProcessorPackageSupportsLogicalProcessors or (MaxCoresPerPackage>1) then begin
LogicalProcessorCountPerPackage := GetLogicalProcessorCountPerPackage;
LogicalPerCore := LogicalProcessorCountPerPackage div MaxCoresPerPackage;
LOGICAL_ID_MASK := $FF;
LOGICAL_ID_SHIFT := 0;
i := 1;
while i<LogicalPerCore do begin
i := i*2;
LOGICAL_ID_MASK := LOGICAL_ID_MASK shl 1;
inc(LOGICAL_ID_SHIFT);
end;
CORE_ID_SHIFT := 0;
if MaxCoresPerPackage>1 then begin
CORE_ID_MASK := LOGICAL_ID_MASK;
i := 1;
while i<MaxCoresPerPackage do begin
i := i*2;
CORE_ID_MASK := CORE_ID_MASK shl 1;
inc(CORE_ID_SHIFT);
end;
end else begin
CORE_ID_MASK := $FF;
end;
LOGICAL_ID_MASK := not LOGICAL_ID_MASK;
CORE_ID_MASK := not CORE_ID_MASK;
if GetProcessAffinityMask(GetCurrentProcess, ProcessAffinityMask, SystemAffinityMask) then begin
ThreadHandle := GetCurrentThread;
ThreadAffinityMask := SetThreadAffinityMask(ThreadHandle, ProcessAffinityMask);//get the current thread affinity
if ThreadAffinityMask<>0 then begin
Try
PackCoreList := TIntegerList.Create;
Try
for i := 0 to 31 do begin
Mask := 1 shl i;
if (ProcessAffinityMask and Mask)<>0 then begin
if SetThreadAffinityMask(ThreadHandle, Mask)<>0 then begin
Sleep(0);//allow OS to reschedule thread onto the selected processor
APIC_ID := GetAPIC_ID;
LOGICAL_ID := APIC_ID and LOGICAL_ID_MASK;
CORE_ID := (APIC_ID and CORE_ID_MASK) shr LOGICAL_ID_SHIFT;
PACKAGE_ID := APIC_ID shr (LOGICAL_ID_SHIFT + CORE_ID_SHIFT);
PACKAGE_CORE_ID := APIC_ID and (not LOGICAL_ID_MASK);//mask out LOGICAL_ID
//identifies the processor core – it’s not a value defined by Intel, rather it’s defined by us!
if PackCoreList.IndexOf(PACKAGE_CORE_ID)=-1 then begin
//count the number of unique processor cores
PackCoreList.Add(PACKAGE_CORE_ID)
end;
end;
end;
end;
Result := PackCoreList.Count;
Finally
FreeAndNil(PackCoreList);
End;
Finally
//restore thread affinity
SetThreadAffinityMask(ThreadHandle, ThreadAffinityMask);
End;
end;
end;
end;
end;
Except
;//some processors don’t support CPUID and so will raise exceptions when it is called
End;
if Result=0 then begin
//if we haven’t modified Result above, then assume that all logical processors are true physical processor cores
Result := AvailableProcessorCount;
end;
end; (* AvailableProcessorCoreCount *)
————————
It works (I think) but what a pain in the backside!
Dual-*Xenon* machine? Dude, quit hogging the XBox 360s and sell one of those on EBay!
<blockquote>Incidentally, the classic Unix version of this question is something like "I can use stat() to tell a soft link from a file. How can I tell a hard link from a file?" (You can’t; every entry pointing to a file, including the original one, is a hard link.)</blockquote>
The Unix way is very powerful, but it takes a little time to wrap one’s head around it. "Every attribute of a file EXCEPT its name and path are associated directly with the file through the inode? But the thing I care about most often – the file name – is dereferenced? WTF!"
One word dude: Opteron
"mu" is good answer for people who know the definition. The lawyers (and I, even though I’m not a lawyer) use, "The question presumes facts not in evidence."
"Every attribute of a file EXCEPT its name and path are associated directly with the file through the inode? But the thing I care about most often – the file name – is dereferenced? WTF!"
Path & Name have a *:1 relationship to inodes. Inodes are constant size, hence you can’t store all of the paths to a file in its inode.
The wikipedia HT page has an excellent illustration of the HT performance problem. It comes down to the false assumption by the HT pipeline that all data are immediately available accessible in L1 cache. When the cache is shared and data access is delayed, the HT pipeline stalls in a disastrous way without retiring any u-ops.
This type of questions is called "Karlsson’s questions". In the book by Astrid Lindgren, Swedish fabler, Karlsson asked Freken Bok: "Have you stopped from drinking brandy in the morning?" If she answers "Yes, I do!", you can tell she drinked brandy before. And if Freken Bok answers no, this evidently means she is drunk now. This is not a yes/no question; you should explain the person that his statement is incorrect.
VAS – since the German word for cheese is kaese or käse, the place that my "name" will take you to should help.
Not that I know what this has to do with hyperthreading, but what the heck.
@Peter:
The canonical example in the US of the question you posed is "Have you stopped beating your wife yet?"
All this talk of dual-core machines reminds me: any explanation on why QueryPerformanceCounter is horribly horribly b0rken on multicore machines? Even though the docs claim otherwise?
I already covered that topic earlier this year. http://blogs.msdn.com/oldnewthing/archive/2005/09/02/459952.aspx
Additional consideration as to hyperthreadign performance can be found on http://msdn.microsoft.com/msdnmag/issues/05/06/HyperThreading/default.aspx
A raging email thread on one of our internal aliases led me to an old blog entry about self-locking kangaroos