Date: | May 16, 2005 / year-entry #120 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20050516-30/?p=35633 |
Comments: | 24 |
Summary: | Getting rid of getline was a big help, but 480ms is still not quite peppy enough. You need to respond to user actions within a tenth of a second for thing to seem responsive. Profiling the latest endeavor reveals that 40% of our CPU time is spent in codecvt::in. Some debugging reveals that codecvt::in ultimately... |
Getting rid of
Profiling the latest endeavor reveals that 40% of our CPU time
is spent in
Let's get rid of #define CP_BIG5 950 Dictionary::Dictionary() { MappedTextFile mtf(TEXT("cedict.b5"));
Instead of using the With this change, the dictionary load time has dropped to 240ms (or 300ms if you include the time it takes to destroy the dictionary). That's twice as fast the previous version, but still not quite close enough to the 100ms goal. We still have some work ahead of us. [Raymond is currently on vacation; this message was pre-recorded.] |
Comments (24)
Comments are closed. |
I can’t help but think that so far, all you have demonstrated is how inefficient the MS STL implementation is.
Perhaps it’s not possible to make it much better. But I don’t really believe that.
C++ and its lack of support of Unicode is a big, big pain… You cannot rely on nothing.
Raymond is definately making some great headway.  Having targetted the single-character at a time…
I’ve run the four versions, but get no discernable difference in performance:
version 3: 210ms
version 4: 210ms
(with Borland CPP builder)
niven – sounds like you’re using GetTickCount() to do your perf timing.
You might want to use the instruction cycle timer or the performance counter instead.
Just for reference here http://blogs.msdn.com/ricom/ the managed unoptimized version (whith plain old stream ReadLine ) clocking at 124ms which seems to me that the small allocations are the weakest part of the standard library.
Dictionary::Dictionary()
{
MappedTextFile mtf(TEXT("cedict.b5"));
const CHAR* pchBuf = mtf.Buffer();
const CHAR* pchEnd = pchBuf + mtf.Length();
std::vector<wchar_t> buf;
while (pchBuf < pchEnd) {
const CHAR* pchEOL = std::find(pchBuf, pchEnd, ‘n’);
if (*pchBuf != ‘#’) {
size_t cchBuf = pchEOL – pchBuf;
buf.resize(cchBuf + 1); // +1 to prevent zero length buffer, which is undefined.
DWORD cchResult = MultiByteToWideChar(CP_BIG5, 0,
pchBuf, cchBuf, &buf[0], cchBuf);
if (cchResult) {
wstring line(&buf[0], cchResult);
DictionaryEntry de;
if (de.Parse(line)) {
v.push_back(de);
}
}
}
pchBuf = pchEOL + 1;
}
}
Some notes about code in your example:
wstring’s constructor could throw
DictionaryEntry’s contructor could throw
de.Parse could throw
v.push_back could throw
Any of these events will give you memory leak, because nobody will call ‘delete[] buf’.
And finally (I hope):
it is a good idea to allocate buffer from stack (with ‘_alloca()’) — this will give significant boost in speed depending on type of CRT you are linking.
To prvent stack overuse(and overflow), each ‘while’ iteration should be contained on separate stack frame — for example containing it in a separate __fastcall __decspec(noinline) function. On modern PCs cost of function call is insignificant in comparison with lock/unlock in malloc (even if it is amortized by any mechanism).
Basically — just use your mind, Luke… :)
Microsoft’s implementation of _alloca is ass-backwards as can be. They just reused (literally, same address, different symbol names) the code they call on functions with large stacks (which just touches the stack in 4k increments) to implement it. Instead of fixing _alloca, they added another function _resetstkoflw (that looks like it was written by a summer intern) which you have to call in a SEH block that tests for stack overflow.
It would be interesting to see how much of a further improvement this code gets just by using Doug Lea’s malloc instead of using VC’s allocator and maybe even add STLPort to the mix instead of VC’s STL/string classes.
2. _alloca’s implementation looks reasonable to me… i.e. this intrinsic is doing almost nothing, leaving stack-growth logic to underlying system.
3. In this specific example stack should not ever overflow, and there should be no need to call ‘_resetstkoflw’, and using _alloca is ok
4. It is unconvenient to force system to increase stack size, because AFAIK all Windows systems never reclaims committed memory from stack while it’s owner thread is still running.
> It would be interesting to see how much of a
> further improvement this code gets just by
> using Doug Lea’s malloc instead of using VC’s
> allocator and maybe even add STLPort to the
> mix instead of VC’s STL/string classes.
It will be slower. _alloca() is intrinsic and should end up with 1-3 asm commands…
My guess is that in the end we will have something like this:
open MemoryMappedFile
Allocate wcharBuffer // big enough for full file
MultiByteToWideChar // one single conversion
parse the wcharBuffer and just remember
pointers inside the wcharBuffer, no allocation,
no nothing.
And I bet Raymond has it all written already, just trying to show us how it is done :-)
Mihai,
Raymond had it written sometime last year (in September or October, IIRC).
He writes that far ahead.
I wonder if ‘n’ could be a part of multibyte character? if yes — this code is invalid.
AFAIK, Windows supports only DBCS (i.e. up to 2 bytes per character) — who knows if it is possible for DBCS character to have ‘n’ in second byte?
Hold it, I thought I just had to use exceptions and the world became a better place without me having to do anymore work.
*snicker*
Once again showing that switching between return values and exceptions just means that the developer has to check for something different.
> Once again showing that switching between
> return values and exceptions just means that
> the developer has to check for something
> different.
No — that means that developer must be always **prepared**. :) That is different.
Other notes:
– it is a good idea to do ‘v.reserve(N)’ with some meaningfuill value N (in case if ‘v’ is std::vector)
– if size of v could potentially grow large I’d strongly suggest to use ‘deque’ (in case if ‘v’ is std::vector)
In a production environment, wouldn’t it be more efficient to write a converter to preprocess the file, and make the whole thing unicode to begin with?
I’m not 100% sure (I don’t have my DBCS manual here) but I believe that no characters < 0x1f are legal leading or trailing bytes in Big5
From <http://www.jbrowse.com/text/charsets.html#Big5>
The term ‘Big5’ has been abused quite a lot over the years. The original Big5 character repertoire is no longer used and the name ‘Big5’ usually means one of the many extensions. The main extensions are Microsoft’s CP950, Big5-ETen, and Big5-HKSCS.
Big5 is both a character set and an encoding. As an encoding, it is a DBCS encoding with lead bytes in the range 0xa1-0xf90 and trails 0x40-0xfe.
Ah ha… Here’s a table from MSDN listing the lead and trail byte ranges for the various Far East code pages used by Windows:
http://web.archive.org/web/20000303195204/http://msdn.microsoft.com/library/books/devintl/S24B2_b2.HTM
(Link at archive.org because it seems it’s no longer on microsoft.com anymore.)
> Anyway, it worth pointing out that for example in ‘CP949’ ‘n’ could be a part of DBCS character
I don’t think your source is correct. According to http://www.microsoft.com/globaldev/reference/dbcs/949.mspx , lead bytes start at 0x81, and trail bytes (as far as I can tell) go no lower than 0x41.
http://lists.freebsd.org/pipermail/freebsd-bugs/2003-August/002657.html (random page found via Google) agrees:
+ * CP949 contains four character sets:
+ *
+ * <0x00-0x7f> ASCII equivalent
+ * <0xa1-0xfd><0xa1-0xfe> EUC instance of KS X 1001:2002
+ * <0x81-0xa0><0x41-0xfe> Hangul syllables GAGG-JWAK
+ * <0xa1-0xc6><0x41-0xa0> Hangul syllables JWAT-HIH (ends with C652)
PingBack from http://blogs.msdn.com/oldnewthing/archive/2006/03/03/542924.aspx
Raymond is definately making some great headway. Having targetted the single-character at a time conversion