Date: | May 19, 2005 / year-entry #123 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20050519-00/?p=35603 |
Comments: | 25 |
Summary: | Designing a memory allocator that exploits our memory allocation pattern. |
After our latest round of optimization, the 100ms barrier teased us,
just milliseconds away.
Profiling the resulting program reveals that
60% of the CPU is spent in Indeed, we can. Notice that the memory allocation pattern for the strings in our dictionary is quite special: Once a string is allocated into the dictionary, it is never modified or freed while the dictionary is in use. When the dictionary is freed, all the strings are deleted at once. This means that we can design an allocator tailored to this usage pattern.
I don't know whether there is a standard name for this thing,
so I'm just going to call it a
We implement it by using the same type of fast allocator that the CLR uses: A single pointer. [25 May 2005: The blog server software corrupts the diagram, sorry.]
To allocate memory, we just increment Note also that this memory arrangement has very good locality. Instead of scattering the strings all over the heap, they are collected into one location. Furthermore, they are stored in memory in exactly the order we're going to access them, which means no wasted page faults or cache lines. (Well, you don't know that's the order we're going to access them, but it's true. This is one of those "performance-guided designs" I mentioned a little while ago.) class StringPool { public: StringPool(); ~StringPool(); LPWSTR AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd); private: union HEADER { struct { HEADER* m_phdrPrev; SIZE_T m_cb; }; WCHAR alignment; }; enum { MIN_CBCHUNK = 32000, MAX_CHARALLOC = 1024*1024 }; private: WCHAR* m_pchNext; // first available byte WCHAR* m_pchLimit; // one past last available byte HEADER* m_phdrCur; // current block DWORD m_dwGranularity; }; // colorization fixed 25 May
Each block of memory we allocate begins with a
Exercise: Why is inline RoundUp(DWORD cb, DWORD units) { return ((cb + units - 1) / units) * units; } StringPool::StringPool() : m_pchNext(NULL), m_pchLimit(NULL), m_phdrCur(NULL) { SYSTEM_INFO si; GetSystemInfo(&si); m_dwGranularity = RoundUp(sizeof(HEADER) + MIN_CBCHUNK, si.dwAllocationGranularity); }
At construction, we compute the size of our chunks.
We base it on the system allocation granularity, choosing
the next multiple of the system allocation granularity
that is at least LPWSTR StringPool::AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd) { size_t cch = pszEnd - pszBegin + 1; LPWSTR psz = m_pchNext; if (m_pchNext + cch <= m_pchLimit) { m_pchNext += cch; lstrcpynW(psz, pszBegin, cch); return psz; } if (cch > MAX_CHARALLOC) goto OOM; DWORD cbAlloc = RoundUp(cch * sizeof(WCHAR) + sizeof(HEADER), m_dwGranularity); BYTE* pbNext = reinterpret_cast<BYTE*>( VirtualAlloc(NULL, cbAlloc, MEM_COMMIT, PAGE_READWRITE)); if (!pbNext) { OOM: static std::bad_alloc OOM; throw(OOM); } m_pchLimit = reinterpret_cast<WCHAR*>(pbNext + cbAlloc); HEADER* phdrCur = reinterpret_cast<HEADER*>(pbNext); phdrCur->m_phdrPrev = m_phdrCur; phdrCur->m_cb = cbAlloc; m_phdrCur = phdrCur; m_pchNext = reinterpret_cast<WCHAR*>(phdrCur + 1); return AllocString(pszBegin, pszEnd); } To allocate a string, we first try to carve it out of the remainder of the current chunk. This nearly always succeeds. If the string doesn't fit in the chunk, we allocate a new chunk based on our allocation granularity. To avoid integer overflow in the computation of the desired chunk size, we check against a fixed "maximum allocation" and go stright to the out-of-memory handler if it's too big.
Once we have a new chunk, we link it into our list of
There is subtlety here. Notice that we do not update
StringPool::~StringPool() { HEADER* phdr = m_phdrCur; while (phdr) { HEADER hdr = *phdr; VirtualFree(hdr.m_phdrPrev, hdr.m_cb, MEM_RELEASE); phdr = hdr.m_phdrPrev; } }
To destroy the string pool, we walk our list of chunks and
free each one. Note the importance of copying the Using this string pool requires only small changes to the rest of our program. struct DictionaryEntry { bool Parse(const WCHAR *begin, const WCHAR *end, StringPool& pool);
We no longer need to free the strings in the bool DictionaryEntry::Parse( const WCHAR *begin, const WCHAR *end, StringPool& pool) { const WCHAR* pch = std::find(begin, end, L' '); if (pch >= end) return false; m_pszTrad = pool.AllocString(begin, pch); begin = std::find(pch, end, L'[') + 1; if (begin >= end) return false; pch = std::find(begin, end, L']'); if (pch >= end) return false; m_pszPinyin = pool.AllocString(begin, pch); begin = std::find(pch, end, L'/') + 1; if (begin >= end) return false; for (pch = end; *--pch != L'/'; ) { } if (begin >= pch) return false; m_pszEnglish = pool.AllocString(begin, pch); return true; } Dictionary::Dictionary() { ... if (de.Parse(buf, buf + cchResult, m_pool)) { ... }
And finally, we pass our string pool to
With these changes, the dictionary loads in 70ms (or 80ms if you include the time it takes to destroy the dictionary). This is 70% faster than the previous version, and over three times as fast if you include the destruction time.
And now that we've reached our 100ms goal, it's a good time to stop.
We've gotten the running time of dictionary loading down from
an uncomfortable 2080ms to a peppier 70ms, a nearly 30-fold improvement,
by repeatedly profiling and focusing on where the most time is
being spent.
(I have some more simple tricks that shave a few
more milliseconds off the startup time.
Perhaps I'll bring them into play if other changes to startup
push us over the 100ms boundary.
As things stand, the largest CPU consumers are
That's the end of the first stage. The next stage will be displaying the dictionary in an owner-data listview, but you'll have to wait until next month. |
Comments (25)
Comments are closed. |
You missed a closing tag somewhere – the whole thing is blue.
I’ve used this technique for symbol tables. It makes it very easy to remove symbols from the table in bulk by moving the end of the symbol table back (and the hash entry table).
This might be a dumb question, but why is MIN_CBCHUNK set to 32,000 instead of 32K?
Maybe you should decide whether the type of a memory size is DWORD, SIZE_T or size_t. Then use that as the return type and argument type of RoundUp rather than relying on the "implicit int" rule which was removed from C++ about, oh, 15 years ago.
"I don’t know whether there is a standard name for this thing"
Wouldn’t it be called a stack. In this case, you are always pushing a never popping. When you want to pop all, you simply reset the stack pointer.
Many years ago the version of Pascal I used supported this method of allocation in the language. I can’t remember all the keywords, but it had a way of marking the stack and then releasing all the allocations back to the mark.
All this work has already be done… Simply use HeapCreate() && HeapAlloc() !
This storage scheme is an instance of an "arena"-based allocator.
I first encountered it in Fraser and Hansons book "A retargetable C compiler", in which the implemention of the lcc compiler is described in detail.
Indeed, this is an excellent technique for implementing programming languages. The MSFT script engines use this technique extensively.
Building a parse tree, for example, requires many many small allocations all of which will be freed at exactly the same time: when the code generator is done walking the parse tree.
Is there an advatage to using reinterpret_cast<type> instead of the good old standard cast (type) ?
Ulric asked:
"Is there an advatage to using reinterpret_cast<type> instead of the good old standard cast (type) ?"
Yes.
The advantage is that you have clearly marked "What I’m noing now is likely ugly! Be aware!". Old-style C casts are easily missed, making it seem like something common, when in fact casting in a strongly typed language is something exceptional and should be marked so.
Well, it’s time for me to surrender.&nbsp; Sort of :)
Raymond pulls out all the stops in his sixth version….
"Exercise: Why is HEADER a union containing a structure rather than just being a structure? What is the significance of the alignment member?"
I didn’t follow the code very closely because C++ makes my head hurt, but I’m assuming the reason is alignment issues.
The CPU is really picky about the way you ask for values stored at an address. If you ask it to look up any address that isn’t cleanly divisible by some specific number, it throws an exception. Compilers hide this for you, and more low level stuff, like malloc(), have system-dependent information built into the library that go far from preventing this from happening.
When you implement your own memory allocation scheme, you get exposed to the ugly underbelly of CPU memory addressing.
The union trick gets the compiler to round up to a certain CPU happy multiple, so the CPU doesn’t end up with an address that makes it choke.
I’m curious as to why you did not use placement new / delete to implement the pool allocator? That way you wouldn’t have dramatically different syntax (stuff like pool.AllocString() for example).
Why not do the following?
1. Preprocess the file into a stream of (chinese pinyin english )* (in Unicode)
2. Read the file into memory
3. Append an extra just in case
4. Loop through the data, count number of dictionary entries
5. Allocate precise number of dictionary entries
6. Loop through the data, populate the pointers in the dictionary entries
Actually, even that can be improved.
1. Preprocess the file into a stream of (chinese pinyin english )* (leave in Big5)
2. Make the file a resource and compile it into the executable
3. Load the resource into memory
4. Allocate the precise number of dictionary entries (since you preprocessed it, you probably know this)
5. Loop through the data, populate the dictionary entries with pointers
Note that leaving it as Big5 here will substantially reduce your working set. I’ll argue that the extra calls to MultiByteToWideChar in your GUI are worth the smaller dataset of leaving the data (which is mostly 7-bit anyway) in Big5.
Plus, making the dictionary part of the executable makes the program easier to distribute.
In summary, this removes the overhead of
(a) MultiByteToWideChar
(b) String allocations
(c) Parsing
Sounds like it would be pretty fast to me.
To clarify, when I say cast back when talking about reinterpret_cast I mean reinterpret_cast back, you cannot reinterpret_cast to one type and static_cast back to the other and expect that to be work in general.
What you did is called an arena allocator (I also first encountered it in Hansen’s excellent book). A pool allocator for the most part is a freelist of fixed sized objects. I’ve seen that claim before about the CLR allocator and I don’t believe it’s that simple given that it has to support pinned objects.
You can throw OOM with "throw std::bad_alloc()" instead of creating a static object. You should stop using DWORD/int when you mean size_t, ptrdiff_t, container::size_type, or container::distance_type. Please *DO NOT* use a windows header definition of those types or their bastardized all caps version, they’ve historically botched the actual type by typedeffing it to a type that just happens to be the same size. Not so important in C (except when creating a pointer to it) but this is extremely important in C++’s very subtle overload resolution (VC has a stupid quirk where int[op]long returns int instead of long, now you know where it comes from). Stop going past the end of the array and stop assuming relational operators work when you’re not comparing elements of an array. Stop reinterpret_casting to/from void pointer, static_cast is the thing you want. My guideline on reinterpret_cast is that there are only 2 places where it can be used: 1. casting to/from a pointer to an integer big enough to hold it so you can do non-portable, implementation defined things to it 2. casting a pointer to a suitably aligned pointer and back. The second case is where you can reinterpret_cast to void *, but ONLY if you cast back to the original type. This style of code has been implementation defined when C correctly replaced all questionable instances of char with void:
foo *f;
unsigned char *uc;
uc = (unsigned char*)f;
// aka uc = reinterpret_cast<unsigned char*>(f);
it’s always portable and correct to do this:
uc = (unsigned char*)(void*)f;
// aka uc = static_cast<unsigned char*>(static_cast<void*>(f));
You may think I’m being anal (and I am), but compilers are increasingly becoming standards compliant in terms of supported and unsupported behavior so I think it’s much better to learn what C/C++ actually guarantees over what compiler switches your current compiler guarantees.
The goto is unnecessary — you can just replace it by throw(OOM) which is about the same number of characters, and put the OOM declaration a bit further up.
</font>
Just tried to close the blue font tag :-)
First, big thanks for the very interesting series.
Just one thing – I wonder, when you originally wrote the dictionary application, did you actually follow all the steps you’ve shown here? What I mean is, did you start with the pure STL version, and then slightly moved on to what we see now or, for example, the memory-mapped file was there from the very beginning, and you just wrote the STL stream version for the blog’s sake?
Sorry, I didn’t ask this earlier but [Raymond was currently on vacation].
Filling a listview with tens of thousands of items.
Owner-data listviews let you take over data management from the listview.
Finally we start searching.
Belated answers to exercises and other questions.
Well, it’s time for me to surrender. Sort of :) Raymond pulls out all the stops in his sixth version