Date: | January 19, 2006 / year-entry #26 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20060119-11/?p=32603 |
Comments: | 22 |
Summary: | There are many algorithms for fast string searching, but the running of a string search is inherently O(n), where n is the length of the string being searched: If m is the length of the string being searched for (which I will call the "target string"), then any algorithm that accesses fewer than n/m elements of the string... |
There are many algorithms for fast string searching, but the running of a string search is inherently O(n), where n is the length of the string being searched: If m is the length of the string being searched for (which I will call the "target string"), then any algorithm that accesses fewer than n/m elements of the string being searched will have a gap of m unaccessed elements, which is enough room to hide the target string. More advanced string searching algorithms can take advantage of characteristics of the target string, but in the general case, where the target string is of moderate size and is not pathological, all that the fancy search algorithms give you over the naive search algorithm is a somewhat smaller multiplicative constant. In the overwhelming majority of cases, then, a naive search algorithm is adequate. As long as you're searching for normal strings and not edge cases like "Find That's why nearly all library functions that do string searching use the naive algorithm. The naive algorithm is the correct algorithm over 99% of the time. |
Comments (22)
Comments are closed. |
If you’re going to be searching the SAME long string over and over again, it is VERY much faster to build an index. (Assuming the string is non-pathological.)
Compare SQL’s LIKE ‘%butter%’ vs. CONTAINS(…, ‘butter’, …) on a large data set.
In the past, I needed a very fast way of searching a constant string for many substrings. I built a suffix trie from the data. This allows searching in O(m) time. You sacrifice memory, but when you are looking for short substrings in large text the speed difference is amazing.
"The naive algorithm is the correct algorithm over 99% of the time."
99% of projects? 99% of calls to search functions/methods? 99% of calls made to search functions/methods?
Maybe I just take Mr Lippert too seriously. ("250% of what, exactly?", Sep 1/05) :D
I still like Boyer-Moore because it only touches each byte once. There’s a certain elegance there that I really appreciate.
Of course, that’s all irrelevant because ultimately, Raymond’s spot on – most of the time there’s no advantage to using the more complex one.
Although you can make things better by optimizing the search for the first character of the string :)
"trying too hard" really isn’t the right phrase to be using here.
Both this and the previous blog posting on splay trees would have been far more accurately titled "The cost of not understanding the implications of your choices", although this isn’t as catchy as yours.
The trend so far (if I may extrapolate from two data points) seems to be that you have an axe to grind with people who uncritically copy algorithms from textbooks. Well, yeah, people like that are pretty common, but copying algorithms uncritically is usually not among their chief failings…
People will occasionally say things like "Why doesn’t XYZ use ? It’s clearly superior to the brute-force algorithm you’re currently using." These people are trying too hard. In many cases, the fancy algorithm turns out to be worse than the dumb one.
I think this case is different from splay trees. If a smarter algorithm is used, is it really slower in the average case, or does it remain equal in the average case while wrapping up the pathological edge cases?
If you’re writing a library (write once and use many times), and if performance of the average case doesn’t deteriorate, then you still get medium-sized gains at cheap cost by choosing a better algorithm.
It’s more than getting all enamored of fancy algorithms. It’s just recognizing when you’re overkilling the problem. Using a fancy hash table to keep track of 5 elements, for example.
One of the thing about the Hashcode class in .NET is that it’s just so easy to use. And "everybody" knows that Hashtables aare as fast as you can get.
So it’s really annoying when you see people using Hashtables to store a dozen or so values, when a SortedList would be even faster due to it’s *much* lower overhead…
The problem is that people just don’t think about the problem enough…
Especially annoying is people using a UnsortedList ‘because N will be small’ and then using the app where the N doesn’t stay small enough.
Especially since they have a HashSet class which exposes the exact same methods used with the same contracts and there aren’t enough instances to worry about the overhead.
In regards to using HashTable vs SortedList in .Net, most people have no way to know which to use. It isn’t like the documentation is real clear on when to use each, nor does anyone really have to time to performance test non-critical code.
Really all they need to do is add a couple of "when should I use this" lines to each class / method, and the rate of correct usage will go way up.
The best example of overkill is the way Windows Update Agent API works. For any Windows computer to scan for updates, it has to download a CAB file which at the moment contains around fifty thousand(!) XML files totaling 50 MB, to decompress them, parse them etc.
Try that on Pentium II or III and dial-up connection. Instead, all the data could have been stored in a few tab separated files and then compressed.
My second favorite overkill example is the way IE stores cache and history data. It was so complicated that IE was for years unable to maintain its own cache — something remained forgotten all the time.
The information about the Windows Update Agent (WUA) API that "AC" mentioned above is very misleading.
The Windows Update Agent (WUA) API is designed to search the local computer’s cache of update metadata. When WUA searches for updates, it actually uses a highly optimized protocol that only picks up new and changed updates, prunes out entire branches of updates based on OS and other applicability rules, omits languages that are not installed on the local machine, and then caches the results in a client-side database. In the usual case, almost nothing is downloaded when WUA performs a search; virtually all of the time "searching for updates" is actually spent running the various applicability rules, e.g. verifying that all of your previously installed patches are still installed and valid.
In order to support API callers on computers where there is no trusted server (such as a computer with no Internet connection), a feature was added to let the API caller supply the giant CAB mentioned above. This CAB is effectively a replacement for the entire WU server, and it includes all updates in all languages because it cannot be known at CAB-generation time which platform(s) the API caller will care about. The data is formatted in XML because that is the same format used by the "real" client-server protocol. The individual XML files in question are never written to disk, so the fact that there are 50,000 of them (as opposed to one giant XML element that contained everything) is beside the point.
Funny that AC suggests that using a tab-delimited file for the CAB scenario would somehow have been less work for the WUA team. That is definitely not the case.
Simon Cooke incorrectly claims Boyer-Moore touches each byte exactly once. Perhaps he is thinking of KMP; Boyer-Moore is one of the ‘skip m’ algorithms but it has worst cases that examine more than n characters total, even for no matches.
Also worth considering about the performance of an m-skipping algorithm is that if m is smaller than the cache-line size, you’ll still be touching every cache line of the (large) string, so any performance gain may be miniscule if the memory traffic dominates.
I invented a skip-m style algorithm for regular-expression searching 15 years ago, but I never did anything with it beyond publishing it on my website, because as Raymond says, it’s just not that interesting except for very long search strings.
Given that "the world is moving towards webapps", it helps to implement the non-naive approach nowadays. Assuming you’re going to be making some internet accessible application, you should better be ready for some fairly pathological arguments to be coming your way.
Ken: "Funny that AC suggests that using a tab-delimited file for the CAB scenario would somehow have been less work for the WUA team. That is definitely not the case."
It’s not about WUA team’s work, it’s how much each computer has to do to find out the updates it needs. Your arguments are all in the direction "we made it to work nicely with server, and if it doesn’t have the server, then it was easiest for the team to pack everything as 50000 XML files". I can imagine that this was the easiest for the team to do, but, sorry, that is still the example of the "overkill" design. Instead of nicely preparing the data on source side, once for all millions of users, it remains in the form that was "easiest".
I really consider "database to XML files" as one of the most inefficient forms for the export of database tables, and "tables to tab separated files" as one of the most efficient. Why would anybody need the names of the fields more than once per table, during the processing of the table?
What the library writers don’t do generally is the documentation of complexity of the algorithms they use. The only such specification I seen so far was the C++ standard. You have to dig the sources usually.
The implementation details isn’t documented because the programmers are ashamed of implementing a unoptimized algoritm in a library used by millions.
The algorithm *is* optimized. It’s optimized for non-self-similar strings.