Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Do hash tables work in constant time? (daniel-lemire.com)
20 points by fogus on Aug 18, 2009 | hide | past | favorite | 23 comments


This is fantastic, he starts with the claim pure theory is wasteful, experiments must be run to validate models.

He asserts hashtables are not amortized O(1). Ok, interesting, maybe something with caching? latency of disk access in huge data stores? No. He concludes hashtables are not O(1), because integer multiplication is not O(1). No mention of putting an upper bound on the hash key size, like the size addressable memory, or electrons in the universe, just that it's not O(1)

Lucky for us, he's backed it up with a clever and insightful lack of experiments.

edit spelling and grammar.


I'm giving an example (vectorization) where experiments would be needed:

"Am I being pedantic? Does the time required to multiply integers on modern machine depend on the size of the integers? It certainly does if you are using vectorization. And vectorization is used in commercial databases!"

Sorry, I did not run the experiments this time around.


You have to admit it's a little silly to start with "This is one of the fundamental reason why pure theory is wasteful." and then not back it up with numbers.

I'm a head in the clouds kind of guy, i'd have been fine if you'd said, here's all this extra stuff you might be overlooking. But when you frame the discussion with concrete measurements... I go looking for concrete measurements.

That said, I broke the rule on not saying something in writing, that i wouldn't say face to face. (well, i would, but only after i knew you better)

I apologize for being snarky. That was uncalled for.


Implementing vectorized hash tables is no joke, so I don't think it is silly to have merely evoked the possibility rather than implementing it. It would have taken me days of work, possibly, and even then, people could have questioned my code. I do have a day job, doing academic research and teaching Computer Science. Maybe one day I will research vectorized hash tables, but the result will not be a mere blog post.


fwiw, gcc-4.2 foo.c; time ./a.out; gcc-4.2 -ftree-vectorize -ffast-math -ftree-vectorizer-verbose=3 -funsafe-math-optimizations -O3 foo.c; time ./a.out foo.c: In function ‘main’: foo.c:16: warning: incompatible implicit declaration of built-in function ‘printf’ 0 real 0m0.095s user 0m0.093s sys 0m0.002s foo.c: In function ‘main’: foo.c:16: warning: incompatible implicit declaration of built-in function ‘printf’

foo.c:6: note: not vectorized: unsupported use in stmt. foo.c:11: note: LOOP VECTORIZED. foo.c:2: note: vectorized 1 loops in function. 0 real 0m0.018s user 0m0.017s sys 0m0.001s

int main () { int a[256], b[256], c[256]; int count = 100000; int i; for(i=0; i < 256; i++) { a[i] = i; b[i] = 255 - i; } do{ for(i=0; i < 256; i++) { c[i] = a[i] * b[i]; } } while(count--); printf("%i", c[0]);// avoid c being optimized away. }

so, i'd assert that vectorized multiplication is a speed win rather than a loss. upping the count a few orders of magnitude indicates vectorized multiplication grows slower than regular multiplication.

So, I claim the vectorization of multiplication is probably not detrimental to the Big O of a hashtable.

ymmv.


Once we constrain the size of our integers - which most hashtable implementations do implicitly by only using numbers that are natively supported by hardware - number multiplication is once again a constant.

From the other direction, if a hashtable implementation uses arbitrarily sized numbers to compute hashes, it probably has bad design.


Please consider vectorization. Natively, modern processors can multiply several 16-bit integers in the time it takes to multiply one pair of 64-bit integers.


I am considering vectorization.

There is always a maximum number of native values that can be operated on at one time by the processor. The maximum cycle costs are also known. Between the two of those, we have a maximum time spent on the computation, and it is a constant.

Considering the size of numbers during algorithm analysis matters when the numbers are huge - that is, when the numbers are larger than the maximum values that can be represented natively by the processor. When that happens, "numbers" become a data structure unto themselves and must be analyzed as such.


I don't get it, somebody please correct me if I'm wrong.

Hashtables derive their efficiency from the number of keys that you can encode for a given size hash. If the collision chance is low the number of 'chained' items in each hash bucket approaches one.

To take the length of the key and the number of bits in it as a factor in the complexity of the hash table shifts the problem to a sub-problem that has nothing to do with the original one. As far as the hash table algorithm is concerned that is a different algorithm, whose fixed cost per item is taken into account.

Sure, it is slower to compute a longer hash. But that is not where the majority of the savings are. The majority of the savings are in not having to traverse the N items in the keyspace to find a given key.

Constant factors - and this is one of them - are simply left out because in the greater scheme of things they are but a very small item. Any constant factor, be it 5, 10 or 50 as viewed from the far side of the code calling the hash store is simply that, a constant.

You might be able to optimize on it by using a clever algorithm, you might be able to work it down all the way to '1'.

But that will not change the complexity at all.


This seems like a clumsy way to make a simpler and more generic point: any key/value data structure (at least one that relies on inspecting the key in its entirety) will scale as O(NlogN) at very large sizes. The reason is that the size of a key needed to identify a unique entry in a table of size M is at least log2(M) bits long. So the primitive operations on the keys is bounded by log2(M) and is not constant.

This is the same reason that "radix sort" (http://en.wikipedia.org/wiki/Radix_sort) isn't a linear operation even though it doesn't seem to have a log(N) dependence on the input set. The number of radix buckets scales as log(N).

That said, of course, this doesn't mean much in practice, where primitive operations on keys live in L1 cache and the pointer indirections require DRAM cycles.


My post was specifically about vectorization. In which case, I do claim that it matters.


Okay, he is saying that the time required for an arithmetic operation is proportional to the number of bits operated on.

This is true, and it is applicable to other concepts besides hash tables. Can you sort n numbers in O(n log n)? Well a comparison is not really constant-time, in the wall-clock sense of time. If the n numbers are distinct, then each is going to need at least O(log n) bits. So a comparison is O(log n) by the clock, and our sort becomes O(n log n log n).

Part of the solution is to be more precise. For example, when we say that Merge Sort is O(n log n), we usually do not mean that it takes that long to run; we mean that it can require that many comparisons. Counting primitive operations can be useful, even if those operations are not constant-time from a wall-clock point of view. And as long as we are clear on what the primitive operations are, everything is fine.

By the way, there is another problem with the statement that hash-table insertion is amortized constant-time: what if all the keys end up being in the same bucket? The true statement is that HT insertion is amortized constant-time for average data. (So it's constant-time on average on average. :-)


You should not be using n for the number of bits and n for the size of the input.

It may take longer to multiply 128 bit data structures than 64 bit data structures, but it is constant with respect to the size of the hash table.


My post specifically refers to vectorization where you may use the fact that you can multiply 4 pairs of 16-bit integers in the time it takes to multiply a pair of 64-bit integers. So, you could operate four 16-bit hash tables in the time it takes to operate a 64-bit hash table.

That's all somewhat theoretical sure, but the point is to challenge your assumptions. (People do use vectorization, right now.)


"That's all somewhat theoretical sure, but the point is to challenge your assumptions. (People do use vectorization, right now.)"

You're assuming that people are not aware that multiplication is not always a constant-time operation. That assumption generally doesn't hold among people with any sort of academic CS background. Multiplication algorithms - and the fact that they were typically O(N) in number of bits - were covered in my intro machine architectures course.

In practice, there are many, many other things that can negatively impact hash table performance in a much bigger way. Like cache misses - one cache miss costs far more than an integer multiplication. Or interpreter overhead from using, say, Ruby instead of C (bad example, since Ruby's hashtables are implemented in C, but the general point holds). Or the difference between -O0 and -O3. Or how early versions of Java Hashtables would often devolve into linked lists because the hash function only looked at the first 8 characters of a string.


I am not sure optimization, cache misses, and interpreter overhead would obviously impact the computational complexity.

"You're assuming that people are not aware that multiplication is not always a constant-time operation."

I made no such assumption. But I suggest people consider the case of vectorization where, all of a sudden, it may matter quite a bit.


You're making a point about where we take cognitive shortcuts. Short version: don't assume the classroom example assumptions always hold in the real world.


> but it is constant with respect to the size of the hash table.

Exactly. That is the whole key to complexity analysis.

Otherwise you end up drowning in details without getting more meaningful answers.

You're analyzing the hash table algorithm, not the hashing algorithm used to produce the keys.


> You should not be using n for the number of bits and n for the size of the input.

I don't see that I ever did use it for a number of bits. I said, to have n numbers with distinct values, you need O(log n) bits in each number. n is not a number of bits.


This is dissapointing. I was really hoping for something more exciting about number of buckets vs number of elements or something like that.

In any event, he was correct when he said: "The problems that result from picking the wrong model of computation are well known and addressed by most textbooks. I have not discovered anything new." There are problems that you can "solve in polynomial time" by encoding the problem as some HUGE number (where the number of bits is exponential in the size of the problem) and doing arithmetic. This is where you really see the breakdown of the constant time arithmetic model.


Excerpt from article: Multiplications of numbers in [0,m) can almost be done in time O(log m). For the purposes of hash tables, the number m must be close to the number of different keys n. Thus—formally—hash tables often run in time O(log n).

Yes, this guy is confused about Big-O notation — nothing to see here, please move along :)


It seems to me he is using the same variable to represent two different quantities:

m- the size of the hash table m- the size of each element in the hash table

The 2nd m should be a different variable, and it is constant with respect to the first m, which is the one we care about.


The definition of a hash table is independent of the hash function. You should separately consider the complexity of your hash function.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: