Hash Ordering and Hyrum's Law

eaftan.github.io

86 points by ColinWright a year ago

smarks a year ago

(from Feb 2021)

Some of the JDK's unmodifiable collections, such as those from `Set.of()` and `Map.of()`, also randomize iteration order. At the time they were added, Go and Python also randomized iteration order. However, more recently, Python moved away from randomization. Early in the Python 3.x releases, dict iteration order was randomized. In 3.6, iteration order was insertion order, but only as an implementation detail. In 3.7, this was elevated to a guarantee.

https://docs.python.org/3/library/stdtypes.html#dict

(search for "dictionary order")

There are occasional complaints about Java's unmodifiable collections' randomized order, as it can occasionally cause intermittent test failures or even failures in production. However, the advantage of the randomized order is that we've been able to reorganize the internal implementations of these data structures -- twice -- without worrying about compatibility problems caused by changing the ordering. Historically this has been a problem with other structures such as `HashMap`. Even though HashMap's order isn't guaranteed, it's predictable and stable, and changing its ordering has definitely flushed out bugs.

o11c a year ago

IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.
And since hash-first languages provide a broken API, there's no way to provide it for arbitrary types. Compare-first languages (like C++) generally provide it, but paying the price of tree-based structures (assuming a B-tree can't negate it) isn't actually necessary.
Rather than either of those choices, I argue that languages really should be "key-first" - rather than having classes implement hashing or comparison themselves, they should return a tuple of the data that will be fed into said hash or comparison. Besides being a nicer API, this also prevents many common bugs.
- lesam a year ago
  
  C++ gives both ‘map’ and ‘unordered_map’, and in my experience idiomatic C++ uses unordered_map unless you actually need a tree.
- tialaramex a year ago
  
  > IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.
  This would be a lot more convincing if you had an actual concrete example where this was true, rather than just insisting that "sometimes" it's true.
  - alexhornby a year ago
    
    Few examples I’ve come across: hashing, float summation, reproducible serialisation
    
    medstrom a year ago
    
    I get two of those, can you tell me more about float summation?
    
    tialaramex a year ago
    
    Because the exponent varies, it matters what order we add floats together, if we add them in one order we get a very different answer to another. If we expected a consistent result that's very surprising.
    Suppose you have a billion, and also you have a million ones, all as 32-bit floating point numbers. If we begin with the billion, and add each one, the additions do nothing, because in the representation used for the billion a single one is too small to matter, so our answer is just one billion again each time -- but if we begin with the ones, adding the billion last, we've reached a million when we add the billion, that's not too small to matter and we get 1.001 billion.
    
    skratky a year ago
    
    Use the Kahan summation algorithm.
    
    tialaramex a year ago
    
    Kahan solves my examples but in general cannot be relied on to deliver any specific benefit - it's never worse but it may not be meaningfully better either.
    Pairwise algorithms can promise better results but aren't applicable unless we're OK with the idea of separately storing the numbers somewhere while we run the algorithm or we provide this hash table with a random access mechanism it may otherwise have no use for.
    
    tialaramex a year ago
    
    Huh. I guess somebody could build a specialised container for this purpose, which deliberately has some arbitrary but unchanging ordering. Certainly that looks practical in Rust (indeed it may already exist, I haven't looked).
  - hinkley a year ago
    
    LRU caching.
- hinkley a year ago
  
  Java and JavaScript both have a way to get keys back in insertion order.
  It’s pretty simple to implement an LRU cache when you have time ordered iterators.

the8472 a year ago

> Can’t you just break them? They’re violating the spec!

> No, you can’t break them.

You can. A vendor playing nice may provide advance warnings, fallbacks, assess blast radius etc, but that's no way a given, some vendors will not act nicely and leave you out cold if you ignore the spec.

Java 7 switched array.sort() to TimSort. It broke some users, they switched nevertheless and offered an override option for some time so people could migrate. Java 9 introduced the modules, which broke some things that tried to access internals, there were overrides for some time so people could migrate.

By default Rust does not specify the field order of its structs, people relied on it anyway. Several releases changed the ordering algorithm, no opt-out was offered since there already was an annotation to specify a fixed layout, people just had to use it.

It's called "API contract" for a reason, it puts obligations on both sides.

jerf a year ago

I am reminded of how Postel's Law has fallen from Iron Law of the Internet to somewhat distasteful in the past 5-10 years. Yes, it's no fun to break your clients and customers today. But if you don't do it today, and you don't do it next month, next year, or the next four years, you'll find yourself in a position where you can't do anything anymore because you've basically let Hyrum's Law "build up" in your system until there's no room to move.
Obviously, one should not willy-nilly break customers for no reason. The point is not to create a management metric around Number Of Times We've Broken The Customer and reward the team for getting their numbers up, up, up. The point is to retain flexibility and the ability to move forward, and that is done through intelligent design up front to the extent possible, and carefully selecting when to break customers over the course of years. It's always expensive and should definitely be treated as an expense, not a benefit on its own terms. But as the Hyrum's Law "builds up" its expense to the company over all will in not really all that long overwhelm the costs of the occasional breakage to stay up-to-date with the world.

jbandela1 a year ago

https://www.youtube.com/watch?v=ncHmEUmJZf4

Is a fun presentation by Matthew Kulukundis (designer of Google's Hash Table) with Hyrum Wright offering objections from the crowd (Hyrum's law) about the design and rollout of it at Google

lsy a year ago

What's the advantage of specifying it as random over specifying it as sequential in some way? Aren't both just specifying a behavior, where one behavior is potentially more useful than the other? I guess I understand the principled point that a set of hash keys is not an array. But it seems like more complexity than is necessary, and even possible to fall victim to Hyrum's law besides… you can imagine someone using random hash ordering to implicitly randomize something without needing to call an RNG explicitly.

It seems like the most principled approach might be an re-specification of the data structure: why is "iteration" even possible over a hash table? Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?

kstrauser a year ago
One advantage to randomization, and why various languages did it in the first place, is that it prevents miscreants from DoSing your service if they know its implementation.
Say you're operating a service and you share its source on GitHub such that anyone can see it. Your language doesn't randomize hash values. Hashes are O(1), right? Well, not if a clever attacker can send you a set of values where they know each one will hash into the same bucket. The you end up with like 9 empty buckets and 1 with 100,000 items in it. Oops!
Old Python pre-randomization had a dict.items() method that would yield all the keys in order, conceptually kind of like:
```
  for bucket in self._buckets:
      for item in bucket:
          yield item
```
The order of those buckets would be repeatable between runs, so bucket foo would always come first, then bucket bar. Then Python added hash randomization so that the resulting hash key was something like hash(salt+key) instead of just hash(key). Now there's no way to tell in advance which bucket an item would get filed into, and the buckets would end up more or less balanced in size.
Newer Pythons (since 3.6? 3.7?) do something altogether different, and I can't explain exactly how their ordered dicts work, except to say I sat through a presentation on them and thought it was freaking genius even if I could re-implement it myself without sitting down with their docs.
- quelltext a year ago
  
  > and I can't explain exactly how their ordered dicts work
  Traditionally you simply use a doubly linked list approach on the entries (each entry maintains two additional references to the previous and next entry) for that like LinkedHashMap: https://docs.oracle.com/javase//8/docs/api/java/util/LinkedH...
  https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/...
  Which is also what Python seems to be doing: https://stackoverflow.com/a/34496644
  It's fairly intuitive.
  Do their new default (now also ordered?) dics do this differently?
  - kstrauser a year ago
    
    Note that OrderedDict is an implementation in Python. CPython's dict has a different implementation. There's more about it at https://docs.python.org/3.6/whatsnew/3.6.html#new-dict-imple... and https://mail.python.org/pipermail/python-dev/2012-December/1... .
    
    quelltext a year ago
    
    This implementation was used from 3.6, right?
    It's interesting that the idea mail mentions that nothing changes about the implementation (including order) but the memory layout. Which would imply insertion order was already preserved in older versions (not the case IIRC) or the idea underwent a few more changes that did in fact impact order.
    EDIT: I couldn't quite find an answer but https://softwaremaniacs.org/blog/2020/02/05/dicts-ordered/ mentions the behavior happens since then because the implementation only tracks indices in the hash table itself and relies on maintaining entries separately in a second array that gets expanded in insertion order.
    This would also seem straightforward but it raises a few questions such as how deletion is implemented (efficiently).
    EDIT2: Okay, the talk (https://youtu.be/p33CVV29OG8) mentions they just leave holes until the next resize (at around 42:00).
    Raymond also mentions there that his original idea didn't preserve ordering but happened due to an additional compacting optimization? Should probably watch the whole thing some time to get the history. Sounds like a fun talk.
    
    kstrauser a year ago
    
    Oh! Yeah, that was the talk, but repeated at PyCon. It’s a very clever design that wasn’t at all obvious to me.
- o11c a year ago
  
  Note that this relies on all the object classes having correct hash implementation (which is actually quite difficult); it's not really a property of the container.
  That said, the attack is almost always on builtin types, so ...
prattmic a year ago

In addition to the DoS aspect mentioned in a sibling comment, the primary reason you would do this is to avoid constraining the implementation. If you want to change the design to improve performance, for example, being forced to match the implicit ordering of the old implementation may be very difficult.
It certainly may be useful to define an specific ordering. Maps ordered by insertion time and maps ordered by key order are fairly common. But maps with these constraints probably won't be as fast as those with arbitrary ordering, so there is a trade-off decision to be made. A map constrained to the arbitrary ordering of the initial implementation is the worst of both worlds: difficult to make faster despite not having a particularly useful ordering.
As a concrete example, I am currently in the middle of rewriting Go's map implementation to use a more efficient design (https://go.dev/issue/54766). If Go hadn't randomized map iteration order this work would likely be impossible. It is unlikely we could completely change the design while keeping iteration order identical to the old implementation (assuming we want the new version to be faster at least).
- smarks a year ago
  
  Cool! Yes, that’s been our experience in Java as well. The randomized order of unmodifiable Set and Map (from `Set.of` and `Map.of`) have enabled us to make optimizations to the internals mostly with impunity. We’ve done so twice since their introduction and we might do so again.
chickenbig a year ago

> why is "iteration" even possible over a hash table?
Going through all items in a container seems like a logical thing to want to do.
> Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?
Efficient implementation (in space and time) of these set operations would limit the possible efficient implementations of a hash table (i.e. to using the hash value as a key in an ordered map).
Going one step further, this guarantee would freeze behaviour of the hash function (and how they are combined) for all time; optimisations for different architectures would not be possible, nor would avoiding hash collision attacks by changing seeds.
Sometimes hash value stability is what you want (e.g. when transmitting data in space and time), so "frozen" hash functions give these guarantees e.g. https://github.com/google/highwayhash#versioning-and-stabili... .
Yes it can be annoying when tests are flaky, but the unit testing/matcher library should have some reasonably efficient way of matching unordered containers.
fwip a year ago

I think removing methods from the Hashmap was not in scope for their work - they were working on the JDK, and presumably didn't have the bandwidth to patch every single Java library used at Google to use set-based methods instead of iteration-based.
Also, I think in practice you'll find you want to extract items from a set in some order, so you need some kind of transformation from set->list. e.g: you want to print them out.
Edit: Forgot to address your main question. If the specification requires a specific ordering, the implementation is forever bound to do that, even if other implementations would be more efficient/secure/other-desirable-properties. By introducing randomness, you reduce the risk of people accidentally relying on unspecified behavior, and are more able to change your implementation later.
- zanecodes a year ago
  
  One solution could be adding a `sort` argument (which takes a function that compares two `<T>` items and returns `true` or `false` depending on their order) to all functions on unordered collections of `<T>` items in which an order must be chosen, or requiring that the items in such collections implement a `TotalOrder` interface or something similar. This isn't very ergonomic in languages that don't have an equivalent of Traits or typeclasses though. In languages which permit side effects, this would include any functions that iterate over the items in an unordered collection.
CJefferson a year ago

Unfortunately, map can (and does) have external effects in most languages, so your code could still be effected by the external order.
I’ve actually worked on some maths code where we wanted to prove algorithms worked correctly in exactly this type of situation, so we wanted to prove our code iterating over a hashed container was order-invariant. You can do it, but involves pulling in theorem provers, and is beyond what most people would probably tolerate.

impoppy a year ago

Didn’t Python tackle a similar problem? I remember that dictionaries were known to be unsorted, however, when writing tests, I’ve noticed that the items were always in the same order. I don’t remember what was I looking up, but my manager and I came to a conclusion that depending on order items of a dictionary was acceptable and having to fix the tests if they somehow broke later was an okay tradeoff. Now, reading Python docs on that topic, they briefly[1] mention that dictionaries’ list views yield items in the order they were inserted without any mentions if items are sorted or not.

[1] briefly in this regard means I’ve seen it in the docs exactly once without much attention paid to that sentence

https://docs.python.org/3/tutorial/datastructures.html#dicti...

Tarean a year ago

This behaviour was introduced in 3.6 (and made part of the spec in 3.7 iirc)
From the python 3.6 change log:
New dict implementation¶ The dict type now uses a “compact” representation based on a proposal by Raymond Hettinger which was first implemented by PyPy. The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5.
The order-preserving aspect of this new implementation is considered an implementation detail and should not be relied upon (this may change in the future, but it is desired to have this new dict implementation in the language for a few releases before changing the language spec to mandate order-preserving semantics for all current and future Python implementations; this also helps preserve backwards-compatibility with older versions of the language where random iteration order is still in effect, e.g. Python 3.5). (Contributed by INADA Naoki in bpo-27350. Idea originally suggested by Raymond Hettinger.)
https://docs.python.org/3.6/whatsnew/3.6.html#new-dict-imple...
zahlman a year ago

The article does mention Python:
> We didn’t invent this idea. Python and Go were both doing this before us, and we modeled our approach on Python’s.
However, this is referring to the actual hashing. Also, Python was doing it for a fundamentally different reason (to avoid denial-of-service attacks caused by pathological input where every key would map to the same hash bucket).
However, starting with Python 3.6, the dictionary implementation doesn't actually store the key-value pairs directly in the hash buckets. Instead, it stores indices for a separate table. This saves space (because the indices can be much smaller than full entries) and also incidentally makes it easy to iterate over key-value pairs in order of insertion (by iterating over the table).
This proved popular, and in 3.7 it became official that this preservation of insertion order would be guaranteed going forward. Note that this is not a "sorted" dictionary structure; the keys are not compared or ordered by value.
Note that space is not freed up in the table when an element is deleted; this would require either O(N) shifting of the data, or ruining the insertion order. As such, Python developers accidentally cheated themselves out of a further optimization by deciding to offer the guarantee.
I have some notes (and a quick demo program) on GitHub demonstrating the space performance characteristics across Python versions: https://github.com/zahlman/python-dict-stats/ . This includes a link to https://stackoverflow.com/questions/327311 which gives a lot of detail about the implementation.
>Now, reading Python docs on that topic, they briefly[1] mention that dictionaries’ list views yield items in the order they were inserted without any mentions if items are sorted or not.
It seems clear enough to me:
> Performing list(d) on a dictionary returns a list of all the keys used in the dictionary, in insertion order (if you want it sorted, just use sorted(d) instead).
The insertion order guarantee is more formally documented in the What's New in Python 3.7 documentation (https://docs.python.org/3/whatsnew/3.7.html), which cites the original decree on the mailing list (https://mail.python.org/pipermail/python-dev/2017-December/1...).

hinkley a year ago

Oh I remember this happening. Fuckin’ JUnit.

Your tests shouldn’t depend on what order they get run, and upgrade hell has been much worse many times since, but for developers of a certain age this might have been one of the first and most frivolous feeling ones. Can’t upgrade not because the app acts differently but the tests are failing??

Only this didn’t just happen between Java 6 and 7 (I must have dodged that one or learned my lesson), it first happened in Java 5.

stygiansonic a year ago

I wrote about something similar, which was motivated by an issue I saw caused by an (incorrect) expectation that a Java hashmap iteration order would be random: https://peterchng.com/blog/2022/06/17/what-iteration-order-c...

mseepgood a year ago

Just don't make it uniformly distributed random, otherwise people will depend on it as a random number generator, what you don't want either.

medstrom a year ago

They might depend on it even if it's not uniformly distributed :)

emmelaich a year ago

Shades of Rusty's API Design Manifesto.

OskarS a year ago

I wanna say that Abseil's hash function does something like this? It was a couple of years ago, but I remember tearing my hair out figuring out why hash tables passed from a C++ api to my tests were failing. Turns out, the tests and the API was compiled by my build system as separate libraries and dynamically linked, and that abseil seeded their hash function with the address of a dummy global variable (which had a different location in the main program and the tests).

Presumably because they didn't want people to depend on the value of the hash function because of Hyrum's Law, but that meant the hash tables couldn't be passed across a DLL boundary, which seemed like an insane tradeoff to me. But hey, I'm not google, I get why they would do it.

ljnelson a year ago

Interesting note: the JDK deliberately randomizes iteration order of certain immutable sets and maps (https://github.com/openjdk/jdk/blob/jdk-23%2B37/src/java.bas...).

hoistbypetard a year ago

Relevant XKCD:

https://xkcd.com/1172/

sorokod a year ago

Hash iteration order is a great example of Hyrum’s Law

An even better example (requires less words) https://xkcd.com/1172/

mise_en_place a year ago

Alternatively, maintain backward compatibility as much as possible. Function overloading might be too clever. Extra verbosity in exchange for clarity might be a worthwhile tradeoff. WET is acceptable in these scenarios.