smarks 3 hours ago

(from Feb 2021)

Some of the JDK's unmodifiable collections, such as those from `Set.of()` and `Map.of()`, also randomize iteration order. At the time they were added, Go and Python also randomized iteration order. However, more recently, Python moved away from randomization. Early in the Python 3.x releases, dict iteration order was randomized. In 3.6, iteration order was insertion order, but only as an implementation detail. In 3.7, this was elevated to a guarantee.

https://docs.python.org/3/library/stdtypes.html#dict

(search for "dictionary order")

There are occasional complaints about Java's unmodifiable collections' randomized order, as it can occasionally cause intermittent test failures or even failures in production. However, the advantage of the randomized order is that we've been able to reorganize the internal implementations of these data structures -- twice -- without worrying about compatibility problems caused by changing the ordering. Historically this has been a problem with other structures such as `HashMap`. Even though HashMap's order isn't guaranteed, it's predictable and stable, and changing its ordering has definitely flushed out bugs.

  • o11c 43 minutes ago

    IMO the real problem is that sometimes you really do want a deterministic order (not caring which one), but the container doesn't offer any API that provides it.

    And since hash-first languages provide a broken API, there's no way to provide it for arbitrary types. Compare-first languages (like C++) generally provide it, but paying the price of tree-based structures (assuming a B-tree can't negate it) isn't actually necessary.

    Rather than either of those choices, I argue that languages really should be "key-first" - rather than having classes implement hashing or comparison themselves, they should return a tuple of the data that will be fed into said hash or comparison. Besides being a nicer API, this also prevents many common bugs.

lsy 7 hours ago

What's the advantage of specifying it as random over specifying it as sequential in some way? Aren't both just specifying a behavior, where one behavior is potentially more useful than the other? I guess I understand the principled point that a set of hash keys is not an array. But it seems like more complexity than is necessary, and even possible to fall victim to Hyrum's law besides… you can imagine someone using random hash ordering to implicitly randomize something without needing to call an RNG explicitly.

It seems like the most principled approach might be an re-specification of the data structure: why is "iteration" even possible over a hash table? Shouldn't the keys be specified as a "set" on which only set-appropriate operations like map can be performed?

  • kstrauser 4 hours ago

    One advantage to randomization, and why various languages did it in the first place, is that it prevents miscreants from DoSing your service if they know its implementation.

    Say you're operating a service and you share its source on GitHub such that anyone can see it. Your language doesn't randomize hash values. Hashes are O(1), right? Well, not if a clever attacker can send you a set of values where they know each one will hash into the same bucket. The you end up with like 9 empty buckets and 1 with 100,000 items in it. Oops!

    Old Python pre-randomization had a dict.items() method that would yield all the keys in order, conceptually kind of like:

      for bucket in self._buckets:
          for item in bucket:
              yield item
    
    The order of those buckets would be repeatable between runs, so bucket foo would always come first, then bucket bar. Then Python added hash randomization so that the resulting hash key was something like hash(salt+key) instead of just hash(key). Now there's no way to tell in advance which bucket an item would get filed into, and the buckets would end up more or less balanced in size.

    Newer Pythons (since 3.6? 3.7?) do something altogether different, and I can't explain exactly how their ordered dicts work, except to say I sat through a presentation on them and thought it was freaking genius even if I could re-implement it myself without sitting down with their docs.

  • prattmic an hour ago

    In addition to the DoS aspect mentioned in a sibling comment, the primary reason you would do this is to avoid constraining the implementation. If you want to change the design to improve performance, for example, being forced to match the implicit ordering of the old implementation may be very difficult.

    It certainly may be useful to define an specific ordering. Maps ordered by insertion time and maps ordered by key order are fairly common. But maps with these constraints probably won't be as fast as those with arbitrary ordering, so there is a trade-off decision to be made. A map constrained to the arbitrary ordering of the initial implementation is the worst of both worlds: difficult to make faster despite not having a particularly useful ordering.

    As a concrete example, I am currently in the middle of rewriting Go's map implementation to use a more efficient design (https://go.dev/issue/54766). If Go hadn't randomized map iteration order this work would likely be impossible. It is unlikely we could completely change the design while keeping iteration order identical to the old implementation (assuming we want the new version to be faster at least).

  • fwip 7 hours ago

    I think removing methods from the Hashmap was not in scope for their work - they were working on the JDK, and presumably didn't have the bandwidth to patch every single Java library used at Google to use set-based methods instead of iteration-based.

    Also, I think in practice you'll find you want to extract items from a set in some order, so you need some kind of transformation from set->list. e.g: you want to print them out.

    Edit: Forgot to address your main question. If the specification requires a specific ordering, the implementation is forever bound to do that, even if other implementations would be more efficient/secure/other-desirable-properties. By introducing randomness, you reduce the risk of people accidentally relying on unspecified behavior, and are more able to change your implementation later.

    • zanecodes 3 hours ago

      One solution could be adding a `sort` argument (which takes a function that compares two `<T>` items and returns `true` or `false` depending on their order) to all functions on unordered collections of `<T>` items in which an order must be chosen, or requiring that the items in such collections implement a `TotalOrder` interface or something similar. This isn't very ergonomic in languages that don't have an equivalent of Traits or typeclasses though. In languages which permit side effects, this would include any functions that iterate over the items in an unordered collection.

mise_en_place 7 hours ago

Alternatively, maintain backward compatibility as much as possible. Function overloading might be too clever. Extra verbosity in exchange for clarity might be a worthwhile tradeoff. WET is acceptable in these scenarios.