Parallelize the thing

The breadth-first algorithm can be parallelized after the second loop:

    while caches[n].is_empty() {
        ...
        for lev in 2..=n {
        /// can be parallelized here

Measurements show that (at least for a square network) the cache is written to only every 1000 iterations or so. Therefore it may be that simple naive locking of the output hash table will have acceptable performance.

This could be further improved by using a concurrent hash map, for example https://lib.rs/crates/dashmap.

Alternatively, use N independent output caches (one for each thread), and then merge them into the common cache. If this is done sequentially, this will likely kill most of the parallelization gain. But the merging/reduction can be probably done in parallel in the form of a binary tree: e.g. starting with 32 result caches reduce these into 16, then 8, and so on until one arrives at a single one that is then merged into the main cache.