Jon Turney wrote:
...
Using std::set_intersection() on values from std::map() here is probably
a mistake. It's simple to write, but the performance is not good.

A faster alternative which avoids set_intersection calls in a loop is possibly to use one large data structure which maps filenames to sets of packages. Using multimap<string, string> instead of the straightforward map<string, set<string>> needs possibly less memory (not tested). But for multimap it is required that file/package name pairs are not inserted twice.

I attached a small standalone POC source file using multimap. It would also detect collisions in the already installed packages.

Thanks for the ideas. It seems I really didn't think that carefully about this...

It seems like maybe building a map from filename to the set of package names which contain it, and then finding all the filenames where that set has more than one member would be a possible better implementation.

Of course this is the more obvious method and easier to implement. It is somewhat slower and needs more memory.

Meantime I did a quick artificial test with 10000 "packages" with 100 "files" each and 2 collisions. This resulted in a (multi)map size if 1000000(+2), total size of strings was ~80MB. Results:

Data structure: execution time, memory (working set)
multimap<string, string>: 1.8 seconds, 238MB
map<string, set<string>>: 2.0 seconds, 318MB

The execution time is needed for the map insertions. The final scan for collisions is very fast in both cases.

Reply via email to