Jon Turney wrote:
...
Using std::set_intersection() on values from std::map() here is
probably
a mistake. It's simple to write, but the performance is not good.
A faster alternative which avoids set_intersection calls in a loop is
possibly to use one large data structure which maps filenames to sets
of packages. Using multimap<string, string> instead of the
straightforward map<string, set<string>> needs possibly less memory
(not tested). But for multimap it is required that file/package name
pairs are not inserted twice.
I attached a small standalone POC source file using multimap. It
would also detect collisions in the already installed packages.
Thanks for the ideas. It seems I really didn't think that carefully
about this...
It seems like maybe building a map from filename to the set of package
names which contain it, and then finding all the filenames where that
set has more than one member would be a possible better implementation.
Of course this is the more obvious method and easier to implement. It is
somewhat slower and needs more memory.
Meantime I did a quick artificial test with 10000 "packages" with 100
"files" each and 2 collisions. This resulted in a (multi)map size if
1000000(+2), total size of strings was ~80MB. Results:
Data structure: execution time, memory (working set)
multimap<string, string>: 1.8 seconds, 238MB
map<string, set<string>>: 2.0 seconds, 318MB
The execution time is needed for the map insertions. The final scan for
collisions is very fast in both cases.