zacharymorn commented on pull request #128: URL: https://github.com/apache/lucene/pull/128#issuecomment-850170387
> the lines come as each check finishes, so you can see what is fast/slow. It seems postings is slowest, preceded by doc values, and everything else is super fast. > Can't wait to see the next CheckIndex time in nightly benchmarks after we push this :) I was finally able to rebuild my local index with `wikibigall` generating 15 segments, and performed 2 test runs with different threadCount. With 11 threadCount, it took 359.293 sec in total to finish, and with 1 threadCount it took 378.583 sec in total, so about 5% time saving. I feel faster machine will have better time saving, but in general the speed up seems to be limited given the skewed distribution of checking speed of different segment parts (e.g. posting check can account for around 85% ~ 90% of the total time spent in the first segment check). > Good news first! I wrote a simple little Python tool to randomly flip a random bit in a random file in a provided directory. > It looks like corruption is indeed still detected with this PR, wonderful! The python tool looks very cool and thanks for testing it! One issue though is that this bit flipping is causing checksum integrity check failures *before* the concurrent segment part checks kick in, so it may not test the changes here? I think we may actually need to write a semantically buggy segment file with good checksum verification to see the error still gets detected and propagated correctly? With the above two, I feel maybe I should also look into parallelizing across segments and keeping it single-threaded / simple within each segment (much of the learnings here can be applied there anyway)? The other approach to get better speed up could be to split up posting check, so that it can be handled by multiple threads as well. But I'm not sure now if it's easily parallelize-able and also need to look into it further. I'm also good with merging this in and see how it performs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org