zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-850170387


   >  the lines come as each check finishes, so you can see what is fast/slow. 
It seems postings is slowest, preceded by doc values, and everything else is 
super fast.
   > Can't wait to see the next CheckIndex time in nightly benchmarks after we 
push this :)
   
   I was finally able to rebuild my local index with `wikibigall` generating 15 
segments, and performed 2 test runs with different threadCount.  With 11 
threadCount, it took 359.293 sec in total to finish, and with 1 threadCount it 
took 378.583 sec in total, so about 5% time saving. I feel faster machine will 
have better time saving, but in general the speed up seems to be limited given 
the skewed distribution of checking speed of different segment parts  (e.g. 
posting check can account for around 85% ~ 90%  of the total time spent in the 
first segment check). 
   
   > Good news first! I wrote a simple little Python tool to randomly flip a 
random bit in a random file in a provided directory.
   > It looks like corruption is indeed still detected with this PR, wonderful!
   
   The python tool looks very cool and thanks for testing it! One issue though 
is that this bit flipping is causing checksum integrity check failures *before* 
the concurrent segment part checks kick in, so it may not test the changes 
here? I think we may actually need to write a semantically buggy segment file 
with good checksum verification to see the error still gets detected and 
propagated correctly? 
   
   With the above two, I feel maybe I should also look into parallelizing 
across segments and keeping it single-threaded / simple within each segment 
(much of the learnings here can be applied there anyway)? The other approach to 
get better speed up could be to split up posting check, so that it can be 
handled by multiple threads as well. But I'm not sure now if it's easily 
parallelize-able and also need to look into it further. 
   
   I'm also good with merging this in and see how it performs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to