sherman commented on issue #12203:
URL: https://github.com/apache/lucene/issues/12203#issuecomment-1506041033

   Hi, @rmuir!
   
   >are you sure docvalues is really the slow part of your merge. I actually 
think doing this for terms/postings would be more bang-for-the-buck?
   
   I am not stating that the doc values are the heaviest part of the force 
merge process. In my case, the rewriting of doc values from the original 
segment (10 millions docs) took 318 seconds, which is comparable to the time it 
takes to merge posting lists. The fully parallel writing (w/o a final metadata 
update) took 23 seconds!
   
   >docvalues is a bit harder and trickier: typically docvalues are only a tiny 
fraction of merge costs, compared to postings (especially merging the terms 
seems to be very intensive).
   
   >there are some real traps here with docvalues, especially string fields 
(SORTED/SORTED_SET). In order to merge these >?>fields, it has to remap the 
ordinals which requires an additional datastructure to do. Doing this for many 
fields at once without >being careful could spike memory (and possibly for 
little benefit as again these fields are typically much faster to merge than 
>indexed ones).
   
   Hmm. After examining the codec code in version 9.x, I came to the opposite 
conclusion. Please correct me if I'm wrong, but it appears that each doc values 
field data consists of two files: meta and data. Moreover, it seems that each 
doc value field is written separately and without sharing data between them.
   
   Perhaps I wasn't clear earlier, but what I meant was to write multiple doc 
values using the original codec, if that's possible. For instance, if I have 
two fields, I would have four files (two data files and two meta files). Then, 
I could copy those data files at the byte level, using the something like `cat 
file1 > all_fields; cat file2 >> all_fields`. As for the metadata files, I 
would need to fix the absolute numbers (i.e., the offsets). Writing of data 
files is parallel operation, updating metadata is a single-threaded.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to