sherman commented on issue #12203: URL: https://github.com/apache/lucene/issues/12203#issuecomment-1506041033
Hi, @rmuir! >are you sure docvalues is really the slow part of your merge. I actually think doing this for terms/postings would be more bang-for-the-buck? I am not stating that the doc values are the heaviest part of the force merge process. In my case, the rewriting of doc values from the original segment (10 millions docs) took 318 seconds, which is comparable to the time it takes to merge posting lists. The fully parallel writing (w/o a final metadata update) took 23 seconds! >docvalues is a bit harder and trickier: typically docvalues are only a tiny fraction of merge costs, compared to postings (especially merging the terms seems to be very intensive). >there are some real traps here with docvalues, especially string fields (SORTED/SORTED_SET). In order to merge these >?>fields, it has to remap the ordinals which requires an additional datastructure to do. Doing this for many fields at once without >being careful could spike memory (and possibly for little benefit as again these fields are typically much faster to merge than >indexed ones). Hmm. After examining the codec code in version 9.x, I came to the opposite conclusion. Please correct me if I'm wrong, but it appears that each doc values field data consists of two files: meta and data. Moreover, it seems that each doc value field is written separately and without sharing data between them. Perhaps I wasn't clear earlier, but what I meant was to write multiple doc values using the original codec, if that's possible. For instance, if I have two fields, I would have four files (two data files and two meta files). Then, I could copy those data files at the byte level, using the something like `cat file1 > all_fields; cat file2 >> all_fields`. As for the metadata files, I would need to fix the absolute numbers (i.e., the offsets). Writing of data files is parallel operation, updating metadata is a single-threaded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org