[GitHub] [lucene] romseygeek commented on a diff in pull request #12222: fix FragEnd bug in BaseFragmentsBuilder
romseygeek commented on code in PR #1: URL: https://github.com/apache/lucene/pull/1#discussion_r1164021939 ## lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/TestSimpleFragmentsBuilder.java: ## @@ -226,7 +226,7 @@ public void testDiscreteMultiValueHighlighting() throws Exception { result = sfb.createFragments(reader, 0, F, ffl, 3); assertEquals(2, result.length); assertEquals("text to highlight", result[0]); -assertEquals("highlight other text", result[1]); +assertEquals("highlight other", result[1]); Review Comment: Do you know why this has changed? The fragment length is still 32 characters so I'd expect to get the full text string back. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on a diff in pull request #12222: fix FragEnd bug in BaseFragmentsBuilder
romseygeek commented on code in PR #1: URL: https://github.com/apache/lucene/pull/1#discussion_r1164024337 ## lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/TestSimpleFragmentsBuilder.java: ## @@ -226,7 +226,7 @@ public void testDiscreteMultiValueHighlighting() throws Exception { result = sfb.createFragments(reader, 0, F, ffl, 3); assertEquals(2, result.length); assertEquals("text to highlight", result[0]); -assertEquals("highlight other text", result[1]); +assertEquals("highlight other", result[1]); Review Comment: Reading again, I see now that you covered this in your opening comment, sorry! I'm not sure it's the correct behaviour though. Maybe we need to take into account if this is the final possible text? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sherman commented on issue #12203: Scalable merge/compaction of big doc values segments.
sherman commented on issue #12203: URL: https://github.com/apache/lucene/issues/12203#issuecomment-1506041033 Hi, @rmuir! >are you sure docvalues is really the slow part of your merge. I actually think doing this for terms/postings would be more bang-for-the-buck? I am not stating that the doc values are the heaviest part of the force merge process. In my case, the rewriting of doc values from the original segment (10 millions docs) took 318 seconds, which is comparable to the time it takes to merge posting lists. The fully parallel writing (w/o a final metadata update) took 23 seconds! >docvalues is a bit harder and trickier: typically docvalues are only a tiny fraction of merge costs, compared to postings (especially merging the terms seems to be very intensive). >there are some real traps here with docvalues, especially string fields (SORTED/SORTED_SET). In order to merge these >?>fields, it has to remap the ordinals which requires an additional datastructure to do. Doing this for many fields at once without >being careful could spike memory (and possibly for little benefit as again these fields are typically much faster to merge than >indexed ones). Hmm. After examining the codec code in version 9.x, I came to the opposite conclusion. Please correct me if I'm wrong, but it appears that each doc values field data consists of two files: meta and data. Moreover, it seems that each doc value field is written separately and without sharing data between them. Perhaps I wasn't clear earlier, but what I meant was to write multiple doc values using the original codec, if that's possible. For instance, if I have two fields, I would have four files (two data files and two meta files). Then, I could copy those data files at the byte level, using the something like `cat file1 > all_fields; cat file2 >> all_fields`. As for the metadata files, I would need to fix the absolute numbers (i.e., the offsets). Writing of data files is parallel operation, updating metadata is a single-threaded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1506345579 Hi @jpountz @mikemccand @rmuir @uschindler @gsmiller , I have added some tests in the last few days and believed this PR is ready for review now, could you please take a look and let me know if you have any suggestion? I'm not particularly sure about my approach for conjunction and leveraging skip data by the way, and am open to alternatives! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org