date:20230412

[GitHub] [lucene] romseygeek commented on a diff in pull request #12222: fix FragEnd bug in BaseFragmentsBuilder

2023-04-12 Thread via GitHub



romseygeek commented on code in PR #1:
URL: https://github.com/apache/lucene/pull/1#discussion_r1164021939


##
lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/TestSimpleFragmentsBuilder.java:
##
@@ -226,7 +226,7 @@ public void testDiscreteMultiValueHighlighting() throws 
Exception {
 result = sfb.createFragments(reader, 0, F, ffl, 3);
 assertEquals(2, result.length);
 assertEquals("text to highlight", result[0]);
-assertEquals("highlight other text", result[1]);
+assertEquals("highlight other", result[1]);

Review Comment:
   Do you know why this has changed?  The fragment length is still 32 
characters so I'd expect to get the full text string back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] romseygeek commented on a diff in pull request #12222: fix FragEnd bug in BaseFragmentsBuilder

2023-04-12 Thread via GitHub



romseygeek commented on code in PR #1:
URL: https://github.com/apache/lucene/pull/1#discussion_r1164024337


##
lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/TestSimpleFragmentsBuilder.java:
##
@@ -226,7 +226,7 @@ public void testDiscreteMultiValueHighlighting() throws 
Exception {
 result = sfb.createFragments(reader, 0, F, ffl, 3);
 assertEquals(2, result.length);
 assertEquals("text to highlight", result[0]);
-assertEquals("highlight other text", result[1]);
+assertEquals("highlight other", result[1]);

Review Comment:
   Reading again, I see now that you covered this in your opening comment, 
sorry!  I'm not sure it's the correct behaviour though.  Maybe we need to take 
into account if this is the final possible text?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] sherman commented on issue #12203: Scalable merge/compaction of big doc values segments.

2023-04-12 Thread via GitHub



sherman commented on issue #12203:
URL: https://github.com/apache/lucene/issues/12203#issuecomment-1506041033

   Hi, @rmuir!
   
   >are you sure docvalues is really the slow part of your merge. I actually 
think doing this for terms/postings would be more bang-for-the-buck?
   
   I am not stating that the doc values are the heaviest part of the force 
merge process. In my case, the rewriting of doc values from the original 
segment (10 millions docs) took 318 seconds, which is comparable to the time it 
takes to merge posting lists. The fully parallel writing (w/o a final metadata 
update) took 23 seconds!
   
   >docvalues is a bit harder and trickier: typically docvalues are only a tiny 
fraction of merge costs, compared to postings (especially merging the terms 
seems to be very intensive).
   
   >there are some real traps here with docvalues, especially string fields 
(SORTED/SORTED_SET). In order to merge these >?>fields, it has to remap the 
ordinals which requires an additional datastructure to do. Doing this for many 
fields at once without >being careful could spike memory (and possibly for 
little benefit as again these fields are typically much faster to merge than 
>indexed ones).
   
   Hmm. After examining the codec code in version 9.x, I came to the opposite 
conclusion. Please correct me if I'm wrong, but it appears that each doc values 
field data consists of two files: meta and data. Moreover, it seems that each 
doc value field is written separately and without sharing data between them.
   
   Perhaps I wasn't clear earlier, but what I meant was to write multiple doc 
values using the original codec, if that's possible. For instance, if I have 
two fields, I would have four files (two data files and two meta files). Then, 
I could copy those data files at the byte level, using the something like `cat 
file1 > all_fields; cat file2 >> all_fields`. As for the metadata files, I 
would need to fix the absolute numbers (i.e., the offsets). Writing of data 
files is parallel operation, updating metadata is a single-threaded.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI

2023-04-12 Thread via GitHub



zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1506345579

   Hi @jpountz @mikemccand @rmuir @uschindler @gsmiller , I have added some 
tests in the last few days and believed this PR is ready for review now, could 
you please take a look and let me know if you have any suggestion? I'm not 
particularly sure about my approach for conjunction and leveraging skip data by 
the way, and am open to alternatives!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] romseygeek commented on a diff in pull request #12222: fix FragEnd bug in BaseFragmentsBuilder

[GitHub] [lucene] romseygeek commented on a diff in pull request #12222: fix FragEnd bug in BaseFragmentsBuilder

[GitHub] [lucene] sherman commented on issue #12203: Scalable merge/compaction of big doc values segments.

[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI

4 matches

Site Navigation

Mail list logo

Footer information