[jira] [Created] (LUCENE-10035) Simple text codec add multi level skip list data
wuda created LUCENE-10035: - Summary: Simple text codec add multi level skip list data Key: LUCENE-10035 URL: https://issues.apache.org/jira/browse/LUCENE-10035 Project: Lucene - Core Issue Type: New Feature Components: core/codecs Affects Versions: main (9.0) Reporter: wuda Simple text codec add skip list data( include impact) to help understand index format,For debugging, curiosity, transparency only!! When term's docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default value is 8), Simple text codec will write skip list, the *.pst (simple text term dictionary file)* file will looks like this {code:java} field title term args doc 2 freq 2 pos 7 pos 10 ## we omit docs for better view .. doc 98 freq 2 pos 2 pos 6 skipList ? level 1 skipDoc 65 skipDocFP 949 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 13 impacts_end ? level 0 skipDoc 17 skipDocFP 284 impacts impact freq 1 norm 2 impact freq 2 norm 12 impacts_end skipDoc 34 skipDocFP 624 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 14 impacts_end skipDoc 65 skipDocFP 949 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 13 impacts_end skipDoc 90 skipDocFP 1311 impacts impact freq 1 norm 2 impact freq 2 norm 10 impact freq 3 norm 13 impact freq 4 norm 14 impacts_end END checksum 000829315543 {code} compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, impact, freq, norm* nodes, at the same, simple text codec can support advanceShallow when search time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wuda0112 opened a new pull request #224: LUCENE-10035: Simple text codec add multi level skip list data
wuda0112 opened a new pull request #224: URL: https://github.com/apache/lucene/pull/224 Simple text codec add skip list data( include impact) to help understand index format,For debugging, curiosity, transparency only!! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data
[ https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuda updated LUCENE-10035: -- Description: Simple text codec add skip list data( include impact) to help understand index format,For debugging, curiosity, transparency only!! When term's docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default value is 8), Simple text codec will write skip list, the *.pst (simple text term dictionary file)* file will looks like this {code:java} field title term args doc 2 freq 2 pos 7 pos 10 ## we omit docs for better view .. doc 98 freq 2 pos 2 pos 6 skipList ? level 1 skipDoc 65 skipDocFP 949 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 13 impacts_end ? level 0 skipDoc 17 skipDocFP 284 impacts impact freq 1 norm 2 impact freq 2 norm 12 impacts_end skipDoc 34 skipDocFP 624 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 14 impacts_end skipDoc 65 skipDocFP 949 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 13 impacts_end skipDoc 90 skipDocFP 1311 impacts impact freq 1 norm 2 impact freq 2 norm 10 impact freq 3 norm 13 impact freq 4 norm 14 impacts_end END checksum 000829315543 {code} compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, impact, freq, norm* nodes, at the same, simple text codec can support advanceShallow when search time. was: Simple text codec add skip list data( include impact) to help understand index format,For debugging, curiosity, transparency only!! When term's docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default value is 8), Simple text codec will write skip list, the *.pst (simple text term dictionary file)* file will looks like this {code:java} field title term args doc 2 freq 2 pos 7 pos 10 ## we omit docs for better view .. doc 98 freq 2 pos 2 pos 6 skipList ? level 1 skipDoc 65 skipDocFP 949 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 13 impacts_end ? level 0 skipDoc 17 skipDocFP 284 impacts impact freq 1 norm 2 impact freq 2 norm 12 impacts_end skipDoc 34 skipDocFP 624 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 14 impacts_end skipDoc 65 skipDocFP 949 impacts impact freq 1 norm 2 impact freq 2 norm 12 impact freq 3 norm 13 impacts_end skipDoc 90 skipDocFP 1311 impacts impact freq 1 norm 2 impact freq 2 norm 10 impact freq 3 norm 13 impact freq 4 norm 14 impacts_end END checksum 000829315543 {code} compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, impact, freq, norm* nodes, at the same, simple text codec can support advanceShallow when search time. > Simple text codec add multi level skip list data > -- > > Key: LUCENE-10035 > URL: https://issues.apache.org/jira/browse/LUCENE-10035 > Project: Lucene - Core > Issue Type: New Feature > Components: core/codecs >Affects Versions: main (9.0) >Reporter: wuda >Priority: Major > Labels: Impact, MultiLevelSkipList, SimpleTextCodec > Time Spent: 10m > Remaining Estimate: 0h > > Simple text codec add skip list data( include impact) to help understand > index format,For debugging, curiosity, trans
[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?
[ https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386761#comment-17386761 ] ASF subversion and git services commented on LUCENE-10016: -- Commit 0ec93b632ce0be880a1e68902bccd07bae65602d in lucene's branch refs/heads/main from Michael Sokolov [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0ec93b6 ] LUCENE-10016: fix test case to use the same similarity in both cases > VectorReader.search needs rethought, o.a.l.search integration? > -- > > Key: LUCENE-10016 > URL: https://issues.apache.org/jira/browse/LUCENE-10016 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Blocker > Fix For: 9.0 > > Time Spent: 50m > Remaining Estimate: 0h > > There's no search integration (e.g. queries) for the current vector values, > no documentation/examples that I can find. > Instead the codec has this method: > {code} > TopDocs search(String field, float[] target, int k, int fanout) > {code} > First, the "fanout" parameter needs to go, this is specific to HNSW impl, get > it out of here. > Second, How am I supposed to skip over deleted documents? How can I use > filters? How should i search across multiple segments? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10034) Vectors NeighborQueue MIN/MAX heap reversed?
[ https://issues.apache.org/jira/browse/LUCENE-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386765#comment-17386765 ] Michael Sokolov commented on LUCENE-10034: -- I have trouble with any "distance" that has d(x, x) != 0, and I though similarity was more general, but I'm not sure what you're proposing. I mean one thing we could try is to enforce that these must *be* distances in the sense that bigger values of d(x, y) mean x,y are less similar, ie further away from each other. But if we do that, then dot-product calculations (or euclidean ones if we define in the opposite way) will have to do extra work to conform to it that we don't require today. > Vectors NeighborQueue MIN/MAX heap reversed? > > > Key: LUCENE-10034 > URL: https://issues.apache.org/jira/browse/LUCENE-10034 > Project: Lucene - Core > Issue Type: Bug >Reporter: Mayya Sharipova >Priority: Trivial > > NeighborQueue is defined as following: > {code:java} > NeighborQueue(int initialSize, boolean reversed) { > if (reversed) { > heap = LongHeap.create(LongHeap.Order.MAX, initialSize); > } else { > heap = LongHeap.create(LongHeap.Order.MIN, initialSize); > } > } > {code} > should it be reversed? should it be instead using MIN heap for reversed > functions such as EUCLIDEAN distance, as we are interested in neigbors with > min euclidean distances? > I apologize if I missed some broader context where this definition makes > sense. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org