[jira] [Created] (LUCENE-10035) Simple text codec add multi level skip list data

2021-07-24 Thread wuda (Jira)
wuda created LUCENE-10035:
-

 Summary: Simple text codec add  multi level skip list data 
 Key: LUCENE-10035
 URL: https://issues.apache.org/jira/browse/LUCENE-10035
 Project: Lucene - Core
  Issue Type: New Feature
  Components: core/codecs
Affects Versions: main (9.0)
Reporter: wuda


Simple text codec add skip list data( include impact) to help understand index 
format,For debugging, curiosity, transparency only!! When term's docFreq 
greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default value is 8), 
Simple text codec will write skip list, the *.pst (simple text term dictionary 
file)* file will looks like this
{code:java}
field title
  term args
doc 2
  freq 2
  pos 7
  pos 10
## we omit docs for better view ..
doc 98
  freq 2
  pos 2
  pos 6
skipList 
?
  level 1
skipDoc 65
skipDocFP 949
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 13
impacts_end 
?
  level 0
skipDoc 17
skipDocFP 284
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
impacts_end 
skipDoc 34
skipDocFP 624
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 14
impacts_end 
skipDoc 65
skipDocFP 949
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 13
impacts_end 
skipDoc 90
skipDocFP 1311
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 10
  impact 
freq 3
norm 13
  impact 
freq 4
norm 14
impacts_end 
END
checksum 000829315543

{code}
compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, 
impact, freq, norm* nodes, at the same, simple text codec can support 
advanceShallow when search time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wuda0112 opened a new pull request #224: LUCENE-10035: Simple text codec add multi level skip list data

2021-07-24 Thread GitBox


wuda0112 opened a new pull request #224:
URL: https://github.com/apache/lucene/pull/224


   Simple text codec add skip list data( include impact) to help understand 
index format,For debugging, curiosity, transparency only!!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data

2021-07-24 Thread wuda (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuda updated LUCENE-10035:
--
Description: 
Simple text codec add skip list data( include impact) to help understand index 
format,For debugging, curiosity, transparency only!! When term's docFreq 
greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default value is 8), 
Simple text codec will write skip list, the *.pst (simple text term dictionary 
file)* file will looks like this
{code:java}
field title
  term args
doc 2
  freq 2
  pos 7
  pos 10
## we omit docs for better view ..
doc 98
  freq 2
  pos 2
  pos 6
skipList 
?
  level 1
skipDoc 65
skipDocFP 949
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 13
impacts_end 
?
  level 0
skipDoc 17
skipDocFP 284
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
impacts_end 
skipDoc 34
skipDocFP 624
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 14
impacts_end 
skipDoc 65
skipDocFP 949
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 13
impacts_end 
skipDoc 90
skipDocFP 1311
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 10
  impact 
freq 3
norm 13
  impact 
freq 4
norm 14
impacts_end 
END
checksum 000829315543

{code}
compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, 
impact, freq, norm* nodes, at the same, simple text codec can support 
advanceShallow when search time.

 

 

  was:
Simple text codec add skip list data( include impact) to help understand index 
format,For debugging, curiosity, transparency only!! When term's docFreq 
greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default value is 8), 
Simple text codec will write skip list, the *.pst (simple text term dictionary 
file)* file will looks like this
{code:java}
field title
  term args
doc 2
  freq 2
  pos 7
  pos 10
## we omit docs for better view ..
doc 98
  freq 2
  pos 2
  pos 6
skipList 
?
  level 1
skipDoc 65
skipDocFP 949
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 13
impacts_end 
?
  level 0
skipDoc 17
skipDocFP 284
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
impacts_end 
skipDoc 34
skipDocFP 624
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 14
impacts_end 
skipDoc 65
skipDocFP 949
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 12
  impact 
freq 3
norm 13
impacts_end 
skipDoc 90
skipDocFP 1311
impacts 
  impact 
freq 1
norm 2
  impact 
freq 2
norm 10
  impact 
freq 3
norm 13
  impact 
freq 4
norm 14
impacts_end 
END
checksum 000829315543

{code}
compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, 
impact, freq, norm* nodes, at the same, simple text codec can support 
advanceShallow when search time.


> Simple text codec add  multi level skip list data 
> --
>
> Key: LUCENE-10035
> URL: https://issues.apache.org/jira/browse/LUCENE-10035
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: wuda
>Priority: Major
>  Labels: Impact, MultiLevelSkipList, SimpleTextCodec
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Simple text codec add skip list data( include impact) to help understand 
> index format,For debugging, curiosity, trans

[jira] [Commented] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2021-07-24 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386761#comment-17386761
 ] 

ASF subversion and git services commented on LUCENE-10016:
--

Commit 0ec93b632ce0be880a1e68902bccd07bae65602d in lucene's branch 
refs/heads/main from Michael Sokolov
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0ec93b6 ]

LUCENE-10016: fix test case to use the same similarity in both cases


> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10034) Vectors NeighborQueue MIN/MAX heap reversed?

2021-07-24 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17386765#comment-17386765
 ] 

Michael Sokolov commented on LUCENE-10034:
--

I have trouble with any "distance" that has d(x, x) != 0, and I though 
similarity was more general, but I'm not sure what you're proposing. I mean one 
thing we could try is to enforce that these must *be* distances in the sense 
that bigger values of d(x, y) mean x,y are less similar, ie  further away from 
each other. But if we do that, then dot-product calculations (or euclidean ones 
if we define in the opposite way) will have to do extra work to conform to it 
that we don't require today.

> Vectors NeighborQueue MIN/MAX heap reversed?
> 
>
> Key: LUCENE-10034
> URL: https://issues.apache.org/jira/browse/LUCENE-10034
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mayya Sharipova
>Priority: Trivial
>
> NeighborQueue is defined as following:
> {code:java}
> NeighborQueue(int initialSize, boolean reversed) {
>   if (reversed) {
> heap = LongHeap.create(LongHeap.Order.MAX, initialSize);
>   } else {
> heap = LongHeap.create(LongHeap.Order.MIN, initialSize);
>   }
> }
> {code}
> should it be reversed? should it be instead using MIN heap for reversed 
> functions such as EUCLIDEAN  distance, as we are interested in neigbors with 
> min euclidean distances? 
> I apologize if I missed some broader context where this definition makes 
> sense. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org