[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158060254


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);

Review Comment:
   This is another error. I should probably test for `TermQuery`.
   
   Edit: changed to sensical query option. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158142144


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain

Review Comment:
   it's sort of open ended, tbh. Many things to improve over time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158142961


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain
+

Review Comment:
   Fixed in: 
[f91ef20](https://github.com/apache/lucene/pull/12208/commits/f91ef20e5ea706b00c3bf438d9d0b937c146d747)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158143400


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {

Review Comment:
   fixed in : 
[f91ef20](https://github.com/apache/lucene/pull/12208/commits/f91ef20e5ea706b00c3bf438d9d0b937c146d747)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158228357


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);

Review Comment:
   We should. I made that change in this test, only. 
   
   However, I want to point out that the TermAutomatonQuery's explain will 
rewrite to a variety of other queries depending on the what the client 
supplies. I think `PhraseQuery` is merely one of the primitive options when 
explain is called. The explain is not an exact representation of the true query 
today. That's something that I think we should look to add because users still 
benefit from where the work is as of now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158233897


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain
+
+  public void testExplainMatchingDocuments() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+int s3 = q.createState();
+q.addTransition(initState, s1, "foo");
+q.addAnyTransition(s1, s2);
+q.addTransition(s2, s3, "bar");
+q.setAccept(s3, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc1 = new Document();
+doc1.add(newTextField("field", "foo x bar", Field.Store.NO));
+w.addDocument(doc1);
+
+Document doc2 = new Document();
+doc2.add(newTextField("field", "foo y bar", Field.Store.NO));
+w.addDocument(doc2);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+
+QueryUtils.check(random(), q, searcher);
+TopDocs topDocs = searcher.search(q, 10);
+// Verify that the docs matched fasho
+assertEquals(2, topDocs.totalHits.value);
+
+// Iterate through score docs and assert aspects of explanation
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(q, scoreDoc.doc);

Review Comment:
   Can you clarify what you mean to be sure? Rewrite is behaving as intended 
here as I understand. 



##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain
+
+  public void testExplainMatchingDocuments() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+int s3 = q.createState();
+q.addTransition(initState, s1, "foo");
+q.addAnyTransition(s1, s2);
+q.addTransition(s2, s3, "bar");
+q.setAccept(s3, true);
+q.finish()

[GitHub] [lucene] dantuzi commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

2023-04-05 Thread via GitHub


dantuzi commented on code in PR #12169:
URL: https://github.com/apache/lucene/pull/12169#discussion_r1158329797


##
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java:
##
@@ -221,6 +223,12 @@ public static void assertTokenStreamContents(
   flagsAtt = ts.getAttribute(FlagsAttribute.class);
 }
 
+BoostAttribute boostAtt = null;

Review Comment:
   I'm going to update my PR removing the BoostAttribute as you suggest



##
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java:
##
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.BufferedInputStream;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Locale;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+
+/**
+ * Word2VecModelReader is a Word2VecModelReader that reads the file generated 
by the library
+ * Deeplearning4j
+ *
+ * Dl4j Word2Vec documentation:
+ * 
https://deeplearning4j.konduit.ai/v/en-1.0.0-beta7/language-processing/word2vec 
Example to
+ * generate a model using dl4j:
+ * 
https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/embeddingsfromcorpus/word2vec/Word2VecRawTextExample.java
+ *
+ * @lucene.experimental
+ */
+public class Dl4jModelReader implements Word2VecModelReader {
+
+  private static final String MODEL_FILE_NAME_PREFIX = "syn0";
+
+  private final String word2vecModelFilePath;
+  private final ZipInputStream word2VecModelZipFile;
+
+  public Dl4jModelReader(String word2vecModelFilePath, InputStream stream) {
+this.word2vecModelFilePath = word2vecModelFilePath;
+this.word2VecModelZipFile = new ZipInputStream(new 
BufferedInputStream(stream));
+  }
+
+  @Override
+  public Word2VecModel read() throws IOException {
+
+ZipEntry entry;
+while ((entry = word2VecModelZipFile.getNextEntry()) != null) {
+  String fileName = entry.getName();
+  if (fileName.startsWith(MODEL_FILE_NAME_PREFIX)) {
+BufferedReader reader =
+new BufferedReader(new InputStreamReader(word2VecModelZipFile, 
StandardCharsets.UTF_8));
+
+String header = reader.readLine();
+String[] headerValues = header.split(" ");
+int dictionarySize = Integer.parseInt(headerValues[0]);
+int vectorDimension = Integer.parseInt(headerValues[1]);
+
+Word2VecModel model = new Word2VecModel(dictionarySize, 
vectorDimension);
+reader
+.lines()
+.forEach(
+line -> {
+  String[] tokens = line.split(" ");
+  BytesRef term = decodeTerm(tokens[0]);
+
+  float[] vector = new float[tokens.length - 1];
+
+  if (vectorDimension != vector.length) {
+throw new RuntimeException(
+String.format(
+Locale.ROOT,
+"Word2Vec model file corrupted. "
++ "Declared vectors of size %d but found 
vector of size %d for word %s (%s)",
+vectorDimension,
+vector.length,
+tokens[0],
+term.utf8ToString()));
+  }
+
+  for (int i = 1; i < tokens.length; i++) {
+vector[i - 1] = Float.parseFloat(tokens[i]);
+  }
+  model.addTermAndVector(new TermAndVector(term, vector));
+});
+return model;
+  }
+}
+throw new UnsupportedEncodingException(

Review Comment:
   When we use the library DL4J to train a model and we export it, we obtain a 
compressed zip file.
   This zip contains multiple files but we are onl

[GitHub] [lucene] zhaih commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


zhaih commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158810796


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain
+
+  public void testExplainMatchingDocuments() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+int s3 = q.createState();
+q.addTransition(initState, s1, "foo");
+q.addAnyTransition(s1, s2);
+q.addTransition(s2, s3, "bar");
+q.setAccept(s3, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc1 = new Document();
+doc1.add(newTextField("field", "foo x bar", Field.Store.NO));
+w.addDocument(doc1);
+
+Document doc2 = new Document();
+doc2.add(newTextField("field", "foo y bar", Field.Store.NO));
+w.addDocument(doc2);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+
+QueryUtils.check(random(), q, searcher);
+TopDocs topDocs = searcher.search(q, 10);
+// Verify that the docs matched fasho
+assertEquals(2, topDocs.totalHits.value);
+
+// Iterate through score docs and assert aspects of explanation
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(q, scoreDoc.doc);

Review Comment:
   Oh I just want to make sure we're testing the target class. 
   Since I think this test should test specifically the explain impl of 
`TermAutomtonQuery` but not `PhraseQuery` or `MultiPhraseQuery` as that should 
be tested somewhere else.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


zhaih commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1158817727


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);

Review Comment:
   > TermAutomatonQuery's explain will rewrite to a variety of other queries
   
   To be more precise, `IndexSearcher` will try to rewrite the query first and 
then if the rewrite still returns the `TermAutomatonQuery`, then it will create 
`TermAutomatonWeight` and call the explain you implemented.
   So if we want to test the specific implementation (which is what I think we 
should do here), we need to make sure we created a query that does not rewrite 
to something else.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new issue, #12224: Find forever home for `KnnGraphTester`

2023-04-05 Thread via GitHub


benwtrent opened a new issue, #12224:
URL: https://github.com/apache/lucene/issues/12224

   ### Description
   
   `KnnGraphTester` is a good utility for creating a vector index, running 
searches, and then giving results. 
   
   Consequently, it can be used for benchmarking and assessing recall quality 
vs. brute-force nearest-neighbors. 
   
   However, right now, it lives within `test` in Lucene. This seems 
non-optimal. 
   
   A natural place could be that we move this utility class into 
[luceneutil](https://github.com/mikemccand/luceneutil).
   
   I know @jpountz has mentioned this to me in passing before. 
   
   What say you @msokolov?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1159010540


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain
+
+  public void testExplainMatchingDocuments() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+int s3 = q.createState();
+q.addTransition(initState, s1, "foo");
+q.addAnyTransition(s1, s2);
+q.addTransition(s2, s3, "bar");
+q.setAccept(s3, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc1 = new Document();
+doc1.add(newTextField("field", "foo x bar", Field.Store.NO));
+w.addDocument(doc1);
+
+Document doc2 = new Document();
+doc2.add(newTextField("field", "foo y bar", Field.Store.NO));
+w.addDocument(doc2);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+
+QueryUtils.check(random(), q, searcher);
+TopDocs topDocs = searcher.search(q, 10);
+// Verify that the docs matched fasho
+assertEquals(2, topDocs.totalHits.value);
+
+// Iterate through score docs and assert aspects of explanation
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(q, scoreDoc.doc);

Review Comment:
   I finally think I understand. My approach was to override rewrite to prevent 
being rewritten to a primitive class. Not sure if that's the right way to 
approach, so please advise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on PR #12208:
URL: https://github.com/apache/lucene/pull/12208#issuecomment-1498148310

   > Sorry this is not what I meant. So first, the intend of the added unit 
test should test the explain code you added, but not rewriting nor explain 
implementation of TermQuery or PhraseQuery. Then, look at how IndexSearcher 
implement explain 
([src](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L831)),
 it will first rewrite the query, and then call the explain for the _rewritten_ 
query's weight. So, if we want to test the code you just added, we need to make 
sure the rewritten query is actually still TermAutomatonQuery. Last, let's look 
at the rewrite logic 
[here](https://github.com/apache/lucene/blob/main/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/TermAutomatonQuery.java#L529),
 I haven't dive too deep there but basically there's a comment
   > 
   > ```
   > // Try for either PhraseQuery or MultiPhraseQuery, which only works when 
the automaton is a
   > // sausage:
   > ```
   > 
   > So I guess we should be able to figure out a way to let it not rewrite to 
other query type.
   > 
   > So to summary, I think in the test we should:
   > 
   > 1. create a query that won't rewrite to other type
   > 2. call searcher.rewrite and assert the query type to double check
   > 3. check what you claimed (like NoMatchingDocument or 2 documents matched)
   > 4. check the explain implementation,
   >4.1. if there's no document matched, then make sure the explain returns 
`NO_MATCH` for any documents
   >4.2. if there's document matched, make sure the document matched 
returns proper explanation.
   > 
   > I hope it makes sense
   
   It does make sense. To be fair, it probably made sense from the first time 
you said it. I think it needed to percolate in my brain for a bit. I think I've 
got it now. Or I hope I do. I don't want to get into modifying rewrite because 
that would be a big change, out of scope for this effort, do the pref testing 
required a regression risks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1159037972


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +843,91 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  /* Implement a custom term automaton query to ensure that rewritten queries

Review Comment:
   @zhaih This felt like the simplest way to force the rewrite to be a 
`TermAutomatonQuery`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Stupiddd opened a new issue, #12225: BulkOperationPacked decode() bug

2023-04-05 Thread via GitHub


Stupiddd opened a new issue, #12225:
URL: https://github.com/apache/lucene/issues/12225

   ### Description
   
   public void decode(
 byte[] blocks, int blocksOffset, long[] values, int valuesOffset, int 
iterations) {
   long nextValue = 0L;
   int bitsLeft = bitsPerValue;
   for (int i = 0; i < iterations * byteBlockCount; ++i) {
 final long bytes = blocks[blocksOffset++] & 0xFFL;
 if (bitsLeft > 8) {
   // just buffer
   bitsLeft -= 8;
   nextValue |= bytes << bitsLeft;
 } else {
   // flush
   int bits = 8 - bitsLeft;
   values[valuesOffset++] = nextValue | (bytes >>> bits);
   while (bits >= bitsPerValue) {
 bits -= bitsPerValue;
 values[valuesOffset++] = (bytes >>> bits) & mask;
   }
   // then buffer
   bitsLeft = bitsPerValue - bits;
   nextValue = (bytes & ((1L << bits) - 1)) << bitsLeft;
 }
   }
   assert bitsLeft == bitsPerValue;
 }
   maybe use for 【iterations * byteValueCount】
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] MarcusSorealheis commented on pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


MarcusSorealheis commented on PR #12208:
URL: https://github.com/apache/lucene/pull/12208#issuecomment-1498479368

   When I changed the tests to force `TermAutomatonQuery`, I noticed the scores 
changed. Thanks for that. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn commented on issue #12225: BulkOperationPacked decode() bug

2023-04-05 Thread via GitHub


dnhatn commented on issue #12225:
URL: https://github.com/apache/lucene/issues/12225#issuecomment-1498523614

   @Stupiddd It appears that this issue you are related to Lucene code. Could 
you please open an issue on the Lucene repository? Also, providing a more 
detailed description of the issue will be helpful in understanding and 
reproducing it.
   Thanks very much for your interest in Elasticsearch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn closed issue #12225: BulkOperationPacked decode() bug

2023-04-05 Thread via GitHub


dnhatn closed issue #12225: BulkOperationPacked decode() bug
URL: https://github.com/apache/lucene/issues/12225


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn commented on issue #12225: BulkOperationPacked decode() bug

2023-04-05 Thread via GitHub


dnhatn commented on issue #12225:
URL: https://github.com/apache/lucene/issues/12225#issuecomment-1498525686

   @Stupiddd My apologies, I mistakenly assumed that the issue was related to 
Elasticsearch repository. Can you please provide us with more details regarding 
the issue? This will assist us in understanding and reproducing the problem. 
Thank you for your cooperation!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


zhaih commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1159361096


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +844,95 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  public void testExplainNonMatchingDocument() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.addTransition(s1, s2, "json");
+q.setAccept(s2, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof PhraseQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);
+
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(rewrite, scoreDoc.doc);
+  assertNotNull("Explanation should not be null", explanation);
+  assertTrue(
+  "Explanation score should match the actual score",
+  Float.compare(scoreDoc.score, (Float) explanation.getValue()) == 0);
+}
+
+IOUtils.close(w, r, dir);
+  }
+
+  // TODO: improve experience of working with explain
+
+  public void testExplainMatchingDocuments() throws Exception {
+TermAutomatonQuery q = new TermAutomatonQuery("field");
+
+int initState = q.createState();
+int s1 = q.createState();
+int s2 = q.createState();
+int s3 = q.createState();
+q.addTransition(initState, s1, "foo");
+q.addAnyTransition(s1, s2);
+q.addTransition(s2, s3, "bar");
+q.setAccept(s3, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc1 = new Document();
+doc1.add(newTextField("field", "foo x bar", Field.Store.NO));
+w.addDocument(doc1);
+
+Document doc2 = new Document();
+doc2.add(newTextField("field", "foo y bar", Field.Store.NO));
+w.addDocument(doc2);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+
+QueryUtils.check(random(), q, searcher);
+TopDocs topDocs = searcher.search(q, 10);
+// Verify that the docs matched fasho
+assertEquals(2, topDocs.totalHits.value);
+
+// Iterate through score docs and assert aspects of explanation
+for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+  Explanation explanation = searcher.explain(q, scoreDoc.doc);

Review Comment:
   That's fine I think, thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


zhaih commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1159363193


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +843,92 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  /* Implement a custom term automaton query to ensure that rewritten queries
+   *  do not get rewritten to primitive queries. The custom extension will 
allow
+   *  the following explain tests to evaluate Explain.
+   * */
+
+  private static class CustomTermAutomatonQuery extends TermAutomatonQuery {
+public CustomTermAutomatonQuery(String field) {
+  super(field);
+}
+
+@Override
+public Query rewrite(IndexSearcher searcher) throws IOException {
+  return this;
+}
+  }
+
+  public void testExplainNoMatchingDocument() throws Exception {
+CustomTermAutomatonQuery q = new CustomTermAutomatonQuery("field");
+int initState = q.createState();
+int s1 = q.createState();
+q.addTransition(initState, s1, "xml");
+q.setAccept(s1, true);
+q.finish();
+
+Directory dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+Document doc = new Document();
+doc.add(newTextField("field", "protobuf", Field.Store.NO));
+w.addDocument(doc);
+
+IndexReader r = w.getReader();
+IndexSearcher searcher = newSearcher(r);
+Query rewrite = q.rewrite(searcher);
+assertTrue(rewrite instanceof TermAutomatonQuery);
+
+TopDocs topDocs = searcher.search(rewrite, 10);
+assertEquals(0, topDocs.totalHits.value);

Review Comment:
   Let's try to call `searcher.explain` here and try to assert it explains a 
NO_MATCH? otherwise this test case is just testing search?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a diff in pull request #12208: Explain term automaton queries

2023-04-05 Thread via GitHub


zhaih commented on code in PR #12208:
URL: https://github.com/apache/lucene/pull/12208#discussion_r1159363756


##
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestTermAutomatonQuery.java:
##
@@ -842,6 +843,91 @@ public void testRewriteSimplePhrase() throws Exception {
 IOUtils.close(w, r, dir);
   }
 
+  /* Implement a custom term automaton query to ensure that rewritten queries

Review Comment:
   I'm ok with it :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org