[GitHub] [lucene] dantuzi commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

via GitHub Wed, 05 Apr 2023 04:21:56 -0700


dantuzi commented on code in PR #12169:
URL: https://github.com/apache/lucene/pull/12169#discussion_r1158329797



##########
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java:
##########
@@ -221,6 +223,12 @@ public static void assertTokenStreamContents(
       flagsAtt = ts.getAttribute(FlagsAttribute.class);
     }
 
+    BoostAttribute boostAtt = null;

Review Comment:
   I'm going to update my PR removing the BoostAttribute as you suggest



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java:
##########
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.BufferedInputStream;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Locale;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+
+/**
+ * Word2VecModelReader is a Word2VecModelReader that reads the file generated 
by the library
+ * Deeplearning4j
+ *
+ * <p>Dl4j Word2Vec documentation:
+ * 
https://deeplearning4j.konduit.ai/v/en-1.0.0-beta7/language-processing/word2vec 
Example to
+ * generate a model using dl4j:
+ * 
https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/embeddingsfromcorpus/word2vec/Word2VecRawTextExample.java
+ *
+ * @lucene.experimental
+ */
+public class Dl4jModelReader implements Word2VecModelReader {
+
+  private static final String MODEL_FILE_NAME_PREFIX = "syn0";
+
+  private final String word2vecModelFilePath;
+  private final ZipInputStream word2VecModelZipFile;
+
+  public Dl4jModelReader(String word2vecModelFilePath, InputStream stream) {
+    this.word2vecModelFilePath = word2vecModelFilePath;
+    this.word2VecModelZipFile = new ZipInputStream(new 
BufferedInputStream(stream));
+  }
+
+  @Override
+  public Word2VecModel read() throws IOException {
+
+    ZipEntry entry;
+    while ((entry = word2VecModelZipFile.getNextEntry()) != null) {
+      String fileName = entry.getName();
+      if (fileName.startsWith(MODEL_FILE_NAME_PREFIX)) {
+        BufferedReader reader =
+            new BufferedReader(new InputStreamReader(word2VecModelZipFile, 
StandardCharsets.UTF_8));
+
+        String header = reader.readLine();
+        String[] headerValues = header.split(" ");
+        int dictionarySize = Integer.parseInt(headerValues[0]);
+        int vectorDimension = Integer.parseInt(headerValues[1]);
+
+        Word2VecModel model = new Word2VecModel(dictionarySize, 
vectorDimension);
+        reader
+            .lines()
+            .forEach(
+                line -> {
+                  String[] tokens = line.split(" ");
+                  BytesRef term = decodeTerm(tokens[0]);
+
+                  float[] vector = new float[tokens.length - 1];
+
+                  if (vectorDimension != vector.length) {
+                    throw new RuntimeException(
+                        String.format(
+                            Locale.ROOT,
+                            "Word2Vec model file corrupted. "
+                                + "Declared vectors of size %d but found 
vector of size %d for word %s (%s)",
+                            vectorDimension,
+                            vector.length,
+                            tokens[0],
+                            term.utf8ToString()));
+                  }
+
+                  for (int i = 1; i < tokens.length; i++) {
+                    vector[i - 1] = Float.parseFloat(tokens[i]);
+                  }
+                  model.addTermAndVector(new TermAndVector(term, vector));
+                });
+        return model;
+      }
+    }
+    throw new UnsupportedEncodingException(

Review Comment:
   When we use the library DL4J to train a model and we export it, we obtain a 
compressed zip file.
   This zip contains multiple files but we are only interested in file `syn0`. 
The exception is thrown if the passed zip does not contain any `syn0` file.
   I guess `IllegalArgumentException` would fit



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymFilter.java:
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.synonym.SynonymGraphFilter;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
+import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
+import org.apache.lucene.search.BoostAttribute;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.TermAndBoost;
+
+/**
+ * Applies single-token synonyms from a Word2Vec trained network to an 
incoming {@link TokenStream}.
+ *
+ * @lucene.experimental
+ */
+public final class Word2VecSynonymFilter extends TokenFilter {
+
+  private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);
+  private final PositionIncrementAttribute posIncrementAtt =
+      addAttribute(PositionIncrementAttribute.class);
+  private final PositionLengthAttribute posLenAtt = 
addAttribute(PositionLengthAttribute.class);
+  private final BoostAttribute boostAtt = addAttribute(BoostAttribute.class);
+  private final TypeAttribute typeAtt = addAttribute(TypeAttribute.class);
+
+  private final SynonymProvider synonymProvider;
+  private final int maxSynonymsPerTerm;
+  private final float minAcceptedSimilarity;
+  private final LinkedList<TermAndBoost> synonymBuffer = new LinkedList<>();
+  private State lastState;
+
+  /**
+   * Apply previously built synonymProvider to incoming tokens.
+   *
+   * @param input input tokenstream
+   * @param synonymProvider synonym provider
+   * @param maxSynonymsPerTerm maximum number of result returned by the 
synonym search
+   * @param minAcceptedSimilarity minimal value of cosine similarity between 
the searched vector and
+   *     the retrieved ones
+   */
+  public Word2VecSynonymFilter(
+      TokenStream input,
+      SynonymProvider synonymProvider,
+      int maxSynonymsPerTerm,
+      float minAcceptedSimilarity) {
+    super(input);
+    this.synonymProvider = synonymProvider;
+    this.maxSynonymsPerTerm = maxSynonymsPerTerm;
+    this.minAcceptedSimilarity = minAcceptedSimilarity;
+  }
+
+  @Override
+  public boolean incrementToken() throws IOException {
+
+    if (!synonymBuffer.isEmpty()) {
+      TermAndBoost synonym = synonymBuffer.pollFirst();
+      clearAttributes();
+      restoreState(this.lastState);
+      termAtt.setEmpty();

Review Comment:
   I tried your suggestion, it's much cleaner but it doesn't work.
   When I ran the unit tests, I found out the `BytesTermAttribute` contains a 
null ByteRef



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+
+/**
+ * Word2VecModel is a class representing the parsed Word2Vec model containing 
the vectors for each
+ * word in dictionary
+ *
+ * @lucene.experimental
+ */
+public class Word2VecModel implements RandomAccessVectorValues<float[]> {
+
+  private final int dictionarySize;
+  private final int vectorDimension;
+  private final TermAndVector[] data;
+  private final Map<BytesRef, TermAndVector> word2Vec;
+  private int loadedCount = 0;
+
+  public Word2VecModel(int dictionarySize, int vectorDimension) {
+    this.dictionarySize = dictionarySize;
+    this.vectorDimension = vectorDimension;
+    this.data = new TermAndVector[dictionarySize];
+    this.word2Vec = new HashMap<>();
+  }
+
+  private Word2VecModel(
+      int dictionarySize,
+      int vectorDimension,
+      TermAndVector[] data,
+      Map<BytesRef, TermAndVector> word2Vec) {
+    this.dictionarySize = dictionarySize;
+    this.vectorDimension = vectorDimension;
+    this.data = data;
+    this.word2Vec = word2Vec;
+  }
+
+  public void addTermAndVector(TermAndVector modelEntry) {
+    modelEntry.normalizeVector();
+    this.data[loadedCount++] = modelEntry;
+    this.word2Vec.put(modelEntry.getTerm(), modelEntry);
+  }
+
+  @Override
+  public float[] vectorValue(int ord) throws IOException {
+    return data[ord].getVector();
+  }
+
+  public float[] vectorValue(BytesRef term) {
+    TermAndVector entry = word2Vec.get(term);
+    return (entry == null) ? null : entry.getVector();
+  }
+
+  public BytesRef binaryValue(int targetOrd) throws IOException {
+    return data[targetOrd].getTerm();
+  }
+
+  @Override
+  public int dimension() {
+    return vectorDimension;
+  }
+
+  @Override
+  public int size() {
+    return dictionarySize;
+  }
+
+  @Override
+  public RandomAccessVectorValues<float[]> copy() throws IOException {

Review Comment:
   @msokolov I tried to implement your suggestion but it looks like the method 
`HnswGraphBuilder::build` doesn't want the same reference passed to the 
`HnswGraphBuilder.create`. [1]
   To be honest I still don't understand why this check [2] is required
   
   [1]
   ```
   Vectors to build must be independent of the source of vectors provided to 
HnswGraphBuilder()
   java.lang.IllegalArgumentException: Vectors to build must be independent of 
the source of vectors provided to HnswGraphBuilder()
        at 
__randomizedtesting.SeedInfo.seed([994075DD4398F0A4:E100BB05917EA0E6]:0)
        at 
org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.HnswGraphBuilder.build(HnswGraphBuilder.java:165)
        at 
org.apache.lucene.analysis.synonym.word2vec.Word2VecSynonymProvider.<init>(Word2VecSynonymProvider.java:64)
        at 
org.apache.lucene.analysis.synonym.word2vec.TestWord2VecSynonymProvider.<init>(TestWord2VecSynonymProvider.java:39)
   ```
   
   [2] 
https://github.com/apache/lucene/blob/776149f0f6964bbc72ad2d292d1bfe770f82ba45/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L155-L158



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java:
##########
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.BufferedInputStream;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Locale;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+
+/**
+ * Word2VecModelReader is a Word2VecModelReader that reads the file generated 
by the library
+ * Deeplearning4j
+ *
+ * <p>Dl4j Word2Vec documentation:
+ * 
https://deeplearning4j.konduit.ai/v/en-1.0.0-beta7/language-processing/word2vec 
Example to
+ * generate a model using dl4j:
+ * 
https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/embeddingsfromcorpus/word2vec/Word2VecRawTextExample.java
+ *
+ * @lucene.experimental
+ */
+public class Dl4jModelReader implements Word2VecModelReader {
+
+  private static final String MODEL_FILE_NAME_PREFIX = "syn0";
+
+  private final String word2vecModelFilePath;

Review Comment:
   Everything comes from the `Word2VecSynonymFilterFactory` that implements 
`ResourceLoaderAware`. This interface provides us a 
`org.apache.lucene.util.ResourceLoader` and the possibility to obtain an 
anonymous `InputStream`.
   I decided to pass also the model file path to enrich the Exception message 
and make the user's life easier.
   BTW I don't have a strong opinion about this. I can easily remove that string



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Dl4jModelReader.java:
##########
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.BufferedInputStream;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.InputStreamReader;
+import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
+import java.util.Base64;
+import java.util.Locale;
+import java.util.zip.ZipEntry;
+import java.util.zip.ZipInputStream;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+
+/**
+ * Word2VecModelReader is a Word2VecModelReader that reads the file generated 
by the library
+ * Deeplearning4j
+ *
+ * <p>Dl4j Word2Vec documentation:
+ * 
https://deeplearning4j.konduit.ai/v/en-1.0.0-beta7/language-processing/word2vec 
Example to
+ * generate a model using dl4j:
+ * 
https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/advanced/modelling/embeddingsfromcorpus/word2vec/Word2VecRawTextExample.java
+ *
+ * @lucene.experimental
+ */
+public class Dl4jModelReader implements Word2VecModelReader {
+
+  private static final String MODEL_FILE_NAME_PREFIX = "syn0";
+
+  private final String word2vecModelFilePath;
+  private final ZipInputStream word2VecModelZipFile;
+
+  public Dl4jModelReader(String word2vecModelFilePath, InputStream stream) {
+    this.word2vecModelFilePath = word2vecModelFilePath;
+    this.word2VecModelZipFile = new ZipInputStream(new 
BufferedInputStream(stream));
+  }
+
+  @Override
+  public Word2VecModel read() throws IOException {
+
+    ZipEntry entry;
+    while ((entry = word2VecModelZipFile.getNextEntry()) != null) {
+      String fileName = entry.getName();
+      if (fileName.startsWith(MODEL_FILE_NAME_PREFIX)) {
+        BufferedReader reader =
+            new BufferedReader(new InputStreamReader(word2VecModelZipFile, 
StandardCharsets.UTF_8));
+
+        String header = reader.readLine();
+        String[] headerValues = header.split(" ");
+        int dictionarySize = Integer.parseInt(headerValues[0]);
+        int vectorDimension = Integer.parseInt(headerValues[1]);
+
+        Word2VecModel model = new Word2VecModel(dictionarySize, 
vectorDimension);
+        reader
+            .lines()
+            .forEach(
+                line -> {
+                  String[] tokens = line.split(" ");
+                  BytesRef term = decodeTerm(tokens[0]);
+
+                  float[] vector = new float[tokens.length - 1];
+
+                  if (vectorDimension != vector.length) {
+                    throw new RuntimeException(
+                        String.format(
+                            Locale.ROOT,
+                            "Word2Vec model file corrupted. "
+                                + "Declared vectors of size %d but found 
vector of size %d for word %s (%s)",
+                            vectorDimension,
+                            vector.length,
+                            tokens[0],
+                            term.utf8ToString()));
+                  }
+
+                  for (int i = 1; i < tokens.length; i++) {
+                    vector[i - 1] = Float.parseFloat(tokens[i]);
+                  }
+                  model.addTermAndVector(new TermAndVector(term, vector));
+                });
+        return model;
+      }
+    }
+    throw new UnsupportedEncodingException(
+        "The ZIP file '"
+            + word2vecModelFilePath
+            + "' does not contain any "
+            + MODEL_FILE_NAME_PREFIX
+            + " file");
+  }
+
+  static BytesRef decodeTerm(String term) {
+    if (term.toLowerCase(Locale.ROOT).startsWith("b64:")) {

Review Comment:
   I like your suggestion to read the first term and assume the remaining terms 
are encoded in the same way.
   I did some checks and the `trim()` was useless. Thank you for noticing it



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+
+/**
+ * Word2VecModel is a class representing the parsed Word2Vec model containing 
the vectors for each
+ * word in dictionary
+ *
+ * @lucene.experimental
+ */
+public class Word2VecModel implements RandomAccessVectorValues<float[]> {
+
+  private final int dictionarySize;
+  private final int vectorDimension;
+  private final TermAndVector[] data;
+  private final Map<BytesRef, TermAndVector> word2Vec;
+  private int loadedCount = 0;
+
+  public Word2VecModel(int dictionarySize, int vectorDimension) {
+    this.dictionarySize = dictionarySize;
+    this.vectorDimension = vectorDimension;
+    this.data = new TermAndVector[dictionarySize];
+    this.word2Vec = new HashMap<>();
+  }
+
+  private Word2VecModel(
+      int dictionarySize,
+      int vectorDimension,
+      TermAndVector[] data,
+      Map<BytesRef, TermAndVector> word2Vec) {
+    this.dictionarySize = dictionarySize;
+    this.vectorDimension = vectorDimension;
+    this.data = data;
+    this.word2Vec = word2Vec;
+  }
+
+  public void addTermAndVector(TermAndVector modelEntry) {
+    modelEntry.normalizeVector();
+    this.data[loadedCount++] = modelEntry;
+    this.word2Vec.put(modelEntry.getTerm(), modelEntry);
+  }
+
+  @Override
+  public float[] vectorValue(int ord) throws IOException {
+    return data[ord].getVector();
+  }
+
+  public float[] vectorValue(BytesRef term) {
+    TermAndVector entry = word2Vec.get(term);
+    return (entry == null) ? null : entry.getVector();
+  }
+
+  public BytesRef binaryValue(int targetOrd) throws IOException {
+    return data[targetOrd].getTerm();

Review Comment:
   As you can see, this method is not `@Override` so this is not an 
implementation of  `RandomAccessVectorValues` interface. This mean nobody 
should use this method in the HNSW graph. I'm going to rename this method to 
avoid confusion.



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java:
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.TermAndVector;
+import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+
+/**
+ * Word2VecModel is a class representing the parsed Word2Vec model containing 
the vectors for each
+ * word in dictionary
+ *
+ * @lucene.experimental
+ */
+public class Word2VecModel implements RandomAccessVectorValues<float[]> {
+
+  private final int dictionarySize;
+  private final int vectorDimension;
+  private final TermAndVector[] data;
+  private final Map<BytesRef, TermAndVector> word2Vec;

Review Comment:
   I've never seen the `ByteRefHash` before, thank you for your suggestion



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecSynonymFilter.java:
##########
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.synonym.word2vec;
+
+import java.io.IOException;
+import java.util.LinkedList;
+import java.util.List;
+import org.apache.lucene.analysis.TokenFilter;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.synonym.SynonymGraphFilter;
+import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
+import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
+import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
+import org.apache.lucene.search.BoostAttribute;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.TermAndBoost;
+
+/**
+ * Applies single-token synonyms from a Word2Vec trained network to an 
incoming {@link TokenStream}.

Review Comment:
   As you suggested, I'm replacing the general SynonymFilter interface with the 
"concrete" corresponding class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dantuzi commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

Reply via email to