[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5667: Support text index on dictionary encoded columns

GitBox Mon, 13 Jul 2020 10:55:12 -0700


mcvsubbu commented on a change in pull request #5667:
URL: https://github.com/apache/incubator-pinot/pull/5667#discussion_r453821384




##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/segment/index/loader/defaultcolumn/BaseDefaultColumnHandler.java
##########
@@ -325,77 +323,92 @@ protected void removeColumnV1Indices(String column)
   }
 
   /**
-   * Right now the text index is supported on RAW (non-dictionary encoded)
+   * Right now the text index is supported on RAW and dictionary encoded

Review comment:
       Are there any more constraints to relax? If not, we can modify this 
comment?

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/segment/index/loader/invertedindex/TextIndexHandler.java
##########
@@ -111,35 +117,22 @@ public void createTextIndexesOnSegmentLoad()
   }
 
   /**
-   * Right now the text index is supported on RAW (non-dictionary encoded)
+   * Right now the text index is supported on RAW and dictionary encoded

Review comment:
       same here. capture the constraints correctly in comments

##########
File path: 
pinot-core/src/main/java/org/apache/pinot/core/segment/index/loader/invertedindex/TextIndexHandler.java
##########
@@ -151,30 +144,68 @@ private void createTextIndexForColumn(ColumnMetadata 
columnMetadata)
       return;
     }
     int numDocs = columnMetadata.getTotalDocs();
-    LOGGER.info("Creating new text index for column: {} in segment: {}", 
column, _segmentName);
+    boolean hasDictionary = columnMetadata.hasDictionary();
+    LOGGER.info("Creating new text index for column: {} in segment: {}, 
hasDictionary: {}", column, _segmentName, hasDictionary);
     File segmentDirectory = 
SegmentDirectoryPaths.segmentDirectoryFor(_indexDir, _segmentVersion);
     // The handlers are always invoked by the preprocessor. Before this 
ImmutableSegmentLoader would have already
     // up-converted the segment from v1/v2 -> v3 (if needed). So based on the 
segmentVersion, whatever segment
     // segmentDirectory is indicated to us by SegmentDirectoryPaths, we create 
lucene index there. There is no
     // further need to move around the lucene index directory since it is 
created with correct directory structure
     // based on segmentVersion.
-    try (LuceneTextIndexCreator textIndexCreator = new 
LuceneTextIndexCreator(column, segmentDirectory, true);
-        VarByteChunkSVForwardIndexReader forwardIndexReader = 
getForwardIndexReader(columnMetadata);
-        ChunkReaderContext readerContext = forwardIndexReader.createContext()) 
{
-      for (int docId = 0; docId < numDocs; docId++) {
-        textIndexCreator.addDoc(forwardIndexReader.getString(docId, 
readerContext), docId);
+    try (ForwardIndexReader forwardIndexReader = 
getForwardIndexReader(columnMetadata);
+        ForwardIndexReaderContext readerContext = 
forwardIndexReader.createContext();
+        LuceneTextIndexCreator textIndexCreator = new 
LuceneTextIndexCreator(column, segmentDirectory, true)) {
+      if (!hasDictionary) {
+        // text index on raw column, just read the raw forward index
+        VarByteChunkSVForwardIndexReader rawIndexReader = 
(VarByteChunkSVForwardIndexReader)forwardIndexReader;
+        ChunkReaderContext chunkReaderContext = 
(ChunkReaderContext)readerContext;
+        for (int docId = 0; docId < numDocs; docId++) {
+          textIndexCreator.addDoc(rawIndexReader.getString(docId, 
chunkReaderContext), docId);
+        }
+      } else {
+        // text index on dictionary encoded SV column
+        // read forward index to get dictId
+        // read the raw value from dictionary using dictId
+        try (BaseImmutableDictionary dictionary = 
getDictionaryReader(columnMetadata)) {
+          if (columnMetadata.isSingleValue()) {
+            for (int docId = 0; docId < numDocs; docId++) {
+              int dictId = forwardIndexReader.getDictId(docId, readerContext);
+              String value = dictionary.getStringValue(dictId);
+              textIndexCreator.addDoc(value, docId);
+            }
+          }
+        }
       }
       textIndexCreator.seal();
     }
+
     LOGGER.info("Created text index for column: {} in segment: {}", column, 
_segmentName);
     PropertiesConfiguration properties = 
SegmentMetadataImpl.getPropertiesConfiguration(_indexDir);
     properties.setProperty(getKeyFor(column, TEXT_INDEX_TYPE), 
TextIndexType.LUCENE.name());
     properties.save();
   }
 
-  private VarByteChunkSVForwardIndexReader 
getForwardIndexReader(ColumnMetadata columnMetadata)
+  private ForwardIndexReader<?> getForwardIndexReader(ColumnMetadata 
columnMetadata)
+      throws IOException {
+    if (!columnMetadata.hasDictionary()) {
+      // raw index
+      PinotDataBuffer buffer = 
_segmentWriter.getIndexFor(columnMetadata.getColumnName(), 
ColumnIndexType.FORWARD_INDEX);
+      return new VarByteChunkSVForwardIndexReader(buffer, DataType.STRING);
+    } else {
+      PinotDataBuffer buffer = 
_segmentWriter.getIndexFor(columnMetadata.getColumnName(), 
ColumnIndexType.FORWARD_INDEX);

Review comment:
       Should it be `ColumnIndexType.DICTIONARY`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #5667: Support text index on dictionary encoded columns

Reply via email to