bziobrowski opened a new pull request, #16103:
URL: https://github.com/apache/pinot/pull/16103

   This PR adds new Multi-Column Text Index.
   It it functionally equivalent to existing Lucene-based Text Index, but 
stores data for all indexed-columns together, in a single directory. The 
approach is especially beneficial when the number of text-indexed columns is 
large (tens or hundreds), because it consolidates lots of small files, reducing 
both disk space and memory usage (including e.g. file handles). 
   Apart from saving space on shared intra-column tokens within Lucene, the new 
index uses a single document id mapping.
   
   Example configuration (within table config):
   ```json
   "tableIndexConfig": {
      "multiColumnTextIndexConfig": {
         "columns": ["hobbies", "skills", "titles" ],
         "properties": {
            "caseSensitive": "false"
          }
          "perColumnProperties": {
             "titles": {
                "caseSensitive": "false"
             }
          }
    },
   ```
   
   As shown in example above, index configuration allows for both: 
   - setting shared index properties that apply to all columns with 
"properties". 
   Allowed keys are : `enableQueryCacheForTextIndex`, `luceneUseCompoundFile`, 
`luceneMaxBufferSizeMB`, `reuseMutableIndex`  and all allowed in 
`perColumnProperties`.
   - setting column-specific properties (overriding shared ones) with 
`perColumnProperties`. 
   Allowed keys: `useANDForMultiTermTextIndexQueries`, 
`enablePrefixSuffixMatchingInPhraseQueries`, `stopWordInclude`, 
`stopWordExclude`, `caseSensitive`, `luceneAnalyzerClass`, 
`luceneAnalyzerClassArgs`, `luceneAnalyzerClassArgTypes`, 
`luceneQueryParserClass`.
    
   Note: this index doesn't handle `noRawDataForTextIndex` and 
`rawValueForTextIndex` properties. 
   
    Benchmarking with BenchmarkTextMatchQueriesSSQE class shows performance 
close to existing Lucene Text Index:
    ```
                                                                                
     (_query)    Mode  Cnt  Score   Error  Units
   TEXT_MATCH(SC_STR_COL_0, 'pinot OR java')                                    
                 avgt    5  0.374 ± 0.065  ms/op
   TEXT_MATCH(MC_STR_COL_0, 'pinot OR java')                                    
                 avgt    5  0.358 ± 0.038  ms/op
   TEXT_MATCH(SC_STR_COL_0, 'pinot java') OR TEXT_MATCH(SC_STR_COL_1, 
'distributed database')    avgt    5  0.724 ± 0.066  ms/op
   TEXT_MATCH(MC_STR_COL_0, 'pinot java') OR TEXT_MATCH(MC_STR_COL_1, 
'distributed database')    avgt    5  0.742 ± 0.070  ms/op
   Q5                                                                           
                 avgt    5  0.862 ± 0.095  ms/op
   Q6                                                                           
                 avgt    5  0.879 ± 0.154  ms/op
   ```
   
   Where:
   - each query starts with `SELECT count(*) from MyTable WHERE` .
   - Q5 is
   ```sql
   SELECT count(*)
   FROM MyTable
   WHERE TEXT_MATCH(SC_STR_COL_0, 'pinot')
   OR    TEXT_MATCH(SC_STR_COL_1, 'java')
   OR    TEXT_MATCH(SC_STR_COL_2, 'database')
   OR    TEXT_MATCH(SC_STR_COL_3, 'distributed')
   OR    TEXT_MATCH(SC_STR_COL_4, 'multi-tenant')
   ```
   - Q6 is
   ```sql
   SELECT count(*)
   FROM MyTable
   WHERE TEXT_MATCH(MC_STR_COL_0, 'pinot')
   OR    TEXT_MATCH(MC_STR_COL_1, 'java')
   OR    TEXT_MATCH(MC_STR_COL_2, 'database')
   OR    TEXT_MATCH(MC_STR_COL_3, 'distributed')
   OR    TEXT_MATCH(MC_STR_COL_4, 'multi-tenant')
   ```
   
    
    
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to