bziobrowski opened a new pull request, #16103: URL: https://github.com/apache/pinot/pull/16103
This PR adds new Multi-Column Text Index. It it functionally equivalent to existing Lucene-based Text Index, but stores data for all indexed-columns together, in a single directory. The approach is especially beneficial when the number of text-indexed columns is large (tens or hundreds), because it consolidates lots of small files, reducing both disk space and memory usage (including e.g. file handles). Apart from saving space on shared intra-column tokens within Lucene, the new index uses a single document id mapping. Example configuration (within table config): ```json "tableIndexConfig": { "multiColumnTextIndexConfig": { "columns": ["hobbies", "skills", "titles" ], "properties": { "caseSensitive": "false" } "perColumnProperties": { "titles": { "caseSensitive": "false" } } }, ``` As shown in example above, index configuration allows for both: - setting shared index properties that apply to all columns with "properties". Allowed keys are : `enableQueryCacheForTextIndex`, `luceneUseCompoundFile`, `luceneMaxBufferSizeMB`, `reuseMutableIndex` and all allowed in `perColumnProperties`. - setting column-specific properties (overriding shared ones) with `perColumnProperties`. Allowed keys: `useANDForMultiTermTextIndexQueries`, `enablePrefixSuffixMatchingInPhraseQueries`, `stopWordInclude`, `stopWordExclude`, `caseSensitive`, `luceneAnalyzerClass`, `luceneAnalyzerClassArgs`, `luceneAnalyzerClassArgTypes`, `luceneQueryParserClass`. Note: this index doesn't handle `noRawDataForTextIndex` and `rawValueForTextIndex` properties. Benchmarking with BenchmarkTextMatchQueriesSSQE class shows performance close to existing Lucene Text Index: ``` (_query) Mode Cnt Score Error Units TEXT_MATCH(SC_STR_COL_0, 'pinot OR java') avgt 5 0.374 ± 0.065 ms/op TEXT_MATCH(MC_STR_COL_0, 'pinot OR java') avgt 5 0.358 ± 0.038 ms/op TEXT_MATCH(SC_STR_COL_0, 'pinot java') OR TEXT_MATCH(SC_STR_COL_1, 'distributed database') avgt 5 0.724 ± 0.066 ms/op TEXT_MATCH(MC_STR_COL_0, 'pinot java') OR TEXT_MATCH(MC_STR_COL_1, 'distributed database') avgt 5 0.742 ± 0.070 ms/op Q5 avgt 5 0.862 ± 0.095 ms/op Q6 avgt 5 0.879 ± 0.154 ms/op ``` Where: - each query starts with `SELECT count(*) from MyTable WHERE` . - Q5 is ```sql SELECT count(*) FROM MyTable WHERE TEXT_MATCH(SC_STR_COL_0, 'pinot') OR TEXT_MATCH(SC_STR_COL_1, 'java') OR TEXT_MATCH(SC_STR_COL_2, 'database') OR TEXT_MATCH(SC_STR_COL_3, 'distributed') OR TEXT_MATCH(SC_STR_COL_4, 'multi-tenant') ``` - Q6 is ```sql SELECT count(*) FROM MyTable WHERE TEXT_MATCH(MC_STR_COL_0, 'pinot') OR TEXT_MATCH(MC_STR_COL_1, 'java') OR TEXT_MATCH(MC_STR_COL_2, 'database') OR TEXT_MATCH(MC_STR_COL_3, 'distributed') OR TEXT_MATCH(MC_STR_COL_4, 'multi-tenant') ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org