RS146BIJAY opened a new issue, #13387:
URL: https://github.com/apache/lucene/issues/13387

   ### Description
   
   ## Issue
   
   Today, Lucene internally creates multiple DocumentWriterPerThread (DWPT) 
instances per shard to facilitate concurrent indexing across different 
ingestion threads. When documents are indexed by the same DWPT, they are 
grouped into the same segment post flush. As DWPT assignment to documents is 
only concurrency based, it’s not possible to predict or control the 
distribution of documents within the segments. For instance, during the 
indexing of time series logs, its possible for a single DWPT to index logs with 
both 5xx and 2xx status codes, leading to segments that contains a 
heterogeneous mix of documents.
   
   Typically, in scenarios like log analytics, users are more interested in a 
certain subset of data (errors (4XX) and/or fault requests (5XX) requests 
logs). Randomly assigning DWPT to index document can disperse these relevant 
documents across multiple segments. Furthermore, if these documents are sparse, 
they will be thinly spread out even within the segments, necessitating the 
iteration over many less relevant documents for search queries. While the 
[optimisation to use BKD tree to skip non competitive 
documents](https://github.com/apache/lucene-solr/pull/1351) by the collectors 
significantly improves query performance, actual number of documents iterated 
still depends on arrangement of data in the segment and how underlying BKD gets 
constructed.
   
   Storing relevant log documents separately from relatively less relevant 
ones, such as 2xx logs, can prevent their scattering across multiple segments. 
This model can markedly enhance query performance by streamlining searches to 
involve fewer segments and omitting documents that are less relevant. Moreover, 
clustering related data allows for the [pre-computation of 
aggregations](https://github.com/opensearch-project/OpenSearch/issues/12498) 
for frequently executed queries (e.g., count, minimum, maximum) and store them 
as separate metadata. Corresponding queries can be served from the metadata 
itself, thus optimizing both on the latency and compute.
   
   ## Proposal
   
   In this proposal, we suggest adding support for DWPT selection mechanism 
based on a specific criteria within the DocumentWriter. Users can define this 
criteria through a grouping function as a new 
[IndexWriterConfig](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriterConfig.java)
 configuration. This grouping criteria can be based on the anticipated query 
pattern in the workload to store frequently queried data together. During 
indexing, this function would be evaluated for each document, ensuring that 
documents with differing criteria are indexed using separate DWPTs. For 
instance, in the context of http request logs, the grouping function could be 
tailored to assign DWPTs according to the status code in the log entry.
   
   ## Associated OpenSearch RFC
   
   https://github.com/opensearch-project/OpenSearch/issues/13183
   
   ## Improvements with new DWPT distribution strategy
   
   We worked on a POC in Lucene and tried integrating it with OpenSearch. We 
validated DWPT distribution based on different criterias such as status code, 
timestamp etc against different types of workload. We observed a 50% - 60% 
improvements in performance of range, aggregation and sort queries with 
proposed DWPT selection approach.
   
   ## Implementation Details
   
   User defined grouping criteria function will be passed to DocumentWriter as 
a new 
[IndexWriterConfig](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriterConfig.java)
 configuration. During indexing of a document, the DocumentWriter will evaluate 
this grouping function and pass this outcome to the DocumentWriterFlushControl 
and DocumentWriterThreadPool when [requesting a 
DWPT](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriter.java#L415)
 for indexing the document. The DocumentWriterThreadPool will now maintain a 
[distinct pool of 
DWPTs](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#L47)
 for each possible outcome. The specific pool selected for indexing a document 
will depend on the outcome of the document for the grouping function.  Should 
the relevant pool [be empty, a new DWPT will be created](https://githu
 
b.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java#L126)
 and added to this pool. Connecting with above example for http request logs, 
having a distinct pools for 2xx and 5xx status code logs would ensure that 2xx 
logs are indexed using a separate set of DWPTs from the 5xx status codes logs. 
Once a DWPT is designated for flushing, it is checked out of the thread pool 
and won't be reused for indexing.
   
   Further, in order to ensure that grouping criteria invariant is maintained 
even during segment merges, we propose a new merge policy that acts as a 
decorator over the existing Tiered Merge policy. During a segment merge, this 
policy would categorize segments according to their grouping function outcomes 
before merging segments within the same category, thus maintaining the grouping 
criteria’s integrity throughout the merge process.
   
   ### Guardrails
   
   To mange the system’s resources effectively, guardrails will be implemented 
to limit the numbers of groups that can be generated from grouping function. 
User will need to provide a predefined  list of acceptable outcomes for the 
grouping function, along with the function itself. Documents whose grouping 
function outcome is not within this list will be indexed using a default pool 
of DWPTs. This limits the number of DWPTs created during indexing, preventing 
the formation of numerous small segments that could lead to frequent segment 
merges. Additionally, a cap on DWPT count keeps the JVM utilization and garbage 
collection in check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to