Hi All, I have a few questions about Lucene indexing and file handling. It would be great if someone can help with these. I had earlier asked these questions on gene...@lucene.apache.org but was asked to seek help here.
(1) During indexing, is there any knob to tell the writer to use off-heap for buffering. I didn't find anything in the docs so probably the answer is no. Just confirming. (2) I did some experiments with buffering threshold using setMaxRAMBufferSizeMB() on IndexWriterConfig. I varied it from 16MB (default), 128MB, 256MB and 512MB. The experiment was ingesting 5million documents. It turns out that buffering threshold also controls the number of files that are created in the index directory. In all the cases, I see only 1 segment (since there was just one segments_1) file but there were multiple .cfs files -- _0.cfs, _1.cfs, _2.cfs, _3.cfs. How can there be multiple cfs files when there is just one segment? My understanding from the documentation was that all files for each segment will have the same name but different extension. In this case, even though there is only 1 segment, there are still cfs files. Does each flush result in a new file? The reason to do this experiment is to understand the number of open files both while building the index and querying. I am not quite sure why I am seeing multiple CFS files when there is only 1 segment. I was hoping there would be only_0.cfs file. This is true when buffer threshold is 512MB, but there are 2 cfs files when threshold is set to 256MB, 5 cfs files when set to 128MB and I didn't see the CFS file for the default 16MB threshold. There were individual files (.fdx, .fdt, .tip etc). I thought by default Lucene creates a compound file at least after the writer closes. Is that not true? I can see that during querying, only the cfs file is kept opened. But I would like to understand a little bit about the number of cfs files and based on that we can set the buffering threshold to control the heap overhead while building the index. (2) In my experiments, the writer commits and is closed after ingesting all the 5million documents and after that there is no need for us to index more. So essentially it is an immutable index. However, I want to understand the threshold for creating a new segment. Is that pretty high? Or if the writer is reopened, then the next set of documents will go into the next segment and so on? I would really appreciate some help with above questions. Thanks, Siddharth