Question about memory usage and file handling

siddharth teotia Mon, 11 Nov 2019 12:40:57 -0800

Hi All,

I have a few questions about Lucene indexing and file handling. It would be
great if someone can help with these. I had earlier asked these questions
on gene...@lucene.apache.org but was asked to seek help here.



(1) During indexing, is there any knob to tell the writer to use off-heap
for buffering. I didn't find anything in the docs so probably the answer is
no. Just confirming.

(2) I did some experiments with buffering threshold using
setMaxRAMBufferSizeMB() on IndexWriterConfig. I varied it from 16MB
(default), 128MB, 256MB and 512MB. The experiment was ingesting 5million
documents. It turns out that buffering threshold also controls the number
of files that are created in the index directory. In all the cases, I see
only 1 segment (since there was just one segments_1) file but there were
multiple .cfs files  -- _0.cfs, _1.cfs, _2.cfs, _3.cfs.

How can there be multiple cfs files when there is just one segment? My
understanding from the documentation was that all files for each segment
will have the same name but different extension. In this case, even though
there is only 1 segment, there are still cfs files. Does each flush result
in a new file?

The reason to do this experiment is to understand the number of open files
both while building the index and querying. I am not quite sure why I am
seeing multiple CFS files when there is only 1 segment. I was hoping there
would be only_0.cfs file.  This is true when buffer threshold is 512MB, but
there are 2 cfs files when threshold is set to 256MB, 5 cfs files when set
to 128MB and I didn't see the CFS file for the default 16MB threshold.
There were individual files (.fdx, .fdt, .tip etc). I thought by default
Lucene creates a compound file at least after the writer closes. Is that
not true?

I can see that during querying, only the cfs file is kept opened. But I
would like to understand a little bit about the number of cfs files and
based on that we can set the buffering threshold to control the heap
overhead while building the index.

(2) In my experiments, the writer commits and is closed after ingesting all
the 5million documents and after that there is no need for us to index
more. So essentially it is an immutable index. However, I want to
understand the threshold for creating a new segment. Is that pretty high?
Or if the writer is reopened, then the next set of documents will go into
the next segment and so on?

I would really appreciate some help with above questions.

Thanks,
Siddharth

Question about memory usage and file handling

Reply via email to