Hello everybody, I read the „Lucene Index Format“ paper but there are still some points that are not clear for me. I understand the file concept behind Lucene and the compound file format. Adding documents and merging those works like this with a SimpleFSDirectory implementation:
1. First documents are added in RAM until MaxBufferedDocs is reached. 2. Docs are flushed to hard drive as segment. 3. Steps 1 and 2 are repeated until MergeFactor is reached (A larger MergeFactor leads to more segments and less merging operations). 4. Segments are merged to one single segment. 5. Steps 1 – 4 are repeated until everything is indexed. With standard settings this means I get by adding 100 documents 9 different segments and with the last doc a merge is triggered which leads to a single segment with 100 documents (10th segment is hold in RAM before). Are merges done in RAM or also on the hard drive? My problems are in the details: 1) How exactly is merging done? What is the algorithm for it? 2) When I store documents in segments they become a unique number in each segment starting by zero. Does this imply a renumbering if I merge several segments? For example: Segment1(0,1,2,3) and Segment2(0,1,2) --> Segment(0,1,2,3 (from here it’s Segment2),4,5,6). If I change the order of adding segments the numbering changes according to it. Segment2(0,1,2) and Segment1(0,1,2,3) --> Segment(0,1,2(from here it’s Segment1),3,4,5,6). 3) If I merge two segments is the second segment only added “behind” the first one and the DocID’s are adjusted such that no ordering or sorting on the hard drive is necessary? Just “Copy & Paste” with the mentioned renumbering? 4) How does Lucene write the index on the hard drive? Are the blocks written sequential onto it? The API documentation says : “A Directory is a flat list of files. Files may be written once, when they are created. Once a file is created it may only be opened for read, or deleted. Random access is permitted both when reading and writing.“ This means Lucene writes the segments sequential and only create holes through deleting/updating? Therefore my Index gets fragmented by the time? If it is getting fragmented can I defragment it by running IndexWriter.optimize() such that the blocks on the hard drive are getting sequential again? Or is this just renumbering my DocID’s? Or am I totally wrong? :P If I want to search on the created index Lucene first looks in the .tii file in RAM and then “skips” to the correct position on the hard drive. Are there really the exact hard drive positions for the terms dictionary in this file? That means in my .tii file is every 128. term of the whole index dictionary (except I set IndexInterval different to 128). The rest is done with binary search as far as I know. 5) What happens if I copy a directory with a builded index onto another drive? Are the positions in .tii still correct? Or should I use for copying an index the function in Directory.copy(Directory src, Directory dest, Boolean closeDirSrc)? Does this readjust the positions in the .tii file? What happens if my other hard drive has a totally different block size? Thanks in advance and kind regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Index-format-details-tp963861p963861.html Sent from the Lucene - General mailing list archive at Nabble.com.
