I implemented an index shrinker and it works.  I reduced my test index
from 6.6 GB to 3.6 GB by removing a single shingled field I did not
need anymore.  I'm actually using Lucene.Net for this project so code
is C# using Lucene.Net 2.9.2 API.  But basic idea is:

Create an IndexReader wrapper that only enumerates the terms you want
to keep, and that removes terms from documents when returning
documents.

Use the SegmentMerger to re-write each segment (where each segment is
wrapped by the wrapper class), writing new segment to a new directory.
Collect the SegmentInfos and do a commit in order to create a new
segments file in new index directory

Done - you now have a shrunk index with specified terms removed.

Implementation uses separate thread for each segment, so it re-writes
them in parallel.  Took about 15 minutes to do 770,000 doc index on my
macbook.


On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fancye...@gmail.com> wrote:
> I have roughly read the codes of 4.0 trunk. maybe it's feasible.
>    SegmentMerger.add(IndexReader) will add to be merged Readers
>    merge() will call
>      mergeTerms(segmentWriteState);
>      mergePerDoc(segmentWriteState);
>
>   mergeTerms() will construct fields from IndexReaders
>    for(int
> readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
>      final MergeState.IndexReaderAndLiveDocs r =
> mergeState.readers.get(readerIndex);
>      final Fields f = r.reader.fields();
>      final int maxDoc = r.reader.maxDoc();
>      if (f != null) {
>        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
>        fields.add(f);
>      }
>      docBase += maxDoc;
>    }
>    So If you wrapper your IndexReader and override its fields() method,
> maybe it will work for merge terms.
>
>    for DocValues, it can also override AtomicReader.docValues(). just
> return null for fields you want to remove. maybe it should
> traverse CompositeReader's getSequentialSubReaders() and wrapper each
> AtomicReader
>
>    other things like term vectors norms are similar.
> On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart...@gmail.com>wrote:
>
>> I was thinking if I make a wrapper class that aggregates another
>> IndexReader and filter out terms I don't want anymore it might work.   And
>> then pass that wrapper into SegmentMerger.  I think if I filter out terms
>> on GetFieldNames(...) and Terms(...) it might work.
>>
>> Something like:
>>
>> HashSet<string> ignoredTerms=...;
>>
>> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>>
>> SegmentMerger merger=new SegmentMerger(writer);
>>
>> merger.add(wrapper);
>>
>> merger.Merge();
>>
>>
>>
>>
>>
>> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>>
>> > for method 2, delete is wrong. we can't delete terms.
>> >   you also should hack with the tii and tis file.
>> >
>> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote:
>> >
>> >> method1, dumping data
>> >> for stored fields, you can traverse the whole index and save it to
>> >> somewhere else.
>> >> for indexed but not stored fields, it may be more difficult.
>> >>    if the indexed and not stored field is not analyzed(fields such as
>> >> id), it's easy to get from FieldCache.StringIndex.
>> >>    But for analyzed fields, though theoretically it can be restored from
>> >> term vector and term position, it's hard to recover from index.
>> >>
>> >> method 2, hack with metadata
>> >> 1. indexed fields
>> >>      delete by query, e.g. field:*
>> >> 2. stored fields
>> >>       because all fields are stored sequentially. it's not easy to
>> delete
>> >> some fields. this will not affect search speed. but if you want to get
>> >> stored fields,  and the useless fields are very long, then it will slow
>> >> down.
>> >>       also it's possible to hack with it. but need more effort to
>> >> understand the index file format  and traverse the fdt/fdx file.
>> >>
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> >>
>> >> this will give you some insight.
>> >>
>> >>
>> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart...@gmail.com
>> >wrote:
>> >>
>> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
>> used in
>> >>> search at all.  In order to save memory and disk, I'd like to rebuild
>> that
>> >>> index *without* those fields, but I don't have original documents to
>> >>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>> >>> there some way to rebuild or optimize an existing index with only a
>> sub-set
>> >>> of the existing indexed fields?  Or alternatively is there a way to
>> avoid
>> >>> loading some indexed fields at all ( to avoid loading term infos and
>> terms
>> >>> index ) ?
>> >>>
>> >>> Thanks
>> >>> Bob
>> >>
>> >>
>> >>
>>
>>

Reply via email to