I implemented an index shrinker and it works. I reduced my test index from 6.6 GB to 3.6 GB by removing a single shingled field I did not need anymore. I'm actually using Lucene.Net for this project so code is C# using Lucene.Net 2.9.2 API. But basic idea is:
Create an IndexReader wrapper that only enumerates the terms you want to keep, and that removes terms from documents when returning documents. Use the SegmentMerger to re-write each segment (where each segment is wrapped by the wrapper class), writing new segment to a new directory. Collect the SegmentInfos and do a commit in order to create a new segments file in new index directory Done - you now have a shrunk index with specified terms removed. Implementation uses separate thread for each segment, so it re-writes them in parallel. Took about 15 minutes to do 770,000 doc index on my macbook. On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fancye...@gmail.com> wrote: > I have roughly read the codes of 4.0 trunk. maybe it's feasible. > SegmentMerger.add(IndexReader) will add to be merged Readers > merge() will call > mergeTerms(segmentWriteState); > mergePerDoc(segmentWriteState); > > mergeTerms() will construct fields from IndexReaders > for(int > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) { > final MergeState.IndexReaderAndLiveDocs r = > mergeState.readers.get(readerIndex); > final Fields f = r.reader.fields(); > final int maxDoc = r.reader.maxDoc(); > if (f != null) { > slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); > fields.add(f); > } > docBase += maxDoc; > } > So If you wrapper your IndexReader and override its fields() method, > maybe it will work for merge terms. > > for DocValues, it can also override AtomicReader.docValues(). just > return null for fields you want to remove. maybe it should > traverse CompositeReader's getSequentialSubReaders() and wrapper each > AtomicReader > > other things like term vectors norms are similar. > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart...@gmail.com>wrote: > >> I was thinking if I make a wrapper class that aggregates another >> IndexReader and filter out terms I don't want anymore it might work. And >> then pass that wrapper into SegmentMerger. I think if I filter out terms >> on GetFieldNames(...) and Terms(...) it might work. >> >> Something like: >> >> HashSet<string> ignoredTerms=...; >> >> FilteringIndexReader wrapper=new FilterIndexReader(reader); >> >> SegmentMerger merger=new SegmentMerger(writer); >> >> merger.add(wrapper); >> >> merger.Merge(); >> >> >> >> >> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote: >> >> > for method 2, delete is wrong. we can't delete terms. >> > you also should hack with the tii and tis file. >> > >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote: >> > >> >> method1, dumping data >> >> for stored fields, you can traverse the whole index and save it to >> >> somewhere else. >> >> for indexed but not stored fields, it may be more difficult. >> >> if the indexed and not stored field is not analyzed(fields such as >> >> id), it's easy to get from FieldCache.StringIndex. >> >> But for analyzed fields, though theoretically it can be restored from >> >> term vector and term position, it's hard to recover from index. >> >> >> >> method 2, hack with metadata >> >> 1. indexed fields >> >> delete by query, e.g. field:* >> >> 2. stored fields >> >> because all fields are stored sequentially. it's not easy to >> delete >> >> some fields. this will not affect search speed. but if you want to get >> >> stored fields, and the useless fields are very long, then it will slow >> >> down. >> >> also it's possible to hack with it. but need more effort to >> >> understand the index file format and traverse the fdt/fdx file. >> >> >> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html >> >> >> >> this will give you some insight. >> >> >> >> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart...@gmail.com >> >wrote: >> >> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10 >> >>> indexes). And a bunch of the "stored" and "indexed" fields are not >> used in >> >>> search at all. In order to save memory and disk, I'd like to rebuild >> that >> >>> index *without* those fields, but I don't have original documents to >> >>> rebuild entire index with (don't have the full-text anymore, etc.). Is >> >>> there some way to rebuild or optimize an existing index with only a >> sub-set >> >>> of the existing indexed fields? Or alternatively is there a way to >> avoid >> >>> loading some indexed fields at all ( to avoid loading term infos and >> terms >> >>> index ) ? >> >>> >> >>> Thanks >> >>> Bob >> >> >> >> >> >> >> >>