You might be better off starting with the Lucene CheckIndex program. It walks all of the Lucene index data structures. I have done forensics by fiddling with the CheckIndex code.
On Thu, Aug 26, 2010 at 9:11 AM, Shawn Heisey <s...@elyograg.org> wrote: > On 5/24/2010 6:30 AM, Sascha Szott wrote: >> >> Hi folks, >> >> is it possible to sort by field length without having to (redundantly) >> save the length information in a seperate index field? At first, I thought >> to accomplish this using a function query, but I couldn't find an >> appropriate one. >> > > I have a slightly different need related to this, though it may turn out > that what Sascha wants is similar. I would like to understand my data > better so I can improve my schema. I need to do some data mining that is > (to my knowledge) difficult or impossible with the source database. > Performance is irrelevant, as long as it finishes eventually. Completing > in less than an hour would be nice. > > I would do this on a test system with much lower performance and memory > (4GB) than my production servers, as a single index instead of multiple > shards. When it finishes building, the entire test index is likely to be > about 75GB. > > What I'm after is an output that would look very much like faceting, but I > want it to show document counts associated with field length (for a simple > string) and number of terms (for a tokenized field) instead of field value. > Can Solr do that, and if so, what do I need to have enabled in the schema > to get it? Would branch_3x be enough, or would trunk be better? > > Thanks, > Shawn > > -- Lance Norskog goks...@gmail.com