Nutch will also handle this but I'd probably stick with the DIH as Steve 
suggested. On windows it's pretty easy to get a list of all the txt file by 
using 

dir /b/s *.txt > files.txt

Just my $0.02 ;-)

Adam

Sent from my iPhone

On Mar 4, 2011, at 5:52 PM, Steven A Rowe <sar...@syr.edu> wrote:

> Hi Colin,
> 
> Solr's DataImportHandler sounds like what you want:
> 
>    http://wiki.apache.org/solr/DataImportHandler
> 
> In particular, take a look at FileListEntityProcessor:
> 
>    http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor
> 
> Steve
> 
>> -----Original Message-----
>> From: csm [mailto:cmcswig...@gmail.com]
>> Sent: Friday, March 04, 2011 5:50 PM
>> To: solr-user@lucene.apache.org
>> Subject: Help please - recursively indexing lots and lots of text files
>> 
>> Hi,
>> 
>> I'm new to Lucene/Solr and I'm trying to build an index of a large body of
>> plaintext files for some corpus research that I'm doing.  There are about
>> 37,000 files of typically 50-100 lines each, and they're scattered
>> throughout a huge nested directory structure.  I've worked through the
>> basic
>> Solr tutorial and the text/html indexing tutorial at
>> http://www.slideshare.net/LucidImagination/indexing-text-and-html-files-
>> with-solr-4063407
>> , but after some looking around, I haven't been able to find any resources
>> for indexing a large number of text files that aren't all sitting in the
>> same directory.
>> 
>> Is this simply a case of having to write a shell script to crawl through
>> the
>> whole directory tree and call cURL for every single file, or is there a
>> library or utility that can do this, or just an easier way?  Any help
>> would
>> be greatly appreciated!  Alternatively, if this is a solved problem and I
>> just need to RTFM, it'd be great if someone could point me in the right
>> direction.
>> 
>> Thanks a lot,
>> Colin
>> 
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Help-
>> please-recursively-indexing-lots-and-lots-of-text-files-
>> tp2635884p2635884.html
>> Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to