Nutch will also handle this but I'd probably stick with the DIH as Steve suggested. On windows it's pretty easy to get a list of all the txt file by using
dir /b/s *.txt > files.txt Just my $0.02 ;-) Adam Sent from my iPhone On Mar 4, 2011, at 5:52 PM, Steven A Rowe <sar...@syr.edu> wrote: > Hi Colin, > > Solr's DataImportHandler sounds like what you want: > > http://wiki.apache.org/solr/DataImportHandler > > In particular, take a look at FileListEntityProcessor: > > http://wiki.apache.org/solr/DataImportHandler#FileListEntityProcessor > > Steve > >> -----Original Message----- >> From: csm [mailto:cmcswig...@gmail.com] >> Sent: Friday, March 04, 2011 5:50 PM >> To: solr-user@lucene.apache.org >> Subject: Help please - recursively indexing lots and lots of text files >> >> Hi, >> >> I'm new to Lucene/Solr and I'm trying to build an index of a large body of >> plaintext files for some corpus research that I'm doing. There are about >> 37,000 files of typically 50-100 lines each, and they're scattered >> throughout a huge nested directory structure. I've worked through the >> basic >> Solr tutorial and the text/html indexing tutorial at >> http://www.slideshare.net/LucidImagination/indexing-text-and-html-files- >> with-solr-4063407 >> , but after some looking around, I haven't been able to find any resources >> for indexing a large number of text files that aren't all sitting in the >> same directory. >> >> Is this simply a case of having to write a shell script to crawl through >> the >> whole directory tree and call cURL for every single file, or is there a >> library or utility that can do this, or just an easier way? Any help >> would >> be greatly appreciated! Alternatively, if this is a solved problem and I >> just need to RTFM, it'd be great if someone could point me in the right >> direction. >> >> Thanks a lot, >> Colin >> >> -- >> View this message in context: http://lucene.472066.n3.nabble.com/Help- >> please-recursively-indexing-lots-and-lots-of-text-files- >> tp2635884p2635884.html >> Sent from the Solr - User mailing list archive at Nabble.com.