As a supplement to what Chris said, if you can partition the walking amongst a number of clients you can also parallelize the indexing. If you're using SolrCloud 4.5+, there are also some nice optimizations in SolrCloud to keep intra-shard routing to a minimum.
FWIW, Erick On Wed, Oct 23, 2013 at 12:59 PM, Chris Geeringh <geeri...@gmail.com> wrote: > Prerna, > > The FileListEntityProcessor has a terribly inefficient recursive method, > which will be using up all your heap building a list of files. > > I would suggest writing a client application and traverse your filesystem > with NIO available in Java 7. Files.walkFileTree() and a FileVisitor. > > As you "walk" post up to the server with SolrJ. > > Cheers, > Chris > > > On 22 October 2013 18:58, keshari.prerna <keshari.pre...@gmail.com> wrote: > > > Hello, > > > > I am tried to index log files (all text data) stored in file system. Data > > can be as big as 1000 GBs or more. I am working on windows. > > > > A sample file can be found at > > https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441 > > > > I tried using FileListEntityProcessor with TikaEntityProcessor which > ended > > up in java heap exception and couldn't get rid of it no matter how much I > > increase my ram size. > > data-confilg.xml > > > > <dataConfig> > > <dataSource name="bin" type="FileDataSource" /> > > <document> > > <entity name="f" dataSource="null" rootEntity="true" > > processor="FileListEntityProcessor" > > transformer="TemplateTransformer" > > baseDir="//mathworks/devel/bat/A/logs/66048/" > > fileName=".*\.*" onError="skip" recursive="true"> > > > > <field column="fileAbsolutePath" name="path" /> > > <field column="fileSize" name="size"/> > > <field column="fileLastModified" name="lastmodified" /> > > > > <entity name="file" dataSource="bin" > > processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" > > onError="skip" transformer="TemplateTransformer" > > rootEntity="true"> > > <field column="text" name="text"/> > > </entity> > > </entity> > > </document> > > </dataConfig> > > > > Then i used FileListEntityProcessor with LineEntityProcessor which never > > stopped indexing even after 40 hours or so. > > > > data-config.xml > > > > <dataConfig> > > <dataSource name="bin" type="FileDataSource" /> > > <document> > > <entity name="f" dataSource="null" rootEntity="true" > > processor="FileListEntityProcessor" > > transformer="TemplateTransformer" > > baseDir="//mathworks/devel/bat/A/logs/" > > fileName=".*\.*" onError="skip" recursive="true"> > > > > <field column="fileAbsolutePath" name="path" /> > > <field column="fileSize" name="size"/> > > <field column="fileLastModified" name="lastmodified" /> > > > > <entity name="file" dataSource="bin" > > processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text" > > onError="skip" > > rootEntity="true"> > > <field column="content" name="rawLine"/> > > </entity> > > </entity> > > </document> > > </dataConfig> > > > > Is there any way i can use post.jar to index text file recursively. Or > any > > other way which works without java heap exception and doesn't take days > to > > index. > > > > I am completely stuck here. Any help would be greatly appreciated. > > > > Thanks, > > Prerna > > > > > > > > -- > > View this message in context: > > > http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > >