Prerna, The FileListEntityProcessor has a terribly inefficient recursive method, which will be using up all your heap building a list of files.
I would suggest writing a client application and traverse your filesystem with NIO available in Java 7. Files.walkFileTree() and a FileVisitor. As you "walk" post up to the server with SolrJ. Cheers, Chris On 22 October 2013 18:58, keshari.prerna <keshari.pre...@gmail.com> wrote: > Hello, > > I am tried to index log files (all text data) stored in file system. Data > can be as big as 1000 GBs or more. I am working on windows. > > A sample file can be found at > https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441 > > I tried using FileListEntityProcessor with TikaEntityProcessor which ended > up in java heap exception and couldn't get rid of it no matter how much I > increase my ram size. > data-confilg.xml > > <dataConfig> > <dataSource name="bin" type="FileDataSource" /> > <document> > <entity name="f" dataSource="null" rootEntity="true" > processor="FileListEntityProcessor" > transformer="TemplateTransformer" > baseDir="//mathworks/devel/bat/A/logs/66048/" > fileName=".*\.*" onError="skip" recursive="true"> > > <field column="fileAbsolutePath" name="path" /> > <field column="fileSize" name="size"/> > <field column="fileLastModified" name="lastmodified" /> > > <entity name="file" dataSource="bin" > processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text" > onError="skip" transformer="TemplateTransformer" > rootEntity="true"> > <field column="text" name="text"/> > </entity> > </entity> > </document> > </dataConfig> > > Then i used FileListEntityProcessor with LineEntityProcessor which never > stopped indexing even after 40 hours or so. > > data-config.xml > > <dataConfig> > <dataSource name="bin" type="FileDataSource" /> > <document> > <entity name="f" dataSource="null" rootEntity="true" > processor="FileListEntityProcessor" > transformer="TemplateTransformer" > baseDir="//mathworks/devel/bat/A/logs/" > fileName=".*\.*" onError="skip" recursive="true"> > > <field column="fileAbsolutePath" name="path" /> > <field column="fileSize" name="size"/> > <field column="fileLastModified" name="lastmodified" /> > > <entity name="file" dataSource="bin" > processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text" > onError="skip" > rootEntity="true"> > <field column="content" name="rawLine"/> > </entity> > </entity> > </document> > </dataConfig> > > Is there any way i can use post.jar to index text file recursively. Or any > other way which works without java heap exception and doesn't take days to > index. > > I am completely stuck here. Any help would be greatly appreciated. > > Thanks, > Prerna > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html > Sent from the Solr - User mailing list archive at Nabble.com. >