Prerna,

The FileListEntityProcessor has a terribly inefficient recursive method,
which will be using up all your heap building a list of files.

I would suggest writing a client application and traverse your filesystem
with NIO available in Java 7. Files.walkFileTree() and a FileVisitor.

As you "walk" post up to the server with SolrJ.

Cheers,
Chris


On 22 October 2013 18:58, keshari.prerna <keshari.pre...@gmail.com> wrote:

> Hello,
>
> I am tried to index log files (all text data) stored in file system. Data
> can be as big as 1000 GBs or more. I am working on windows.
>
> A sample file can be found at
> https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
>
> I tried using FileListEntityProcessor with TikaEntityProcessor which ended
> up in java heap exception and couldn't get rid of it no matter how much I
> increase my ram size.
> data-confilg.xml
>
> <dataConfig>
>     <dataSource name="bin" type="FileDataSource" />
>     <document>
>         <entity name="f" dataSource="null" rootEntity="true"
>             processor="FileListEntityProcessor"
> transformer="TemplateTransformer"
>             baseDir="//mathworks/devel/bat/A/logs/66048/"
>             fileName=".*\.*" onError="skip" recursive="true">
>
>             <field column="fileAbsolutePath" name="path" />
>             <field column="fileSize" name="size"/>
>             <field column="fileLastModified" name="lastmodified" />
>
>             <entity name="file" dataSource="bin"
> processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> onError="skip" transformer="TemplateTransformer"
>            rootEntity="true">
>                 <field column="text" name="text"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
>
> Then i used FileListEntityProcessor with LineEntityProcessor which never
> stopped indexing even after 40 hours or so.
>
> data-config.xml
>
> <dataConfig>
>     <dataSource name="bin" type="FileDataSource" />
>     <document>
>         <entity name="f" dataSource="null" rootEntity="true"
>             processor="FileListEntityProcessor"
> transformer="TemplateTransformer"
>             baseDir="//mathworks/devel/bat/A/logs/"
>             fileName=".*\.*" onError="skip" recursive="true">
>
>             <field column="fileAbsolutePath" name="path" />
>             <field column="fileSize" name="size"/>
>             <field column="fileLastModified" name="lastmodified" />
>
>             <entity name="file" dataSource="bin"
> processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> onError="skip"
>            rootEntity="true">
>                 <field column="content" name="rawLine"/>
>             </entity>
>         </entity>
>     </document>
> </dataConfig>
>
> Is there any way i can use post.jar to index text file recursively. Or any
> other way which works without java heap exception and doesn't take days to
> index.
>
> I am completely stuck here. Any help would be greatly appreciated.
>
> Thanks,
> Prerna
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to