There are a few tutorials out there.

1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
3. Build the latest from branch
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
this one.

http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/

but add the "solr" parameter at the end bin/nutch crawl urls -depth 5
-topN 100 -solr http://localhost:8983/solr

This will automatically add the data nutch collected to Solr. For
larger files I would also increase your JAVA_OPTS env to something
like JAVA_OPTS=' Xmx2048m'

Adam




On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt <panbh...@gmail.com> wrote:
> Thanks Adam, It seems like Nutch use to solve most of my concerns.
> i would be great if you can have share resources for Nutch with us.
>
> / Pankaj Bhatt.
>
> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
> estrada.adam.gro...@gmail.com> wrote:
>
>> I would just use Nutch and specify the -solr param on the command line.
>> That will add the extracted content your instance of solr.
>>
>> Adam
>>
>> Sent from my iPhone
>>
>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt <panbh...@gmail.com> wrote:
>>
>> > Hi All,
>> >         I need to index the documents presents in my file system at
>> various
>> > locations (e.g. C:\docs , d:\docs ).
>> >    Is there any way through which i can specify this in my DIH
>> > Configuration.
>> >    Here is my configuration:-
>> >
>> > <document>
>> >      <entity name="sd"
>> >        processor="FileListEntityProcessor"
>> >        fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
>> > *baseDir="G:\\Desktop\\"*
>> >        recursive="false"
>> >        rootEntity="true"
>> >        transformer="DateFormatTransformer"
>> > onerror="continue">
>> >        <entity name="tikatest"
>> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>> >          <field column="Author" name="author" meta="true"/>
>> >          <field column="Content-Type" name="title" meta="true"/>
>> >          <!-- field column="title" name="title" meta="true"/ -->
>> >          <field column="text" name="all_text"/>
>> >        </entity>
>> >
>> >        <!-- field column="fileLastModified" name="date"
>> > dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" / -->
>> >        <field column="fileSize" name="size"/>
>> >        <field column="file" name="filename"/>
>> >    </entity>
>> > <!--baseDir="../site"-->
>> >  </document>
>> >
>> > / Pankaj Bhatt.
>>
>

Reply via email to