See http://crawler.archive.org/faq.html#new_writer For other Heritrix
questions, this should probably go to the Heritrix list.

-Sean

Tony Wang wrote:
> Sean -
>
> I found Heritrix is pretty easy to set up. I am testing it on my server here
> http://66.197.161.133:8081, and am trying to create crawl jobs. As of
> 'Heritrix writer', could you write the crawling results to XML or do you
> think inserting into MySQL would be better? And where can I find
> documentation for creating Heritrix writer? I really want to make it work
> for Solr.
>
> Thanks!
> Tony
>
> On Fri, Mar 6, 2009 at 8:08 AM, Sean Timm <tim...@aol.com> wrote:
>
>   
>> We too use Heritrix. We tried Nutch first but Nutch was not finding all
>> of the documents that it was supposed to. When Nutch and Heritrix were
>> both set to crawl our own site to a depth of three, Nutch missed some
>> pages that were linked directly from the seed. We ended up with 10%-20%
>> fewer pages in the Nutch crawl.
>>
>> It is pretty easy to add custom writers to Heritrix. We write our crawls
>> to MySQL and then ingest into Solr from there. It would not be hard to
>> write a Heritrix writer that writes directly to Solr however.
>>
>> -Sean
>>
>> Baalman, Laura A. (ARC-TI)[QSS GROUP INC] wrote:
>>     
>>> We are using Heritrix, the Internet Archive’s open source crawler, which
>>>       
>> is very easy to extend. We have augmented it with a custom parser to crawl
>> some specific data formats and coded our own processors (Heritrix’s
>> terminology for extensions) to link together different data sources as well
>> as to output xmls in the right format to feed to solr. We have not yet
>> created an automated path to feed the xmls into solr but we plan to.
>>     
>>> ~LB
>>>
>>>
>>>
>>> On 3/5/09 3:32 PM, "Tony Wang" <ivyt...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I wonder if there's any open source crawler product that could be
>>>       
>> integrated
>>     
>>> with Solr. What crawler do you guys use? or you coded one by yourself? I
>>> have been trying to find out solutions for Nutch/Solr integration, but
>>> haven't got any luck yet.
>>>
>>> Could someone shed me some light?
>>>
>>> thanks!
>>>
>>> Tony
>>>
>>> --
>>> Are you RCholic? www.RCholic.com
>>> 温 良 恭 俭 让 仁 义 礼 智 信
>>>
>>>
>>>       
>
>
>
>   

Reply via email to