For building a crawler, I’d start with Scrapy (https://scrapy.org 
<https://scrapy.org/>). It is a solid design and
should be easy to use for crawling web pages, files, or an API. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 28, 2020, at 4:16 AM, Joe Doupnik <j...@netlab1.net> wrote:
> 
>     Some time ago I faced a roughly similar challenge. After many trials and 
> tests I ended up creating my own programs to accomplish the tasks of fetching 
> files, selecting which are allowed to be indexed, and feeding them into Solr 
> (POST style). This work is open source, found on https://netlab1.net/, web 
> page section titled Presentations of long term utility, item Solr/Lucene 
> Search Service. This is a set of docs, three small PHP programs, and a Solr 
> schema etc bundle, all within one downloadable zip file.
>     On filtering found files, my solution uses a list of regular expressions 
> which are simple to state and to process. The docs discuss the rules. 
> Luckily, the code dealing with rules per se and doing the filtering is very 
> short and simple; see crawler.php for convertfilter() and filterbyname(). 
> Thus you may wish to consider them or equivalents for inclusion in your 
> system, whatever that may be.
>     Thanks,
>     Joe D.
> 
> On 27/08/2020 20:32, Alexandre Rafalovitch wrote:
>> If you are indexing from Drupal into Solr, that's the question for
>> Drupal's solr module. If you are doing it some other way, which way
>> are you doing it? bin/post command?
>> 
>> Most likely this is not the Solr question, but whatever you have
>> feeding data into Solr.
>> 
>> Regards,
>>   Alex.
>> 
>> On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF
>> <phil.sta...@wisconsin.gov> wrote:
>>> Can you or how do you exclude a specific folder/directory from indexing in 
>>> SOLR version 7.x or 8.x?   Also our CMS is Drupal 8
>>> 
>>> Thanks,
>>> 
>>> Phil Staley
>>> DCF Webmaster
>>> 608 422-6569
>>> phil.sta...@wisconsin.gov
>>> 
>>> 
> 

Reply via email to