For building a crawler, I’d start with Scrapy (https://scrapy.org <https://scrapy.org/>). It is a solid design and should be easy to use for crawling web pages, files, or an API.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 28, 2020, at 4:16 AM, Joe Doupnik <j...@netlab1.net> wrote: > > Some time ago I faced a roughly similar challenge. After many trials and > tests I ended up creating my own programs to accomplish the tasks of fetching > files, selecting which are allowed to be indexed, and feeding them into Solr > (POST style). This work is open source, found on https://netlab1.net/, web > page section titled Presentations of long term utility, item Solr/Lucene > Search Service. This is a set of docs, three small PHP programs, and a Solr > schema etc bundle, all within one downloadable zip file. > On filtering found files, my solution uses a list of regular expressions > which are simple to state and to process. The docs discuss the rules. > Luckily, the code dealing with rules per se and doing the filtering is very > short and simple; see crawler.php for convertfilter() and filterbyname(). > Thus you may wish to consider them or equivalents for inclusion in your > system, whatever that may be. > Thanks, > Joe D. > > On 27/08/2020 20:32, Alexandre Rafalovitch wrote: >> If you are indexing from Drupal into Solr, that's the question for >> Drupal's solr module. If you are doing it some other way, which way >> are you doing it? bin/post command? >> >> Most likely this is not the Solr question, but whatever you have >> feeding data into Solr. >> >> Regards, >> Alex. >> >> On Thu, 27 Aug 2020 at 15:21, Staley, Phil R - DCF >> <phil.sta...@wisconsin.gov> wrote: >>> Can you or how do you exclude a specific folder/directory from indexing in >>> SOLR version 7.x or 8.x? Also our CMS is Drupal 8 >>> >>> Thanks, >>> >>> Phil Staley >>> DCF Webmaster >>> 608 422-6569 >>> phil.sta...@wisconsin.gov >>> >>> >