I think the OP is indexing flat files, not web pages (but otherwise, I
agree with you that Scrapy is great - I know some of the people behind
it too and they're a good bunch).
Charlie
On 02/06/2020 16:41, Walter Underwood wrote:
On Jun 2, 2020, at 7:40 AM, Charlie Hull wrote:
If it was me I
> On Jun 2, 2020, at 7:40 AM, Charlie Hull wrote:
>
> If it was me I'd probably build a standalone indexer script in Python that
> did the file handling, called out to a separate Tika service for extraction,
> posted to Solr.
I would do the same thing, and I would base that script on Scrapy
Ah OK. I haven't used SimplePostTool myself and I note the docs say
"View this not as a best-practice code example, but as a standalone
example built with an explicit purpose of not having external jar
dependencies."
I'm wondering if it's some kind of synchronisation issue between new
files a
Hi Charlie,
The main code that is doing the indexing is from the Solr's
SimplePostTools, but we have done some modification to it.
The walking through a folder is done by PowerShell script, the extracting
of the content from .eml file is from Tika that comes with Solr, and the
images in the .eml
Hi Edwin,
What code is actually doing the indexing? AFAIK Solr doesn't include any
code for actually walking a folder, extracting the content from .eml
files and pushing this data into its index, so I'm guessing you've built
something external?
Charlie
On 01/06/2020 02:13, Zheng Lin Edwin
Hi,
I am running this on Solr 7.6.0
Currently I have a situation whereby there's more than 2 million EML file
in a folder, and the folder is constantly updating the EML files with the
latest information and adding new EML files.
When I do the indexing, it is suppose to index the new EML files, a