Re: Solr related questions

Shawn Heisey Mon, 16 Oct 2017 15:24:54 -0700

On 10/13/2017 5:50 AM, startrekfan wrote:
> Thank you for your answer.
>
> To 3.)
> The file is on server A, my program is on server B and  solr is on server
> C. If I use a normal http(rest) post, my program has to fetch the file
> content from server A to Server B and then post it from server B to server
> C as there is no open connection between A and C. So the file has to be
> transmitted two times.
> Is there a way to tell solr to read the file _directly_ from Server A (e.g.
> via SMB)


What exactly is in a "file" in this situation, and what does your
service do with that file in order to decide what information gets sent
to Solr?  This information will be vital to figuring out whether you can
do what you're wanting to do.

If your service does not have business-specific logic, and the files on
your server are more generic, Solr does have the ability to "directly"
index rich text files like PDF, Word, etc.  Typically the file is still
sent to Solr even with that functionality.  I think there are ways to
have it fetch the file, but I have no idea what kind of fetching is
supported.

There is one major issue with using that ability, called the Extracting
Request Handler.  That functionality uses another piece of Apache
software called Tika.  Because the exact structure of the documents that
Tika supports can change subtly and not all of those formats are fully
documented, Tika has a habit of exploding when it encounters something
that its authors have never seen before.  If Tika is running inside Solr
when it explodes, that explosion can take down the entire Solr process. 
For that reason, we do not actually recommend running that functionality
inside Solr, but rather in an external program that extracts information
and sends it to Solr.

The Tika authors do take such explosions seriously, and they do try to
fix those problems when they are encountered.  It is impossible for the
Tika project to prevent such problems from occurring, because there will
always be documents produced that contain data formats that they've
never seen before.

Generally speaking, if you already have a well-tested way of extracting
information from files and sending it to Solr, the recommendation is
that you stick with that software, rather than try to get Solr to
directly index your files.

Thanks,
Shawn

Re: Solr related questions

Reply via email to