On 10/13/2017 5:50 AM, startrekfan wrote: > Thank you for your answer. > > To 3.) > The file is on server A, my program is on server B and solr is on server > C. If I use a normal http(rest) post, my program has to fetch the file > content from server A to Server B and then post it from server B to server > C as there is no open connection between A and C. So the file has to be > transmitted two times. > Is there a way to tell solr to read the file _directly_ from Server A (e.g. > via SMB)
What exactly is in a "file" in this situation, and what does your service do with that file in order to decide what information gets sent to Solr? This information will be vital to figuring out whether you can do what you're wanting to do. If your service does not have business-specific logic, and the files on your server are more generic, Solr does have the ability to "directly" index rich text files like PDF, Word, etc. Typically the file is still sent to Solr even with that functionality. I think there are ways to have it fetch the file, but I have no idea what kind of fetching is supported. There is one major issue with using that ability, called the Extracting Request Handler. That functionality uses another piece of Apache software called Tika. Because the exact structure of the documents that Tika supports can change subtly and not all of those formats are fully documented, Tika has a habit of exploding when it encounters something that its authors have never seen before. If Tika is running inside Solr when it explodes, that explosion can take down the entire Solr process. For that reason, we do not actually recommend running that functionality inside Solr, but rather in an external program that extracts information and sends it to Solr. The Tika authors do take such explosions seriously, and they do try to fix those problems when they are encountered. It is impossible for the Tika project to prevent such problems from occurring, because there will always be documents produced that contain data formats that they've never seen before. Generally speaking, if you already have a well-tested way of extracting information from files and sending it to Solr, the recommendation is that you stick with that software, rather than try to get Solr to directly index your files. Thanks, Shawn