Re: Solr 7.7: Using Tika in Production

Erick Erickson Wed, 29 Jan 2020 05:51:10 -0800

I doubt that’d work. When Solr gets an update, it forwards the document to the 
leader of the shard it’s going to eventually reside on. Among other things, the 
Solr node hosting no replicas would need to go to ZK and pull down the config 
you've created for Tika to know what to do. There’s no technical reason this 
couldn’t be done but I’m 99.9% certain nobody has, especially since running 
Tika inside solr is intended for PoC purposes rather than production.


The article you linked to has some SolrJ code that is usually  a better idea, 
or run Tika in server mode.

Best,
Erick

> On Jan 28, 2020, at 5:02 PM, Dustin Lebsock <dustin.lebs...@bentley.com> 
> wrote:
> 
> Hi!
>  
> First off, thank you for the help!
>  
> I’m currently running SolrCloud based off the helm chart found here: 
> https://github.com/helm/charts/tree/master/incubator/solr
>  
> Everything works great but I’d like to now use Tika to start indexing PDF’s 
> as well. In the documentation, its recommended to not use Solr Cell in a 
> production environment: 
> https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html#solr-cell-performance-implications
>  
> So I have been trying to figure out a solution to have a Tika service to 
> extract the contents of the possible files and came up with an idea. I could 
> scale the amount of solr pods, have a dedicated service point to specific 
> solr-pods that do not contain any shards on them and that will only be used 
> for content extraction. That way if content-extraction goes wrong, it doesn’t 
> matter if the pod crashes. However, these nodes will still be connected to 
> ZooKeeper for the entire cluster, that way they may index the file to the 
> correct collection immediately after extraction. I’m not sure if this is how 
> SolrCloud works though.
>  
> If I send an extraction and Index request to a pod that doesn’t contain the 
> specified collection, is it extracted before being sent to the correct pod 
> for indexing? Or is it sent to a pod with the collection and then extracted? 
> If it’s the later, do you have any advice?
>  
> Thanks for the help! 
>  
> Dustin Pilkington
> Associate Software Engineer
> dustin.pilking...@bentley.com
>  
>

Re: Solr 7.7: Using Tika in Production

Reply via email to