Hi!

First off, thank you for the help!

I'm currently running SolrCloud based off the helm chart found here: 
https://github.com/helm/charts/tree/master/incubator/solr

Everything works great but I'd like to now use Tika to start indexing PDF's as 
well. In the documentation, its recommended to not use Solr Cell in a 
production environment: 
https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html#solr-cell-performance-implications

So I have been trying to figure out a solution to have a Tika service to 
extract the contents of the possible files and came up with an idea. I could 
scale the amount of solr pods, have a dedicated service point to specific 
solr-pods that do not contain any shards on them and that will only be used for 
content extraction. That way if content-extraction goes wrong, it doesn't 
matter if the pod crashes. However, these nodes will still be connected to 
ZooKeeper for the entire cluster, that way they may index the file to the 
correct collection immediately after extraction. I'm not sure if this is how 
SolrCloud works though.

If I send an extraction and Index request to a pod that doesn't contain the 
specified collection, is it extracted before being sent to the correct pod for 
indexing? Or is it sent to a pod with the collection and then extracted? If 
it's the later, do you have any advice?

Thanks for the help!

Dustin Pilkington
Associate Software Engineer
dustin.pilking...@bentley.com<mailto:dustin.pilking...@bentley.com>

[Bentley_Logo_sig_113x36]

Reply via email to