Solr 7.7: Using Tika in Production

Dustin Lebsock Tue, 28 Jan 2020 14:03:20 -0800

Hi!

First off, thank you for the help!

I'm currently running SolrCloud based off the helm chart found here:
https://github.com/helm/charts/tree/master/incubator/solr

Everything works great but I'd like to now use Tika to start indexing PDF's as
well. In the documentation, its recommended to not use Solr Cell in a
production environment:
https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html#solr-cell-performance-implications

So I have been trying to figure out a solution to have a Tika service to
extract the contents of the possible files and came up with an idea. I could
scale the amount of solr pods, have a dedicated service point to specific
solr-pods that do not contain any shards on them and that will only be used for
content extraction. That way if content-extraction goes wrong, it doesn't
matter if the pod crashes. However, these nodes will still be connected to
ZooKeeper for the entire cluster, that way they may index the file to the
correct collection immediately after extraction. I'm not sure if this is how
SolrCloud works though.

If I send an extraction and Index request to a pod that doesn't contain the
specified collection, is it extracted before being sent to the correct pod for
indexing? Or is it sent to a pod with the collection and then extracted? If
it's the later, do you have any advice?

Thanks for the help!

Dustin Pilkington
Associate Software Engineer
dustin.pilking...@bentley.com<mailto:dustin.pilking...@bentley.com>

[Bentley_Logo_sig_113x36]

Solr 7.7: Using Tika in Production

Reply via email to