Re: TIKA INTEGRATION PERFORMANCE

Tomás Fernández Löbbe Mon, 06 Jun 2011 04:51:48 -0700

1. About the commit strategy, all the ExtractingRequestHandler (request
handler that uses Tika to extract content from the input file) will do is
extract the content of your file and add it to a SolrInputDocument. The
commit strategy should not change because of this, compared to other
documents you might be indexing. It is usually not recommended to commit on
every new / updated document.

2. Don't know if I understand the question. you can add all the static
fields you want to the document by adding the "literal." prefix to the name
of the fields when using ExtractingRequestHandler (as you are doing with "
literal.id"). You can also leave empty fields if they are not marked as
"required" at the schema.xml file. See:
http://wiki.apache.org/solr/ExtractingRequestHandler#Literals

3. Solr cores can work almost as completely different Solr instances. You
could tell one core to replicate from another core. I don't think this would
be of any help here. If you want to separate the indexing operations from
the query operations, you could probably use different machines, that's
usually a better option. Configure the indexing box as master and the query
box as slave. Here you have some more information about it:
http://wiki.apache.org/solr/SolrReplication

Were this the answers you were looking for or did I misunderstand your
questions?

Tomás

On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta <nkgiit...@gmail.com> wrote:

> Hi
>
> Since it is php, we are using solphp for calling curl based call,
>
> what my concern here is that for each user, we might be having 20-40
> attachments needed to be indexed each day, and there are various users
> ..daily we are targeting around 500-1000 users ..
>
> right now if you see, we
>
> <?php
> $ch = curl_init('
> http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
>  curl_setopt ($ch, CURLOPT_POST, 1);
>  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
>  $result= curl_exec ($ch);
> ?>
>
> also we are planning to use other fields which are to be indexed and stored
> ...
>
>
> There are couple of questions here
>
> 1. what would be the best strategies for commit. if we take all the
> documents in an array and iterating one by one and fire the curl and for
> the
> last doc, if we commit, will it work or for each doc, we need to commit?
>
> 2. we are having several fields which are already defined in schema and few
> of the them are required earlier, but for this purpose, we don't want, how
> to have two requirement together in the same schema?
>
> 3. since it is frequent commit, how to use solr multicore for write and
> read
> operations separately ?
>
> Thanks
> Naveen
>

Re: TIKA INTEGRATION PERFORMANCE

Reply via email to