Re: TIKA INTEGRATION PERFORMANCE

Naveen Gupta Mon, 06 Jun 2011 09:47:56 -0700

Hi Tomas,

1. Regarding SolrInputDocument,


We are not using java client, rather we are using php solr, wrapping content
in SolrInputDocument, i am not sure how to do in PHP client? In this case,
we need tika related jars to avail the metadata such as content .. we
certainly don't want to handle all these things in PHP client.

 Secondly, what i was asking about commit strategy --

what about suppose you have 100 docs

iterate over 99 docs and fire curl without commit in url

and for 100th doc, we will use commit ....

so doing so, will it also update the indexes for last 99 docs ....

while(upto 99){
     curl_command = url without commit;
}

when i = 100, url would be commit

i wanted to achieve something similar to optimize kind of thing ....

why these kind of use cases which are general purpose not included in
example (especially in other language ...java guys can easily do using API)

I am basically a Java Guy, so i can feel the problem

Thanks
Naveen
2011/6/6 Tomás Fernández Löbbe <tomasflo...@gmail.com>

> 1. About the commit strategy, all the ExtractingRequestHandler (request
> handler that uses Tika to extract content from the input file) will do is
> extract the content of your file and add it to a SolrInputDocument. The
> commit strategy should not change because of this, compared to other
> documents you might be indexing. It is usually not recommended to commit on
> every new / updated document.
>
> 2. Don't know if I understand the question. you can add all the static
> fields you want to the document by adding the "literal." prefix to the name
> of the fields when using ExtractingRequestHandler (as you are doing with "
> literal.id"). You can also leave empty fields if they are not marked as
> "required" at the schema.xml file. See:
> http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
>
> 3. Solr cores can work almost as completely different Solr instances. You
> could tell one core to replicate from another core. I don't think this
> would
> be of any help here. If you want to separate the indexing operations from
> the query operations, you could probably use different machines, that's
> usually a better option. Configure the indexing box as master and the query
> box as slave. Here you have some more information about it:
> http://wiki.apache.org/solr/SolrReplication
>
> Were this the answers you were looking for or did I misunderstand your
> questions?
>
> Tomás
>
> On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta <nkgiit...@gmail.com> wrote:
>
> > Hi
> >
> > Since it is php, we are using solphp for calling curl based call,
> >
> > what my concern here is that for each user, we might be having 20-40
> > attachments needed to be indexed each day, and there are various users
> > ..daily we are targeting around 500-1000 users ..
> >
> > right now if you see, we
> >
> > <?php
> > $ch = curl_init('
> > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
> >  curl_setopt ($ch, CURLOPT_POST, 1);
> >  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
> >  $result= curl_exec ($ch);
> > ?>
> >
> > also we are planning to use other fields which are to be indexed and
> stored
> > ...
> >
> >
> > There are couple of questions here
> >
> > 1. what would be the best strategies for commit. if we take all the
> > documents in an array and iterating one by one and fire the curl and for
> > the
> > last doc, if we commit, will it work or for each doc, we need to commit?
> >
> > 2. we are having several fields which are already defined in schema and
> few
> > of the them are required earlier, but for this purpose, we don't want,
> how
> > to have two requirement together in the same schema?
> >
> > 3. since it is frequent commit, how to use solr multicore for write and
> > read
> > operations separately ?
> >
> > Thanks
> > Naveen
> >
>

Re: TIKA INTEGRATION PERFORMANCE

Reply via email to