Re: TIKA INTEGRATION PERFORMANCE

Tomás Fernández Löbbe Mon, 06 Jun 2011 11:18:19 -0700

On Mon, Jun 6, 2011 at 1:47 PM, Naveen Gupta <nkgiit...@gmail.com> wrote:


> Hi Tomas,
>
> 1. Regarding SolrInputDocument,
>
> We are not using java client, rather we are using php solr, wrapping
> content
> in SolrInputDocument, i am not sure how to do in PHP client? In this case,
> we need tika related jars to avail the metadata such as content .. we
> certainly don't want to handle all these things in PHP client.
>

I don't understand, Tika IS integrated in Solr, it doesn't matter which
client or client language you are using. To add a static value, all you have
to do is add it as a request parameter with the prefix "literal". Something
like "literal.somefield=thevalue". Content and other file metadata such as
author etc (see
http://wiki.apache.org/solr/ExtractingRequestHandler#Metadata) will be added
to the document inside Solr and indexed. You don't need to handle this on
the client application.

>
>  Secondly, what i was asking about commit strategy --
>
> what about suppose you have 100 docs
>
> iterate over 99 docs and fire curl without commit in url
>
> and for 100th doc, we will use commit ....
>
> so doing so, will it also update the indexes for last 99 docs ....
>
> while(upto 99){
>     curl_command = url without commit;
> }
>
> when i = 100, url would be commit
>

You can certainly do this. The 100 documents will be available for search
after the commit. Non of the documents will be available for search before
commit.

>
> i wanted to achieve something similar to optimize kind of thing ....
>

Optimize command should be issued when not many queries or updates are sent
to the index. It uses lots of resources and will slow down queries.

>
> why these kind of use cases which are general purpose not included in
> example (especially in other language ...java guys can easily do using API)
>

They are, you can the auto-commit feature, configured on solrconfig.xml
file. You can either tell Solr to commit on a time interval or when a number
of documents are updated and not committed. On the example file, the
autocommit is commented, but you can uncomment it.


> I am basically a Java Guy, so i can feel the problem
>
> Thanks
> Naveen
> 2011/6/6 Tomás Fernández Löbbe <tomasflo...@gmail.com>
>
> > 1. About the commit strategy, all the ExtractingRequestHandler (request
> > handler that uses Tika to extract content from the input file) will do is
> > extract the content of your file and add it to a SolrInputDocument. The
> > commit strategy should not change because of this, compared to other
> > documents you might be indexing. It is usually not recommended to commit
> on
> > every new / updated document.
> >
> > 2. Don't know if I understand the question. you can add all the static
> > fields you want to the document by adding the "literal." prefix to the
> name
> > of the fields when using ExtractingRequestHandler (as you are doing with
> "
> > literal.id"). You can also leave empty fields if they are not marked as
> > "required" at the schema.xml file. See:
> > http://wiki.apache.org/solr/ExtractingRequestHandler#Literals
> >
> > 3. Solr cores can work almost as completely different Solr instances. You
> > could tell one core to replicate from another core. I don't think this
> > would
> > be of any help here. If you want to separate the indexing operations from
> > the query operations, you could probably use different machines, that's
> > usually a better option. Configure the indexing box as master and the
> query
> > box as slave. Here you have some more information about it:
> > http://wiki.apache.org/solr/SolrReplication
> >
> > Were this the answers you were looking for or did I misunderstand your
> > questions?
> >
> > Tomás
> >
> > On Mon, Jun 6, 2011 at 2:54 AM, Naveen Gupta <nkgiit...@gmail.com>
> wrote:
> >
> > > Hi
> > >
> > > Since it is php, we are using solphp for calling curl based call,
> > >
> > > what my concern here is that for each user, we might be having 20-40
> > > attachments needed to be indexed each day, and there are various users
> > > ..daily we are targeting around 500-1000 users ..
> > >
> > > right now if you see, we
> > >
> > > <?php
> > > $ch = curl_init('
> > > http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true'
> );
> > >  curl_setopt ($ch, CURLOPT_POST, 1);
> > >  curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
> > >  $result= curl_exec ($ch);
> > > ?>
> > >
> > > also we are planning to use other fields which are to be indexed and
> > stored
> > > ...
> > >
> > >
> > > There are couple of questions here
> > >
> > > 1. what would be the best strategies for commit. if we take all the
> > > documents in an array and iterating one by one and fire the curl and
> for
> > > the
> > > last doc, if we commit, will it work or for each doc, we need to
> commit?
> > >
> > > 2. we are having several fields which are already defined in schema and
> > few
> > > of the them are required earlier, but for this purpose, we don't want,
> > how
> > > to have two requirement together in the same schema?
> > >
> > > 3. since it is frequent commit, how to use solr multicore for write and
> > > read
> > > operations separately ?
> > >
> > > Thanks
> > > Naveen
> > >
> >
>

Re: TIKA INTEGRATION PERFORMANCE

Reply via email to