date:20081212

[Changing subject accordingly ] .

Thanks Noble.

I grabbed one of the nightlies at -
http://people.apache.org/builds/lucene/solr/nightly/ .

I could not find the DataImportHandler in the same.  May be I am
missing something about the sources of DataImportHandler.

Can somebody suggest on where to find the same.

On Thu, Dec 11, 2008 at 12:49 AM, Noble Paul നോബിള്‍ नोब्ळ्
 wrote:
> On Wed, Dec 10, 2008 at 11:00 PM, Rakesh Sinha  
> wrote:
>> Hi -
>>  I am a new user of Solr tool  and came across the introductory
>> tutorial here - http://lucene.apache.org/solr/tutorial.html  .
>> I am planning to use Solr in one of my projects . I see that the
>> tutorial mentions about a REST api / interface to add documents and to
>> query the same.
>>
>> I would like to create  the indices locally , where the web server (or
>> pool of servers ) will have access to the database directly , but use
>> the query REST api to query for the results.
>
> If your data resides in DB consider using DIH.
> http://wiki.apache.org/solr/DataImportHandler
>
>>
>>  I am curious how this could be possible without taking the http rest
>> api submission to add to indices. (For the sake of simplicity - we can
>> assume it would be just one node to store the index but multiple
>> readers / query machines that could potentially connect to the solr
>> web service and retrieve the query results. Also the index might be
>> locally present in the same machine as that of the Solr host or at
>> least accessible through NFS etc. )
> I guess you are thinking of using a master/slave setup.
> see this http://wiki.apache.org/solr/CollectionDistribution
> or http://wiki.apache.org/solr/SolrReplication
>
>
>>
>> Thanks for helping out to some starting pointers regarding the same.
>>
>
>
>
> --
> --Noble Paul
>

Re: Taxonomy Support on Solr

2008-12-12 Thread Walter Underwood

I designed and built the taxonomy and classification support
in the Ultraseek search engine.

There are many kinds of taxonomies, even different "shapes":
tree, DAG, facets, tree + links (e.g. ANSI/NISO Z39.19, LCSH,
Yahoo directory), and even mixtures of those.

It would be a serious limitation to support only one. It would
be a mess to try and support all.

Luckily, it works very well to keep the classification separate
and tag the documents with category membership. Building a long
string is not hard.

One feature that is very useful is to update the category tag
after the document has been indexed. We ran into that again
and again when implementing taxonomies at Verity.

For example, you can build a nearest neighbor classifier using
More Like This, but you need to index and commit the doc before
you run an MLT search and discover the category. Then you need
to delete it and reindex with the category tag.

wunder

On 12/12/08 5:16 AM, "Jana, Kumar Raja"  wrote:

> Thanks all. This workaround was very helpful for my case.
> 
> However, it would be wonderful if there was a way to make Solr have a
> copy of my classification so that I need not create a big string at the
> client side everytime I need to index a document. I am sure there are
> many others out there who do the following on the client side burdening
> their already overburdened client servers:
> 
> Here's something I would love to see on Solr if possible:
> 1. A way to send my classification to Solr.
> a. The classification has nodes which have their own fields (and
> default values)
> b. Images or static files of such sort associated with the nodes
> 2. Update the classification everytime I want to change it.
> 
> During indexing of documents:
> 1. Give the name of the node my document belongs to, say a special field
> named, . Solr would index the document just as any
> other document or do some optimized indexing to return results faster.
> 
> During Search:
> 1. Solr matches my query with and checks if any of the documents being
> returned are classified.
> 2. If a classified document is found, return all the documents which are
> classified with the child nodes (as well as with the document's node -
> this can be optional)
> 
> 
> I feel taxonomy support would be a welcome feature in Solr. What do the
> developers say?
> 
> -Kumar
> 
> -Original Message-
> From: Alexander Ramos Jardim [mailto:alexander.ramos.jar...@gmail.com]
> Sent: Friday, December 12, 2008 12:04 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Taxonomy Support on Solr
> 
> I use this workaround all the time.
> 
> When I need to put the hierarchy which a product belongs, I simply
> arranje all the nodes as: "a ^ b ^ c ^ d"
> 
> 2008/12/11 Otis Gospodnetic 
> 
>> This is what Hoss was hinting at yesterday (or was that on the Lucene
>> list?).  You can do that if you encode the hierarchy in a field
>> properly., e.g. "/A /B /1"  may be one doc's field. "/A /B /2" may be
>> another doc's field.  THen you just have to figure out how to query
>> that to get a sub-tree.
>> 
>> 
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> 
>> 
>> - Original Message 
>>> From: "Jana, Kumar Raja" 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thursday, December 11, 2008 5:03:02 AM
>>> Subject: Taxonomy Support on Solr
>>> 
>>> Hi,
>>> 
>>> Any plans of supporting user-defined classifications on Solr? Is
>>> there any component which returns all the children of a node (till
>>> the leaf
>>> node) when I search for any node?
>>> 
>>> May be this would help:
>>> 
>>> Say I have a few SolrDocuments classified as:
>>> 
>>> A
>>> B--C
>>> 123  8--9
>>> 
>>> (I.e A has 2 child nodes B and C. B has 3 child nodes 1,2,3 and C
>>> has 2 child nodes 8,9) When my search criteria matches B, my results
> 
>>> should contain B as well as 1,2 and 3 too.
>>> Search for A would return all the nodes mentioned above.
>>> 
>>> -Kumar
>> 
>> 
> 
> 
> --
> Alexander Ramos Jardim

Re: not string or text fields and shards

2008-12-12 Thread Ian Connor

That problem related to the score being computed when it was not in the
sort:

https://issues.apache.org/jira/browse/SOLR-626

To see if this is also your problem, you can make the score in the sort to
see if that stops the error. If you have the same issue, you can take the
patch and compile your own until it makes it into a stable build.

However, if you just have a bad field - this may not fix your problem and
your only choice is to clean up the data (think rebuild index).

On Sun, Nov 23, 2008 at 8:56 PM, Yonik Seeley  wrote:

> On Thu, Nov 20, 2008 at 7:41 AM, Marc Sturlese 
> wrote:
> > I have started working with an index divided in 3 shards. When I did a
> > distributed search I got an error with the fields that were not string or
> > text. I read that the error was due to BinaryResponseWriter and not
> > string/text empty fields.
>
> I think it's more the case that if you have an invalid field value, it
> could blow up at different points in different code paths.  The root
> cause is still an invalid value in the field.
>
> -Yonik
>

-- 
Regards,

Ian Connor
1 Leighton St #605
Cambridge, MA 02141
Direct Line: +1 (978) 672
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Mobile Phone: +1 (312) 218 3209
Fax: +1(770) 818 5697
Suisse Phone: +41 (0) 22 548 1664
Skype: ian.connor

Re: Solr 1.3 - DataInputHandler DIH integration

Ooops . Sorry - Never mind - they are present under contrib directory.

/opt/programs/solr $ find contrib -name *.java | grep Handler
contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataImportHandlerException.java
contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/AbstractDataImportHandlerTest.java
contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataImportHandler.java
contrib/extraction/src/main/java/org/apache/solr/handler/SolrContentHandler.java
contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingRequestHandler.java
contrib/extraction/src/main/java/org/apache/solr/handler/SolrContentHandlerFactory.java
contrib/extraction/src/test/java/org/apache/solr/handler/ExtractingRequestHandlerTest.java



On Fri, Dec 12, 2008 at 10:40 AM, Rakesh Sinha  wrote:
> [Changing subject accordingly ] .
>
> Thanks Noble.
>
> I grabbed one of the nightlies at -
> http://people.apache.org/builds/lucene/solr/nightly/ .
>
> I could not find the DataImportHandler in the same.  May be I am
> missing something about the sources of DataImportHandler.
>
> Can somebody suggest on where to find the same.
>
> On Thu, Dec 11, 2008 at 12:49 AM, Noble Paul നോബിള്‍ नोब्ळ्
>  wrote:
>> On Wed, Dec 10, 2008 at 11:00 PM, Rakesh Sinha  
>> wrote:
>>> Hi -
>>>  I am a new user of Solr tool  and came across the introductory
>>> tutorial here - http://lucene.apache.org/solr/tutorial.html  .
>>> I am planning to use Solr in one of my projects . I see that the
>>> tutorial mentions about a REST api / interface to add documents and to
>>> query the same.
>>>
>>> I would like to create  the indices locally , where the web server (or
>>> pool of servers ) will have access to the database directly , but use
>>> the query REST api to query for the results.
>>
>> If your data resides in DB consider using DIH.
>> http://wiki.apache.org/solr/DataImportHandler
>>
>>>
>>>  I am curious how this could be possible without taking the http rest
>>> api submission to add to indices. (For the sake of simplicity - we can
>>> assume it would be just one node to store the index but multiple
>>> readers / query machines that could potentially connect to the solr
>>> web service and retrieve the query results. Also the index might be
>>> locally present in the same machine as that of the Solr host or at
>>> least accessible through NFS etc. )
>> I guess you are thinking of using a master/slave setup.
>> see this http://wiki.apache.org/solr/CollectionDistribution
>> or http://wiki.apache.org/solr/SolrReplication
>>
>>
>>>
>>> Thanks for helping out to some starting pointers regarding the same.
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>

Re: Query Performance while updating teh index


Hey Otis,

Do you think our problem is slow warm time, or too few items that are being
copied?

Oleg

-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p20980523.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query Performance while updating teh index


Here’s what we have on one of the data slaves for the autowarming.

 

--

Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main

   
filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evictions=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hits=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0}

Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming result for searc...@3f32ca2b main

   
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size=1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0}

Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main

   
queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evictions=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}

--

Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming result for searc...@3f32ca2b main

   
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}

Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main

   
documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictions=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}

Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming result for searc...@3f32ca2b main

   
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}

--

 

This is our current values after I’ve messed with them a few times trying to
get better performance.

 








-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p20980669.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr 1.3 DataImportHandler iBatis integration ..

Hi -
  I was planning to check more details about integrating ibatis query
resultsets with the query required for  tags .   Before I
start experimenting more along the lines - I am just curious if there
had been some effort done earlier on this end (specifically - how to
better integrate DataImportHandler with iBatis queries etc. )

Re: Query Performance while updating teh index

It looks like cache warming is taking about 12 seconds.  It sounds like you 
need to see if performance is bad during warming, or right after warming (and 
right after the new searcher gets exposed to queries).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: oleg_gnatovskiy 
> To: solr-user@lucene.apache.org
> Sent: Friday, December 12, 2008 1:07:49 PM
> Subject: Re: Query Performance while updating teh index
> 
> 
> Here’s what we have on one of the data slaves for the autowarming.
> 
> 
> 
> --
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>   
> filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evictions=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hits=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0}
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>   
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size=1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0}
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>   
> queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evictions=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}
> 
> --
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>   
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>   
> documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictions=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>   
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}
> 
> --
> 
> 
> 
> This is our current values after I’ve messed with them a few times trying to
> get better performance.
> 
> 
> 
> 
> 
>   class="solr.LRUCache"
> 
>   size="3"
> 
>   initialSize="15000"
> 
>   autowarmCount="1000"/>
> 
> 
> 
>   class="solr.LRUCache"
> 
>   size="6"
> 
>   initialSize="3"
> 
>   autowarmCount="5"/>
> 
> 
> 
>   class="solr.LRUCache"
> 
>   size="20"
> 
>   initialSize="125000"
> 
>   autowarmCount="0"/>
> 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p20980669.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Unwanted clustering of search results after sorting by score

Max - field collapsing may be your friend - 
https://issues.apache.org/jira/browse/SOLR-236


This field collapsing keeps coming up...

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Max Scheffler 
> To: solr-user@lucene.apache.org
> Sent: Friday, December 12, 2008 10:23:23 AM
> Subject: Unwanted clustering of search results after sorting by score
> 
> Hallo,
> 
> We have a website on which you can search through a large amount of
> products from different shops.
> 
> The information describing the products are provided to us by the shops
> which sell these products.
> 
> If we sort a search result by score many products of the same shop are
> clustered together. The reason for this behavior is that a shops tend to
> use the same 'style' to describe their products. For example:
> 
> Shop 'foo' describes its products with 250 words and uses the searched
> word once. Shop 'bar' describes its products with only 25 words and also
> uses the searched word once. The score for shop 'foo' will be much worst
> than for shop 'bar'. In a search in which are many products of shop
> 'foo' and 'bar' the products of shop 'bar' are shown before the products
> of shop 'foo'.
> 
> We tried to avoid this behavior by not using the term frequency. But
> after this we got very strange products under the first results.
> 
> Has anybody an idea to avoid the clustering of products (documents)
> which are from the same shop?
> 
> Greetings
> Max

Re: Solr and Aperture Framework

Rogerio,

I think it might be better to specify which part of Aperture specifically - 
e.g. parsers or crawler or ...?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Rogerio Pereira 
> To: solr-user@lucene.apache.org
> Sent: Friday, December 12, 2008 9:47:58 AM
> Subject: Solr and Aperture Framework
> 
> Hi!
> Is there someone in the list that worked on integrate Aperture Framework (
> http://aperture.sourceforge.net/) with Solr?
> 
> -- 
> Regards,
> 
> Rogério (_rogerio_)
> 
> [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
> http://twitter.com/ararog]
> 
> "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
> distribua e aprenda mais."
> (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

RE: Query Performance while updating teh index

2008-12-12 Thread Feak, Todd

It's spending 4-5 seconds warming up your query cache. If 4-5 seconds is
too much, you could reduce the number of queries to auto-warm with on
that cache.

Notice that the 4-5 seconds is spent only putting about 420 queries into
the query cache. Your autowarm of 5 for the query cache seems a bit
high. If you need to reduce that autowarm time below 5 seconds, you may
have to set that value in the hundreds, as opposed to tens of thousands.

-Todd Feak

-Original Message-
From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com] 
Sent: Friday, December 12, 2008 10:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Query Performance while updating teh index


Here's what we have on one of the data slaves for the autowarming.

 

--

Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main

   
filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evicti
ons=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hit
s=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_e
victions=0}

Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming result for searc...@3f32ca2b main

   
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size
=1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,
cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0
}

Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main

   
queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evicti
ons=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=
77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictio
ns=0}

--

Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming result for searc...@3f32ca2b main

   
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,
size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,
cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}

Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main

   
documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictio
ns=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=12
68318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evicti
ons=0}

Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm

INFO: autowarming result for searc...@3f32ca2b main

   
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=
0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumula
tive_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}

--

 

This is our current values after I've messed with them a few times
trying to
get better performance.

 








-- 
View this message in context:
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452
835p20980669.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Aperture Framework

2008-12-12 Thread Rogerio Pereira

Parsers to be more specific.

2008/12/12 Otis Gospodnetic 

> Rogerio,
>
> I think it might be better to specify which part of Aperture specifically -
> e.g. parsers or crawler or ...?
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Rogerio Pereira 
> > To: solr-user@lucene.apache.org
> > Sent: Friday, December 12, 2008 9:47:58 AM
> > Subject: Solr and Aperture Framework
> >
> > Hi!
> > Is there someone in the list that worked on integrate Aperture Framework
> (
> > http://aperture.sourceforge.net/) with Solr?
> >
> > --
> > Regards,
> >
> > Rogério (_rogerio_)
> >
> > [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]
>  [Twitter:
> > http://twitter.com/ararog]
> >
> > "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
> > distribua e aprenda mais."
> > (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
>
>


-- 
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

Re: Solr and Aperture Framework

Rogerio,

You may want to look at http://wiki.apache.org/solr/ExtractingRequestHandler

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Rogerio Pereira 
> To: solr-user@lucene.apache.org
> Sent: Friday, December 12, 2008 1:47:35 PM
> Subject: Re: Solr and Aperture Framework
> 
> Parsers to be more specific.
> 
> 2008/12/12 Otis Gospodnetic 
> 
> > Rogerio,
> >
> > I think it might be better to specify which part of Aperture specifically -
> > e.g. parsers or crawler or ...?
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> > > From: Rogerio Pereira 
> > > To: solr-user@lucene.apache.org
> > > Sent: Friday, December 12, 2008 9:47:58 AM
> > > Subject: Solr and Aperture Framework
> > >
> > > Hi!
> > > Is there someone in the list that worked on integrate Aperture Framework
> > (
> > > http://aperture.sourceforge.net/) with Solr?
> > >
> > > --
> > > Regards,
> > >
> > > Rogério (_rogerio_)
> > >
> > > [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]
> >  [Twitter:
> > > http://twitter.com/ararog]
> > >
> > > "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
> > > distribua e aprenda mais."
> > > (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
> >
> >
> 
> 
> -- 
> Regards,
> 
> Rogério (_rogerio_)
> 
> [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
> http://twitter.com/ararog]
> 
> "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
> distribua e aprenda mais."
> (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

Re: Solr and Aperture Framework

2008-12-12 Thread Grant Ingersoll


Yes, I have for my book and other things.


On Dec 12, 2008, at 9:47 AM, Rogerio Pereira wrote:


Hi!
Is there someone in the list that worked on integrate Aperture  
Framework (

http://aperture.sourceforge.net/) with Solr?

--
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]   
[Twitter:

http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha  
conhecimento,

distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Query Performance while updating teh index

2008-12-12 Thread Yonik Seeley

Right, query cache typically has a lower hit ratio, and one check per
request - often not worth autowarming much.
The filter cache can be a different story with a higher hit ratio, and
higher number of checks per request.

-Yonik

On Fri, Dec 12, 2008 at 1:35 PM, Feak, Todd  wrote:
> It's spending 4-5 seconds warming up your query cache. If 4-5 seconds is
> too much, you could reduce the number of queries to auto-warm with on
> that cache.
>
> Notice that the 4-5 seconds is spent only putting about 420 queries into
> the query cache. Your autowarm of 5 for the query cache seems a bit
> high. If you need to reduce that autowarm time below 5 seconds, you may
> have to set that value in the hundreds, as opposed to tens of thousands.
>
> -Todd Feak
>
> -Original Message-
> From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com]
> Sent: Friday, December 12, 2008 10:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Query Performance while updating teh index
>
>
> Here's what we have on one of the data slaves for the autowarming.
>
>
>
> --
>
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
>
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
>
>
> filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evicti
> ons=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hit
> s=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_e
> victions=0}
>
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
>
> INFO: autowarming result for searc...@3f32ca2b main
>
>
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size
> =1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,
> cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0
> }
>
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
>
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
>
>
> queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evicti
> ons=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=
> 77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictio
> ns=0}
>
> --
>
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
>
> INFO: autowarming result for searc...@3f32ca2b main
>
>
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,
> size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,
> cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}
>
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
>
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
>
>
> documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictio
> ns=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=12
> 68318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evicti
> ons=0}
>
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
>
> INFO: autowarming result for searc...@3f32ca2b main
>
>
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=
> 0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumula
> tive_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}
>
> --
>
>
>
> This is our current values after I've messed with them a few times
> trying to
> get better performance.
>
>
>
>
>  class="solr.LRUCache"
>
>  size="3"
>
>  initialSize="15000"
>
>  autowarmCount="1000"/>
>
>
>  class="solr.LRUCache"
>
>  size="6"
>
>  initialSize="3"
>
>  autowarmCount="5"/>
>
>
>  class="solr.LRUCache"
>
>  size="20"
>
>  initialSize="125000"
>
>  autowarmCount="0"/>
>
>
> --
> View this message in context:
> http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452
> 835p20980669.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>

Re: Solr and Aperture Framework

2008-12-12 Thread Ryan McKinley


If your up for a bit of integration you may also want to look at:
http://incubator.apache.org/droids/

droids + http://wiki.apache.org/solr/ExtractingRequestHandler + some  
polish


could be an alternative to aperture.

With Aperture, I feel like I spend most of my time getting stuff out  
of RDF.. with droids, you work with Tiki directly


ryan



On Dec 12, 2008, at 1:47 PM, Rogerio Pereira wrote:


Parsers to be more specific.

2008/12/12 Otis Gospodnetic 


Rogerio,

I think it might be better to specify which part of Aperture  
specifically -

e.g. parsers or crawler or ...?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Rogerio Pereira 
To: solr-user@lucene.apache.org
Sent: Friday, December 12, 2008 9:47:58 AM
Subject: Solr and Aperture Framework

Hi!
Is there someone in the list that worked on integrate Aperture  
Framework

(

http://aperture.sourceforge.net/) with Solr?

--
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]

[Twitter:

http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha  
conhecimento,

distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)






--
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]   
[Twitter:

http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha  
conhecimento,

distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

RE: Query Performance while updating teh index


The auto warm time is not an issue. We take the server off the load balancer
while it is autowarming. It seems that the slowness occurs after autowarm is
done.



Feak, Todd wrote:
> 
> It's spending 4-5 seconds warming up your query cache. If 4-5 seconds is
> too much, you could reduce the number of queries to auto-warm with on
> that cache.
> 
> Notice that the 4-5 seconds is spent only putting about 420 queries into
> the query cache. Your autowarm of 5 for the query cache seems a bit
> high. If you need to reduce that autowarm time below 5 seconds, you may
> have to set that value in the hundreds, as opposed to tens of thousands.
> 
> -Todd Feak
> 
> -Original Message-
> From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com] 
> Sent: Friday, December 12, 2008 10:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Query Performance while updating teh index
> 
> 
> Here's what we have on one of the data slaves for the autowarming.
> 
>  
> 
> --
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>
> filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evicti
> ons=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hit
> s=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_e
> victions=0}
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size
> =1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,
> cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0
> }
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>
> queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evicti
> ons=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=
> 77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictio
> ns=0}
> 
> --
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,
> size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,
> cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>
> documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictio
> ns=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=12
> 68318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evicti
> ons=0}
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=
> 0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumula
> tive_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}
> 
> --
> 
>  
> 
> This is our current values after I've messed with them a few times
> trying to
> get better performance.
> 
>  
> 
>  
>   class="solr.LRUCache"
> 
>   size="3"
> 
>   initialSize="15000"
> 
>   autowarmCount="1000"/>
> 
>  
>   class="solr.LRUCache"
> 
>   size="6"
> 
>   initialSize="3"
> 
>   autowarmCount="5"/>
> 
>  
>   class="solr.LRUCache"
> 
>   size="20"
> 
>   initialSize="125000"
> 
>   autowarmCount="0"/>
> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452
> 835p20980669.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p20981862.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Query Performance while updating teh index


I just verified this. The slowness occurs after auto warm is done.

Oleg

-- 
View this message in context: 
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452835p20982068.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Query Performance while updating teh index

2008-12-12 Thread Feak, Todd

Sorry, my bad. Didn't read the entire thread.

Look at your filter cache first. You are autowarming 1000, and there is
exactly 1000 in there. Yet it looks like there may be tens of thousands
of filter queries in your system. I would try autowarming more. Try
10,000 or 20,000 and see if it helps.

Second look at your document cache. Document caches don't use autowarm.
But you can add queries to your firstSeacher and newSearcher entries in
your solrconfig to pre-populate the document cache during warming.

-Todd Feak


-Original Message-
From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com] 
Sent: Friday, December 12, 2008 11:19 AM
To: solr-user@lucene.apache.org
Subject: RE: Query Performance while updating teh index


The auto warm time is not an issue. We take the server off the load
balancer
while it is autowarming. It seems that the slowness occurs after
autowarm is
done.



Feak, Todd wrote:
> 
> It's spending 4-5 seconds warming up your query cache. If 4-5 seconds
is
> too much, you could reduce the number of queries to auto-warm with on
> that cache.
> 
> Notice that the 4-5 seconds is spent only putting about 420 queries
into
> the query cache. Your autowarm of 5 for the query cache seems a
bit
> high. If you need to reduce that autowarm time below 5 seconds, you
may
> have to set that value in the hundreds, as opposed to tens of
thousands.
> 
> -Todd Feak
> 
> -Original Message-
> From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com] 
> Sent: Friday, December 12, 2008 10:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Query Performance while updating teh index
> 
> 
> Here's what we have on one of the data slaves for the autowarming.
> 
>  
> 
> --
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>
>
filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evicti
>
ons=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hit
>
s=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_e
> victions=0}
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>
>
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size
>
=1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,
>
cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0
> }
> 
> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>
>
queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evicti
>
ons=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=
>
77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictio
> ns=0}
> 
> --
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>
>
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,
>
size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,
>
cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
> 
>
>
documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictio
>
ns=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=12
>
68318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evicti
> ons=0}
> 
> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
> 
> INFO: autowarming result for searc...@3f32ca2b main
> 
>
>
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=
>
0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumula
> tive_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}
> 
> --
> 
>  
> 
> This is our current values after I've messed with them a few times
> trying to
> get better performance.
> 
>  
> 
>  
>   class="solr.LRUCache"
> 
>   size="3"
> 
>   initialSize="15000"
> 
>   autowarmCount="1000"/>
> 
>  
>   class="solr.LRUCache"
> 
>   size="6"
> 
>   initialSize="3"
> 
>   autowarmCount="5"/>
> 
>  
>   class="solr.LRUCache"
> 
>   size="20"
> 
>   initialSize="125000"
> 
>   autowarmCount="0"/>
> 
> 
> -- 
> View this message in context:
>
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452
> 835p20980669.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/Query-Performance-while-updating-the-index-tp20452
835p20981862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr and Aperture Framework

2008-12-12 Thread Rogerio Pereira

Thanks guys for all answers.I'll take a look on
ExtractingRequestHandler
and
keep tracking Otis progress on Tika integration.
2008/12/12 Ryan McKinley 

> If your up for a bit of integration you may also want to look at:
> http://incubator.apache.org/droids/
>
> droids + http://wiki.apache.org/solr/ExtractingRequestHandler + some
> polish
>
> could be an alternative to aperture.
>
> With Aperture, I feel like I spend most of my time getting stuff out of
> RDF.. with droids, you work with Tiki directly
>
> ryan
>
>
>
>
> On Dec 12, 2008, at 1:47 PM, Rogerio Pereira wrote:
>
>  Parsers to be more specific.
>>
>> 2008/12/12 Otis Gospodnetic 
>>
>>  Rogerio,
>>>
>>> I think it might be better to specify which part of Aperture specifically
>>> -
>>> e.g. parsers or crawler or ...?
>>>
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> - Original Message 
>>>
 From: Rogerio Pereira 
 To: solr-user@lucene.apache.org
 Sent: Friday, December 12, 2008 9:47:58 AM
 Subject: Solr and Aperture Framework

 Hi!
 Is there someone in the list that worked on integrate Aperture Framework

>>> (
>>>
 http://aperture.sourceforge.net/) with Solr?

 --
 Regards,

 Rogério (_rogerio_)

 [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]

>>> [Twitter:
>>>
 http://twitter.com/ararog]

 "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
 distribua e aprenda mais."
 (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

>>>
>>>
>>>
>>
>> --
>> Regards,
>>
>> Rogério (_rogerio_)
>>
>> [Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]
>>  [Twitter:
>> http://twitter.com/ararog]
>>
>> "Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
>> distribua e aprenda mais."
>> (http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)
>>
>
>


-- 
Regards,

Rogério (_rogerio_)

[Blog: http://faces.eti.br]  [Sandbox: http://bmobile.dyndns.org]  [Twitter:
http://twitter.com/ararog]

"Faça a diferença! Ajude o seu país a crescer, não retenha conhecimento,
distribua e aprenda mais."
(http://faces.eti.br/2006/10/30/conhecimento-e-amadurecimento)

RE: Query Performance while updating teh index


Should this autowarm value be set based on the number of lookups? From the
info I provided that like 60k.  filterCache{lookups=58522

Will 25k be enough?

Also, does that mean that we have to at least increase the size and initial
size as big as we set the autowarm?


Feak, Todd wrote:
> 
> Sorry, my bad. Didn't read the entire thread.
> 
> Look at your filter cache first. You are autowarming 1000, and there is
> exactly 1000 in there. Yet it looks like there may be tens of thousands
> of filter queries in your system. I would try autowarming more. Try
> 10,000 or 20,000 and see if it helps.
> 
> Second look at your document cache. Document caches don't use autowarm.
> But you can add queries to your firstSeacher and newSearcher entries in
> your solrconfig to pre-populate the document cache during warming.
> 
> -Todd Feak
> 
> 
> -Original Message-
> From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com] 
> Sent: Friday, December 12, 2008 11:19 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Query Performance while updating teh index
> 
> 
> The auto warm time is not an issue. We take the server off the load
> balancer
> while it is autowarming. It seems that the slowness occurs after
> autowarm is
> done.
> 
> 
> 
> Feak, Todd wrote:
>> 
>> It's spending 4-5 seconds warming up your query cache. If 4-5 seconds
> is
>> too much, you could reduce the number of queries to auto-warm with on
>> that cache.
>> 
>> Notice that the 4-5 seconds is spent only putting about 420 queries
> into
>> the query cache. Your autowarm of 5 for the query cache seems a
> bit
>> high. If you need to reduce that autowarm time below 5 seconds, you
> may
>> have to set that value in the hundreds, as opposed to tens of
> thousands.
>> 
>> -Todd Feak
>> 
>> -Original Message-
>> From: oleg_gnatovskiy [mailto:oleg_gnatovs...@citysearch.com] 
>> Sent: Friday, December 12, 2008 10:08 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Query Performance while updating teh index
>> 
>> 
>> Here's what we have on one of the data slaves for the autowarming.
>> 
>>  
>> 
>> --
>> 
>> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
>> 
>> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
>> 
>>
>>
> filterCache{lookups=351993,hits=347055,hitratio=0.98,inserts=8332,evicti
>>
> ons=0,size=8245,warmupTime=215,cumulative_lookups=2837676,cumulative_hit
>>
> s=2766551,cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_e
>> victions=0}
>> 
>> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
>> 
>> INFO: autowarming result for searc...@3f32ca2b main
>> 
>>
>>
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=1000,evictions=0,size
>>
> =1000,warmupTime=317,cumulative_lookups=2837676,cumulative_hits=2766551,
>>
> cumulative_hitratio=0.97,cumulative_inserts=72050,cumulative_evictions=0
>> }
>> 
>> Dec 12, 2008 8:46:02 AM org.apache.solr.search.SolrIndexSearcher warm
>> 
>> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
>> 
>>
>>
> queryResultCache{lookups=5309,hits=5223,hitratio=0.98,inserts=422,evicti
>>
> ons=0,size=421,warmupTime=4628,cumulative_lookups=77802,cumulative_hits=
>>
> 77216,cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictio
>> ns=0}
>> 
>> --
>> 
>> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
>> 
>> INFO: autowarming result for searc...@3f32ca2b main
>> 
>>
>>
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=421,evictions=0,
>>
> size=421,warmupTime=5536,cumulative_lookups=77804,cumulative_hits=77218,
>>
> cumulative_hitratio=0.99,cumulative_inserts=424,cumulative_evictions=0}
>> 
>> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
>> 
>> INFO: autowarming searc...@3f32ca2b main from searc...@443ad545 main
>> 
>>
>>
> documentCache{lookups=87216,hits=86686,hitratio=0.99,inserts=570,evictio
>>
> ns=0,size=570,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=12
>>
> 68318,cumulative_hitratio=0.99,cumulative_inserts=2455,cumulative_evicti
>> ons=0}
>> 
>> Dec 12, 2008 8:46:07 AM org.apache.solr.search.SolrIndexSearcher warm
>> 
>> INFO: autowarming result for searc...@3f32ca2b main
>> 
>>
>>
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=
>>
> 0,warmupTime=0,cumulative_lookups=1270773,cumulative_hits=1268318,cumula
>> tive_hitratio=0.99,cumulative_inserts=2455,cumulative_evictions=0}
>> 
>> --
>> 
>>  
>> 
>> This is our current values after I've messed with them a few times
>> trying to
>> get better performance.
>> 
>>  
>> 
>> > 
>>   class="solr.LRUCache"
>> 
>>   size="3"
>> 
>>   initialSize="15000"
>> 
>>   autowarmCount="1000"/>
>> 
>> > 
>>   class="solr.LRUCache"
>> 
>>   size="6"
>> 
>>   initialSize="3"
>> 
>>   autowarmCount="5"/>
>> 
>> > 
>>   class="solr.LRUCache"
>> 
>>   size

Re: Taxonomy Support on Solr

On Fri, Dec 12, 2008 at 9:11 PM, Walter Underwood wrote:

>
> One feature that is very useful is to update the category tag
> after the document has been indexed. We ran into that again
> and again when implementing taxonomies at Verity.

Take a look at SOLR-828. There's no patch there but there have been
discussions and a design proposal. I think Noble is working on it. Thoughts
and suggestions welcome :)

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr 1.3 DataImportHandler iBatis integration ..

On Fri, Dec 12, 2008 at 11:50 PM, Rakesh Sinha wrote:

> Hi -
>  I was planning to check more details about integrating ibatis query
> resultsets with the query required for  tags .   Before I
> start experimenting more along the lines - I am just curious if there
> had been some effort done earlier on this end (specifically - how to
> better integrate DataImportHandler with iBatis queries etc. )
>

Why do you want to go through iBatis? Why not index directly from the
database?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr 1.3 DataImportHandler iBatis integration ..

Trivial answer - I already have quite a bit of iBatis queries as part
of the project ( a large consumer facing website) that I want to
reuse.
Also  - the iBatis layer already has all the db authentication tokens
/ sqlmap wired on ( as part of sql-map-config.xml ).

When I create the dataConfig xml I seem to re-entering the db
authentication details and the query once again to use the same.
Hence an orthogonal integration might be really useful.

On Fri, Dec 12, 2008 at 3:11 PM, Shalin Shekhar Mangar
 wrote:
> On Fri, Dec 12, 2008 at 11:50 PM, Rakesh Sinha wrote:
>
>> Hi -
>>  I was planning to check more details about integrating ibatis query
>> resultsets with the query required for  tags .   Before I
>> start experimenting more along the lines - I am just curious if there
>> had been some effort done earlier on this end (specifically - how to
>> better integrate DataImportHandler with iBatis queries etc. )
>>
>
> Why do you want to go through iBatis? Why not index directly from the
> database?
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Applying Field Collapsing Patch

2008-12-12 Thread John Martyniak


That worked perfectly!!!

Thank you.

I wonder why it didn't work in the same way off the downloaded build.

-John

On Dec 11, 2008, at 9:40 PM, Doug Steigerwald wrote:

Have you tried just checking out (or exporting) the source from SVN  
and applying the patch?  Works fine for me that way.


$ svn co http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0 
 solr-1.3.0
$ cd solr-1.3.0 ; patch -p0 < ~/Downloads/collapsing-patch-to-1.3.0- 
ivan_2.patch


Doug

On Dec 11, 2008, at 3:50 PM, John Martyniak wrote:

It was a completely clean install.  I downloaded it from one of  
mirrors right before applying the patch to it.


Very troubling.  Any other suggestions or ideas?

I am running it on Mac OS Maybe I will try looking for some answers  
around that.


-John

On Dec 11, 2008, at 3:05 PM, Stephen Weiss   
wrote:


Yes, only ivan patch 2 (and before, only ivan patch 1), my sense  
was these patches were meant to be used in isolation (there were  
no notes saying to apply any other patches first).


Are you using patches for any other purpose (non-SOLR-236)?  Maybe  
you need to apply this one first, then those patches.  For me  
using any patch makes me nervous (we have a pretty strict policy  
about using beta code anywhere), I'm only doing it this once  
because it's absolutely necessary to provide the functionality  
desired.


--
Steve

On Dec 11, 2008, at 2:53 PM, John Martyniak wrote:


thanks for the advice.

I just downloaded a completely clean version, haven't even tried  
to build it yet.


Applied the same, and I received exactly the same results.

Do you only apply the ivan patch 2?  What version of patch are  
you running?


-John

On Dec 11, 2008, at 2:10 PM, Stephen Weiss wrote:

Are you sure you have a clean copy of the source?  Every time  
I've applied his patch I grab a fresh copy of the tarball and  
run the exact same command, it always works for me.


Now, whether the collapsing actually works is a different  
matter...


--
Steve

On Dec 11, 2008, at 1:29 PM, John Martyniak wrote:


Hi,

I am trying to apply Ivan's field collapsing patch to solr 1.3  
(not a nightly), and it continously fails.  I am using the  
following command:

patch -p0 -i collapsing-patch-to-1.3.0-ivan_2.patch --dry-run

I am in the apache-solr directory, and have read write for all  
files directories and files.


I am get the following results:

patching file src/test/org/apache/solr/search/TestDocSet.java
Hunk #1 FAILED at 88.
1 out of 1 hunk FAILED -- saving rejects to file src/test/org/ 
apache/solr/search/TestDocSet.java.rej

patching file src/java/org/apache/solr/search/CollapseFilter.java
patching file src/java/org/apache/solr/search/DocSet.java
Hunk #1 FAILED at 195.
1 out of 1 hunk FAILED -- saving rejects to file src/java/org/ 
apache/solr/search/DocSet.java.rej

patching file src/java/org/apache/solr/search/NegatedDocSet.java
patching file src/java/org/apache/solr/search/ 
SolrIndexSearcher.java

Hunk #1 FAILED at 1357.
1 out of 1 hunk FAILED -- saving rejects to file src/java/org/ 
apache/solr/search/SolrIndexSearcher.java.rej
patching file src/java/org/apache/solr/common/params/ 
CollapseParams.java
patching file src/java/org/apache/solr/handler/component/ 
CollapseComponent.java



Also the '.rej' files are not created.

Does anybody have any ideas?

thanks in advance for the help.

-John

Solr - DataImportHandler - Large Dataset results ?

As per the example in the wiki - http://wiki.apache.org/solr/DataImportHandler  
- I am seeing the following fragment. 






  ..




My scaled-down application looks very similar along these lines but where my 
resultset is so big that it cannot fit within main memory by any chance. 

So I was planning to split this single query into multiple subqueries - with 
another conditional based on the id . ( id < 0 and id > 100 , say ) . 

I am curious if there is any way to specify another conditional clause , 
(, where the column is supposed to be 
an integer value) - and internally , the implementation could actually generate 
the subqueries - 

i) get the min , max of the numeric column , and send queries to the database 
based on the batch size 

ii) Add Documents for each batch and close the resultset . 

This might end up putting more load on the database (but at least the dataset 
would fit in the main memory ). 

Let me know if anyone else had run into similar issues and how this was 
encountered.

Using Regex fragmenter to extract paragraphs

2008-12-12 Thread Mark Ferguson

Hello,

I am trying to use the regex fragmenter and am having a hard time getting
the results I want. I am trying to get fragments that start on a word
character and end on punctuation, but for some reason the fragments being
returned to me seem to be very inflexible, despite that I've provided a
large slop. Here are the relevant parameters I'm using, maybe someone can
help point out where I've gone wrong:

500
regex
0.8
[\w].*{400,600}[.!?]
true
chinese

This should be matching between 400-600 characters, beginning with a word
character and ending with one of .!?. Here is an example of a typical
result:

. Check these pictures out. Nine panda cubs on display for the first time
Thursday in southwest China. They're less than a year old. They just
recently stopped nursing. There are only 1,600 of these guys left in the
mountain forests of central China, another 120 in Chinese breeding facilities and zoos. And they're about 20
that live outside China in zoos. They exist almost entirely on bamboo. They
can live to be 30 years old. And these little guys will eventually get much
bigger. They'll grow

As you can see, it is starting with a period and ending on a word character!
It's almost as if the fragments are just coming out as they will and the
regex isn't doing anything at all, but the results are different when I use
the gap fragmenter. In the above result I don't see any reason why it
shouldn't have stripped out the preceding period and the last two words,
there is plenty of room in the slop and in the regex pattern. Please help me
figure out what I'm doing wrong...

Thanks a lot,

Mark Ferguson

Re: Solr - DataImportHandler - Large Dataset results ?

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you should be
fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay  wrote:

> As per the example in the wiki -
> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following
> fragment.
>
>  url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>
>
>
>
>  ..
>
> 
> 
>
> My scaled-down application looks very similar along these lines but where
> my resultset is so big that it cannot fit within main memory by any chance.
>
> So I was planning to split this single query into multiple subqueries -
> with another conditional based on the id . ( id < 0 and id > 100 , say ) .
>
> I am curious if there is any way to specify another conditional clause ,
> (, where the column is supposed to
> be an integer value) - and internally , the implementation could actually
> generate the subqueries -
>
> i) get the min , max of the numeric column , and send queries to the
> database based on the batch size
>
> ii) Add Documents for each batch and close the resultset .
>
> This might end up putting more load on the database (but at least the
> dataset would fit in the main memory ).
>
> Let me know if anyone else had run into similar issues and how this was
> encountered.
>
>
>




-- 
Regards,
Shalin Shekhar Mangar.

Re: new faceting algorithm

2008-12-12 Thread wojtekpia

It looks like my filterCache was too big. I reduced my filterCache size from
700,000 to 20,000 (without changing the heap size) and all my performance
issues went away. I experimented with various GC settings, but none of them
made a significant difference.

I see a 16% increase in throughput by applying this patch.

Yonik Seeley wrote:
> 
> ... This can be a big chunk of memory
> per-request, and is most likely what changed your GC profile (i.e.
> changing the GC settings may help).
> 
> 

-- 
View this message in context: 
http://www.nabble.com/new-faceting-algorithm-tp20674902p20984502.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - DataImportHandler - Large Dataset results ?

I am using MySQL. I believe (since MySQL 5) supports streaming. 

On more about streaming - can we assume that when the database driver supports 
streaming , the resultset iterator is a forward directional iterator. 

If , say the streaming size is 10K records and we are trying to retrieve a 
total of 100K records - what exactly happens when the threshold is reached , 
(say , the first 10K records were retrieved ). 

Are the previous set of records thrown away and replaced in memory by the new 
batch of records.  

--- On Fri, 12/12/08, Shalin Shekhar Mangar  wrote:
From: Shalin Shekhar Mangar 
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 9:41 PM

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you should be
fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay  wrote:

> As per the example in the wiki -
> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following
> fragment.
>
>  url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>
>
>
>
>  ..
>
> 
> 
>
> My scaled-down application looks very similar along these lines but where
> my resultset is so big that it cannot fit within main memory by any
chance.
>
> So I was planning to split this single query into multiple subqueries -
> with another conditional based on the id . ( id < 0 and id > 100 ,
say ) .
>
> I am curious if there is any way to specify another conditional clause ,
> (,
where the column is supposed to
> be an integer value) - and internally , the implementation could actually
> generate the subqueries -
>
> i) get the min , max of the numeric column , and send queries to the
> database based on the batch size
>
> ii) Add Documents for each batch and close the resultset .
>
> This might end up putting more load on the database (but at least the
> dataset would fit in the main memory ).
>
> Let me know if anyone else had run into similar issues and how this was
> encountered.
>
>
>

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

2008-12-12 Thread Bryan Talbot

It only supports streaming if properly enabled which is completely  
lame: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html


 By default, ResultSets are completely retrieved and stored in  
memory. In most cases this is the most efficient way to operate, and  
due to the design of the MySQL network protocol is easier to  
implement. If you are working with ResultSets that have a large number  
of rows or large values, and can not allocate heap space in your JVM  
for the memory required, you can tell the driver to stream the results  
back one row at a time.


To enable this functionality, you need to create a Statement instance  
in the following manner:


stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
  java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch  
size of Integer.MIN_VALUE serves as a signal to the driver to stream  
result sets row-by-row. After this any result sets created with the  
statement will be retrieved row-by-row.




-Bryan




On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote:


I am using MySQL. I believe (since MySQL 5) supports streaming.

On more about streaming - can we assume that when the database  
driver supports streaming , the resultset iterator is a forward  
directional iterator.


If , say the streaming size is 10K records and we are trying to  
retrieve a total of 100K records - what exactly happens when the  
threshold is reached , (say , the first 10K records were retrieved ).


Are the previous set of records thrown away and replaced in memory  
by the new batch of records.




--- On Fri, 12/12/08, Shalin Shekhar Mangar   
wrote:

From: Shalin Shekhar Mangar 
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 9:41 PM

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you  
should be

fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay   
wrote:



As per the example in the wiki -
http://wiki.apache.org/solr/DataImportHandler  - I am seeing the  
following

fragment.


  
  
item">

  
  
..
  



My scaled-down application looks very similar along these lines but  
where

my resultset is so big that it cannot fit within main memory by any

chance.


So I was planning to split this single query into multiple  
subqueries -

with another conditional based on the id . ( id < 0 and id > 100 ,

say ) .


I am curious if there is any way to specify another conditional  
clause ,

(,

where the column is supposed to
be an integer value) - and internally , the implementation could  
actually

generate the subqueries -

i) get the min , max of the numeric column , and send queries to the
database based on the batch size

ii) Add Documents for each batch and close the resultset .

This might end up putting more load on the database (but at least the
dataset would fit in the main memory ).

Let me know if anyone else had run into similar issues and how this  
was

encountered.








--
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?

Thanks Bryan . 

That clarifies a lot. 

But even with streaming - retrieving one document at a time and adding to the 
IndexWriter seems to making it more serializable . 

So - may be the DataImportHandler could be optimized to retrieve a bunch of 
results from the query and add the Documents in a separate thread , from a 
Executor pool (and make this number configurable / may be retrieved from the 
System as the number of physical cores to exploit maximum parallelism ) since 
that seems like a bottleneck. 

Any comments on the same. 

--- On Fri, 12/12/08, Bryan Talbot  wrote:
From: Bryan Talbot 
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 5:26 PM

It only supports streaming if properly enabled which is completely lame:
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html

 By default, ResultSets are completely retrieved and stored in memory. In most
cases this is the most efficient way to operate, and due to the design of the
MySQL network protocol is easier to implement. If you are working with
ResultSets that have a large number of rows or large values, and can not
allocate heap space in your JVM for the memory required, you can tell the driver
to stream the results back one row at a time.

To enable this functionality, you need to create a Statement instance in the
following manner:

stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
  java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

The combination of a forward-only, read-only result set, with a fetch size of
Integer.MIN_VALUE serves as a signal to the driver to stream result sets
row-by-row. After this any result sets created with the statement will be
retrieved row-by-row.

-Bryan

On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote:

> I am using MySQL. I believe (since MySQL 5) supports streaming.
> 
> On more about streaming - can we assume that when the database driver
supports streaming , the resultset iterator is a forward directional iterator.
> 
> If , say the streaming size is 10K records and we are trying to retrieve a
total of 100K records - what exactly happens when the threshold is reached ,
(say , the first 10K records were retrieved ).
> 
> Are the previous set of records thrown away and replaced in memory by the
new batch of records.
> 
> 
> 
> --- On Fri, 12/12/08, Shalin Shekhar Mangar 
wrote:
> From: Shalin Shekhar Mangar 
> Subject: Re: Solr - DataImportHandler - Large Dataset results ?
> To: solr-user@lucene.apache.org
> Date: Friday, December 12, 2008, 9:41 PM
> 
> DataImportHandler is designed to stream rows one by one to create Solr
> documents. As long as your database driver supports streaming, you should
be
> fine. Which database are you using?
> 
> On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay 
wrote:
> 
>> As per the example in the wiki -
>> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the
following
>> fragment.
>> 
>> > url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>>   
>>   
>>   
>>   
>> ..
>>   
>> 
>> 
>> 
>> My scaled-down application looks very similar along these lines but
where
>> my resultset is so big that it cannot fit within main memory by any
> chance.
>> 
>> So I was planning to split this single query into multiple subqueries
-
>> with another conditional based on the id . ( id < 0 and id > 100
,
> say ) .
>> 
>> I am curious if there is any way to specify another conditional clause
,
>> (,
> where the column is supposed to
>> be an integer value) - and internally , the implementation could
actually
>> generate the subqueries -
>> 
>> i) get the min , max of the numeric column , and send queries to the
>> database based on the batch size
>> 
>> ii) Add Documents for each batch and close the resultset .
>> 
>> This might end up putting more load on the database (but at least the
>> dataset would fit in the main memory ).
>> 
>> Let me know if anyone else had run into similar issues and how this
was
>> encountered.
>> 
>> 
>> 
> 
> 
> 
> 
> --Regards,
> Shalin Shekhar Mangar.
> 
> 
>

Re: Using Regex fragmenter to extract paragraphs

2008-12-12 Thread Mark Ferguson

Someone helped me with the regex and pointed out a couple mistakes, most
notably the extra quantifier in .*{400,600}. My new regex is this:

\w.{400,600}[\.!?]

Unfortunately, my results still aren't any better. Some results start with a
word character, some don't, and none seem to end with punctuation. Any ideas
would else could be wrong?

Mark



On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson wrote:

> Hello,
>
> I am trying to use the regex fragmenter and am having a hard time getting
> the results I want. I am trying to get fragments that start on a word
> character and end on punctuation, but for some reason the fragments being
> returned to me seem to be very inflexible, despite that I've provided a
> large slop. Here are the relevant parameters I'm using, maybe someone can
> help point out where I've gone wrong:
>
> 500
> regex
> 0.8
> [\w].*{400,600}[.!?]
> true
> chinese
>
> This should be matching between 400-600 characters, beginning with a word
> character and ending with one of .!?. Here is an example of a typical
> result:
>
> . Check these pictures out. Nine panda cubs on display for the first time
> Thursday in southwest China. They're less than a year old. They just
> recently stopped nursing. There are only 1,600 of these guys left in the
> mountain forests of central China, another 120 in  class='hl'>Chinese breeding facilities and zoos. And they're about 20
> that live outside China in zoos. They exist almost entirely on bamboo. They
> can live to be 30 years old. And these little guys will eventually get much
> bigger. They'll grow
>
> As you can see, it is starting with a period and ending on a word
> character! It's almost as if the fragments are just coming out as they will
> and the regex isn't doing anything at all, but the results are different
> when I use the gap fragmenter. In the above result I don't see any reason
> why it shouldn't have stripped out the preceding period and the last two
> words, there is plenty of room in the slop and in the regex pattern. Please
> help me figure out what I'm doing wrong...
>
> Thanks a lot,
>
> Mark Ferguson
>

negative boosts

2008-12-12 Thread Kevin Osborn

My index has a category field and I would like to apply a negative boost to 
certain categories. For example, if I search for "thinkpad", it should push 
results for the laptop bag and other accessory categories to the bottom.

So, I first tried altering the bq field with category:(batteries bags 
cables)^0.3.

In fact, this seems to positively boost products in those categories. This 
actually makes sense because the parsedQuery is something like:

+DisjunctionMaxQuery((category:thinkpad^1.75 | description:thinkpad^1.3 | 
manufacturer:thinkpad^1.6)~1.1) DisjunctionMaxQuery((category:thinkpad^1.75 | 
description:thinkpad^1.3 | manufacturer:thinkpad^1.6)~1.1) ((category:batteries 
category:bags category:cables)^0.3)

So, since only the negatively boosted categories are in the query, those 
categories make their way to the top when they are matched.

Is the only way to make this work to postively boost certain categories?

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Jacob Singh

Hi Grant,

Thanks for the quick response.  My Colleague looked into the code a
bit, and I did as well, here is what I see (my Java sucks):

http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java
//handle the literals from the params
Iterator paramNames = params.getParameterNamesIterator();
while (paramNames.hasNext()) {
  String name = paramNames.next();
  if (name.startsWith(LITERALS_PREFIX)) {
String fieldName = name.substring(LITERALS_PREFIX.length());
//no need to map names here, since they are literals from the user
SchemaField schFld = schema.getFieldOrNull(fieldName);
if (schFld != null) {
  String value = params.get(name);
  boost = getBoost(fieldName);
  //no need to transform here, b/c we can assume the user sent
it in correctly
  document.addField(fieldName, value, boost);
} else {
  handleUndeclaredField(fieldName);
}
  }
}


I don't know the solr source quite well enough to know if
document.addField() can take a struct in the form of some serialized
string, but how can I pass a multi-valued field via a
file-upload/multi-part POST?

One idea is that as one of the POST fields, I could add an XML payload
as could be parsed by the XML handler, and then we could instantiate
it, pass in the doc by reference, and get its multivalue fields all
populated nicely.  But this perhaps isn't a fantastic solution, I'm
really not much of a Java programmer at all, would love to hear your
expert opinion on how to solve this.

Best,
J

On Fri, Dec 12, 2008 at 6:40 PM, Grant Ingersoll  wrote:
> Hmmm, I think I see the disconnect, but I'm not sure.  Sending to the ERH
> (ExtractingReqHandler) is not an XML command at all, it's a file-upload/
> multi-part encoding.  I think you will need an API that does something like:
>
> (Just making this up, this is not real code)
> File file = new File(fileToIndex)
> resp = solr.addFile(file, params);
> 
>
> Where params contains the literals, captures, etc.  Then, in your API you
> need to do whatever PHP does to send that file as a multipart file (I think
> you can also POST it, too, but that has some downsides as described on the
> wiki)
>
> I'll try to whip up some SolrJ sample code, as I know others have asked for
> that.
>
> -Grant
>
> On Dec 12, 2008, at 5:34 AM, Jacob Singh wrote:
>
>> Hi Grant,
>>
>> Happy to.
>>
>> Currently we are sending over documents by building a big XML file of
>> all of the fields of that document. Something like this:
>>
>> $document = new Apache_Solr_Document();
>>   $document->id = apachesolr_document_id($node->nid);
>>   $document->title = $node->title;
>>   $document->body = strip_tags($text);
>>   $document->type  = $node->type;
>>   foreach ($categories as $cat) {
>>  $document->setMultiValue('category', $cat);
>>   }
>>
>> The PHP Client library then takes all of this, and builds it into an
>> XML payload which we POST over to Solr.
>>
>> When we implement rich file handling, I see these instructions:
>>
>> -
>> Literals
>>
>> To add in your own metadata, pass in the literal parameter along with the
>> file:
>>
>> curl
>> http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
>> -F "tutori...@tutorial.pdf"
>>
>> -
>>
>> So it seems we can:
>>
>> a). Refactor the class to not generate XML, but rather to build post
>> headers for each field.  We would like to avoid this.
>> b)  Instead, I was hoping we could send the XML payload with all the
>> literal fields defined (like id, type, etc), and the post fields
>> required for the file content and the field it belongs to in one
>> reqeust
>>
>> Since my understanding is that docs in Solr are immutable, there is no:
>> c). Send the file contents over, give it an ID, and then send over the
>> rest of the fields and merge into that ID.
>>
>> If the unfortunate answer is a, then how do we deal with multi-value
>> fields?  I don't know how to format them given the ext.literal format
>> above.
>>
>> Thanks for your help and awesome contributions!
>>
>> -Jacob
>>
>>
>>
>>
>> On Fri, Dec 12, 2008 at 4:52 AM, Grant Ingersoll 
>> wrote:
>>>
>>> On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:
>>>
 Hey folks,

 I'm looking at implementing ExtractingRequestHandler in the
 Apache_Solr_PHP
 library, and I'm wondering what we can do about adding meta-data.

 I saw the docs, which suggests you use different post headers to pass
 field
 values along with ext.literal.  Is there anyway to use the
 XmlUpdateHandler
 instead along with a document?  I'm not sure how this would work,
 perhaps it
 would require 2 trips, perhaps the XML would be in the post "content"
 and
 the file in something else?

Re: Solr - DataImportHandler - Large Dataset results ?

On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay  wrote:

> Thanks Bryan .
>
> That clarifies a lot.
>
> But even with streaming - retrieving one document at a time and adding to
> the IndexWriter seems to making it more serializable .
>

We have experimented with making DataImportHandler multi-threaded in the
past. We found that the improvement was very small (5-10%) because, with
databases on the local network, the bottleneck is Lucene's ability to index
documents rather than DIH's ability to create documents. Since that made the
implementation much more complex, we did not go with it.

>
> So - may be the DataImportHandler could be optimized to retrieve a bunch of
> results from the query and add the Documents in a separate thread , from a
> Executor pool (and make this number configurable / may be retrieved from the
> System as the number of physical cores to exploit maximum parallelism )
> since that seems like a bottleneck.
>

For now, you can try creating multiple root entities with LIMIT clause to
fetch rows in batches.

For example:

...

and so on.

An alternate solution would be to use request parameters as variables in the
LIMIT clause and call DIH full import with different start and offset.

For example:

Then call:
http://host:port/solr/dataimport?command=full-import&startAt=0&count=5000
Wait for it to complete import (you'll have to monitor the output to figure
out when the import ends), and then call:
http://host:port
/solr/dataimport?command=full-import&startAt=5000&count=1
and so on. Note, "start" and "rows" are parameters used by DIH, so don't use
these parameter names.

I guess this will be more complex than using multiple root entities.

>
> Any comments on the same.
>
>
A workaround for the streaming bug with MySql JDBC driver is detailed here:
http://wiki.apache.org/solr/DataImportHandlerFaq

If you try any of these tricks, do let us know if it improves the
performance. If there is something which gives a lot of improvement, we can
figure out ways to implement them inside DataImportHandler itself.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr 1.3 DataImportHandler iBatis integration ..

Ok makes sense. I don't think anybody has reported trying this. If you
decide to do it, it might be worth contributing back. I guess it may be more
difficult than just using plain sql queries.

On Sat, Dec 13, 2008 at 2:10 AM, Rakesh Sinha wrote:

> Trivial answer - I already have quite a bit of iBatis queries as part
> of the project ( a large consumer facing website) that I want to
> reuse.
> Also  - the iBatis layer already has all the db authentication tokens
> / sqlmap wired on ( as part of sql-map-config.xml ).
>
> When I create the dataConfig xml I seem to re-entering the db
> authentication details and the query once again to use the same.
> Hence an orthogonal integration might be really useful.
>
>
>
>
> On Fri, Dec 12, 2008 at 3:11 PM, Shalin Shekhar Mangar
>  wrote:
> > On Fri, Dec 12, 2008 at 11:50 PM, Rakesh Sinha  >wrote:
> >
> >> Hi -
> >>  I was planning to check more details about integrating ibatis query
> >> resultsets with the query required for  tags .   Before I
> >> start experimenting more along the lines - I am just curious if there
> >> had been some effort done earlier on this end (specifically - how to
> >> better integrate DataImportHandler with iBatis queries etc. )
> >>
> >
> > Why do you want to go through iBatis? Why not index directly from the
> > database?
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>



-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?


Thanks Shalin for the clarification.

The case about Lucene taking more time to index the Document when 
compared to DataImportHandler creating the input is definitely intuitive.


But just curious about the underlying architecture on which the test was 
being run. Was this performed on a multi-core machine . If so - how many 
cores were there ? What architecture would they be ?  It might be useful 
to know more about them to understand more about the results and see 
where they could be improved.


As about the query -

select * from table LIMIT 0, 5000

how database / vendor / driver neutral is this statement . I believe 
mysql supports this. But I am just curious how generic is this statement 
going to be .





Shalin Shekhar Mangar wrote:

On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay  wrote:

  

Thanks Bryan .

That clarifies a lot.

But even with streaming - retrieving one document at a time and adding to
the IndexWriter seems to making it more serializable .




We have experimented with making DataImportHandler multi-threaded in the
past. We found that the improvement was very small (5-10%) because, with
databases on the local network, the bottleneck is Lucene's ability to index
documents rather than DIH's ability to create documents. Since that made the
implementation much more complex, we did not go with it.


  

So - may be the DataImportHandler could be optimized to retrieve a bunch of
results from the query and add the Documents in a separate thread , from a
Executor pool (and make this number configurable / may be retrieved from the
System as the number of physical cores to exploit maximum parallelism )
since that seems like a bottleneck.




For now, you can try creating multiple root entities with LIMIT clause to
fetch rows in batches.

For example:




...


and so on.

An alternate solution would be to use request parameters as variables in the
LIMIT clause and call DIH full import with different start and offset.

For example:


Then call:
http://host:port/solr/dataimport?command=full-import&startAt=0&count=5000
Wait for it to complete import (you'll have to monitor the output to figure
out when the import ends), and then call:
http://host:port
/solr/dataimport?command=full-import&startAt=5000&count=1
and so on. Note, "start" and "rows" are parameters used by DIH, so don't use
these parameter names.

I guess this will be more complex than using multiple root entities.


  

Any comments on the same.




A workaround for the streaming bug with MySql JDBC driver is detailed here:
http://wiki.apache.org/solr/DataImportHandlerFaq

If you try any of these tricks, do let us know if it improves the
performance. If there is something which gives a lot of improvement, we can
figure out ways to implement them inside DataImportHandler itself.

Stopping / Starting IndexReaders in Solr 1.3+

For a particular application of ours - we need to suspend the Solr 
server from doing any query operation ( IndexReader-s) for sometime, and 
then after sometime in the near future ( in minutes ) - reinitialize / 
warm IndexReaders once again and get moving.


It is a little bit different from  since this server is only 
supposed to read the data and not add create segments . But we want to 
suspend it as an initial test case for one of our load balancers. 

(Restarting Solr is an option though we want to get to that as a last 
resort ).

Re: Solr - DataImportHandler - Large Dataset results ?

On Sat, Dec 13, 2008 at 11:03 AM, Kay Kay  wrote:

> Thanks Shalin for the clarification.
>
> The case about Lucene taking more time to index the Document when compared
> to DataImportHandler creating the input is definitely intuitive.
>
> But just curious about the underlying architecture on which the test was
> being run. Was this performed on a multi-core machine . If so - how many
> cores were there ? What architecture would they be ?  It might be useful to
> know more about them to understand more about the results and see where they
> could be improved.
>

This was with 4 CPU 64-bit Xeon dual core boxes with 6GB dedicated to the
JVM. IIRC, dataset was 3 million documents joining 3 tables from MySQL
(index size on disk 1.3 gigs). Both Solr and MySql boxes were same
configuration and running on a gigabit network. This was done a long time
back so these may not be the exact values but should be pretty close.


>
> As about the query -
>
> select * from table LIMIT 0, 5000
>
> how database / vendor / driver neutral is this statement . I believe mysql
> supports this. But I am just curious how generic is this statement going to
> be .
>
>
This is for MySql. I believe we are discussing these workarounds only
because MySQL driver does not support batch streaming. It fetches rows
either one-by-one or all-at-once. You probably wouldn't need these tricks
for other databases.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Solr - DataImportHandler - Large Dataset results ?


Shalin Shekhar Mangar wrote:

On Sat, Dec 13, 2008 at 11:03 AM, Kay Kay  wrote:

  

Thanks Shalin for the clarification.

The case about Lucene taking more time to index the Document when compared
to DataImportHandler creating the input is definitely intuitive.

But just curious about the underlying architecture on which the test was
being run. Was this performed on a multi-core machine . If so - how many
cores were there ? What architecture would they be ?  It might be useful to
know more about them to understand more about the results and see where they
could be improved.




This was with 4 CPU 64-bit Xeon dual core boxes with 6GB dedicated to the
JVM. IIRC, dataset was 3 million documents joining 3 tables from MySQL
(index size on disk 1.3 gigs). Both Solr and MySql boxes were same
configuration and running on a gigabit network. This was done a long time
back so these may not be the exact values but should be pretty close.

  

Thanks for the detailed configuration on which the tests were performed.
Our current architecture also looks more or less very similar to the same.
  

As about the query -

select * from table LIMIT 0, 5000

how database / vendor / driver neutral is this statement . I believe mysql
supports this. But I am just curious how generic is this statement going to
be .




This is for MySql. I believe we are discussing these workarounds only
because MySQL driver does not support batch streaming. It fetches rows
either one-by-one or all-at-once. You probably wouldn't need these tricks
for other databases.

  
True - Currently , playing around with mysql . But I was trying to 
understand more about how the Statement object is getting created (in 
the case of a platform / vendor specific query like this ). Are we going 
through JPA internally in Solr to create the Statements for the queries. 
Where can I look into this in Solr source code to understand more about 
this.

Re: Solr - DataImportHandler - Large Dataset results ?

On Sat, Dec 13, 2008 at 11:45 AM, Kay Kay  wrote:

> True - Currently , playing around with mysql . But I was trying to
> understand more about how the Statement object is getting created (in the
> case of a platform / vendor specific query like this ). Are we going through
> JPA internally in Solr to create the Statements for the queries. Where can I
> look into this in Solr source code to understand more about this.
>
>
We use Jdbc directly. Look at JdbcDataSource class inside
contrib/dataimporthandler/src/main/java.

Also look at this open issue --
https://issues.apache.org/jira/browse/SOLR-812

-- 
Regards,
Shalin Shekhar Mangar.

Re: multiword query using dismax


: 2<-25%

: my query is = monty+python+scandal

: i just issue monty+python i get bunch of documents but when i issue
: monty+python+scandal then i just get 1. isn't the case that i should
: get documents which match monty+python+scandal then followed by 
: documents that match monty+python or python+scandal or monty+scandal as per 
mm criteria?

as mentioned on the wiki...
http://wiki.apache.org/solr/DisMaxRequestHandler

If there are less then 3 optional clauses, they all must match; 
if there are 3 or more, then 75% must match, rounded up: "2<-25%"

...75% of 3 is 2.25, rounded up is 3, so all 3 words are required.

more details can be found in the docs...
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html

: some one explain from above example how many clauses we have ( i think
: it is 4 since 3 words across five fields and a phrase boost correct me
: if i am wrong) for minimal match criteria.

the mm only applies to the clauses from the words, you have 3 words, so 
you have 3 clauses.  if you used quotes, there are less...

monty python scandal   ... 3 clauses
"monty python" scandal ... 2 clauses


-Hoss

Re: minimum match issue with dismax


:   do any one know how to make sure minimum match in dismax is working? i 
: change the values and try doing solrCtl restart indexname but i don't 
: see it taking into effect. any body have an idea on this?

use debugQuery=true, and then look at the parsedquery ... it can 
be somewhat confusing if you aren't use to it, but for simple testing: 
don't use a pf, bf, or bq, set qf to a single field, and set tie=0

using the example configs a url like this...

http://localhost:8983/solr/select/?tie=0&pf=&bq=&bf=&q=first+second+third&qt=dismax&qf=text&mm=50%25&debugQuery=true

produces...

+((DisjunctionMaxQuery((text:first)) DisjunctionMaxQuery((text:second)) 
DisjunctionMaxQuery((text:third)))~1) ()

...that ~1 is the result of computing 50% of 3 rounded down.  if i change 
it to 70%...

http://localhost:8983/solr/select/?tie=0&pf=&bq=&bf=&q=first+second+third&qt=dismax&qf=text&mm=70%25&debugQuery=true

...i get...

+((DisjunctionMaxQuery((text:first)) DisjunctionMaxQuery((text:second)) 
DisjunctionMaxQuery((text:third)))~2) ()

...etc.  One thing to watch out for is that the "~X" syntax only shows you 
the minNrShouldMath value for boolean queries.  for phrase queries it 
shows you the slop value, and for the individual DisjunctionMaxQueries it 
shows you the tie breaker value (hence blanking out all those params keeps 
it simpler and easier to spot the mm value getting used)



-Hoss

Re: negative boosts


: My index has a category field and I would like to apply a negative boost 
: to certain categories. For example, if I search for "thinkpad", it 
: should push results for the laptop bag and other accessory categories to 
: the bottom.

: So, I first tried altering the bq field with category:(batteries bags 
cables)^0.3.
...
: Is the only way to make this work to postively boost certain categories?

Bingo.  you need to boost all products *not* in those categories...

bq = *:* -category:(batteries bags cables)


-Hoss

Re: Stopping / Starting IndexReaders in Solr 1.3+

2008-12-12 Thread Erik Hatcher

Maybe the PingRequestHandler can help?  It can check for the existence  
of a file (see solrconfig.xml for healthcheck) and return an error if  
it is  not there.  This wouldn't prevent Solr from responding to  
requests, but if a client used that information to determine whether  
to make requests or not it'd do the trick.


Erik


On Dec 13, 2008, at 12:54 AM, Kay Kay wrote:

For a particular application of ours - we need to suspend the Solr  
server from doing any query operation ( IndexReader-s) for sometime,  
and then after sometime in the near future ( in minutes ) -  
reinitialize / warm IndexReaders once again and get moving.


It is a little bit different from  since this server is  
only supposed to read the data and not add create segments . But we  
want to suspend it as an initial test case for one of our load  
balancers.
(Restarting Solr is an option though we want to get to that as a  
last resort ).

Re: Dismax Minimum Match/Stopwords Bug