How can Solr do parallel query warming with and ?

2012-03-02 Thread Neil Hooey
I'm trying to get Solr to run warming queries in parallel with
listener events, but it always does them in sequence, pegging one CPU
while calculating facet counts.

Someone at Lucid Imagination suggested using multiple  tags, each with a single facet query in them,
but those are still done in parallel.

Is it possible to run warming queries in parallel, and if so, how?

I'm aware that you could run an external script that forks, but I'd
like to use Solr's native support for this if it exists.

Examples that don't work:



  

  *:*field1
  *:*field2
  *:*field3
  *:*field4

  




  

  *:*field1

  
  

  *:*field2

  
  

  *:*field3

  
  

  *:*field4

  



Re: How can Solr do parallel query warming with and ?

2012-03-02 Thread Neil Hooey
> Someone at Lucid Imagination suggested using multiple  event="firstSearcher"> tags, each with a single facet query in them,
> but those are still done in parallel.

I meant to say: "but those are still done in sequence".


On Fri, Mar 2, 2012 at 3:37 PM, Neil Hooey  wrote:
> I'm trying to get Solr to run warming queries in parallel with
> listener events, but it always does them in sequence, pegging one CPU
> while calculating facet counts.
>
> Someone at Lucid Imagination suggested using multiple  event="firstSearcher"> tags, each with a single facet query in them,
> but those are still done in parallel.
>
> Is it possible to run warming queries in parallel, and if so, how?
>
> I'm aware that you could run an external script that forks, but I'd
> like to use Solr's native support for this if it exists.
>
> Examples that don't work:
>
> 
> 
>  
>    
>      *:*field1
>      *:*field2
>      *:*field3
>      *:*field4
>    
>  
> 
>
> 
> 
>  
>    
>      *:*field1
>    
>  
>  
>    
>      *:*field2
>    
>  
>  
>    
>      *:*field3
>    
>  
>  
>    
>      *:*field4
>    
>  
> 


Re: How can Solr do parallel query warming with and ?

2012-03-03 Thread Neil Hooey
I need to have those queries trigger the generation of facet counts, which
can take up to 5 minutes for all of them combined.

If the facet counts aren't warmed, then the first query to ask for facet
counts on a particular field will take several minutes to return results.

On Sat, Mar 3, 2012 at 5:40 AM, Mikhail Khludnev 
wrote:
> Neil,
>
> Would you mind if I ask what particularly do you want to warm by these
> queries?
>
> Regards
>
> On Sat, Mar 3, 2012 at 12:37 AM, Neil Hooey  wrote:
>
>> I'm trying to get Solr to run warming queries in parallel with
>> listener events, but it always does them in sequence, pegging one CPU
>> while calculating facet counts.
>>
>> Someone at Lucid Imagination suggested using multiple > event="firstSearcher"> tags, each with a single facet query in them,
>> but those are still done in parallel.
>>
>> Is it possible to run warming queries in parallel, and if so, how?
>>
>> I'm aware that you could run an external script that forks, but I'd
>> like to use Solr's native support for this if it exists.
>>
>> Examples that don't work:
>>
>> 
>> 
>>  
>>
>>  *:*field1
>>  *:*field2
>>  *:*field3
>>  *:*field4
>>
>>  
>> 
>>
>> 
>> 
>>  
>>
>>  *:*field1
>>
>>  
>>  
>>
>>  *:*field2
>>
>>  
>>  
>>
>>  *:*field3
>>
>>  
>>  
>>
>>  *:*field4
>>
>>  
>> 
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  


How does "start.jar" get build in the Solr trunk repository?

2012-05-07 Thread Neil Hooey
I'm trying to figure out how the "solr/example/start.jar" file gets built
in the Solr trunk repository, but I can't find anything about jetty being
built in any of the Ant build XML files.

I'm trying to duplicate the same behaviour in my Maven build of Solr with
my custom plugin.

Does anyone know the target that builds jetty-start?

- Neil


Re: How does "start.jar" get build in the Solr trunk repository?

2012-05-07 Thread Neil Hooey
I see that it's done in "solr/example/build.xml" with this XML:



Does anyone know how you could do that in Maven?

On Mon, May 7, 2012 at 4:39 PM, Neil Hooey  wrote:

> I'm trying to figure out how the "solr/example/start.jar" file gets built
> in the Solr trunk repository, but I can't find anything about jetty being
> built in any of the Ant build XML files.
>
>  I'm trying to duplicate the same behaviour in my Maven build of Solr with
> my custom plugin.
>
> Does anyone know the target that builds jetty-start?
>
> - Neil
>


Replacing payloads for per-document-per-keyword scores

2012-05-15 Thread Neil Hooey
Hello Hoss and the list,

We are currently using Lucene payloads to store per-document-per-keyword
scores for our dataset. Our dataset consists of photos with keywords
assigned (only once each) to them. The index is about 90 GB, running on
24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to
the JVM.

When searching the payloads field, our 98 percentile query time is at 2
seconds even with trivially low queries per second. I have asked several
Lucene committers about this and it's believed that the implementation of
payloads being so general is the cause of the slowness.

Hoss guessed that we could override Term Frequency with PreAnalyzedField[1]
for the per-keyword scores, since keywords (tags) always have a Term
Frequency of 1 and the TF calculation is very fast. However it turns out
that you can't[2] specify TF in the PreAnalyzedField.

Is there any other way to override Term Frequency during index time? If
not, where in the code could this be implemented?

An obvious option is to repeat the keyword as many times as its payload
score, but that would drastically increase the amount of data per document
sent during index time.

I'd welcome any other per-document-per-keyword score solutions, or some way
to speed up searching a payload field.

Thanks,

- Neil

[1] https://issues.apache.org/jira/browse/SOLR-1535
[2]
https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501


How do you index multiple documents in JSON?

2011-05-04 Thread Neil Hooey
How do you add multiple documents to Solr in JSON in a single request?

In XML, I can just send this:

1
2


There is an example on this page:
http://wiki.apache.org/solr/UpdateJSON

But it doesn't demonstrate how to send more than one document.

Thanks,

- Neil


Re: How do you index multiple documents in JSON?

2011-05-04 Thread Neil Hooey
I found out how to do it, but you have to have duplicate "add" keys in
a JSON object, which isn't easily serializable from a hash in a
language.

I reported an issue here:
https://issues.apache.org/jira/browse/SOLR-2496

Please vote for it if you agree.


On Wed, May 4, 2011 at 3:00 PM, Neil Hooey  wrote:
> How do you add multiple documents to Solr in JSON in a single request?
>
> In XML, I can just send this:
> 
>    1
>    2
> 
>
> There is an example on this page:
> http://wiki.apache.org/solr/UpdateJSON
>
> But it doesn't demonstrate how to send more than one document.
>
> Thanks,
>
> - Neil
>


Do boosts on values in multivalued fields still get consolidated?

2011-05-04 Thread Neil Hooey
Kapil Chhabra indicates on his blog that if you boost a value in a
multivalued field during index time, the boosts are consolidated for
every field, and the individual values are lost.

Here's the link:
http://blog.kapilchhabra.com/2008/01/solr-index-time-boost-facts-2

This post is from 2008-01-20, but it still seems to be true in Solr 3.1.

Has this behaviour been fixed in future versions of Solr, or are there
plans to fix it?

In general, when a user searches for a document, I'd like to
arbitrarily weight each keyword for that document during index time.

For example if they searched for "q=keywords:monkey", and got these documents:
keywords: [ monkey, ape, chimp, garage ]
keywords: [ monkey, cloud, food, door ]

I'd like to have boosts recorded like this, at index time, based on
keyword co-relevance:
keywords: [ monkey:50, ape:50, chimp:50, garage:0.1 ]
keywords: [ monkey:1, cloud:1, food:1, door:1 ]

Since, in the first document, the word "monkey" is clearly related to
"ape" and "chimp", but "garage" is not. Similarly in the second
document, none of the keywords are really related to each other at
all.

I see a couple of potential solutions to this problem, in the absence
of boosts for multivalued fields:
1. Turn keyword lists into strings, and use payloads: "monkey|50,
ape|50, chimp|50, garage|0.1"
2. Use dynamic fields of the form: keyword_*: keyword_monkey,
keyword_ape, ... and boost those fields.

Are those solutions feasible, or are there better solutions to this problem?

- Neil


Re: Do boosts on values in multivalued fields still get consolidated?

2011-05-04 Thread Neil Hooey
If I have a document with:
{ id: 1, sentences: "hello world|5.0_goodbye|2.3_this is a sentence|2.8" }

How would I get those payloads to take affect, on the tokens separated by
"_"?

How do you write a query to use those payloads?

On Wed, May 4, 2011 at 22:26, Otis Gospodnetic
wrote:

> Hi Neil,
>
> I think payloads is the way to go.  Index-time boosting is not per term.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Neil Hooey 
> > To: solr-user@lucene.apache.org
> > Sent: Wed, May 4, 2011 9:36:24 PM
> > Subject: Do boosts on values in multivalued fields still get
> consolidated?
> >
> > Kapil Chhabra indicates on his blog that if you boost a value in  a
> > multivalued field during index time, the boosts are consolidated  for
> > every field, and the individual values are lost.
> >
> > Here's the  link:
> > http://blog.kapilchhabra.com/2008/01/solr-index-time-boost-facts-2
> >
> > This  post is from 2008-01-20, but it still seems to be true in Solr 3.1.
> >
> > Has  this behaviour been fixed in future versions of Solr, or are there
> > plans to  fix it?
> >
> > In general, when a user searches for a document, I'd like  to
> > arbitrarily weight each keyword for that document during index  time.
> >
> > For example if they searched for "q=keywords:monkey", and got these
> documents:
> > keywords: [ monkey, ape, chimp, garage ]
> > keywords: [ monkey,  cloud, food, door ]
> >
> > I'd like to have boosts recorded like this, at index  time, based on
> > keyword co-relevance:
> > keywords: [ monkey:50, ape:50,  chimp:50, garage:0.1 ]
> > keywords: [ monkey:1, cloud:1, food:1, door:1  ]
> >
> > Since, in the first document, the word "monkey" is clearly related  to
> > "ape" and "chimp", but "garage" is not. Similarly in the  second
> > document, none of the keywords are really related to each other  at
> > all.
> >
> > I see a couple of potential solutions to this problem, in the  absence
> > of boosts for multivalued fields:
> > 1. Turn keyword lists into  strings, and use payloads: "monkey|50,
> > ape|50, chimp|50, garage|0.1"
> > 2.  Use dynamic fields of the form: keyword_*: keyword_monkey,
> > keyword_ape, ...  and boost those fields.
> >
> > Are those solutions feasible, or are there better  solutions to this
> problem?
> >
> > - Neil
> >
>


Improving PayloadTermQuery Performance

2011-05-23 Thread Neil Hooey
What are some ways that one can increase the performance of PayloadTermQuery's?

I'm currently getting a max of 22 QPS after 90k unique queries from a
payload-enhanced keyword field on a dataset of 18 million documents,
where a simple term search on the equivalent multivalue string field
gives a max of 700 QPS.

Here are the performance numbers for queries 89,000 - 90,000:
 Int #ReqsSecs  Reqs/s Avg  Median80th95th99th Max
891000   45.5222.0   0.045   0.013   0.067   0.198   0.360   1.144

In terms of implementation, I wrote a bunch of custom classes that end
up overriding QueryParserBase.newTermQuery() to return a
PayloadTermQuery instead of a TermQuery. This implementation seems to
work fine, but it's very slow.

I'm using HTTPD::Bench::ApacheBench with anywhere between 1 and 40
concurrent requests, and it pegs one of four CPUs at 100% the whole
time, leaving the others idle.

Specfically, are there ways to:
1. Use more than one CPU for PayloadTermQuery processing?
2. Take advantage of caching when calculating payloads?
   (I've heard multivalue string fields take advantage of caching
where payloads do not)
3. Increase the query throughput for payloads in any other way?

Thanks,

- Neil