How can Solr do parallel query warming with and ?
I'm trying to get Solr to run warming queries in parallel with listener events, but it always does them in sequence, pegging one CPU while calculating facet counts. Someone at Lucid Imagination suggested using multiple tags, each with a single facet query in them, but those are still done in parallel. Is it possible to run warming queries in parallel, and if so, how? I'm aware that you could run an external script that forks, but I'd like to use Solr's native support for this if it exists. Examples that don't work: *:*field1 *:*field2 *:*field3 *:*field4 *:*field1 *:*field2 *:*field3 *:*field4
Re: How can Solr do parallel query warming with and ?
> Someone at Lucid Imagination suggested using multiple event="firstSearcher"> tags, each with a single facet query in them, > but those are still done in parallel. I meant to say: "but those are still done in sequence". On Fri, Mar 2, 2012 at 3:37 PM, Neil Hooey wrote: > I'm trying to get Solr to run warming queries in parallel with > listener events, but it always does them in sequence, pegging one CPU > while calculating facet counts. > > Someone at Lucid Imagination suggested using multiple event="firstSearcher"> tags, each with a single facet query in them, > but those are still done in parallel. > > Is it possible to run warming queries in parallel, and if so, how? > > I'm aware that you could run an external script that forks, but I'd > like to use Solr's native support for this if it exists. > > Examples that don't work: > > > > > > *:*field1 > *:*field2 > *:*field3 > *:*field4 > > > > > > > > > *:*field1 > > > > > *:*field2 > > > > > *:*field3 > > > > > *:*field4 > > >
Re: How can Solr do parallel query warming with and ?
I need to have those queries trigger the generation of facet counts, which can take up to 5 minutes for all of them combined. If the facet counts aren't warmed, then the first query to ask for facet counts on a particular field will take several minutes to return results. On Sat, Mar 3, 2012 at 5:40 AM, Mikhail Khludnev wrote: > Neil, > > Would you mind if I ask what particularly do you want to warm by these > queries? > > Regards > > On Sat, Mar 3, 2012 at 12:37 AM, Neil Hooey wrote: > >> I'm trying to get Solr to run warming queries in parallel with >> listener events, but it always does them in sequence, pegging one CPU >> while calculating facet counts. >> >> Someone at Lucid Imagination suggested using multiple > event="firstSearcher"> tags, each with a single facet query in them, >> but those are still done in parallel. >> >> Is it possible to run warming queries in parallel, and if so, how? >> >> I'm aware that you could run an external script that forks, but I'd >> like to use Solr's native support for this if it exists. >> >> Examples that don't work: >> >> >> >> >> >> *:*field1 >> *:*field2 >> *:*field3 >> *:*field4 >> >> >> >> >> >> >> >> >> *:*field1 >> >> >> >> >> *:*field2 >> >> >> >> >> *:*field3 >> >> >> >> >> *:*field4 >> >> >> >> > > > > -- > Sincerely yours > Mikhail Khludnev > Lucid Certified > Apache Lucene/Solr Developer > Grid Dynamics > > <http://www.griddynamics.com> >
How does "start.jar" get build in the Solr trunk repository?
I'm trying to figure out how the "solr/example/start.jar" file gets built in the Solr trunk repository, but I can't find anything about jetty being built in any of the Ant build XML files. I'm trying to duplicate the same behaviour in my Maven build of Solr with my custom plugin. Does anyone know the target that builds jetty-start? - Neil
Re: How does "start.jar" get build in the Solr trunk repository?
I see that it's done in "solr/example/build.xml" with this XML: Does anyone know how you could do that in Maven? On Mon, May 7, 2012 at 4:39 PM, Neil Hooey wrote: > I'm trying to figure out how the "solr/example/start.jar" file gets built > in the Solr trunk repository, but I can't find anything about jetty being > built in any of the Ant build XML files. > > I'm trying to duplicate the same behaviour in my Maven build of Solr with > my custom plugin. > > Does anyone know the target that builds jetty-start? > > - Neil >
Replacing payloads for per-document-per-keyword scores
Hello Hoss and the list, We are currently using Lucene payloads to store per-document-per-keyword scores for our dataset. Our dataset consists of photos with keywords assigned (only once each) to them. The index is about 90 GB, running on 24-core machines with dedicated 10k SAS drives, and 16/32 GB allocated to the JVM. When searching the payloads field, our 98 percentile query time is at 2 seconds even with trivially low queries per second. I have asked several Lucene committers about this and it's believed that the implementation of payloads being so general is the cause of the slowness. Hoss guessed that we could override Term Frequency with PreAnalyzedField[1] for the per-keyword scores, since keywords (tags) always have a Term Frequency of 1 and the TF calculation is very fast. However it turns out that you can't[2] specify TF in the PreAnalyzedField. Is there any other way to override Term Frequency during index time? If not, where in the code could this be implemented? An obvious option is to repeat the keyword as many times as its payload score, but that would drastically increase the amount of data per document sent during index time. I'd welcome any other per-document-per-keyword score solutions, or some way to speed up searching a payload field. Thanks, - Neil [1] https://issues.apache.org/jira/browse/SOLR-1535 [2] https://issues.apache.org/jira/browse/SOLR-1535?focusedCommentId=13273501#comment-13273501
How do you index multiple documents in JSON?
How do you add multiple documents to Solr in JSON in a single request? In XML, I can just send this: 1 2 There is an example on this page: http://wiki.apache.org/solr/UpdateJSON But it doesn't demonstrate how to send more than one document. Thanks, - Neil
Re: How do you index multiple documents in JSON?
I found out how to do it, but you have to have duplicate "add" keys in a JSON object, which isn't easily serializable from a hash in a language. I reported an issue here: https://issues.apache.org/jira/browse/SOLR-2496 Please vote for it if you agree. On Wed, May 4, 2011 at 3:00 PM, Neil Hooey wrote: > How do you add multiple documents to Solr in JSON in a single request? > > In XML, I can just send this: > > 1 > 2 > > > There is an example on this page: > http://wiki.apache.org/solr/UpdateJSON > > But it doesn't demonstrate how to send more than one document. > > Thanks, > > - Neil >
Do boosts on values in multivalued fields still get consolidated?
Kapil Chhabra indicates on his blog that if you boost a value in a multivalued field during index time, the boosts are consolidated for every field, and the individual values are lost. Here's the link: http://blog.kapilchhabra.com/2008/01/solr-index-time-boost-facts-2 This post is from 2008-01-20, but it still seems to be true in Solr 3.1. Has this behaviour been fixed in future versions of Solr, or are there plans to fix it? In general, when a user searches for a document, I'd like to arbitrarily weight each keyword for that document during index time. For example if they searched for "q=keywords:monkey", and got these documents: keywords: [ monkey, ape, chimp, garage ] keywords: [ monkey, cloud, food, door ] I'd like to have boosts recorded like this, at index time, based on keyword co-relevance: keywords: [ monkey:50, ape:50, chimp:50, garage:0.1 ] keywords: [ monkey:1, cloud:1, food:1, door:1 ] Since, in the first document, the word "monkey" is clearly related to "ape" and "chimp", but "garage" is not. Similarly in the second document, none of the keywords are really related to each other at all. I see a couple of potential solutions to this problem, in the absence of boosts for multivalued fields: 1. Turn keyword lists into strings, and use payloads: "monkey|50, ape|50, chimp|50, garage|0.1" 2. Use dynamic fields of the form: keyword_*: keyword_monkey, keyword_ape, ... and boost those fields. Are those solutions feasible, or are there better solutions to this problem? - Neil
Re: Do boosts on values in multivalued fields still get consolidated?
If I have a document with: { id: 1, sentences: "hello world|5.0_goodbye|2.3_this is a sentence|2.8" } How would I get those payloads to take affect, on the tokens separated by "_"? How do you write a query to use those payloads? On Wed, May 4, 2011 at 22:26, Otis Gospodnetic wrote: > Hi Neil, > > I think payloads is the way to go. Index-time boosting is not per term. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Neil Hooey > > To: solr-user@lucene.apache.org > > Sent: Wed, May 4, 2011 9:36:24 PM > > Subject: Do boosts on values in multivalued fields still get > consolidated? > > > > Kapil Chhabra indicates on his blog that if you boost a value in a > > multivalued field during index time, the boosts are consolidated for > > every field, and the individual values are lost. > > > > Here's the link: > > http://blog.kapilchhabra.com/2008/01/solr-index-time-boost-facts-2 > > > > This post is from 2008-01-20, but it still seems to be true in Solr 3.1. > > > > Has this behaviour been fixed in future versions of Solr, or are there > > plans to fix it? > > > > In general, when a user searches for a document, I'd like to > > arbitrarily weight each keyword for that document during index time. > > > > For example if they searched for "q=keywords:monkey", and got these > documents: > > keywords: [ monkey, ape, chimp, garage ] > > keywords: [ monkey, cloud, food, door ] > > > > I'd like to have boosts recorded like this, at index time, based on > > keyword co-relevance: > > keywords: [ monkey:50, ape:50, chimp:50, garage:0.1 ] > > keywords: [ monkey:1, cloud:1, food:1, door:1 ] > > > > Since, in the first document, the word "monkey" is clearly related to > > "ape" and "chimp", but "garage" is not. Similarly in the second > > document, none of the keywords are really related to each other at > > all. > > > > I see a couple of potential solutions to this problem, in the absence > > of boosts for multivalued fields: > > 1. Turn keyword lists into strings, and use payloads: "monkey|50, > > ape|50, chimp|50, garage|0.1" > > 2. Use dynamic fields of the form: keyword_*: keyword_monkey, > > keyword_ape, ... and boost those fields. > > > > Are those solutions feasible, or are there better solutions to this > problem? > > > > - Neil > > >
Improving PayloadTermQuery Performance
What are some ways that one can increase the performance of PayloadTermQuery's? I'm currently getting a max of 22 QPS after 90k unique queries from a payload-enhanced keyword field on a dataset of 18 million documents, where a simple term search on the equivalent multivalue string field gives a max of 700 QPS. Here are the performance numbers for queries 89,000 - 90,000: Int #ReqsSecs Reqs/s Avg Median80th95th99th Max 891000 45.5222.0 0.045 0.013 0.067 0.198 0.360 1.144 In terms of implementation, I wrote a bunch of custom classes that end up overriding QueryParserBase.newTermQuery() to return a PayloadTermQuery instead of a TermQuery. This implementation seems to work fine, but it's very slow. I'm using HTTPD::Bench::ApacheBench with anywhere between 1 and 40 concurrent requests, and it pegs one of four CPUs at 100% the whole time, leaving the others idle. Specfically, are there ways to: 1. Use more than one CPU for PayloadTermQuery processing? 2. Take advantage of caching when calculating payloads? (I've heard multivalue string fields take advantage of caching where payloads do not) 3. Increase the query throughput for payloads in any other way? Thanks, - Neil