Date faceting and memory leaks
I have been running load testing using JMeter on a Solr 1.4 index with ~4 million docs. I notice a steady JVM heap size increase as I iterator 100 query terms a number of times against the index. The GC does not seems to claim the heap after the test run is completed. It will run into OutOfMemory as I repeat the test or increase the number of threads/users. The date facet queries are specified as following (as part of "append" section in request handler): {!ex=last_modified}last_modified:[NOW-30DAY TO *] {!ex=last_modified}last_modified:[NOW-90DAY TO NOW-30DAY] {!ex=last_modified}last_modified:[NOW-180DAY TO NOW-90DAY] {!ex=last_modified}last_modified:[NOW-365DAY TO NOW-180DAY] {!ex=last_modified}last_modified:[NOW-730DAY TO NOW-365DAY] {!ex=last_modified}last_modified:[* TO NOW-730DAY] The last_modified field is a TrieDateField with a precisionStep of 6. I have played for filterCache setting but does not have any effects as the date field cache seems be managed by Lucene FieldCahce. Please help as I can be struggling with this for days. Thanks in advance. -- View this message in context: http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp824372p824372.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Date faceting and memory leaks
No I still have the OOM issue with repeated facet query request on the date field. I forgot to mention that I am running 64-bit IBM 1.5 JVM. I also tried the Sun 1.6 JVM with and without your GC arguments. The GC pattern is different but the heap size does not drop as the test going on. I tested with a single thread from Jmeter just to make sure there is ample room for GC to clean house. The jmeter fires request one after another without pause but I assume it should not effect GC. It is clear to me that date facet query has some major impact on this as I can run the load test with other field facets with no problem (JVM heap size would stabilize at certain level over time). -- View this message in context: http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp824372p824577.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Date faceting and memory leaks
Chris, Thanks for the detailed response. No I am not using Date Facet but Facet Query as for facet display. Here is the full configuration of my "dismax" query handler: dismax explicit 0.01 title text^0.5 domain^0.1 nature^0.1 author title text recip(ms(NOW,last_modified),3.16e-11,1,1) url,title,domain,nature,src,last_modified,text,sz 2<-1 5<-2 6<90% 100 *:* on title,text 0 3 text 400 regex {!ex=src}src {!ex=domain}domain {!ex=nature}nature {!ex=last_modified}last_modified:[NOW-30DAY TO *] {!ex=last_modified}last_modified:[NOW-90DAY TO NOW-30DAY] {!ex=last_modified}last_modified:[NOW-180DAY TO NOW-90DAY] {!ex=last_modified}last_modified:[NOW-365DAY TO NOW-180DAY] {!ex=last_modified}last_modified:[NOW-730DAY TO NOW-365DAY] {!ex=last_modified}last_modified:[* TO NOW-730DAY] Cache settings: I am monitoring Solr JVM Heap Memory Usage via remote Jconsole, the image below shows how heap size keep increasing as more facet query requests being sent the Solr via JMeter: http://n3.nabble.com/file/n825038/memory-1.jpg The following is the request URL pattern: select?rows=0&facet=true&facet.mincount=1&facet.method=enum&q=${query}&qt=dismax where ${query} is selected randomly from a list of 100 query terms The date rounding suggest is a very good one, I will need to rerun the test and report back on the cache setting. I remember my filterCache hit ratio is around 0.7. I did use the tagged results for multi-select display of facet values but in this case there is no fq in the load test request URL. Thanks again and I will report back on the re-run with date rounding. -- View this message in context: http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp824372p825038.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Date faceting and memory leaks
Chris, Just completed the re-run and your date rounding tip saved my day. I now realized the "NOW" as a timestamp is a very bad idea for query caching as it is never the same in value. NOW/DAY would at least makes a set facet queries caches re-usable for a period of time. It turns on you can help with your insight with just the little fraction of information provided. Thanks again! -Yao -- View this message in context: http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp824372p825059.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Date faceting and memory leaks
Just to close the loop. I was fooling around the all the cache setting trying to figure out my problem, so the filterCache is set as part of the experiments. It did not cause any memory issue in this case. After the date rounding adjustment, I re-ran the query with 15 threads with 6000 request and got 1,500/minute throughput by only using a little more than 0.5 GB of Heap Memory. The hit ratio reported in Solr admin statistics page shows filterCache has a hitratio of 0.99. with 103800 lookups and 103773 hits, I assume it is 99%. Have a nice day. -Yao From: Chris Hostetter-3 [via Lucene] [mailto:ml-node+825052-1711725506-201...@n3.nabble.com] Sent: Monday, May 17, 2010 9:04 PM To: Ge, Yao (Y.) Subject: Re: Date faceting and memory leaks : Cache settings: : that's a monster filterCache ...i can easly imagine it causing an OOM if your heap is only 5G. : The date rounding suggest is a very good one, I will need to rerun the test : and report back on the cache setting. I remember my filterCache hit ratio is : around 0.7. I did use the tagged results for multi-select display of facet a "hit ratio" or "0.7" ratio, or "0.7% hit rate"? ... with that many unique facet queries, i can't imaging you were getting a 70% hit rate. I'm betting if you monitor that filterCache size and hit rate as you run your test you'll see it just grow and grow until the OOM. and if you analyze the heap dumps you'll probably see the cache hanging on to a ton of DocSets that will never be used again. : values but in this case there is no fq in the load test request URL. I've never tested this, so i can't say for sure, but if it turns out that the filterCache is not your problem, then perhaps there is soemthing wonky with the filterquery exclusion code in cases like this -- where you explicilty exlucde a taged fq but that fq doesn't exist. the qya to rule it out would be to remove the exlcusion from your configs and test it that way to see if the behavior is hte same. -Hoss View message @ http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp8243 72p825052.html To unsubscribe from Re: Date faceting and memory leaks, click here < (link removed) WdlQGZvcmQuY29tfDgyNTAzOHwxNjYwNDQ2MTQ1> . -- View this message in context: http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp824372p825086.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr read-only core
Is there a way to open a Solr index/core in read-only mode? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-read-only-core-tp843049p843049.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr read-only core
My motivation is more from the performance prospective than functional prospective. I was hoping by opening the Solr index/core read-only, underlying Lucene IndexReader can be opened in read-only mode for optimum query performance (removing the overhead of multi-thread management). -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-read-only-core-tp843049p843099.html Sent from the Solr - User mailing list archive at Nabble.com.
[SolrCloud] shard hash ranges changed after restoring backup
Hi all, My team at work maintains a SolrCloud 5.3.2 cluster with multiple collections configured with sharding and replication. We recently backed up our Solr indexes using the built-in backup functionality. After the cluster was restored from the backup, we noticed that atomic updates of documents are failing occasionally with the error message 'missing required field [...]'. The exceptions are thrown on a host on which the document to be updated is not stored. From this we are deducing that there is a problem with finding the right host by the hash of the uniqueKey. Indeed, our investigations so far showed that for at least one collection in the new cluster, the shards have different hash ranges assigned now. We checked the hash ranges by querying /admin/collections?action=CLUSTERSTATUS. Find below the shard hash ranges of one collection that we debugged. Old cluster: shard1_0 8000 - aaa9 shard1_1 - d554 shard2_0 d555 - fffe shard2_1 - 2aa9 shard3_0 2aaa - 5554 shard3_1 - 7fff New cluster: shard1 8000 - aaa9 shard2 - d554 shard3 d555 - shard4 0 - 2aa9 shard5 2aaa - 5554 shard6 - 7fff Note that the shard names differ because the old cluster's shards were split. As you can see, the ranges of shard3 and shard4 differ from the old cluster. This change of hash ranges matches with the symptoms we are currently experiencing. We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750 in which David Smiley comments: shard hash ranges aren't restored; this error could be disasterous It seems that this is what happened to us. We would like to hear some suggestions on how we could recover from this problem. Best, Gary
Re: [SolrCloud] shard hash ranges changed after restoring backup
Hi Erick, I should add that our Solr cluster is in production and new documents are constantly indexed. The new cluster has been up for three weeks now. The problem was discovered only now because in our use case Atomic Updates and RealTime Gets are mostly performed on new documents. With almost absolute certainty there are already documents in the index that were distributed to the shards according to the new hash ranges. If we just changed the hash ranges in ZooKeeper, the index would still be in an inconsistent state. Is there any way to recover from this without having to re-index all documents? Best, Gary 2016-06-15 19:23 GMT+02:00 Erick Erickson : > Simplest, though a bit risky is to manually edit the znode and > correct the znode entry. There are various tools out there, including > one that ships with Zookeeper (see the ZK documentation). > > Or you can use the zkcli scripts (the Zookeeper ones) to get the znode > down to your local machine, edit it there and then push it back up to ZK. > > I'd do all this with my Solr nodes shut down, then insure that my ZK > ensemble was consistent after the update etc > > Best, > Erick > > On Wed, Jun 15, 2016 at 8:36 AM, Gary Yao wrote: >> Hi all, >> >> My team at work maintains a SolrCloud 5.3.2 cluster with multiple >> collections configured with sharding and replication. >> >> We recently backed up our Solr indexes using the built-in backup >> functionality. After the cluster was restored from the backup, we >> noticed that atomic updates of documents are failing occasionally with >> the error message 'missing required field [...]'. The exceptions are >> thrown on a host on which the document to be updated is not stored. From >> this we are deducing that there is a problem with finding the right host >> by the hash of the uniqueKey. Indeed, our investigations so far showed >> that for at least one collection in the new cluster, the shards have >> different hash ranges assigned now. We checked the hash ranges by >> querying /admin/collections?action=CLUSTERSTATUS. Find below the shard >> hash ranges of one collection that we debugged. >> >> Old cluster: >> shard1_0 8000 - aaa9 >> shard1_1 - d554 >> shard2_0 d555 - fffe >> shard2_1 - 2aa9 >> shard3_0 2aaa - 5554 >> shard3_1 - 7fff >> >> New cluster: >> shard1 8000 - aaa9 >> shard2 - d554 >> shard3 d555 - >> shard4 0 - 2aa9 >> shard5 2aaa - 5554 >> shard6 - 7fff >> >> Note that the shard names differ because the old cluster's shards were >> split. >> >> As you can see, the ranges of shard3 and shard4 differ from the old >> cluster. This change of hash ranges matches with the symptoms we are >> currently experiencing. >> >> We found this JIRA ticket https://issues.apache.org/jira/browse/SOLR-5750 >> in which David Smiley comments: >> >> shard hash ranges aren't restored; this error could be disasterous >> >> It seems that this is what happened to us. We would like to hear some >> suggestions on how we could recover from this problem. >> >> Best, >> Gary
SolrCloud result correctness compared with single core
Hi Guys, As the main scoring mechanism is based tf/idf, so will same query running against SolrCloud return different result against running it against single core with same data sets as idf will only count df inside one core? eg: Assume I have 100GB data: A) Index those data using single core B) Index those data using SolrCloud with two cores (each has 50GB data index) Then If I query those with same query like 'apple', then will I get different result for A and B? Regards, Yandong
Re: SolrCloud result correctness compared with single core
Pretty helpful, thanks Erick! 2015-01-24 9:48 GMT+08:00 Erick Erickson : > you might, but probably not enough to notice. At 50G, the tf/idf > stats will _probably_ be close enough you won't be able to tell. > > That said, recently distributed tf/idf has been implemented but > you need to ask for it, see SOLR-1632. This is Solr 5.0 though. > > I've rarely seen it matter except in fairly specialized situations. > Consider a single core. Deleted documents still count towards > some of the tf/idf stats. So your scoring could theoretically > change after, say, an optimize. > > So called "bottom line" is that yes, the scoring may change, but > IMO not any more radically than was possible with single cores, > and I wouldn't worry about unless I had evidence that it was > biting me. > > Best > Erick > > On Fri, Jan 23, 2015 at 2:52 PM, Yandong Yao wrote: > > > Hi Guys, > > > > As the main scoring mechanism is based tf/idf, so will same query running > > against SolrCloud return different result against running it against > single > > core with same data sets as idf will only count df inside one core? > > > > eg: Assume I have 100GB data: > > A) Index those data using single core > > B) Index those data using SolrCloud with two cores (each has 50GB data > > index) > > > > Then If I query those with same query like 'apple', then will I get > > different result for A and B? > > > > > > Regards, > > Yandong > > >
how to support "implicit trailing wildcards"
Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. >From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!
Re: how to support "implicit trailing wildcards"
Hi Bastian, Sorry for not make it clear, I also want exact match have higher score than wildcard match, that is means: if searching 'mount', documents with 'mount' will have higher score than documents with 'mountain', while 'mount*' seems treat 'mount' and 'mountain' as same. besides, also want the query to be processed with analyzer, while from http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The rationale is that if search 'mounted', I also want documents with 'mount' match. So seems built-in wildcard search could not satisfy my requirements if i understand correctly. Thanks very much! 2010/8/9 Bastian Spitzer > Wildcard-Search is already built in, just use: > > ?q=umoun* > ?q=mounta* > > -Ursprüngliche Nachricht- > Von: yandong yao [mailto:yydz...@gmail.com] > Gesendet: Montag, 9. August 2010 15:57 > An: solr-user@lucene.apache.org > Betreff: how to support "implicit trailing wildcards" > > Hi everyone, > > > How to support 'implicit trailing wildcard *' using Solr, eg: using Google > to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' > will be matched. > > From my point of view, there are several ways, both with disadvantages: > > 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', > 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index > size increases dramatically, b) will matches even has no relationship, such > as such 'mount' will match 'mountain' also. > > 2) Using two pass searching: first pass searches term dictionary through > TermsComponent using given keyword, then using the first matched term from > term dictionary to search again. eg: when user enter 'umoun', TermsComponent > will match 'umount', then use 'umount' to search. The disadvantage are: a) > need to parse query string so that could recognize meta keywords such as > 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP > client), b) The returned hit counts is not for original search string, thus > will influence other components such as auto-suggest component based on user > search history and hit counts. > > 3) Write custom SearchComponent, while have no idea where/how to start > with. > > Is there any other way in Solr to do this, any feedback/suggestion are > welcome! > > Thanks very much in advance! >
Re: how to support "implicit trailing wildcards"
Hi Jan, Seems q=mount OR mount* have different sorting order with q=mount for those documents including mount. Change to q=mount^100 OR (mount?* -mount)^1.0, and test well. Thanks very much! 2010/8/10 Jan Høydahl / Cominvent > Hi, > > You don't need to duplicate the content into two fields to achieve this. > Try this: > > q=mount OR mount* > > The exact match will always get higher score than the wildcard match > because wildcard matches uses "constant score". > > Making this work for multi term queries is a bit trickier, but something > along these lines: > > q=(mount OR mount*) AND (everest OR everest*) > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote: > > > you could satisfy this by making 2 fields: > > 1. exactmatch > > 2. wildcardmatch > > > > use copyfield in your schema to copy 1 --> 2 . > > > > q=exactmatch:mount+wildcardmatch:mount*&q.op=OR > > this would score exact matches above (solely) wildcard matches > > > > Geert-Jan > > > > 2010/8/10 yandong yao > > > >> Hi Bastian, > >> > >> Sorry for not make it clear, I also want exact match have higher score > than > >> wildcard match, that is means: if searching 'mount', documents with > 'mount' > >> will have higher score than documents with 'mountain', while 'mount*' > seems > >> treat 'mount' and 'mountain' as same. > >> > >> besides, also want the query to be processed with analyzer, while from > >> > >> > http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F > >> , > >> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. > >> The > >> rationale is that if search 'mounted', I also want documents with > 'mount' > >> match. > >> > >> So seems built-in wildcard search could not satisfy my requirements if i > >> understand correctly. > >> > >> Thanks very much! > >> > >> > >> 2010/8/9 Bastian Spitzer > >> > >>> Wildcard-Search is already built in, just use: > >>> > >>> ?q=umoun* > >>> ?q=mounta* > >>> > >>> -Ursprüngliche Nachricht- > >>> Von: yandong yao [mailto:yydz...@gmail.com] > >>> Gesendet: Montag, 9. August 2010 15:57 > >>> An: solr-user@lucene.apache.org > >>> Betreff: how to support "implicit trailing wildcards" > >>> > >>> Hi everyone, > >>> > >>> > >>> How to support 'implicit trailing wildcard *' using Solr, eg: using > >> Google > >>> to search 'umoun', 'umount' will be matched , search 'mounta', > 'mountain' > >>> will be matched. > >>> > >>> From my point of view, there are several ways, both with disadvantages: > >>> > >>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with > 'u', > >>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the > >> index > >>> size increases dramatically, b) will matches even has no relationship, > >> such > >>> as such 'mount' will match 'mountain' also. > >>> > >>> 2) Using two pass searching: first pass searches term dictionary > through > >>> TermsComponent using given keyword, then using the first matched term > >> from > >>> term dictionary to search again. eg: when user enter 'umoun', > >> TermsComponent > >>> will match 'umount', then use 'umount' to search. The disadvantage are: > >> a) > >>> need to parse query string so that could recognize meta keywords such > as > >>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP > >>> client), b) The returned hit counts is not for original search string, > >> thus > >>> will influence other components such as auto-suggest component based on > >> user > >>> search history and hit counts. > >>> > >>> 3) Write custom SearchComponent, while have no idea where/how to start > >>> with. > >>> > >>> Is there any other way in Solr to do this, any feedback/suggestion are > >>> welcome! > >>> > >>> Thanks very much in advance! > >>> > >> > >
A question on WordDelimiterFilterFactory
Hi Guys, I encountered a problem when enabling WordDelimiterFilterFactory for both index and query (pasted relative part of schema.xml at the bottom of email). *1. Steps to reproduce:* 1.1 The indexed sample document contains only one sentence: "This is a TechNote." 1.2 Query is: q=TechNote 1.3 Result: no matches return, while the above sentence contains word 'TechNote' absolutely. * 2. Output when enabling debugQuery* By turning on debugQuery http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl=, get following information: TechNote TechNote PhraseQuery(all:"tech note") all:"tech note" id:001 0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 = tf(phraseFreq=0.0) 0.61370564 = idf(all: tech=1 note=1) 0.25 = fieldNorm(field=all, doc=0) Seems that the raw query string is converted to phrase query "tech note", while its term frequency is 0, so no matches. *3. Result from admin/analysis.jsp page* >From analysis.jsp, seems the query 'TechNote' matches the input document, see below words marked by RED color. Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1} term position 12345 term text ThisisaTechNote TechNote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12345 term text thisisatechnote technote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12345 term text thisisa*tech**note* technot term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} term position 12 term text TechNote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12 term text technote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12 term text tech note term type wordword source start,end 0,44,8 payload * 4. My questions are:* 4.1: Why debugQuery and analysis.jsp has different result? 4.2: From my understanding, during indexing, the word 'TechNote' will be converted to: 1) 'technote' and 2) 'tech note' according to my config in schema.xml. And at query time, 'TechNote' will be converted to 'tech note', thus it SHOULD match. Am I right? 4.3: Why the phrase frequency 'tech note' is 0 in the output of debugQuery result (0.0 = tf(phraseFreq=0.0))? Any suggestion/comments are absolutely welcome! *5. fieldType definition in schema.xml* Thanks very much!
Re: A question on WordDelimiterFilterFactory
Hi Robert, I am using solr 1.4, will try with 1.4.1 tomorrow. Thanks very much! Regards, Yandong Yao 2010/9/14 Robert Muir > did you index with solr 1.4 (or are you using solr 1.4) ? > > at a quick glance, it looks like it might be this: > https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in 1.4.1 > > On Tue, Sep 14, 2010 at 5:40 AM, yandong yao wrote: > > > Hi Guys, > > > > I encountered a problem when enabling WordDelimiterFilterFactory for both > > index and query (pasted relative part of schema.xml at the bottom of > > email). > > > > *1. Steps to reproduce:* > >1.1 The indexed sample document contains only one sentence: "This is a > > TechNote." > >1.2 Query is: q=TechNote > >1.3 Result: no matches return, while the above sentence contains word > > 'TechNote' absolutely. > > > > * > > 2. Output when enabling debugQuery* > > By turning on debugQuery > > > > > http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl= > > , > > get following information: > > > > TechNote > > TechNote > > PhraseQuery(all:"tech note") > > all:"tech note" > > > > id:001 > > > > > > 0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 = > > tf(phraseFreq=0.0) > > 0.61370564 = idf(all: tech=1 note=1) > > 0.25 = fieldNorm(field=all, doc=0) > > > > > > > > Seems that the raw query string is converted to phrase query "tech note", > > while its term frequency is 0, so no matches. > > > > *3. Result from admin/analysis.jsp page* > > > > From analysis.jsp, seems the query 'TechNote' matches the input document, > > see below words marked by RED color. > > > > Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} > term > > position 1234 term text ThisisaTechNote. term type wordwordwordword > source > > start,end 0,45,78,910,19 payload > > > > > > > > org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, > > expand=true, ignoreCase=true} term position 1234 term text > > ThisisaTechNote. term > > type wordwordwordword source start,end 0,45,78,910,19 payload > > > > > > > > org.apache.solr.analysis.WordDelimiterFilterFactory > {splitOnCaseChange=1, > > generateNumberParts=1, catenateWords=1, generateWordParts=1, > catenateAll=0, > > catenateNumbers=1} term position 12345 term text ThisisaTechNote > TechNote > > term > > type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 > > payload > > > > > > > > > > > > org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12345 > > term > > text thisisatechnote technote term type wordwordwordwordword word source > > start,end 0,45,78,910,1414,18 10,18 payload > > > > > > > > > > > > org.apache.solr.analysis.SnowballPorterFilterFactory > > {protected=protwords.txt, language=English} term position 12345 term > text > > thisisa*tech**note* technot term type wordwordwordwordword word source > > start,end 0,45,78,910,1414,18 10,18 payload > > > > > > > > > > > > Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} > > term > > position 1 term text TechNote term type word source start,end 0,8 payload > > org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, > > expand=true, ignoreCase=true} term position 1 term text TechNote term > type > > word source start,end 0,8 payload > > org.apache.solr.analysis.WordDelimiterFilterFactory > {splitOnCaseChange=1, > > generateNumberParts=1, catenateWords=0, generateWordParts=1, > catenateAll=0, > > catenateNumbers=0} term position 12 term text TechNote term type > > wordword source > > start,end 0,44,8 payload > > > > org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12 > term > > text technote term type wordword source start,end 0,44,8 payload > > > > org.apache.solr.analysis.SnowballPorterFilterFactory > > {protected=protwords.txt, language=English} term position 12 term text > tech > > note term type wordword source start,end 0,44,8 payload > > > > > > * > > 4. My questions are:* > >4.1: Why debugQuery and analysis.jsp has different result? > >
Re: A question on WordDelimiterFilterFactory
After upgrading to 1.4.1, it is fixed. Thanks very much for your help! Regards, Yandong Yao 2010/9/14 yandong yao > Hi Robert, > > I am using solr 1.4, will try with 1.4.1 tomorrow. > > Thanks very much! > > Regards, > Yandong Yao > > 2010/9/14 Robert Muir > > did you index with solr 1.4 (or are you using solr 1.4) ? >> >> at a quick glance, it looks like it might be this: >> https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in >> 1.4.1 >> >> On Tue, Sep 14, 2010 at 5:40 AM, yandong yao wrote: >> >> > Hi Guys, >> > >> > I encountered a problem when enabling WordDelimiterFilterFactory for >> both >> > index and query (pasted relative part of schema.xml at the bottom of >> > email). >> > >> > *1. Steps to reproduce:* >> >1.1 The indexed sample document contains only one sentence: "This is >> a >> > TechNote." >> >1.2 Query is: q=TechNote >> >1.3 Result: no matches return, while the above sentence contains >> word >> > 'TechNote' absolutely. >> > >> > * >> > 2. Output when enabling debugQuery* >> > By turning on debugQuery >> > >> > >> http://localhost:7111/solr/test/select?indent=on&version=2.2&q=TechNote&fq=&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=id%3A001&hl.fl= >> > , >> > get following information: >> > >> > TechNote >> > TechNote >> > PhraseQuery(all:"tech note") >> > all:"tech note" >> > >> > id:001 >> > >> > >> > 0.0 = fieldWeight(all:"tech note" in 0), product of: 0.0 = >> > tf(phraseFreq=0.0) >> > 0.61370564 = idf(all: tech=1 note=1) >> > 0.25 = fieldNorm(field=all, doc=0) >> > >> > >> > >> > Seems that the raw query string is converted to phrase query "tech >> note", >> > while its term frequency is 0, so no matches. >> > >> > *3. Result from admin/analysis.jsp page* >> > >> > From analysis.jsp, seems the query 'TechNote' matches the input >> document, >> > see below words marked by RED color. >> > >> > Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} >> term >> > position 1234 term text ThisisaTechNote. term type wordwordwordword >> source >> > start,end 0,45,78,910,19 payload >> > >> > >> > >> > org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, >> > expand=true, ignoreCase=true} term position 1234 term text >> > ThisisaTechNote. term >> > type wordwordwordword source start,end 0,45,78,910,19 payload >> > >> > >> > >> > org.apache.solr.analysis.WordDelimiterFilterFactory >> {splitOnCaseChange=1, >> > generateNumberParts=1, catenateWords=1, generateWordParts=1, >> catenateAll=0, >> > catenateNumbers=1} term position 12345 term text ThisisaTechNote >> TechNote >> > term >> > type wordwordwordwordword word source start,end 0,45,78,910,1414,18 >> 10,18 >> > payload >> > >> > >> > >> > >> > >> > org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12345 >> > term >> > text thisisatechnote technote term type wordwordwordwordword word source >> > start,end 0,45,78,910,1414,18 10,18 payload >> > >> > >> > >> > >> > >> > org.apache.solr.analysis.SnowballPorterFilterFactory >> > {protected=protwords.txt, language=English} term position 12345 term >> text >> > thisisa*tech**note* technot term type wordwordwordwordword word source >> > start,end 0,45,78,910,1414,18 10,18 payload >> > >> > >> > >> > >> > >> > Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} >> > term >> > position 1 term text TechNote term type word source start,end 0,8 >> payload >> > org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, >> > expand=true, ignoreCase=true} term position 1 term text TechNote term >> type >> > word source start,end 0,8 payload >> > org.apache.solr.analysis.WordDelimiterFilterFactory >> {splitOnCaseChange=1, >> > generateNumberParts=1, catenateWords=0, generateWordParts=1, >> catenateAll=0, >> > catena
Re: Need help for solr searching case insensative item
Sounds like WordDelimiterFilter config issue, please refer to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory . Also it will help if you could provide: 1) Tokenizers/Filters config in schema file 2) analysis.jsp output in admin page. 2010/10/26 wu liu > Hi all, > > I just noticed a wierd thing happend to my solr search result. > if I do a search for "ecommons", it cannot get the result for "eCommons", > instead, > if i do a search for "eCommons", i can only get all the match for > "eCommons", but not "ecommons". > > I cannot figure it out why? > > please help me > > Thanks very much in advance >
How to run many MoreLikeThis request efficiently?
Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
Re: How to run many MoreLikeThis request efficiently?
Any comments on this? Thanks very much in advance! 2013/1/9 Yandong Yao > Hi Solr Guru, > > I have two set of documents in one SolrCore, each set has about 1M > documents with different document type, say 'type1' and 'type2'. > > Many documents in first set are very similar with 1 or 2 documents in the > second set, What I want to get is: for each document in set 2, return the > most similar document in set 1 using either 'MoreLikeThisHandler' or > 'MoreLikeThisComponent'. > > Currently I use following code to get the result, while it will send far > too many request to Solr server serially. Is there any way to enhance this > besides using multi-threading? Thanks very much! > > for each document in set 2 whose type is 'type2' > run MoreLikeThis request against Solr server and get the most similar > document > end. > > Regards, > Yandong >
Re: How to run many MoreLikeThis request efficiently?
Hi Otis, Really appreciate your help on this!! Will go with multi-thread firstly, and then provide a custom component when performance is not good enough. Regards, Yandong 2013/1/10 Otis Gospodnetic > Patience, young Yandong :) > > Multi-threading *in your application* is the way to go. Alternatively, one > could write a custom SearchComponent that is called once and inside of > which the whole work is done after just one call to it. This component > could then write the output somewhere, like in a new index since making a > blocking call to it may time out. > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Jan 9, 2013 6:07 PM, "Yandong Yao" wrote: > > > Any comments on this? Thanks very much in advance! > > > > 2013/1/9 Yandong Yao > > > > > Hi Solr Guru, > > > > > > I have two set of documents in one SolrCore, each set has about 1M > > > documents with different document type, say 'type1' and 'type2'. > > > > > > Many documents in first set are very similar with 1 or 2 documents in > the > > > second set, What I want to get is: for each document in set 2, return > > the > > > most similar document in set 1 using either 'MoreLikeThisHandler' or > > > 'MoreLikeThisComponent'. > > > > > > Currently I use following code to get the result, while it will send > far > > > too many request to Solr server serially. Is there any way to enhance > > this > > > besides using multi-threading? Thanks very much! > > > > > > for each document in set 2 whose type is 'type2' > > > run MoreLikeThis request against Solr server and get the most > similar > > > document > > > end. > > > > > > Regards, > > > Yandong > > > > > >
Re: Index optimize takes more than 40 minutes for 18M documents
Thans Walter for info, we will disable optimize then and do more testing. Regards, Yandong 2013/2/22 Walter Underwood > That seems fairly fast. We index about 3 million documents in about half > that time. We are probably limited by the time it takes to get the data > from MySQL. > > Don't optimize. Solr automatically merges index segments as needed. > Optimize forces a full merge. You'll probably never notice the difference, > either in disk space or speed. > > It might make sense to force merge (optimize) if you reindex everything > once per day and have no updates in between. But even then it may be a > waste of time. > > You need lots of free disk space for merging, whether a forced merge or > automatic. Free space equal to the size of the index is usually enough, but > worst case can need double the size of the index. > > wunder > > On Feb 21, 2013, at 9:20 AM, Yandong Yao wrote: > > > Hi Guys, > > > > I am using Solr 4.1 and have indexed 18M documents using solrj > > ConcurrentUpdateSolrServer (each document contains 5 fields, and average > > length is less than 1k). > > > > 1) It takes 70 minutes to index those documents without optimize on my > mac > > 10.8, how is the performance, slow, fast or common? > > > > 2) It takes about 40 minutes to optimize those documents, following is > top > > output, and there are lots of FAULTS, what does this means? > > > > Processes: 118 total, 2 running, 8 stuck, 108 sleeping, 719 threads > > > > 00:56:52 > > Load Avg: 1.48, 1.56, 1.73 CPU usage: 6.63% user, 6.40% sys, 86.95% idle > > SharedLibs: 31M resident, 0B data, 6712K linkedit. > > MemRegions: 34734 total, 5801M resident, 39M private, 638M shared. > PhysMem: > > 982M wired, 3600M active, 3567M inactive, 8150M used, 38M free. > > VM: 254G vsize, 1285M framework vsize, 1469887(368) pageins, 1095550(0) > > pageouts. Networks: packets: 14842595/9661M in, 14777685/9395M out. > > Disks: 820048/43G read, 523814/53G written. > > > > PID COMMAND %CPU TIME #TH #WQ #POR #MRE RPRVT RSHRD RSIZE > > VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV > SYSBSD > > SYSMACH > > 4585 java 11.7 02:52:01 32 1483 342 3866M+ 6724K > 3856M+ > > 4246M 6908M 4580 4580 sleepin 501 1490340+ 402 3000781+ 231785+ > > 15044055+ 10033109+ > > > > 3) If I don't run optimize, what is the impact? bigger disk size or slow > > query performance? > > > > Following is my index config in solrconfig.xml: > > > > 100 > > 10 > > > > 10 > > 30 > > false > > > > > > Thanks very much in advance! > > > > Regards, > > Yandong > > > > >
How to use nested query in fq?
Hi Guys, I am using Solr 3.5, and would like to use a fq like 'getField(getDoc(uuid:workspace_${workspaceId})), "isPublic"):true? - workspace_${workspaceId}: workspaceId is indexed field. - getDoc(uuid:concat("workspace_", workspaceId): return the document whose uuid is "workspace_${workspaceId}" - getField(getDoc(uuid:workspace_${workspaceId})), "isPublic"): return the matched document's isPublic field The use case is that I have workspace objects and workspace contains many sub-objects, such as work files, comments, datasets and so on. And workspace has a 'isPublic' field. If this field is true, then all registered user could access this workspace and all its sub-objects. Otherwise, only workspace member could access this workspace and its sub-objects. So I want to use fq to determine whether document in question belongs to public workspace or not. Is it possible? If not, how to implement similar feature like this? implement a ValueSourcePlugin? any guidance or example on this? Or is there any better solutions? It is possible to add 'isPublic' field to all sub-objects, while it makes indexing update more complex. so try to find better solution. Thanks very much in advance! Regards, Yandong
Re: Faster Solr Indexing
I have similar issues by using DIH, and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) consumes most of the time when indexing 10K rows (each row is about 70K) - DIH nextRow takes about 10 seconds totally - If index uses whitespace tokenizer and lower case filter, then addDoc() methods takes about 80 seconds - If index uses whitespace tokenizer, lower case filer, WDF, then addDoc uses about 112 seconds - If index uses whitespace tokenizer, lower case filer, WDF and porter stemmer, then addDoc uses about 145 seconds We have more than million rows totally, and am wondering whether i am using sth. wrong or is there any way to improve the performance of addDoc()? Thanks very much in advance! Following is the configure: 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m 2) Solr version 3.5 3) solrconfig.xml (almost copied from solr's example/solr directory.) false 10 64 2147483647 1000 1 native 2012/3/11 Peyman Faratin > Hi > > I am trying to index 12MM docs faster than is currently happening in Solr > (using solrj). We have identified solr's add method as the bottleneck (and > not commit - which is tuned ok through mergeFactor and maxRamBufferSize and > jvm ram). > > Adding 1000 docs is taking approximately 25 seconds. We are making sure we > add and commit in batches. And we've tried both CommonsHttpSolrServer and > EmbeddedSolrServer (assuming removing http overhead would speed things up > with embedding) but the differences is marginal. > > The docs being indexed are on average 20 fields long, mostly indexed but > none stored. The major size contributors are two fields: > >- content, and >- shingledContent (populated using copyField of content). > > The length of the content field is (likely) gaussian distributed (few > large docs 50-80K tokens, but majority around 2k tokens). We use > shingledContent to support phrase queries and content for unigram queries > (following the advice of Solr Enterprise search server advice - p. 305, > section "The Solution: Shingling"). > > Clearly the size of the docs is a contributor to the slow adds (confirmed > by removing these 2 fields resulting in halving the indexing time). We've > tried compressed=true also but that is not working. > > Any guidance on how to support our application logic (without having to > change the schema too much) and speed the indexing speed (from current 212 > days for 12MM docs) would be much appreciated. > > thank you > > Peyman > >
SolrCloud: how to index documents into a specific core and how to search against that core?
Hi Guys, I use following command to start solr cloud according to solr cloud wiki. yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar Then I have created several cores using CoreAdmin API ( http://localhost:8983/solr/admin/cores?action=CREATE&name= &collection=collection1), and clusterstate.json show following topology: collection1: -- shard1: -- collection1 -- CoreForCustomer1 -- CoreForCustomer3 -- CoreForCustomer5 -- shard2: -- collection1 -- CoreForCustomer2 -- CoreForCustomer4 1) Index: Using following command to index mem.xml file in exampledocs directory. yydzero:exampledocs bjcoe$ java -Durl= http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/coreForCustomer3/update.. SimplePostTool: POSTing file mem.xml SimplePostTool: COMMITting Solr index changes. And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 core has 0 documents. *Question 1:* Is this expected behavior? How do I to index documents into a specific core? *Question 2*: If SolrCloud don't support this yet, how could I extend it to support this feature (index document to particular core), where should i start, the hashing algorithm? *Question 3*: Why the documents are also indexed into 'coreForCustomer1' and 'coreForCustomer5'? The default replica for documents are 1, right? Then I try to index some document to 'coreForCustomer2': $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar post.jar ipod_video.xml While 'coreForCustomer2' still have 0 documents and documents in ipod_video are indexed to core for customer 1/3/5. *Question 4*: Why this happens? 2) Search: I use " http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml"; to search against 'CoreForCustomer2', while it will return all documents in the whole collection even though this core has no documents at all. Then I use " http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml&shards=localhost:8983/solr/coreForCustomer2";, and it will return 0 documents. *Question 5*: So If want to search against a particular core, we need to use 'shards' parameter and use solrCore name as parameter value, right? Thanks very much in advance! Regards, Yandong
Re: SolrCloud: how to index documents into a specific core and how to search against that core?
Hi Darren, Thanks very much for your reply. The reason I want to control core indexing/searching is that I want to use one core to store one customer's data (all customer share same config): such as customer 1 use coreForCustomer1 and customer 2 use coreForCustomer2. Is there any better way than using different core for different customer? Another way maybe use different collection for different customer, while not sure how many collections solr cloud could support. Which way is better in terms of flexibility/scalability? (suppose there are tens of thousands customers). Regards, Yandong 2012/5/22 Darren Govoni > Why do you want to control what gets indexed into a core and then > knowing what core to search? That's the kind of "knowing" that SolrCloud > solves. In SolrCloud, it handles the distribution of documents across > shards and retrieves them regardless of which node is searched from. > That is the point of "cloud", you don't know the details of where > exactly documents are being managed (i.e. they are cloudy). It can > change and re-balance from time to time. SolrCloud performs the > distributed search for you, therefore when you try to search a node/core > with no documents, all the results from the "cloud" are retrieved > regardless. This is considered "A Good Thing". > > It requires a change in thinking about indexing and searching > > On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: > > Hi Guys, > > > > I use following command to start solr cloud according to solr cloud wiki. > > > > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf > > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar > > yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 > -jar > > start.jar > > > > Then I have created several cores using CoreAdmin API ( > > http://localhost:8983/solr/admin/cores?action=CREATE&name= > > &collection=collection1), and clusterstate.json show following > > topology: > > > > > > collection1: > > -- shard1: > > -- collection1 > > -- CoreForCustomer1 > > -- CoreForCustomer3 > > -- CoreForCustomer5 > > -- shard2: > > -- collection1 > > -- CoreForCustomer2 > > -- CoreForCustomer4 > > > > > > 1) Index: > > > > Using following command to index mem.xml file in exampledocs directory. > > > > yydzero:exampledocs bjcoe$ java -Durl= > > http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml > > SimplePostTool: version 1.4 > > SimplePostTool: POSTing files to > > http://localhost:8983/solr/coreForCustomer3/update.. > > SimplePostTool: POSTing file mem.xml > > SimplePostTool: COMMITting Solr index changes. > > > > And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', > > 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 > > core has 0 documents. > > > > *Question 1:* Is this expected behavior? How do I to index documents > into > > a specific core? > > > > *Question 2*: If SolrCloud don't support this yet, how could I extend it > > to support this feature (index document to particular core), where > should i > > start, the hashing algorithm? > > > > *Question 3*: Why the documents are also indexed into 'coreForCustomer1' > > and 'coreForCustomer5'? The default replica for documents are 1, right? > > > > Then I try to index some document to 'coreForCustomer2': > > > > $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar > > post.jar ipod_video.xml > > > > While 'coreForCustomer2' still have 0 documents and documents in > ipod_video > > are indexed to core for customer 1/3/5. > > > > *Question 4*: Why this happens? > > > > 2) Search: I use " > > http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml"; to > > search against 'CoreForCustomer2', while it will return all documents in > > the whole collection even though this core has no documents at all. > > > > Then I use " > > > http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*&wt=xml&shards=localhost:8983/solr/coreForCustomer2 > ", > > and it will return 0 documents. > > > > *Question 5*: So If want to search against a particular core, we need to > > use 'shards' parameter and use solrCore name as parameter value, right? > > > > > > Thanks very much in advance! > > > > Regards, > > Yandong > > >
Re: SolrCloud: how to index documents into a specific core and how to search against that core?
Hi Mark, Darren Thanks very much for your help, Will try collection for each customer then. Regards, Yandong 2012/5/22 Mark Miller > I think the key is this: you want to think of a SolrCore on a single node > Solr installation as a collection on a multi node SolrCloud installation. > > So if you would use multiple SolrCore's with a std Solr setup, you should > be using multiple collections in SolrCloud. If you were going to try to do > everything in one SolrCore, that would be like putting everything in one > collection in SolrCloud. I don't think it generally makes sense to try and > work at the SolrCore level when working with SolrCloud. This will be made > more clear once we add a simple collections api. > > So I think your choice should be similar to using a single node - do you > want to put everything in one 'collection' and use a filter to separate > customers (with all its caveats and limitations) or do you want to use a > collection per customer. You can always start up more clusters if you reach > any limits. > > > > On May 22, 2012, at 10:08 AM, Darren Govoni wrote: > > > I'm curious what the solrcloud experts say, but my suggestion is to try > not to over-engineering the search architecture on solrcloud. For example, > what is the benefit of managing the what cores are indexed and searched? > Having to know those details, in my mind, works against the automation in > solrcore, but maybe there's a good reason you want to do it this way. > > > > --- Original Message --- > > On 5/22/2012 07:35 AM Yandong Yao wrote:Hi Darren, > > > > Thanks very much for your reply. > > > > The reason I want to control core indexing/searching is that I want > to > > use one core to store one customer's data (all customer share same > > config): such as customer 1 use coreForCustomer1 and customer 2 > > use coreForCustomer2. > > > > Is there any better way than using different core for different > customer? > > > > Another way maybe use different collection for different customer, > while > > not sure how many collections solr cloud could support. Which way is > better > > in terms of flexibility/scalability? (suppose there are tens of > thousands > > customers). > > > > Regards, > > Yandong > > > > 2012/5/22 Darren Govoni > > > > > Why do you want to control what gets indexed into a core and then > > > knowing what core to search? That's the kind of "knowing" that > SolrCloud > > > solves. In SolrCloud, it handles the distribution of documents > across > > > shards and retrieves them regardless of which node is searched > from. > > > That is the point of "cloud", you don't know the details of where > > > exactly documents are being managed (i.e. they are cloudy). It can > > > change and re-balance from time to time. SolrCloud performs the > > > distributed search for you, therefore when you try to search a > node/core > > > with no documents, all the results from the "cloud" are retrieved > > > regardless. This is considered "A Good Thing". > > > > > > It requires a change in thinking about indexing and searching > > > > > > On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: > > > > Hi Guys, > > > > > > > > I use following command to start solr cloud according to solr > cloud wiki. > > > > > > > > yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf > > > > -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar > start.jar > > > > yydzero:example2 bjcoe$ java -Djetty.port=7574 > -DzkHost=localhost:9983 > > > -jar > > > > start.jar > > > > > > > > Then I have created several cores using CoreAdmin API ( > > > > http://localhost:8983/solr/admin/cores?action=CREATE&name= > > > > &collection=collection1), and clusterstate.json show > following > > > > topology: > > > > > > > > > > > > collection1: > > > > -- shard1: > > > > -- collection1 > > > > -- CoreForCustomer1 > > > > -- CoreForCustomer3 > > > > -- CoreForCustomer5 > > > > -- shard2: > > > > -- collection1 > > > > -- CoreForCustomer2 > > > > -- CoreForCustomer4 > > > > > > > > > > > > 1) Index: > > > > > > > > Using
Count is inconsistent between facet and stats
Hi Guys, Steps to reproduce: 1) Download apache-solr-4.0.0-ALPHA 2) cd example; java -jar start.jar 3) cd exampledocs; ./post.sh *.xml 4) Use statsComponent to get the stats info for field 'popularity' based on facet 'cat'. And the 'count' for 'electronics' is 3 http://localhost:8983/solr/collection1/select?q=cat:electronics&wt=json&rows=0&stats=true&stats.field=popularity&stats.facet=cat { - stats_fields: { - popularity: { - min: 0, - max: 10, - count: 14, - missing: 0, - sum: 75, - sumOfSquares: 503, - mean: 5.357142857142857, - stddev: 2.7902892835178013, - facets: { - cat: { - music: { - min: 10, - max: 10, - count: 1, - missing: 0, - sum: 10, - sumOfSquares: 100, - mean: 10, - stddev: 0 }, - monitor: { - min: 6, - max: 6, - count: 2, - missing: 0, - sum: 12, - sumOfSquares: 72, - mean: 6, - stddev: 0 }, - hard drive: { - min: 6, - max: 6, - count: 2, - missing: 0, - sum: 12, - sumOfSquares: 72, - mean: 6, - stddev: 0 }, - scanner: { - min: 6, - max: 6, - count: 1, - missing: 0, - sum: 6, - sumOfSquares: 36, - mean: 6, - stddev: 0 }, - memory: { - min: 0, - max: 7, - count: 3, - missing: 0, - sum: 12, - sumOfSquares: 74, - mean: 4, - stddev: 3.605551275463989 }, - graphics card: { - min: 7, - max: 7, - count: 2, - missing: 0, - sum: 14, - sumOfSquares: 98, - mean: 7, - stddev: 0 }, - electronics: { - min: 1, - max: 7, - count: 3, - missing: 0, - sum: 9, - sumOfSquares: 51, - mean: 3, - stddev: 3.4641016151377544 } } } } } } 5) Facet on 'cat' and the count is 14. http://localhost:8983/solr/collection1/select?q=cat:electronics&wt=json&rows=0&facet=true&facet.field=cat { - cat: [ - "electronics", - 14, - "memory", - 3, - "connector", - 2, - "graphics card", - 2, - "hard drive", - 2, - "monitor", - 2, - "camera", - 1, - "copier", - 1, - "multifunction printer", - 1, - "music", - 1, - "printer", - 1, - "scanner", - 1, - "currency", - 0, - "search", - 0, - "software", - 0 ] }, So from StatsComponent the count for 'electronics' cat is 3, while FacetComponent report 14 'electronics'. Is this a bug? Following is the field definition for 'cat'. Thanks, Yandong
mergeindex: what happens if there is deletion during index merging
Hi guys, >From http://wiki.apache.org/solr/MergingSolrIndexes, it said 'Using "srcCore", care is taken to ensure that the merged index is not corrupted even if writes are happening in parallel on the source index'. What does it means? If there are deletion request during merging, will this deletion be processed correctly after merging finished? 1) eg: I have an existing core 'core0', and I want to merge core 'core1' and 'core2' to core 'core0', so I will use http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&srcCore=core2 , During the merging happens, core0, core1, core2 have received deletion request to delete some old documents, will the final core 'core0' contains all content from 'core1' and 'core2' and also all documents matches deletion criteria has been deleted? 2) And if core0, core1, and core2 are processing deletion request, at the same time core merge request comes in, what will happen then? Will merge request block until deletion finished on all cores? Thanks very much in advance! Regards, Yandong
Re: mergeindex: what happens if there is deletion during index merging
Hi Shalin, Thanks very much for your detailed explanation! Regards, Yandong 2012/8/21 Shalin Shekhar Mangar > On Tue, Aug 21, 2012 at 8:47 AM, Yandong Yao wrote: > > > Hi guys, > > > > From http://wiki.apache.org/solr/MergingSolrIndexes, it said 'Using > > "srcCore", care is taken to ensure that the merged index is not corrupted > > even if writes are happening in parallel on the source index'. > > > > What does it means? If there are deletion request during merging, will > this > > deletion be processed correctly after merging finished? > > > > Solr keeps an instance of the IndexReader for each srcCore which is a > static snapshot of the index at the time of the merge request. This static > snapshot is merged to the target core. Therefore any insert/delete request > made to the srcCores after the merge request will not affect the merged > index. > > > > > > 1) > > eg: I have an existing core 'core0', and I want to merge core 'core1' > and > > 'core2' to core 'core0', so I will use > > > > > http://localhost:8983/solr/admin/cores?action=mergeindexes&core=core0&srcCore=core1&srcCore=core2 > > , > > > > During the merging happens, core0, core1, core2 have received deletion > > request to delete some old documents, will the final core 'core0' > contains > > all content from 'core1' and 'core2' and also all documents matches > > deletion criteria has been deleted? > > > > The final core0 will not have documents deleted by requests made on core0. > However, documents deleted on core1 and core2 will still be in core0 if the > merge started before those requests were made. > > > > > > 2) > > And if core0, core1, and core2 are processing deletion request, at the > same > > time core merge request comes in, what will happen then? Will merge > request > > block until deletion finished on all cores? > > > > I believe core0 will continue to process deletion requests concurrently > with the merge. As for core1 and core2, since a merge reserves their > IndexReader, the answer depends on when a commit happens on core1 and > core2. If, for example, 2 deletions were made on core1 and then a commit > was issued (or autoCommit happened) and then the merge was triggered then > the final core0 will not have those documents but it may still have docs > deleted after the commit. > > > > > > Thanks very much in advance! > > > > Regards, > > Yandong > > > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Limiting facets for huge data - setting indexed=false in schema.xml
Having a large number of fields is not the same as having a large number of facets. To facets are something you would display to users as aid for query refinement or navigation. There is no way for a user to use 3700 facets at the same time. So it more of question on how to determine what facets to fetch on search time based on the user's actions or based on certain predefined configurations. I have written an application with 30 some facetable fields on millions of records, I also ran into the issue of calculate all facets as the server resources as limited to number of caches available and CPU cycles available for facet calculations. I then realize why display all these facet regardless user want to see them or not? I have then change to approach to only fetch minimum set of facets by default and make the rest of facets fields open on-demand (using AJAX). I was able to dramatically increase the response time by spreading the facet loading overtime. There are still issues of total facet caches when you have a large number available facets, but you need realistically evaluate what does it means to a user to have large number of facet. I don't think on typical user interface having more than 10 filters showing at the same time will be any more effective than having a small number of filters to begin with and progressive showing more on-demand (hierarchical facets?) Rahul R wrote: > > Hello, > We are trying to get Solr to work for a really huge parts database. > Details > of the database > - 55 million parts > - Totally 3700 properties (facets). But each record will not have value > for > all properties. > - Most of these facets are defined as dynamic fields within the Solr Index > > We were getting really unacceptable timing while doing faceting/searches > on > an index created with this database. With only one user using the system, > query times are in excess of 1 minute. With more users concurrently using > the system, the response times are further high. > > We thought that by limiting the number of properties that are available > for > faceting, the performance can be improved. To test this, we enabled only 6 > properties for faceting by setting indexed=true (in schema.xml) for only > these properties. All other properties which are defined as dynamic > properties had indexed=false. The observations after this change : > > - Index size reduced by a meagre 5 % only > - Performance did not improve. Infact during PSR run we observed that it > degraded. > > My questions: > - Will reducing the number of facets improve faceting and search > performance ? > - Is there a better way to reduce the number of facets ? > - Will having a large number of properties defined as dynamic fields, > reduce > performance ? > > Thank you. > > Regards > Rahul > > -- View this message in context: http://www.nabble.com/Limiting-facets-for-huge-data---setting-indexed%3Dfalse-in-schema.xml-tp24751763p24761778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Item Facet
Are your product_name* fields numeric fields (integer or float)? Dals wrote: > > Hi... > > Is there any way to group values like shopping.yahoo.com or > shopper.cnet.com do? > > For instance, I have documents like: > > doc1 - product_name1 - value1 > doc2 - product_name1 - value2 > doc3 - product_name1 - value3 > doc4 - product_name2 - value4 > doc5 - product_name2 - value5 > doc6 - product_name2 - value6 > > I'd like to have a result grouping by product name with the value > range per product. Something like: > > product_name1 - (value1 to value3) > product_name2 - (value4 to value6) > > It is not like the current facet because the information is grouped by > item, not the entire result. > > Any idea? > > Thanks! > > David Lojudice Sobrinho > > -- View this message in context: http://www.nabble.com/Item-Facet-tp24853669p24865535.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Google Side-By-Side UI
Yes. I think would be very helpful tool for tunning search relevancy - you can do a controlled experiment with your target audiences to understand their responses to the parameter changes. We plan to use this feature to benchmark Lucene/SOLR against our in-house commercial search engine - it will be an interesting test. Lance Norskog-2 wrote: > > http://googleenterprise.blogspot.com/2009/08/compare-enterprise-search-relevance.html > > This is really cool, and a version for Solr would help in doing > relevance experiments. We don't need the "select A or B" feature, just > seeing search result sets side-by-side would be great. > > -- > Lance Norskog > goks...@gmail.com > > -- View this message in context: http://www.nabble.com/Google-Side-By-Side-UI-tp25719087p25719806.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH - Export to XML
For Data Import Handler, there is a way to dump data to a SOLR feed format XML file? -- View this message in context: http://old.nabble.com/DIH---Export-to-XML-tp26138213p26138213.html Sent from the Solr - User mailing list archive at Nabble.com.
encountered the "Cannot allocate memory" when calling snapshooter program after optimize command
Hi, I configured solr to listen on postOptimize event and call the snapshooter program after an optimize command. It works well when the Java heap size is set to less than 4G. But if I increased the java heap size to 5G, the snapshooter program can't be successfully called after the optimize command and error message is here: SEVERE: java.io.IOException: Cannot run program "/home/solr_1.3/solr/bin/snapshooter" (in directory "/home/solr_1.3/solr/bin"): java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at java.lang.Runtime.exec(Runtime.java:593) Here is my server platform: OS: CentOS 5.2 x86_64 Memory: 8G Solr: 1.3 Any suggestion is appreciated. Thanks, Justin
Query Boost Functions
I have a field named "last-modified" that I like to use in bf (Boot Functions) parameter: recip(rord(last-modified),1,1000,1000) in DisMaxRequestHander. However the Solr query parser complain about the syntax of the formula. I think it is related with hyphen in the field name. I have tried to add single and double quote around the field name but didn't help. Can field name contain hyphen in boot functions? How to do it? If not, where do I find the field name special character restrictions? -Yao -- View this message in context: http://www.nabble.com/Query-Boost-Functions-tp23595860p23595860.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Shard - Strange results
Maybe you want to try with docNumber field type as "string" and see it would make a difference. CB-PO wrote: > > I'm not quite sure what logs you are talking about, but in the > tomcat/logs/catalina.out logs, i found the following [note, i can't > copy/paste, so i am typing up a summary]: > > I execute command: > localhost:8080/bravo/select?q=fred&rows=102&start=0&shards=localhost:8080/alpha,localhost:8080/bravo > > In this example, alpha has 27 instances of "fred", while bravo has 0. > > Then in the catalina.out: > > -There is the request for the command i sent, shards parameters and all. > it has the proper queryString. > -Then I see the two requests sent to the shards, apha and bravo. These > two requests weave between each other until they are finished: > INFO: REQUEST URI =/alpha/select > INFO: REQUEST URI =/bravo/select > The parameters have changed to: > > wt=javabin&fsv=true&version=2.2&f1=docNumber,score&q=fred&rows=102&isShard=true&start=0 > > -Then 2 INFO's scroll across: > INFO: [] webapp=/bravo path=/select > params={wt=javabin&fsv=true&version=2.2&f1=docNumber,score&q=fred&rows=102&isShard=true&start=0} > hits=0 status=0 QTime=1 > INFO: [] webapp=/alpha path=/select > params={wt=javabin&fsv=true&version=2.2&f1=docNumber,score&q=fred&rows=102&isShard=true&start=0} > hits=27 status=0 QTime=1 > **Note, hits=27 > > -Then i see some octet-streams being transferred, with status 200, so > those are OK. > > -The i see something peculiar: > It calls alpha with the following parameters: > wt=javabin&version=2.2&ids=ABC-1353,ABC-408,ABC-1355,ABC-1824,ABC-1354,FRED-ID-27,55&q=fred&rows=102¶meter=isShard=true&start=0 > > Performing this query on my own (without the wt=javabin) gives me > numFound=2, the result-set I get back from the overarching query. > Changing it to rows=10, it gives me numFound=2, and 2 's. This is > not the strange functionality I was seeing with the overarching query and > the mis-matched "numfound" and 's. > > This does beg the question.. why did it add: > "ids=ABC-1353,ABC-408,ABC-1355,ABC-1824,ABC-1354,FRED-ID-27,55" to the > query? They are the format that would be under docNumber, if that helps.. > Any thoughts? I will do some research on those particular ID numbered > docs, in the mean time. > > Here's the configuration information. I only posted the difference from > the default files in the solr/example/solr/conf > > [solrconfig.xml] > > ${solr.data.dir:/data/indices/bravo/solr/data > >class="org.apache.solr.handler.dataimport.DataImportHandler"> > >name="config">/data/indices/bravo/solr/conf/data-config.xml > > > > > > [schema.xml] > > >stored="true" /> >/> >/> >/> >/> >/> >/> >/> >/> >/> > > docNumber > column2 > > > > [data-config.xml] > >url="jdbc:metamatrix:b...@mms://hostname:port" user="username" > password="password"/> > > > > > > > > > > > > > > > > > > > > > Yonik Seeley-2 wrote: >> >> On Fri, May 15, 2009 at 4:11 PM, CB-PO wrote: >>> Yeah, the first thing I thought of was that perhaps there was something >>> wrong >>> with the uniqueKey and they were clashing between the indexes, however >>> upon >>> visual inspection of the data the field we are using as the unique key >>> in >>> each of the indexes is grossly different between the two databases, so >>> there >>> is no chance of them clashing. >> >> Yes, but is the same fieldname and FieldType used for both indexes? >> (that's sort of a requirement) >> >> You might also try looking at the logs for the exact requests that >> were sent to each shard as part of the distributed request, and >> manually sending those requests and inspecting the results. That >> should tell you if the shard requests or responses are weird, or if >> it's the top-level combining logic that's causing this. >> >> -Yonik >> http://www.lucidimagination.com >> >> > > -- View this message in context: http://www.nabble.com/Solr-Shard---Strange-results-tp23561201p23601624.html Sent from the Solr - User mailing list archive at Nabble.com.
DataImportHandler Template Transformer
It took me a while to understand that to use the Template Transfomer (http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/TemplateTransformer.html), all building variable names (e.g. ${e.firstName} ${e.lastName} etc). can not contain null values. I hope the parser can do a better job explaining it. Also it will be nice to simple pad the null value will blank string. Should this be considered as an enhancement? -- View this message in context: http://www.nabble.com/DataImportHandler-Template-Transformer-tp23609267p23609267.html Sent from the Solr - User mailing list archive at Nabble.com.
spell checking
Can someone help providing a tutorial like introduction on how to get spell-checking work in Solr. It appears many steps are requires before the spell-checkering functions can be used. It also appears that a dictionary (a list of correctly spelled words) is required to setup the spell checker. Can anyone validate my impression? Thanks. -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23835427.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell checking
Yes. I did. I was not able to grasp the concept of making spell checking work. For example, the wiki page says an spell check index need to be built. But did not say how to do it. Does Solr buid the index out of thin air? Or the index is buit from the main index? or index is built form a dictionary or word list? Please help. Grant Ingersoll-6 wrote: > > Have you gone through: http://wiki.apache.org/solr/SpellCheckComponent > > > On Jun 2, 2009, at 8:50 AM, Yao Ge wrote: > >> >> Can someone help providing a tutorial like introduction on how to get >> spell-checking work in Solr. It appears many steps are requires >> before the >> spell-checkering functions can be used. It also appears that a >> dictionary (a >> list of correctly spelled words) is required to setup the spell >> checker. Can >> anyone validate my impression? >> >> Thanks. >> -- >> View this message in context: >> http://www.nabble.com/spell-checking-tp23835427p23835427.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > using Solr/Lucene: > http://www.lucidimagination.com/search > > > -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23840843.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell checking
Sorry for not be able to get my point across. I know the syntax that leads to a index build for spell checking. I actually run the command saw some additional file created in data\spellchecker1 directory. What I don't understand is what is in there as I can not trick Solr to make spell suggestions based on the documented query structure in wiki. Can anyone tell me what happened after when the default spell check is built? In my case, I used copyField to copy a couple of text fields into a field called "spell". These fields are the original text, they are the ones with typos that I need to run spell check on. But how can these original data be used as a base for spell checking? How does Solr know what are correctly spelled words? ... ... Yao Ge wrote: > > Can someone help providing a tutorial like introduction on how to get > spell-checking work in Solr. It appears many steps are requires before the > spell-checkering functions can be used. It also appears that a dictionary > (a list of correctly spelled words) is required to setup the spell > checker. Can anyone validate my impression? > > Thanks. > -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23841373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell checking
Excellent. Now everything make sense to me. :-) The spell checking suggestion is the closest variance of user input that actually existed in the main index. So called "correction" is relative the text existed indexed. So there is no need for a brute force list of all correctly spelled words. Maybe we should call this "alternative search terms" or "suggested search terms" instead of spell checking. It is misleading as there is no right or wrong in spelling, there is only popular (term frequency?) alternatives. Thanks for the insight. Otis Gospodnetic wrote: > > > Hello, > > In short, the assumption behind this type of SC is that the text in the > main index is (mostly) correctly spelled. When the SC finds query > terms that are close in spelling to words indexed in SC, it offers > spelling suggestions/correction using those presumably correctly spelled > terms (there are other parameters that control the exact behaviour, but > this is the idea) > > Solr (Lucene's spellchecker, which Solr uses under the hood, actually) > turn the input text (values from those fields you copy to the spell field) > into so called n-grams. You can see that if you open up the SC index with > something like Luke. Please see > http://wiki.apache.org/jakarta-lucene/SpellChecker . > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Yao Ge >> To: solr-user@lucene.apache.org >> Sent: Tuesday, June 2, 2009 5:34:07 PM >> Subject: Re: spell checking >> >> >> Sorry for not be able to get my point across. >> >> I know the syntax that leads to a index build for spell checking. I >> actually >> run the command saw some additional file created in data\spellchecker1 >> directory. What I don't understand is what is in there as I can not trick >> Solr to make spell suggestions based on the documented query structure in >> wiki. >> >> Can anyone tell me what happened after when the default spell check is >> built? In my case, I used copyField to copy a couple of text fields into >> a >> field called "spell". These fields are the original text, they are the >> ones >> with typos that I need to run spell check on. But how can these original >> data be used as a base for spell checking? How does Solr know what are >> correctly spelled words? >> >> >> multiValued="true"/> >> >> multiValued="true"/> >>... >> >> multiValued="true"/> >>... >> >> >> >> >> >> Yao Ge wrote: >> > >> > Can someone help providing a tutorial like introduction on how to get >> > spell-checking work in Solr. It appears many steps are requires before >> the >> > spell-checkering functions can be used. It also appears that a >> dictionary >> > (a list of correctly spelled words) is required to setup the spell >> > checker. Can anyone validate my impression? >> > >> > Thanks. >> > >> >> -- >> View this message in context: >> http://www.nabble.com/spell-checking-tp23835427p23841373.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23844050.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceting on text fields
I am index a database with over 1 millions rows. Two of fields contain unstructured text but size of each fields is limited (256 characters). I come up with an idea to use visualize the text fields using text cloud by turning the two text fields in facets. The weight of font and size is of each facet value (words) derived from the facet counts. I used simpler field type so that the there is no stemming to these facet values: The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? The following is my filterCahce setting: Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Yes. I am using 1.3. When is 1.4 due for release? Yonik Seeley-2 wrote: > > Are you using Solr 1.3? > You might want to try the latest 1.4 test build - faceting has changed a > lot. > > -Yonik > http://www.lucidimagination.com > > On Thu, Jun 4, 2009 at 12:01 PM, Yao Ge wrote: >> >> I am index a database with over 1 millions rows. Two of fields contain >> unstructured text but size of each fields is limited (256 characters). >> >> I come up with an idea to use visualize the text fields using text cloud >> by >> turning the two text fields in facets. The weight of font and size is of >> each facet value (words) derived from the facet counts. I used simpler >> field >> type so that the there is no stemming to these facet values: >> > positionIncrementGap="100" >>> >> >> >> > ignoreCase="true" expand="false"/> >> > words="stopwords.txt"/> >> > generateWordParts="0" generateNumberParts="0" catenateWords="1" >> catenateNumbers="1" catenateAll="0"/> >> >> >> >> >> >> The facet query is considerably slower comparing to other facets from >> structured database fields (with highly repeated values). What I found >> interesting is that even after I constrained search results to just a few >> hunderd hits using other facets, these text facets are still very slow. >> >> I understand that text fields are not good candidate for faceting as it >> can >> contain very large number of unique values. However why it is still slow >> after my matching documents is reduced to hundreds? Is it because the >> whole >> filter is cached (regardless the matching docs) and I don't have enough >> filter cache size to fit the whole list? >> >> The following is my filterCahce setting: >> > autowarmCount="128"/> >> >> Lastly, what I really want to is to give user a chance to visualize and >> filter on top relevant words in the free-text fields. Are there >> alternative >> to facet field approach? term vectors? I can do client side process based >> on >> top N (say 100) hits for this but it is my last option. >> -- >> View this message in context: >> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23876051.html Sent from the Solr - User mailing list archive at Nabble.com.
Query Filter fq with OR operator
If I want use OR operator with mutile query filters, I can do: fq=popularity:[10 TO *] OR section:0 Is there a more effecient alternative to this? -- View this message in context: http://www.nabble.com/Query-Filter-fq-with-OR-operator-tp23895837p23895837.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Michael, Thanks for the update! I definitely need to get a 1.4 build see if it makes a difference. BTW, maybe instead of using faceting for text mining/clustering/visualization purpose, we can build a separate feature in SOLR for this. Many of commercial search engines I have experiences with (Google Search Appliance, Vivisimo etc) provide dynamic term clustering based on top N ranked documents (N is a parameter can be configured). When facet field is highly fragmented (say a text field), the existing set intersection based approach might no longer be optimum. Aggregating term vectors over top N docs might be more attractive. Another features I can really appreciate is to provide search time n-gram term clustering. Maybe this might be better suited for "spell checker" as it just a different way to display the alternative search terms. -Yao Michael Ludwig-4 wrote: > > Yao Ge schrieb: > >> The facet query is considerably slower comparing to other facets from >> structured database fields (with highly repeated values). What I found >> interesting is that even after I constrained search results to just a >> few hunderd hits using other facets, these text facets are still very >> slow. >> >> I understand that text fields are not good candidate for faceting as >> it can contain very large number of unique values. However why it is >> still slow after my matching documents is reduced to hundreds? Is it >> because the whole filter is cached (regardless the matching docs) and >> I don't have enough filter cache size to fit the whole list? > > Very interesting questions! I think an answer would both require and > further an understanding of how filters work, which might even lead to > a more general guideline on when and how to use filters and facets. > > Even though faceting appears to have changed in 1.4 vs 1.3, it would > still be interesting to understand the 1.3 side of things. > >> Lastly, what I really want to is to give user a chance to visualize >> and filter on top relevant words in the free-text fields. Are there >> alternative to facet field approach? term vectors? I can do client >> side process based on top N (say 100) hits for this but it is my last >> option. > > Also a very interesting data mining question! I'm sorry I don't have any > answers for you. Maybe someone else does. > > Best, > > Michael Ludwig > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Thanks for insight Otis. I have no awareness of ClusteringComponent until now. It is time to move to Solr 1.4 -Yao Otis Gospodnetic wrote: > > > Yao, > > Solr can already cluster top N hits using Carrot2: > http://wiki.apache.org/solr/ClusteringComponent > > I've also done ugly "manual counting" of terms in top N hits. For > example, look at the right side of this: > http://www.simpy.com/user/otis/tag/%22machine+learning%22 > > Something like http://www.sematext.com/product-key-phrase-extractor.html > could also be used. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Yao Ge >> To: solr-user@lucene.apache.org >> Sent: Tuesday, June 9, 2009 3:46:13 PM >> Subject: Re: Faceting on text fields >> >> >> Michael, >> >> Thanks for the update! I definitely need to get a 1.4 build see if it >> makes >> a difference. >> >> BTW, maybe instead of using faceting for text >> mining/clustering/visualization purpose, we can build a separate feature >> in >> SOLR for this. Many of commercial search engines I have experiences with >> (Google Search Appliance, Vivisimo etc) provide dynamic term clustering >> based on top N ranked documents (N is a parameter can be configured). >> When >> facet field is highly fragmented (say a text field), the existing set >> intersection based approach might no longer be optimum. Aggregating term >> vectors over top N docs might be more attractive. Another features I can >> really appreciate is to provide search time n-gram term clustering. Maybe >> this might be better suited for "spell checker" as it just a different >> way >> to display the alternative search terms. >> >> -Yao >> >> >> Michael Ludwig-4 wrote: >> > >> > Yao Ge schrieb: >> > >> >> The facet query is considerably slower comparing to other facets from >> >> structured database fields (with highly repeated values). What I found >> >> interesting is that even after I constrained search results to just a >> >> few hunderd hits using other facets, these text facets are still very >> >> slow. >> >> >> >> I understand that text fields are not good candidate for faceting as >> >> it can contain very large number of unique values. However why it is >> >> still slow after my matching documents is reduced to hundreds? Is it >> >> because the whole filter is cached (regardless the matching docs) and >> >> I don't have enough filter cache size to fit the whole list? >> > >> > Very interesting questions! I think an answer would both require and >> > further an understanding of how filters work, which might even lead to >> > a more general guideline on when and how to use filters and facets. >> > >> > Even though faceting appears to have changed in 1.4 vs 1.3, it would >> > still be interesting to understand the 1.3 side of things. >> > >> >> Lastly, what I really want to is to give user a chance to visualize >> >> and filter on top relevant words in the free-text fields. Are there >> >> alternative to facet field approach? term vectors? I can do client >> >> side process based on top N (say 100) hits for this but it is my last >> >> option. >> > >> > Also a very interesting data mining question! I'm sorry I don't have >> any >> > answers for you. Maybe someone else does. >> > >> > Best, >> > >> > Michael Ludwig >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23965401.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
FYI. I did a direct integration with Carrot2 with Solrj with a separate Ajax call from UI for top 100 hits to clusters terms in the two text fields. It gots comparable performance to other facets in terms of response time. In terms of algorithms, their listed two "Lingo" and "STC" which I don't reconize. But I think at least one of them might have used SVD (http://en.wikipedia.org/wiki/Singular_value_decomposition). -Yao Otis Gospodnetic wrote: > > > I'd call it related (their application in search encourages exploration), > but also distinct enough to never mix them up. I think your assessment > below is correct, although I'm not familiar with the details of Carrot2 > any more (was once), so I can't tell you exactly which algo is used under > the hood. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Michael Ludwig >> To: solr-user@lucene.apache.org >> Sent: Wednesday, June 10, 2009 9:41:54 AM >> Subject: Re: Faceting on text fields >> >> Otis Gospodnetic schrieb: >> > >> > Solr can already cluster top N hits using Carrot2: >> > http://wiki.apache.org/solr/ClusteringComponent >> >> Would it be fair to say that clustering as detailed on the page you're >> referring to is a kind of dynamic faceting? The faceting not being done >> based on distinct values of certain fields, but on the presence (and >> frequency) of terms in one field? >> >> The main difference seems to be that with faceting, grouping criteria >> (facets) are known beforehand, while with clustering, grouping criteria >> (the significant terms which create clusters - the cluster keys) have >> yet to be determined. Is that a correct assessment? >> >> Michael Ludwig > > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980124.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
BTW, Carrot2 has a very impressive Clustering Workbench (based on eclipse) that has built-in integration with Solr. If you have a Solr service running, it is a just a matter of point the workbench to it. The clustering results and visualization are amazing. (http://project.carrot2.org/download.html). Yao Ge wrote: > > FYI. I did a direct integration with Carrot2 with Solrj with a separate > Ajax call from UI for top 100 hits to clusters terms in the two text > fields. It gots comparable performance to other facets in terms of > response time. > > In terms of algorithms, their listed two "Lingo" and "STC" which I don't > reconize. But I think at least one of them might have used SVD > (http://en.wikipedia.org/wiki/Singular_value_decomposition). > > -Yao > > > Otis Gospodnetic wrote: >> >> >> I'd call it related (their application in search encourages exploration), >> but also distinct enough to never mix them up. I think your assessment >> below is correct, although I'm not familiar with the details of Carrot2 >> any more (was once), so I can't tell you exactly which algo is used under >> the hood. >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> - Original Message >>> From: Michael Ludwig >>> To: solr-user@lucene.apache.org >>> Sent: Wednesday, June 10, 2009 9:41:54 AM >>> Subject: Re: Faceting on text fields >>> >>> Otis Gospodnetic schrieb: >>> > >>> > Solr can already cluster top N hits using Carrot2: >>> > http://wiki.apache.org/solr/ClusteringComponent >>> >>> Would it be fair to say that clustering as detailed on the page you're >>> referring to is a kind of dynamic faceting? The faceting not being done >>> based on distinct values of certain fields, but on the presence (and >>> frequency) of terms in one field? >>> >>> The main difference seems to be that with faceting, grouping criteria >>> (facets) are known beforehand, while with clustering, grouping criteria >>> (the significant terms which create clusters - the cluster keys) have >>> yet to be determined. Is that a correct assessment? >>> >>> Michael Ludwig >> >> >> > > -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980959.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Filter fq with OR operator
I will like to submit a JIRA issue for this. Can anyone help me on where to go? -Yao Otis Gospodnetic wrote: > > > Brian, > > Opening a JIRA issue if it doesn't already exist is the best way. If you > can provide a patch, even better! > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: brian519 >> To: solr-user@lucene.apache.org >> Sent: Tuesday, June 16, 2009 1:32:41 PM >> Subject: Re: Query Filter fq with OR operator >> >> >> This feature is very important to me .. should I post something on the >> dev >> forum? Not sure what the proper protocol is for adding a feature to the >> roadmap >> >> Thanks, >> Brian. >> -- >> View this message in context: >> http://www.nabble.com/Query-Filter-fq-with-OR-operator-tp23895837p24059181.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Query-Filter-fq-with-OR-operator-tp23895837p24222170.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceting with MoreLikeThis
Does Solr support faceting on MoreLikeThis search results? -- View this message in context: http://www.nabble.com/Faceting-with-MoreLikeThis-tp24356166p24356166.html Sent from the Solr - User mailing list archive at Nabble.com.
Filtering MoreLikeThis results
I could not find any support from http://wiki.apache.org/solr/MoreLikeThis on how to restrict MLT results to certain subsets. I passed along a fq parameter and it is ignored. Since we can not incorporate the filters in the query itself which is used to retrieve the target for similarity comparison, it appears there is no way to filter MLT results. BTW. I am using Solr 1.3. Please let me know if there is way (other than hacking the source code) to do this. Thanks! -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24360355.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering MoreLikeThis results
I am not sure about the parameters for MLT the requestHandler plugin. Can one of you share the solrconfig.xml entry for MLT? Thanks in advance. -Yao Bill Au wrote: > > I have been using the StandardRequestHandler (ie /solr/select). fq does > work with the MoreLikeThisHandler. I will switch to use that. Thanks. > > Bill > > On Tue, Jul 7, 2009 at 11:02 AM, Marc Sturlese > wrote: > >> >> At least in trunk, if you request for: >> http://localhost:8084/solr/core_A/mlt?q=id:7468365&fq=price[100<http://localhost:8084/solr/core_A/mlt?q=id:7468365&fq=price%5B100>TO >> 200] >> It will filter the MoreLikeThis results >> >> >> Bill Au wrote: >> > >> > I think fq only works on the main response, not the mlt matches. I >> found >> > a >> > couple of releated jira: >> > >> > http://issues.apache.org/jira/browse/SOLR-295 >> > http://issues.apache.org/jira/browse/SOLR-281 >> > >> > If I am reading them correctly, I should be able to use DIsMax and >> > MoreLikeThis together. I will give that a try and report back. >> > >> > Bill >> > >> > >> > On Tue, Jul 7, 2009 at 4:45 AM, Marc Sturlese >> > wrote: >> > >> >> >> >> Using MoreLikeThisHandler you can use fq to filter your results. As >> far >> >> as >> >> I >> >> know bq are not allowed. >> >> >> >> >> >> Bill Au wrote: >> >> > >> >> > I have been trying to restrict MoreLikeThis results without any luck >> >> also. >> >> > In additional to restricting the results, I am also looking to >> >> influence >> >> > the >> >> > scores similar to the way boost query (bq) works in the >> >> > DisMaxRequestHandler. >> >> > >> >> > I think Solr's MoreLikeThis depends on Lucene's contrib queries >> >> > MoreLikeThis, or at least it used to. Has anyone looked into >> enhancing >> >> > Solrs' MoreLikeThis to support bq and restricting mlt results? >> >> > >> >> > Bill >> >> > >> >> > On Mon, Jul 6, 2009 at 2:16 PM, Yao Ge wrote: >> >> > >> >> >> >> >> >> I could not find any support from >> >> >> http://wiki.apache.org/solr/MoreLikeThison >> >> >> how to restrict MLT results to certain subsets. I passed along a fq >> >> >> parameter and it is ignored. Since we can not incorporate the >> filters >> >> in >> >> >> the >> >> >> query itself which is used to retrieve the target for similarity >> >> >> comparison, >> >> >> it appears there is no way to filter MLT results. BTW. I am using >> Solr >> >> >> 1.3. >> >> >> Please let me know if there is way (other than hacking the source >> >> code) >> >> >> to >> >> >> do this. Thanks! >> >> >> -- >> >> >> View this message in context: >> >> >> >> >> >> http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24360355.html >> >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> >> >> >> >> >> > >> >> > >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24369257.html >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24374996.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24377360.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering MoreLikeThis results
The answer to my owner question: ... ... would work. -Yao Yao Ge wrote: > > I am not sure about the parameters for MLT the requestHandler plugin. Can > one of you share the solrconfig.xml entry for MLT? Thanks in advance. > -Yao > > > Bill Au wrote: >> >> I have been using the StandardRequestHandler (ie /solr/select). fq does >> work with the MoreLikeThisHandler. I will switch to use that. Thanks. >> >> Bill >> >> On Tue, Jul 7, 2009 at 11:02 AM, Marc Sturlese >> wrote: >> >>> >>> At least in trunk, if you request for: >>> http://localhost:8084/solr/core_A/mlt?q=id:7468365&fq=price[100<http://localhost:8084/solr/core_A/mlt?q=id:7468365&fq=price%5B100>TO >>> 200] >>> It will filter the MoreLikeThis results >>> >>> >>> Bill Au wrote: >>> > >>> > I think fq only works on the main response, not the mlt matches. I >>> found >>> > a >>> > couple of releated jira: >>> > >>> > http://issues.apache.org/jira/browse/SOLR-295 >>> > http://issues.apache.org/jira/browse/SOLR-281 >>> > >>> > If I am reading them correctly, I should be able to use DIsMax and >>> > MoreLikeThis together. I will give that a try and report back. >>> > >>> > Bill >>> > >>> > >>> > On Tue, Jul 7, 2009 at 4:45 AM, Marc Sturlese >>> > wrote: >>> > >>> >> >>> >> Using MoreLikeThisHandler you can use fq to filter your results. As >>> far >>> >> as >>> >> I >>> >> know bq are not allowed. >>> >> >>> >> >>> >> Bill Au wrote: >>> >> > >>> >> > I have been trying to restrict MoreLikeThis results without any >>> luck >>> >> also. >>> >> > In additional to restricting the results, I am also looking to >>> >> influence >>> >> > the >>> >> > scores similar to the way boost query (bq) works in the >>> >> > DisMaxRequestHandler. >>> >> > >>> >> > I think Solr's MoreLikeThis depends on Lucene's contrib queries >>> >> > MoreLikeThis, or at least it used to. Has anyone looked into >>> enhancing >>> >> > Solrs' MoreLikeThis to support bq and restricting mlt results? >>> >> > >>> >> > Bill >>> >> > >>> >> > On Mon, Jul 6, 2009 at 2:16 PM, Yao Ge wrote: >>> >> > >>> >> >> >>> >> >> I could not find any support from >>> >> >> http://wiki.apache.org/solr/MoreLikeThison >>> >> >> how to restrict MLT results to certain subsets. I passed along a >>> fq >>> >> >> parameter and it is ignored. Since we can not incorporate the >>> filters >>> >> in >>> >> >> the >>> >> >> query itself which is used to retrieve the target for similarity >>> >> >> comparison, >>> >> >> it appears there is no way to filter MLT results. BTW. I am using >>> Solr >>> >> >> 1.3. >>> >> >> Please let me know if there is way (other than hacking the source >>> >> code) >>> >> >> to >>> >> >> do this. Thanks! >>> >> >> -- >>> >> >> View this message in context: >>> >> >> >>> >> >>> http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24360355.html >>> >> >> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >> >>> >> >> >>> >> > >>> >> > >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24369257.html >>> >> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >>> >> >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24374996.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> > > -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24380408.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting with MoreLikeThis
Faceting on MLT request the use of MoreLikeThisHandler. The standard request handler, while provide support to MLT via a search component, does not return facets on MLT results. To enable MLT handler, add an entry like below to your solrconfig.xml The query parameters syntax for faceting remains the same as standard request handler. -Yao Yao Ge wrote: > > Does Solr support faceting on MoreLikeThis search results? > -- View this message in context: http://www.nabble.com/Faceting-with-MoreLikeThis-tp24356166p24380459.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: A big question about Solr and SolrJ range query ?
use Solr's Filter Query parameter "fq": fq=x:[10 TO 100]&fq=y:[20 TO 300]&fl=title -Yao huenzhao wrote: > > Hi all: > > Suppose that my index have 3 fields: title, x and y. > > I know one range(10 < x < 100) can query liks this: > > http://localhost:8983/solr/select?q=x:[10 TO 100]&fl=title > > If I want to two range(10 < x <100 AND 20 < y < 300) query like > > SQL(select title where x>10 and x < 100 and y > 20 and y < 300) > > by using Solr range query or SolrJ, but not know how to implement. Anybody > know ? Thanks > > Email: enzhao...@gmail.com > > -- View this message in context: http://www.nabble.com/A-big-question-about-Solr-and-SolrJ-range-query---tp24384416p24384540.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: about defaultSearchField
Try with fl=* or fl=*,score added to your request string. -Yao Yang Lin-2 wrote: > > Hi, > I have some problems. > For my solr progame, I want to type only the Query String and get all > field > result that includ the Query String. But now I can't get any result > without > specified field. For example, query with "tina" get nothing, but > "Sentence:tina" could. > > I hava adjusted the *schema.xml* like this: > > >>> stored="true" multiValued="true"/> >>> stored="true" multiValued="true"/> >>> stored="true" multiValued="true"/> >>> multiValued="true"/> >> >>> multiValued="true"/> >> >> >> Sentence >> >> >> allText >> >> >> >> >> >> >> >> > > > I think the problem is in , but I don't know how to > fix > it. Could anyone help me? > > Thanks > Yang > > -- View this message in context: http://www.nabble.com/about-defaultSearchField-tp24382105p24384615.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr's MLT query call doesn't work
A couple of things, your mlt.fl value, must be part of fl. In this case, content_mlt is not included in fl. I think the fl parameter value need to be comma separated. try fl=title,author,content_mlt,score -Yao SergeyG wrote: > > Hi, > > Recently, while implementing the MoreLikeThis search, I've run into the > situation when Solr's mlt query calls don't work. > > More specifically, the following query: > > http://localhost:8080/solr/select?q=id:10&mlt=true&mlt.fl=content_mlt&mlt.maxqt= > 5&mlt.interestingTerms=details&fl=title+author+score > > brings back just the doc with id=10 and nothing else. While using the > GetMethod approach (putting /mlt explicitely into the url), I got back > some results. > > I've been trying to solve this problem for more than a week with no luck. > If anybody has any hint, please help. > > Below, I put logs & outputs from 3 runs: a) Solr; b) GetMethod (/mlt); c) > GetMethod (/select). > > Thanks a lot. > > Regards, > Sergey Goldberg > > > Here're the logs: > > a) Solr (http://localhost:8080/solr/select) > 08.07.2009 15:50:33 org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/select > params={fl=title+author+score&mlt.fl=content_mlt&q=id:10&mlt= > true&mlt.interestingTerms=details&mlt.maxqt=5&wt=javabin&version=2.2} > hits=1 status=0 QTime=172 > > INFO MLTSearchRequestProcessor:49 - SolrServer url: > http://localhost:8080/solr > INFO MLTSearchRequestProcessor:67 - solrQuery> > q=id%3A10&mlt=true&mlt.fl=content_mlt&mlt.maxqt= > 5&mlt.interestingTerms=details&fl=title+author+score > INFO MLTSearchRequestProcessor:73 - Number of docs found = 1 > INFO MLTSearchRequestProcessor:77 - title = SG_Book; score = 2.098612 > > > b) GetMethod (http://localhost:8080/solr/mlt) > 08.07.2009 16:55:44 org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/mlt > params={fl=title+author+score&mlt.fl=content_mlt&q=id:10&mlt.max > qt=5&mlt.interestingTerms=details} status=0 QTime=15 > > INFO MLT2SearchRequestProcessor:76 - encoding="UTF-8"?> > > 0 name="QTime">0 maxScore="2.098612">2.098612S.G. name="title">SG_Book umFound="4" start="0" maxScore="0.28923997"> name="score">0.28923997O. > HenryS.G.Four Million, > The0.08667877 name="author">Katherine MosbyThe Season > of Lillian Dawes name="score">0.07947738Jerome K. > JeromeThree Men in a > Boat name="score">0.047219563Charles > OliverS.G.ABC's of > Science name="content_mlt:ye">1.0 name="content_mlt:tobin">1.0 name="content_mlt:a">1.0 name="content_mlt:i">1.0 name="content_mlt:his">1.0 > > > > c) GetMethod (http://localhost:8080/solr/select) > 08.07.2009 17:06:45 org.apache.solr.core.SolrCore execute > INFO: [] webapp=/solr path=/select > params={fl=title+author+score&mlt.fl=content_mlt&q=id:10&mlt. > maxqt=5&mlt.interestingTerms=details} hits=1 status=0 QTime=16 > > INFO MLT2SearchRequestProcessor:80 - encoding="UTF-8"?> > > 0 name="QTime">16title author > scorecontent_mlt name="q">id:105 name="mlt.interestingTerms">details name="response" numFound="1" start="0" maxScore="2.098612"> name="score">2.098612S.G. name="title">SG_Book name="rawquerystring">id:10id:10 name="parsedq > uery">id:10id:10 name="explain"> > 2.098612 = (MATCH) weight(id:10 in 3), product of: > 0.9994 = queryWeight(id:10), product of: > 2.0986123 = idf(docFreq=1, numDocs=5) > 0.47650534 = queryNorm > 2.0986123 = (MATCH) fieldWeight(id:10 in 3), product of: > 1.0 = tf(termFreq(id:10)=1) > 2.0986123 = idf(docFreq=1, numDocs=5) > 1.0 = fieldNorm(field=id, doc=3) > OldLuceneQParser name="timing">16.0 name="time">0.0 name="org.apache.solr.handler.component.QueryComponent"> name="time">0.0 name="org.apache.solr.handler.component.FacetComponent"> name="time">0.00.0 name="org.apache.solr.handler.component.HighlightComponent"> name="time">0.0 name="org.apache.solr.handler.component.DebugComponent"> name="time">0.0 name="time">16.0 name="org.apache.solr.handler.component.QueryComponent"> name="time">0.0 name="org.apache.solr.handler.component.FacetComponent"> name="time">0.0 name="org.apache.solr.handler.component.MoreLikeThisComponent"> name="time">0.0 name="org.apache.solr.handler.component.HighlightComponent"> name="time">0.0 name="org.apache.solr.handler.component.DebugComponent"> name="time">16.0 > > > > And here're the relevant entries from solrconfig.xml: > > default="true"> > > > explicit > id,title,author,score > on > > > > > > 1 > 10 > > > -- View this message in context: http://www.nabble.com/Solr%27s-MLT-query-call-doesn%27t-work-tp24391843p24391918.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH delta import - last modified date
I am struggling with the concept of delta import in DIH. According the to documentation, the delta import will automatically record the last index time stamp and make it available to use for the delta query. However in many case when the last_modified date time stamp in the database lag behind the current time, the last index time stamp is the not good for delta query. Can I pick a different mechanism to generate "last_index_time" by using time stamp computed from the database (such as from a column of the database)? -- View this message in context: http://old.nabble.com/DIH-delta-import---last-modified-date-tp27231449p27231449.html Sent from the Solr - User mailing list archive at Nabble.com.
hl.maxAlternateFieldLength defaults in solrconfig.xml
It appears the hl.maxAlternateFieldLength parameter default setting in solrconfig.xml does not take effect. I can only get it to work by explicitly sending the parameter via the client request. It is not big deal but it appears to be a bug. -- View this message in context: http://old.nabble.com/hl.maxAlternateFieldLength-defaults-in-solrconfig.xml-tp27542463p27542463.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Search problem; cannot search the existing word in the index content
Modify all settings in solrconfig.xml and try again, by default solr will only index the first 1 fields. Best Regards, Yandong -Original Message- From: Mint o_O! [mailto:mint@gmail.com] Sent: 2010年6月3日 13:58 To: solr-user@lucene.apache.org Subject: Re: Solr Search problem; cannot search the existing word in the index content Thanks for you advice. I did as you said and i still cannot search my content. One thing i notice here i can search for only the words within first 100 rows or maybe bigger than this not sure but not all. So is it the limitation of the index it self? When I create another sample content with only small amount of data. It's working great!!! My content is around 1.2M. I stored it as the text field as in the schema.xml sample file. Anyone has the same issue with me? thanks, Mint On Tue, May 18, 2010 at 1:58 PM, Lance Norskog wrote: > backslash*rhode > \*rhode may work. > > On Mon, May 17, 2010 at 7:23 AM, Erick Erickson > wrote: > > A couple of things: > > 1> try searching with &debugQuery=on attached to your URL, that'll > > give you some clues. > > 2> It's really worthwhile exploring the admin pages for a while, it'll > also > > give you a world of information. It takes a while to understand what the > > various pages are telling you, but you'll come to rely on them. > > 3> Are you really searching with leading and trailing wildcards or is > that > > just the mail changing bolding? Because this is tricky, very tricky. > Search > > the mail archives for "leading wildcard" to see lots of discussion of > this > > topic. > > > > You might back off a bit and try building up to wildcards if that's what > > you're doing > > > > HTH > > Erick > > > > On Mon, May 17, 2010 at 1:11 AM, Mint o_O! wrote: > > > >> Hi, > >> > >> I'm working on the index/search project recently and i found solr which > is > >> very fascinating to me. > >> > >> I followed the test successful from the tutorial page. Starting up jetty > >> and > >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar post.jar > >> *.xml*) so far so good at this stage. > >> > >> Now i have create my own testing westpac.xml file with real data I > intend > >> to > >> implement, putting in exampledocs and again ran the command > >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*). > >> Everything went on very well however when i searched for "*rhode*" which > is > >> in the content. And Index returned nothing. > >> > >> Could anyone guide me what I did wrong why i couldn't search for that > word > >> even though that word is in my index content. > >> > >> thanks, > >> > >> Mint > >> > > > > > > -- > Lance Norskog > goks...@gmail.com >
RE: Permissions and user to acess administrative interface
I can only speak from my experience with Tomcat. First make sure the available authentication modes are available by checking server.xml. I added a few roles in tomcat-users.xml and add individual user id/password to these roles. For example you can separate by Search, Update, Admin roles. Modified the web.xml to map different modules to different roles. -Yao -Original Message- From: Em [mailto:mailformailingli...@yahoo.de] Sent: Monday, February 13, 2012 11:05 AM To: solr-user@lucene.apache.org Subject: Re: Permissions and user to acess administrative interface Hi Anderson, you will need to rearrange the JSPs a little bit to do what you want. If you do so, you can create rules via .htaccess. Otherwise I would suggest you to look for a commercial distribution of Solr which might fit your needs. Regards, Em Am 13.02.2012 16:48, schrieb Anderson vasconcelos: > Hi All > > Is there some way to add users and permissions on SOLR administration page? > I need to restrict the access of users in the administration page. I Just > wanna expose the query section for determinate user. Addition, i wanna to > restrict the access of the cores per user. Somethings like that: > > Core 1 - Users : John, Paul, Carter > Full Interface: John, Paul > Only search interface: Carter > Core 2 -Users: John , Mary > Full Interface: John > Only search interface: Mary > > Is that possible? > > Thanks >
RE: Item Facet
If you can reindex, simply rebuild the index with fields replaced by combining existing fields. -Yao -Original Message- From: David Lojudice Sobrinho [mailto:dalss...@gmail.com] Sent: Thursday, August 06, 2009 4:17 PM To: solr-user@lucene.apache.org Subject: Item Facet Hi... Is there any way to group values like shopping.yahoo.com or shopper.cnet.com do? For instance, I have documents like: doc1 - product_name1 - value1 doc2 - product_name1 - value2 doc3 - product_name1 - value3 doc4 - product_name2 - value4 doc5 - product_name2 - value5 doc6 - product_name2 - value6 I'd like to have a result grouping by product name with the value range per product. Something like: product_name1 - (value1 to value3) product_name2 - (value4 to value6) It is not like the current facet because the information is grouped by item, not the entire result. Any idea? Thanks! David Lojudice Sobrinho
RE: Date faceting and memory leaks
I do not have any GC specific setting in command line. I had tried to force GC collection via Jconsole at the end of the run but it didn't seems to do anything the heap size. -Yao -Original Message- From: Antonio Lobato [mailto:alob...@symplicity.com] Sent: Monday, May 17, 2010 2:44 PM To: solr-user@lucene.apache.org Subject: Re: Date faceting and memory leaks What garbage collection settings are you running at the command line when starting Solr? On May 17, 2010, at 2:41 PM, Yao wrote: > > I have been running load testing using JMeter on a Solr 1.4 index with ~4 > million docs. I notice a steady JVM heap size increase as I iterator 100 > query terms a number of times against the index. The GC does not seems to > claim the heap after the test run is completed. It will run into OutOfMemory > as I repeat the test or increase the number of threads/users. > > The date facet queries are specified as following (as part of "append" > section in request handler): > >{!ex=last_modified}last_modified:[NOW-30DAY TO > *] > {!ex=last_modified}last_modified:[NOW-90DAY TO > NOW-30DAY] > {!ex=last_modified}last_modified:[NOW-180DAY TO > NOW-90DAY] > {!ex=last_modified}last_modified:[NOW-365DAY TO > NOW-180DAY] > {!ex=last_modified}last_modified:[NOW-730DAY TO > NOW-365DAY] > {!ex=last_modified}last_modified:[* TO > NOW-730DAY] > > > The last_modified field is a TrieDateField with a precisionStep of 6. > > I have played for filterCache setting but does not have any effects as the > date field cache seems be managed by Lucene FieldCahce. > > Please help as I can be struggling with this for days. Thanks in advance. > -- > View this message in context: http://lucene.472066.n3.nabble.com/Date-faceting-and-memory-leaks-tp8243 72p824372.html > Sent from the Solr - User mailing list archive at Nabble.com. > --- Antonio Lobato Symplicity Corporation www.symplicity.com (703) 351-0200 x 8101 alob...@symplicity.com
Facet.query
When mutiple facet queries are specified, are they booleaned as OR or AND? -Yao
RE: Facet.query
Never mind. I should have read the example (http://wiki.apache.org/solr/SimpleFacetParameters#head-1da3ab3995bc4abc dce8e0f04be7355ba19e9b2c) first. From: Ge, Yao (Y.) Sent: Thursday, April 19, 2007 10:41 PM To: 'solr-user@lucene.apache.org' Subject: Facet.query When mutiple facet queries are specified, are they booleaned as OR or AND? -Yao
solr java client code and XML schema
Looks like there is no publish java client for solr - what a surprise. I would assume it would be very useful for integrating solr into existing apps. Anyone has done parsing standard XML response to Java Objects? I would like to create some strong typed object hierarchy instead bunch of Collections and Maps. The current XML schema is very light and I was having a hard time writing Digester rules. I am in the mid of write XSLT to transform response into "easier" XML tags for Digester (such as 0... instead of 0...). Is there a good reason for solr to not have a particular rich XML schema? -Yao
RE: Faceted count syntax (exclude zeros)...
There is an bug related to "facet.mincount" in incubating version. http://www.mail-archive.com/solr-user@lucene.apache.org/msg03269.html -Yao -Original Message- From: escher2k [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 01, 2007 2:00 AM To: solr-user@lucene.apache.org Subject: Faceted count syntax (exclude zeros)... I am trying to execute a faceted count on a field called "load_id" and want to exclude 0s. The URL below doesn't seem to be excluding zeros. http://localhost:12002/solr/select/?qt=dismax&q=Y&qf=show_all_flag&fl=lo ad_id&facet=true&facet.limit=-1&facet.field=load_id&facet.mincount=1&row s=0 Result (relevant part of XML): 0 0 80 81 77 62 31061 Thanks. -- View this message in context: http://www.nabble.com/Faceted-count-syntax-%28exclude-zeros%29...-tf3673 535.html#a10264961 Sent from the Solr - User mailing list archive at Nabble.com.
Look ahead queries
I am planning to develop look ahead queries with Solr so that as user type query terms a list of related terms is shown in a popup window (similar to Google suggest). It will be a little AJAX type calls to Solr with wildcards. So if user types "fuel", a look ahead query will be sent to solr in form of "fuel *". User will end-up seeing relevant terms like "fuel consumption", "fuel leaks", "fuel tank" etc showing up. In this case, I will likely to limit queries to certain fields only and some post processing is required to get a final list of suggestion. Let me know if someone has already done this and there are better ways or suggestions to accomplish this. I figured solr's caching will make this type of application more efficient than a straight Lucene integration. Thanks. -Yao