Re: Solr search engine configuration
Thanks! That provides me with some more insight, I altered the search query to "dieren zaak" to see how queries consisting of more than 1 word are handled. I see that words are tokenized into groups of 3, I think because of my NGramFilterFactory with minGramSize of 3. (title_search_global:(dieren zaak) OR description_search_global:(dieren zaak)) (title_search_global:(dieren zaak) OR description_search_global:(dieren zaak)) (+(((title_search_global:die title_search_global:ier title_search_global:ere title_search_global:ren title_search_global:dier title_search_global:iere title_search_global:eren title_search_global:diere title_search_global:ieren title_search_global:dieren) (title_search_global:zaa title_search_global:aak title_search_global:zaak)) (((description_search_global:dier description_search_global:diere description_search_global:dieren)/no_coord) description_search_global:zaak)))/no_coord +(((title_search_global:die title_search_global:ier title_search_global:ere title_search_global:ren title_search_global:dier title_search_global:iere title_search_global:eren title_search_global:diere title_search_global:ieren title_search_global:dieren) (title_search_global:zaa title_search_global:aak title_search_global:zaak)) ((description_search_global:dier description_search_global:diere description_search_global:dieren) description_search_global:zaak)) ExtendedDismaxQParser (lang:"nl" OR lang:"all") lang:nl lang:all I tried the query with and without the &defType=edismax parameter but I'm getting the EXACT same results. Does that mean some configuration error? I'm not sure how to progress from here. Can you see if your presumption that I'm mixing two different parsers is correct? My schema.xml is here: http://www.telefonievergelijken.nl/schema.xml Related: do you know of the existence of any sample schema.xml config that would be usable for a search engine? Seems like something so obvious to float around out there. I feel that would go a long way. Not sure if it matters but my requirements are: Exact match "dieren zaak" boost result with 1000 Exact match "dierenzaak" boost result with 900 Exact match "dieren" or "zaak" boost result with 600 Partial match "huisdierenzaak" or "huisdieren zaak" boost result with 500 Stem match "dier" boost result with 100 Stem partial match "huisdier" boost result with 70 Other partial matches "die" boost result with 10 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Some performance questions....
Hi, I have some questions regarding performance. Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my Solr and some other stuff. Would it be more beneficial to only run 1 instance of Solr with the collection stored on 4 HD's in RAID 0?? Or Have several Virtual Machines each running of its own HD, ie: Have 4 VM's running Solr? Any Thoughts? Thank you! RRK
Re: What are descent disk I/O for Solr and Zookeeper ?
Hi Shawn, I agree on Disk I/O versus available memory about Solr performances. However for heavy indexing and heavy searching context, even with a lot of RAM, disk I/O should be critical. My concern is also about write I/O for Zookeeper transactions log. My understanding is that is critical not as much for Solrcloud performances but mainly for SolrCloud stability. Sometimes even with best practices respect and all possible configuration tuning, Solrcoud is not stable or not performant due to lake of hardware resources. Monitoring CPU, CPU load, iowait, jvm GC, … should highlight theses lake of ressources. If the hardware is undersized, we need metrics in order to explain and demonstrate this to the customer (furthermore if the infrastructure provider do not want admit there are issues with hardware or virtualization). That was the meaning of my question about “decent disk I/O”. Regards Dominique Le ven. 9 mars 2018 à 00:40, Shawn Heisey a écrit : > On 3/8/2018 2:55 PM, Dominique Bejean wrote: > > Disk I/O are critical for high performance Solrcloud. > > This statement has truth to it, but if your system is correctly sized, > disk performance will not have much of an impact on Solr performance. > If upgrading to faster disks does improves long-term query performance, > the system probably doesn't have enough memory installed. There can be > other causes, but that is the most common. > > When there is enough memory available to allow the operating system to > effectively cache the index data, Solr will not need to access the disk > much at all for queries -- all that data will be already in memory. > Indexing will still be dependent on disk performance even when there is > plenty of memory, because that will require writing new data to the disk. > > https://wiki.apache.org/solr/SolrPerformanceProblems > > This is my hammer. To me, your question looks like a nail. :) > > Thanks, > Shawn > > -- Dominique Béjean 06 08 46 12 43
CLUSTERSTATUS API and Error loading specified collection / config in Solr 5.3.2.
Hi , I am working on an application which involves working on a highly distributed Solr cloud environment. The application supports multi-tenancy and we have around 250-300 collections on Solr where each client has their own collection with a new shard being created as clientid- where the timestamp is whenever the new data comes in for the client (typically every 4-8 hrs) , the reason for this convention is to make sure when the Indexes are being built (on demand) the timestamp matches closely to the time when the last indexing was run (the earlier shard is de-provisioned as soon as the new one is created). Whenever the indexing is triggered it first makes a DB entry and then creates a catalog with timestamp in solr. The Solr cloud has 10 Nodes distributed geographically among 10 datacenters. The replication factor is 2. The Solr version is 5.3.2. Coming to my problem - I had to write a utility to ensure that the DB insert timestamp matches closely to the Solr index timestamp wherein I can ensure that if the difference between DB timestamp and Solr Index tinestamp is <= 2 hrs , we have fresh index. The new index contains revised prices of products or offers etc which are critical to be updated as in when they come. Hence this utility is to track that the required updates have been successfully made. I used *CLUSTERSTATUS* api for this task. It is serving the purpose well so far , but pretty recently our solr cloud started complaining of strange things because of which the *CLUSTERSTATUS* api keeps returning as error. The error claims to be of missing config & sometime missing collections like. org.apache.solr.common.SolrException: Could not find collection : > 1785-1520548816454 org.apache.solr.common.SolrException: Could not find collection : 1785-1520548816454 at org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:165) at org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:110) at org.apache.solr.handler.admin.CollectionsHandler$CollectionOperation$19.call(CollectionsHandler.java:614) at org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:166) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:678) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:444) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:215) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) The other times it would complain of missing the config for same or different client id- timestamp like : 1532-1518669619526_shard1_replica3: org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException: Specified config does not exist in ZooKeeper:1532-1518669619526I I would really appreciate if : 1. Someone can possibly guide me as to whats going on Solr Cloud 2. If CLUSTERSTATUS is the right pick to build such utility. Do we have any other option? Thanks for any pointers and suggestions. Appreciate your attention looking this through. Atita
Re: Solr search engine configuration
bq: I tried the query with and without the &defType=edismax parameter but I'm getting the EXACT same results. Does that mean some configuration error? Well, not an error at all, this line: ExtendedDismaxQParser Means you're using edismax. If that happens both with or without &defType, that means that your request handler in solrconfig.xml has this defined as a default. Look for the entry like: edismax So any search you send to Solr like http://blah blah/solr/collection/select? will use edismax if no defType overrides it on the URL. --- Let's talk about what "exact match" means ;) Exact match "dieren zaak". Does "Exact match" here mean it would or would not be an exact match on "dieren zaak soemthingelse"? I you do NOT consider the above "exact match", the usual trick is to use a copyField directive to a field that uses KeywordTokenizerFactory (probably) followed by LowerCaseFilterFactory etc. KeywordTokenizerFactory takes the entire input field as a _single_ token, then you can transform it various ways, things like folding accents, lowercasing and the like if desired. I you DO consider the above "exact match", take a look at the pf, pf2 and pf3 parameters in edismax. They're all about forming phrases, bigrams and trigrams respectively for this form of "exact match". Exact match "dierenzaak". This one is tricky. There's little OOB that understands that "dieren zaak" is equivalent to "dierenzaak". I know that in German there's prior art on "decompounding" filters, I don't know about Dutch. Further, given my total lack of understanding the rules of either language I don't know if it does "compounding" too, i.e. understanding that "dieren zaak" is equivalent to "dierenzaak". Can't help much there. For a start I'd get rid of the gramming until I'd explored other alternatives. Gramming is generally a good thing for pre-and-post wildcards, i.e. matching *some*. Since you're concerned with relevance, I suspect that gramming will make your task harder. And if you haven't discovered the admin UI/analysis page, I recommend you spend some time with it (hint, un-check the "verbose" checkbox). As you play with various combinations of tokenizers and filters it'll give you a much better understanding of what the effects of various combinations are. If only human language followed strict rules ;) Professor:"In English, two negatives are allowed and mean a positive, but two positives don't mean a negative." Bored voice from the back: "Yeah, right". Erick On Sun, Mar 11, 2018 at 5:19 AM, PeterKerk wrote: > Thanks! That provides me with some more insight, I altered the search query > to "dieren zaak" to see how queries consisting of more than 1 word are > handled. > I see that words are tokenized into groups of 3, I think because of my > NGramFilterFactory with minGramSize of 3. > > > > (title_search_global:(dieren zaak) OR > description_search_global:(dieren > zaak)) > > > (title_search_global:(dieren zaak) OR > description_search_global:(dieren > zaak)) > > > (+(((title_search_global:die title_search_global:ier > title_search_global:ere title_search_global:ren title_search_global:dier > title_search_global:iere title_search_global:eren title_search_global:diere > title_search_global:ieren title_search_global:dieren) > (title_search_global:zaa title_search_global:aak title_search_global:zaak)) > (((description_search_global:dier description_search_global:diere > description_search_global:dieren)/no_coord) > description_search_global:zaak)))/no_coord > > > +(((title_search_global:die title_search_global:ier > title_search_global:ere > title_search_global:ren title_search_global:dier title_search_global:iere > title_search_global:eren title_search_global:diere title_search_global:ieren > title_search_global:dieren) (title_search_global:zaa title_search_global:aak > title_search_global:zaak)) ((description_search_global:dier > description_search_global:diere description_search_global:dieren) > description_search_global:zaak)) > > ExtendedDismaxQParser > > > > > > (lang:"nl" OR lang:"all") > > > lang:nl lang:all > > > > > I tried the query with and without the &defType=edismax parameter but I'm > getting the EXACT same results. Does that mean some configuration error? > > I'm not sure how to progress from here. Can you see if your presumption that > I'm mixing two different parsers is correct? My schema.xml is here: > http://www.telefonievergelijken.nl/schema.xml > > > Related: do you know of the existence of any sample schema.xml config that > would be usable for a search engine? Seems like something so obvious to > float around out there. I feel that would go a long way. > > > > Not sure if it matters but my requirements are: > > Exact match "di
Re: Some performance questions....
To rephrase your Question "Does Solr do well with Scale-up or Scale-out?" Are there any Performance Benchmarks for the same out there supporting the claim? On 11 Mar 2018 23:05, "BlackIce" wrote: > Hi, > > I have some questions regarding performance. > > Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my > Solr and some other stuff. > > Would it be more beneficial to only run 1 instance of Solr with the > collection stored on 4 HD's in RAID 0?? Or Have several Virtual > Machines each running of its own HD, ie: Have 4 VM's running Solr? > > Any Thoughts? > > Thank you! > > RRK >
Re: Some performance questions....
Thnx for the pointers. I haven't given much thought to Solr, asides shemal.xml and solrconfig.xml and I'm just diving into a bit more deeper stuff! Greetz RRK On Sun, Mar 11, 2018 at 8:58 PM, Deepak Goel wrote: > To rephrase your Question > > "Does Solr do well with Scale-up or Scale-out?" > > Are there any Performance Benchmarks for the same out there supporting the > claim? > > On 11 Mar 2018 23:05, "BlackIce" wrote: > > > Hi, > > > > I have some questions regarding performance. > > > > Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my > > Solr and some other stuff. > > > > Would it be more beneficial to only run 1 instance of Solr with the > > collection stored on 4 HD's in RAID 0?? Or Have several Virtual > > Machines each running of its own HD, ie: Have 4 VM's running Solr? > > > > Any Thoughts? > > > > Thank you! > > > > RRK > > >
Re: Some performance questions....
Second to this wouldn't 4 Solr instances each with its own HD be fault tolerant? vs. one solr instance with 4 HD's in RAID 0? Plus to his comes the storage capacity, I need the capacity of those 4 drives... the more I read.. the more questions On Sun, Mar 11, 2018 at 9:43 PM, BlackIce wrote: > Thnx for the pointers. > > I haven't given much thought to Solr, asides shemal.xml and solrconfig.xml > and I'm just diving into a bit more deeper stuff! > > Greetz > > RRK > > On Sun, Mar 11, 2018 at 8:58 PM, Deepak Goel wrote: > >> To rephrase your Question >> >> "Does Solr do well with Scale-up or Scale-out?" >> >> Are there any Performance Benchmarks for the same out there supporting the >> claim? >> >> On 11 Mar 2018 23:05, "BlackIce" wrote: >> >> > Hi, >> > >> > I have some questions regarding performance. >> > >> > Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my >> > Solr and some other stuff. >> > >> > Would it be more beneficial to only run 1 instance of Solr with the >> > collection stored on 4 HD's in RAID 0?? Or Have several Virtual >> > Machines each running of its own HD, ie: Have 4 VM's running Solr? >> > >> > Any Thoughts? >> > >> > Thank you! >> > >> > RRK >> > >> > >
Re: Solr search engine configuration
Sorry for this lengthy post, but I wanted to be complete. The only occurence of edismax in solrconfig.xml is this one: edismax explicit 10 double_score false *:* I don't have a requestHandler named "/select". Also, removing the gramming definitely helped! :-) I tried to simplify my setup first and then expand, so what I have now is this: In my database I have these 4 values for "title" that populate "title_search_global" "Hi there dier something else" "Hi there dieren zaak something else" "Hi there dierenzaak something else" "Hi there dierzaak something else" ps. "dier" is singular of plural "dieren". Using this query: http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true These results are found: "Hi there dier something else" "Hi there dieren zaak something else" And these are NOT: "Hi there dierenzaak something else" "Hi there dierzaak something else" I'd expect it should be fairly easy (although I don't know how) to also include result "dierenzaak", by compounding the 2 query values. And yes you are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not sure what logic would also include "dierzaak" Regarding your question: yes, I do consider "dieren zaak soemthingelse" an exact match of "dieren zaak" So I also checked the usage of pf parameters with edismax (based on these links: https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html, http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/) And also for dismax: https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter But I can't find any examples how to actually use these parameters? The search results, including debug info is here: 0 7 title_search_global:(dieren zaak) edismax true true title_search_global id,title (lang:"nl" OR lang:"all") xml true true dieren zaak 115_3699638 dier 115_3699637 title_search_global:(dieren zaak) title_search_global:(dieren zaak) (+(title_search_global:dier title_search_global:zaak))/no_coord +(title_search_global:dier title_search_global:zaak) 5.489122 = (MATCH) sum of: 2.4387078 = (MATCH) weight(title_search_global:dier in 51) [DefaultSimilarity], result of: 2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546 = queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513) 0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 = idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51) 1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 = (MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2) 0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 = (MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product of: 1.0 = t
Re: Some performance questions....
On 3/11/2018 11:35 AM, BlackIce wrote: I have some questions regarding performance. Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my Solr and some other stuff. Would it be more beneficial to only run 1 instance of Solr with the collection stored on 4 HD's in RAID 0?? Or Have several Virtual Machines each running of its own HD, ie: Have 4 VM's running Solr? Performance is always going to be better on bare metal than on virtual machines. Virtualization in modern times is really good, so the difference *might* be minimal, but there is ALWAYS overhead. I used to create virtual machines in my hardware for Solr. Initially with vmware esxi, then later natively in Linux with KVM. At that time, I was running one index core per VM. Just for some testing, I took a similar machine and set up one Solr instance handling all the same cores on bare metal. I do not remember HOW much faster it was, but it was definitely faster. One big thing I like about bare metal is that there's only one "machine", IP address, and Solr instance to administer. Unless you're willing to completely rebuild the whole thing in the event of drive failure, don't use RAID0. If one drive dies (and every hard drive IS eventually going to die if it's used long enough), then *all* of the data on the whole RAID volume is gone. You could do RAID5, which has decent redundancy and good space efficiency, but if you're not familiar with the RAID5 write penalty, do some research on it, and you'll probably come out of it not wanting to EVER use it. If you like, I can explain exactly why you should avoid any RAID level that incorporates 5 or 6. Overall, the best level is RAID10 ... but it has a glaring disadvantage from a cost perspective -- you lose half of your raw capacity. Since drives are relatively cheap, I always build my servers with RAID10, using a 1MB stripe size and a battery-backed caching controller. For the typical hardware I'm using, that means that I'm going to end up with 6 to 12TB of usable space instead of 10 to 20TB (raid5), but the volume is FAST. Thanks, Shawn
Re: Some performance questions....
On 12 Mar 2018 05:51, "Shawn Heisey" wrote: On 3/11/2018 11:35 AM, BlackIce wrote: > I have some questions regarding performance. > > Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my > Solr and some other stuff. > > Would it be more beneficial to only run 1 instance of Solr with the > collection stored on 4 HD's in RAID 0?? Or Have several Virtual > Machines each running of its own HD, ie: Have 4 VM's running Solr? > Performance is always going to be better on bare metal than on virtual machines. Virtualization in modern times is really good, so the difference *might* be minimal, but there is ALWAYS overhead. *Deepak* I doubt this. It would be great if someone can subtantiate this with hard facts *Deepak* I used to create virtual machines in my hardware for Solr. Initially with vmware esxi, then later natively in Linux with KVM. At that time, I was running one index core per VM. Just for some testing, I took a similar machine and set up one Solr instance handling all the same cores on bare metal. I do not remember HOW much faster it was, but it was definitely faster. One big thing I like about bare metal is that there's only one "machine", IP address, and Solr instance to administer. Unless you're willing to completely rebuild the whole thing in the event of drive failure, don't use RAID0. If one drive dies (and every hard drive IS eventually going to die if it's used long enough), then *all* of the data on the whole RAID volume is gone. You could do RAID5, which has decent redundancy and good space efficiency, but if you're not familiar with the RAID5 write penalty, do some research on it, and you'll probably come out of it not wanting to EVER use it. If you like, I can explain exactly why you should avoid any RAID level that incorporates 5 or 6. Overall, the best level is RAID10 ... but it has a glaring disadvantage from a cost perspective -- you lose half of your raw capacity. Since drives are relatively cheap, I always build my servers with RAID10, using a 1MB stripe size and a battery-backed caching controller. For the typical hardware I'm using, that means that I'm going to end up with 6 to 12TB of usable space instead of 10 to 20TB (raid5), but the volume is FAST. Thanks, Shawn