Re: Determining the Number of Solr Shards
On Thu, 2015-01-08 at 22:55 +0100, Nishanth S wrote: > Thanks guys for your inputs I would be looking at around 100 Tb of total > index size with 5100 million documents [...] That is a large corpus when coupled with your high indexing & QPS requirements. Are the queries complex too? Will you be doing non-trivial faceting? Your requirements are so high that any guesswork at this point is likely to be wrong by an order of magnitude. What is very certain is that you will need serious hardware. Your starting point should not be to try and estimate the number of shards. Start by building a test setup. - Toke Eskildsen
Tokenizer or Filter ?
Hello, i have a question what i have to use tokenizer or filter ? I need separate 2 chanels. I wrote this here earlier, but realize it with solr basic tools it is not probably possible. And i',m trying to write own tool for this task. I have this input HelloHelloHow are you ?Fine and you're? d1 - direction1 d2 - direction2 and i want to output only d1 and between this result search some words, for example output should be: Output: [Hello,How are you?] I wrote my idea in java, but i dont know where to incorporate it. If to Filter or Tokenizer and some advices how to start? I probably must extends some lucene library and include it easily modificated there isn't it ? Here is my code: package test1; import java.util.Arrays; public class Test1 { public static void main(String[] args) { String dialogue = "HelloHelloHow are you ?Fine and you're? "; String[] input = dialogue.split("(?<=)\\d*(?=)"); int countD1 = 0; for (String input1 : input) { if (input1.startsWith("")) { countD1++; } } String [] d1 = new String[countD1]; int array = 0; for (String input1 : input) { if (input1.startsWith("")) { d1[array] = input1; array++; } } String d1Out = Arrays.toString(d1); System.out.println(d1Out); //Return s1Out } } Thanks for you advices. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Can't you use solr.PatternTokenizerFactory for this task? On Friday, January 9, 2015 1:48 PM, tomas.kalas wrote: Hello, i have a question what i have to use tokenizer or filter ? I need separate 2 chanels. I wrote this here earlier, but realize it with solr basic tools it is not probably possible. And i',m trying to write own tool for this task. I have this input HelloHelloHow are you ?Fine and you're? d1 - direction1 d2 - direction2 and i want to output only d1 and between this result search some words, for example output should be: Output: [Hello,How are you?] I wrote my idea in java, but i dont know where to incorporate it. If to Filter or Tokenizer and some advices how to start? I probably must extends some lucene library and include it easily modificated there isn't it ? Here is my code: package test1; import java.util.Arrays; public class Test1 { public static void main(String[] args) { String dialogue = "HelloHelloHow are you ?Fine and you're? "; String[] input = dialogue.split("(?<=)\\d*(?=)"); int countD1 = 0; for (String input1 : input) { if (input1.startsWith("")) { countD1++; } } String [] d1 = new String[countD1]; int array = 0; for (String input1 : input) { if (input1.startsWith("")) { d1[array] = input1; array++; } } String d1Out = Arrays.toString(d1); System.out.println(d1Out); //Return s1Out } } Thanks for you advices. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to return child documents with parent
Thanks. That solved my problem. Y -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-return-child-documents-with-parent-tp4178081p4178378.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
I'm used the same regex and it doesn't work unfortunately. Or should I somehow change the regex? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4178389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenizer or Filter ?
Consider an update processor - it can take any input, break it up any way you want, and then output multiple field values. You can even us the stateless script update processor to write the logic in JavaScript. -- Jack Krupansky On Fri, Jan 9, 2015 at 6:47 AM, tomas.kalas wrote: > Hello, i have a question what i have to use tokenizer or filter ? > I need separate 2 chanels. I wrote this here earlier, but realize it with > solr basic tools it is not probably possible. And i',m trying to write own > tool for this task. > I have this input HelloHelloHow are you > ?Fine > and you're? > d1 - direction1 > d2 - direction2 > and i want to output only d1 and between this result search some words, for > example output should be: > Output: [Hello,How are you?] > > I wrote my idea in java, but i dont know where to incorporate it. If to > Filter or Tokenizer and some advices how to start? I probably must extends > some lucene library and include it easily modificated there isn't it ? > > Here is my code: > > package test1; > import java.util.Arrays; > > public class Test1 { > > > public static void main(String[] args) { > String dialogue = "HelloHelloHow are you > ?Fine and you're? "; > > String[] input = dialogue.split("(?<=)\\d*(?=)"); > > int countD1 = 0; > > for (String input1 : input) { > if (input1.startsWith("")) { > countD1++; > } > } > String [] d1 = new String[countD1]; > int array = 0; > > for (String input1 : input) { > if (input1.startsWith("")) { > d1[array] = input1; > array++; > } > } > String d1Out = Arrays.toString(d1); > System.out.println(d1Out); > //Return s1Out > } > } > > Thanks for you advices. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Request two databases at the same time ?
Dear All, I use Apache-SOLR3.6, on Ubuntu (newbie user). I have a big database named BigDB1 with 90M documents, each document contains several fields (docid, title, author, date, etc...) I received today from another source, abstract of some documents (there are also the same docid field in this source). I don't want to modify my BigDB1 to update documents with abstract because BigDB1 is always updated twice by week. Do you think it's possible to create a new database named AbsDB1 and request the both database at the same time ? if I do for example: title:airplane AND abstract:plastic I would like to obtain documents from BigDB1 and AbsDB1. Many thanks for your help, information and others things that can help me. Regards, Bruno --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
Best way to implement Spotlight of certain results
I have a requirement to spotlight certain results if the query text exactly matches the title or see reference (indexed by me as alttitle_t). What that means is that these matching results are shown above the top-10/20 list with different CSS and fields. Its like feeling lucky on google :) I have considered three ways of implementing this: 1. Assume that edismax qf/pf will boost these results to be first when there is an exact match on these important fields. The downside then is that my relevancy is constrained and I must maintain my configuration with title and alttitle_t as top search fields (see XML snippet below).I may have to overweight them to achieve the "always first" criteria. Another less major downside is that I must always return the spotlight summary field (for display) and the image to display on each search. These could be got from a database by the id, however, it is convenient to get them from Solr. 2. Issue two searches for every user search, and use a second set of parameters (change the search type and fields to search only by exact matching a specific string field spottitle_s). The search for the spotlight can then have its own configuration. The downside here is that I am using Django and pysolr for the front-end, and pysolr is both synchronous and tied to the requestHandler named "select". Convention. Of course, running in parallel is not a fix-all - running a search takes some time, even if run in parallel. 3. Automate the population of elevate.xml so that all these 959 queries are here. This is probably best, but forces me to restart/reload when there are changes to this components. The elevation can be done through a query. What I'd love to do is to configure the "select" requestHandler to run both searches and return me both sets of results. Is there anyway to do that - apply the same q= parameter to two configured way to run a search? Something like sub queries? I suspect that approach 1 will get me through my demo and a brief evaluation period, but that either approach 2 or 3 will be the winner. Here's a snippet from my current qf/pf configuration: title^100 alttitle_t^100 ... text title^1000 alttitle_t^1000 ... text^10 Thanks, Dan Davis
filter on solr pivot data
Hello i need to know how can i filter on solr pivot data. For exampel we have a dealer which might have many cars in his lot and car has photos, i need to find out a dealer which has cars which has no photos so i have dealer1 -> has 20 cars -> all of them has photos dealer2 -> has 20 cars -> some of them have photos dealer3 -> has 20 cars -> none of them have photos in the results i want to see only dealers which has no photos ie dealer3, i managed to do pivot and get a breakdown by vin and photo exists now i want to apply filter and get only those dealer who has all vin which have photo exists as 0 lst name="facet_pivot"> vin 1N4AA5AP0EC908535 1 mappings_|photo_exist| 1 1 vin 1N4AA5AP1EC470625 1 mappings_|photo_exist| 1 1 is it possible -- View this message in context: http://lucene.472066.n3.nabble.com/filter-on-solr-pivot-data-tp4178451.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: GC tuning question - can improving GC pauses cause indexing to slow down?
For throughput with G1, get rid of the pause time goal (-XX:MaxGCPauseMillis), so the GC can pause as long as it wants. Beyond that, use a non-concurrent collector and make sure that everything is OK with pauses that last a few seconds. This is a pretty detailed paper about balancing throughput and pause: https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Jan 8, 2015, at 11:38 PM, Shawn Heisey wrote: > On 1/8/2015 11:05 PM, Boogie Shafer wrote: >> In the abstract, it sounds like you are seeing the difference between tuning >> for latency vs tuning for throughput >> >> My hunch would be you are seeing more (albeit individually quicker) GC >> events with your new settings during the rebuild >> >> I imagine that in most cases a solr rebuild is relatively rare compared to >> the amount of times where a lower latency request is desired. If the rebuild >> times are problematic for you, use tunings specific to that workload during >> the times you need it and then switch back to your low latency settings >> after. If you are doing that you can probably run with a bigger heap >> temporarily during the rebuild as you aren't likely to be fielding queries >> and don't benefit from having a larger OS cache available > > Full rebuilds are indeed relatively rare. Avoiding long pauses and > keeping query latency low are usually a lot more important than how > quickly the index rebuilds. Quick rebuilds are nice, but not strictly > necessary. > > We do incremental updates that start at the top of every minute, unless > an update is already running. Exactly how long those updates take is of > little importance, unless that time is easier to measure in minutes > rather than seconds. > > If I ever find myself in a situation where completing a rebuild as fast > as possible becomes extremely important, does anyone have suggestions > for GC tuning options that will optimize for throughput? > > Thanks, > Shawn >
Re: Request two databases at the same time ?
bq: I don't want to modify my BigDB1 to update documents with abstract because BigDB1 is always updated twice by week. Why not? Solr/Lucene handle updating docs, if a doc in the index has the same , the old doc is deleted and the new one takes its place. So why not just put the new abstracts into BigDB1? If you re-index the docs later (your twice/week comment), then they'll be overwritten. This will be much simpler than trying to maintain two. But if you cannot update BigDB1 just fire off two queries and combine them. Or specify the shards parameter on the URL pointing to both collections. Do note, though, that the relevance calculations may not be absolutely comparable, so mixing the results may show some surprises... Best, Erick On Fri, Jan 9, 2015 at 9:12 AM, Bruno Mannina wrote: > Dear All, > > I use Apache-SOLR3.6, on Ubuntu (newbie user). > > I have a big database named BigDB1 with 90M documents, > each document contains several fields (docid, title, author, date, etc...) > > I received today from another source, abstract of some documents (there are > also the same docid field in this source). > I don't want to modify my BigDB1 to update documents with abstract because > BigDB1 is always updated twice by week. > > Do you think it's possible to create a new database named AbsDB1 and request > the both database at the same time ? > if I do for example: > title:airplane AND abstract:plastic > > I would like to obtain documents from BigDB1 and AbsDB1. > > Many thanks for your help, information and others things that can help me. > > Regards, > Bruno > > --- > Ce courrier électronique ne contient aucun virus ou logiciel malveillant > parce que la protection avast! Antivirus est active. > http://www.avast.com >
Re: Best way to implement Spotlight of certain results
Hmm, I wonder if the RerankingQueryParser might help here? See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking Best, Erick On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis wrote: > I have a requirement to spotlight certain results if the query text exactly > matches the title or see reference (indexed by me as alttitle_t). > What that means is that these matching results are shown above the > top-10/20 list with different CSS and fields. Its like feeling lucky on > google :) > > I have considered three ways of implementing this: > >1. Assume that edismax qf/pf will boost these results to be first when >there is an exact match on these important fields. The downside then is >that my relevancy is constrained and I must maintain my configuration with >title and alttitle_t as top search fields (see XML snippet below).I may >have to overweight them to achieve the "always first" criteria. Another >less major downside is that I must always return the spotlight summary >field (for display) and the image to display on each search. These could >be got from a database by the id, however, it is convenient to get them >from Solr. >2. Issue two searches for every user search, and use a second set of >parameters (change the search type and fields to search only by exact >matching a specific string field spottitle_s). The search for the >spotlight can then have its own configuration. The downside here is that >I am using Django and pysolr for the front-end, and pysolr is both >synchronous and tied to the requestHandler named "select". Convention. >Of course, running in parallel is not a fix-all - running a search takes >some time, even if run in parallel. >3. Automate the population of elevate.xml so that all these 959 queries >are here. This is probably best, but forces me to restart/reload when >there are changes to this components. The elevation can be done through a >query. > > What I'd love to do is to configure the "select" requestHandler to run both > searches and return me both sets of results. Is there anyway to do that - > apply the same q= parameter to two configured way to run a search? > Something like sub queries? > > I suspect that approach 1 will get me through my demo and a brief > evaluation period, but that either approach 2 or 3 will be the winner. > > Here's a snippet from my current qf/pf configuration: > > title^100 > alttitle_t^100 > ... > text > > > title^1000 > alttitle_t^1000 > ... > text^10 > > > Thanks, > > Dan Davis
Re: filter on solr pivot data
Why not just add an fq clause like &fq=-mappings_iphoto_exist:[* TO *]? note the "-" sign. On Fri, Jan 9, 2015 at 11:14 AM, Darniz wrote: > Hello > > i need to know how can i filter on solr pivot data. > > For exampel we have a dealer which might have many cars in his lot and car > has photos, i need to find out a dealer which has cars which has no photos > > so i have > > dealer1 -> has 20 cars -> all of them has photos > dealer2 -> has 20 cars -> some of them have photos > dealer3 -> has 20 cars -> none of them have photos > > in the results i want to see only dealers which has no photos ie dealer3, i > managed to do pivot and get a breakdown by vin and photo exists now i want > to apply filter and get only those dealer who has all vin which have photo > exists as 0 > > > > lst name="facet_pivot"> > > > vin > 1N4AA5AP0EC908535 > 1 > > > mappings_|photo_exist| > 1 > 1 > > > > > vin > 1N4AA5AP1EC470625 > 1 > > > mappings_|photo_exist| > 1 > 1 > > > > > is it possible > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/filter-on-solr-pivot-data-tp4178451.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Best way to implement Spotlight of certain results
Maybe I understand you badly but I thing that you could use grouping to achieve such effect. If you could prepare two group queries one with exact match and other, let's say, default than you will be able to extract matches from grouping results. i.e (using default solr example collection) http://localhost:8983/solr/collection1/select?q=*:*&group=true&group.query=manu%3A%22Ap+Computer+Inc.%22&group.query=name:Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Black&group.limit=10 this query will return two groups one with exact match second with the rest standard results. Regars, Michal 2015-01-09 20:44 GMT+01:00 Erick Erickson : > Hmm, I wonder if the RerankingQueryParser might help here? > See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking > > Best, > Erick > > On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis wrote: > > I have a requirement to spotlight certain results if the query text > exactly > > matches the title or see reference (indexed by me as alttitle_t). > > What that means is that these matching results are shown above the > > top-10/20 list with different CSS and fields. Its like feeling lucky on > > google :) > > > > I have considered three ways of implementing this: > > > >1. Assume that edismax qf/pf will boost these results to be first when > >there is an exact match on these important fields. The downside > then is > >that my relevancy is constrained and I must maintain my configuration > with > >title and alttitle_t as top search fields (see XML snippet below). > I may > >have to overweight them to achieve the "always first" criteria. > Another > >less major downside is that I must always return the spotlight summary > >field (for display) and the image to display on each search. These > could > >be got from a database by the id, however, it is convenient to get > them > >from Solr. > >2. Issue two searches for every user search, and use a second set of > >parameters (change the search type and fields to search only by exact > >matching a specific string field spottitle_s). The search for the > >spotlight can then have its own configuration. The downside here is > that > >I am using Django and pysolr for the front-end, and pysolr is both > >synchronous and tied to the requestHandler named "select". > Convention. > >Of course, running in parallel is not a fix-all - running a search > takes > >some time, even if run in parallel. > >3. Automate the population of elevate.xml so that all these 959 > queries > >are here. This is probably best, but forces me to restart/reload > when > >there are changes to this components. The elevation can be done > through a > >query. > > > > What I'd love to do is to configure the "select" requestHandler to run > both > > searches and return me both sets of results. Is there anyway to do > that - > > apply the same q= parameter to two configured way to run a search? > > Something like sub queries? > > > > I suspect that approach 1 will get me through my demo and a brief > > evaluation period, but that either approach 2 or 3 will be the winner. > > > > Here's a snippet from my current qf/pf configuration: > > > > title^100 > > alttitle_t^100 > > ... > > text > > > > > > title^1000 > > alttitle_t^1000 > > ... > > text^10 > > > > > > Thanks, > > > > Dan Davis > -- Michał Bieńkowski
RE: can't make sense of spellchecker results when using techproducts example
Chris, - DirectSpellChecker has a setting for "minPrefix" which the techproducts example sets to 1 (also the default). So it will never try to correct the first character. I think this is both a performance optimization and is based on the assumption that we rarely misspell the first character. This is why it will not correct "hell" to "dell". I think it will allow you to set this to 0, if you want your sample query to work. - The "maxCollationTries" feature re-writes "q" / "spellcheck.q", and then using all the other parameters, queries internally to see if there any hits. This doesn't play very well when "q.op=OR" / "mm=1". So when you see a collation like "here ultrasharp" / "heat ..." etc, you see it is indeed getting some hits. So it considers it a valid query re-write, despite the absurdity. We could improve this example config by adding "spellcheck.collateParam.q.op=AND" to the defaults. (When using dismax, you would add "spellcheck.collateParam.mm=100%") Also, while the "collateParam" functionality is in the old Solr wiki, it doesn't seem to be in the reference manual, so we probably should add it as this would be pretty important for a lot of users. - Unless using the legacy IndexBasedSpellChecker / FileBasedSpellchecker, you need not use "spellcheck.build". Its a no-op for both Direct and WordBreak, as these do not use sidecar indexes. So without changing the config, these queries illustrate the spellchecker pretty well, including the word-break functionality. http://localhost:8983/solr/techproducts/spell?spellcheck.q=dzll+ultra%20sharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND http://localhost:8983/solr/techproducts/spell?spellcheck.q=dellultrasharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND Spellcheck has a lot of gotchas, and I would wish we could dream up a way to make it easy for people. I remember it being a struggle for me when I was a new user, and I know we get lots of questions on the user-list about it. My apologies to you for not answering this sooner. James Dyer Ingram Content Group -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Wednesday, December 17, 2014 6:49 PM To: solr-user@lucene.apache.org Subject: can't make sense of spellchecker results when using techproducts example Ok, so i've been working on updating hte ref guide to account for hte new way to run the "examples" in 5.0. The spell checking page... https://cwiki.apache.org/confluence/display/solr/Spell+Checking ...has some examples that loosely corroloate to the "techproducts" example, but even if you ignore the specifics of those examples, i need help understanding the basic behavior of hte spellchecker as configured in the techproducts Assuming you run this... bin/solr -e techproducts with that example running & those docs indexed, this URL gives me results i can't explain... http://localhost:8983/solr/techproducts/spell?spellcheck.q=hell+ultrashar&df=text&spellcheck=true&spellcheck.build=true (see below) 1) "dell" is not listed as a possible suggestion for for "hell" (even if the dictionary thinks "hold" is a better suggestion, why isn't "dell" even included in the list of possibilities? 2) in the "collation" section, i can't make any sense of what these results mean -- how is "hello ultrasharp" a suggested collationQuery when *none* of the example docs contain both "hello" and "ultrasharp" ? http://localhost:8983/solr/techproducts/select?df=text&q=%2Bhello+%2Bultrasharp So WTF is up with these spell check results? 0 15 build 6 0 4 0 hello 1 here 2 heat 1 hold 1 html 1 héllo 1 1 5 14 0 ultrasharp 1 false hello ultrasharp 2 hello ultrasharp here ultrasharp 3 here ultrasharp heat ultrasharp 2 heat ultrasharp hold ultrasharp 2 hold ultrasharp html ultrasharp 2 html ultrasharp -Hoss http://www.lucidworks.com/
Re: How does text-rev work?
Anybody? Otherwise, I guess it is a JIRA to delete the unused field? Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 28 December 2014 at 13:16, Alexandre Rafalovitch wrote: > I am looking at the collection1/techproducts schema and I can't figure > out how the reversed wildcard example is supposed to work. > > We define text_general_rev type and text_rev field, but we don't seem > to be populating it at any point. And running the example does not > seem to show any tokens in the field even when the non-inverted text > field does have some. > > Apparently, there is some magic in the QueryParser to do something > about this at query time, but I see no explanation of what is supposed > to happen at the index/schema time. > > Anybody has the skinny on this one? > > Regards, >Alex. > > Sign up for my Solr resources newsletter at http://www.solr-start.com/
Re: Request two databases at the same time ?
Dear Erick, thank you for your answer. My answers are below. Le 09/01/2015 20:43, Erick Erickson a écrit : bq: I don't want to modify my BigDB1 to update documents with abstract because BigDB1 is always updated twice by week. Why not? Solr/Lucene handle updating docs, if a doc in the index has the same , the old doc is deleted and the new one takes its place. So why not just put the new abstracts into BigDB1? If you re-index the docs later (your twice/week comment), then they'll be overwritten. This will be much simpler than trying to maintain two. I understand this process, I use it for other collections and twice time by week for BigDB1. But, i.e. Doc1 is updated with Abstract on Monday. Tuesday I must update it with new data, then Abstract will be lost. I can't check/get abstract before to re-insert it in the new doc because I receive several thousand docs every week (new and amend), i think it will take a long time to do that. But if you cannot update BigDB1 just fire off two queries and combine them. Or specify the shards parameter on the URL pointing to both collections. Do note, though, that the relevance calculations may not be absolutely comparable, so mixing the results may show some surprises... Shards..I wilkl take a look to this, I don't know this param. Concerning relevance, I don't really use it, so it won't be a problem I think. Sincerely, Best, Erick On Fri, Jan 9, 2015 at 9:12 AM, Bruno Mannina wrote: Dear All, I use Apache-SOLR3.6, on Ubuntu (newbie user). I have a big database named BigDB1 with 90M documents, each document contains several fields (docid, title, author, date, etc...) I received today from another source, abstract of some documents (there are also the same docid field in this source). I don't want to modify my BigDB1 to update documents with abstract because BigDB1 is always updated twice by week. Do you think it's possible to create a new database named AbsDB1 and request the both database at the same time ? if I do for example: title:airplane AND abstract:plastic I would like to obtain documents from BigDB1 and AbsDB1. Many thanks for your help, information and others things that can help me. Regards, Bruno --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
Re: How does text-rev work?
Or a Jira to document it. The basic idea is that if a normal leading wildcard is too slow, the user can index a copy of their text fields using the text_rev type, which indexes terms with their characters reversed and with a special marker. Then the query parser detects a leading wildcard and that the field type uses the reversed wildcard filter, and then it generates a wildcard query that using the reversed query token and wildcard pattern so that the leading wildcard becomes a trailing wildcard or prefix query -- Jack Krupansky On Fri, Jan 9, 2015 at 3:15 PM, Alexandre Rafalovitch wrote: > Anybody? Otherwise, I guess it is a JIRA to delete the unused field? > > Regards, >Alex. > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ > > > On 28 December 2014 at 13:16, Alexandre Rafalovitch > wrote: > > I am looking at the collection1/techproducts schema and I can't figure > > out how the reversed wildcard example is supposed to work. > > > > We define text_general_rev type and text_rev field, but we don't seem > > to be populating it at any point. And running the example does not > > seem to show any tokens in the field even when the non-inverted text > > field does have some. > > > > Apparently, there is some magic in the QueryParser to do something > > about this at query time, but I see no explanation of what is supposed > > to happen at the index/schema time. > > > > Anybody has the skinny on this one? > > > > Regards, > >Alex. > > > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ >
Unexplained leader initiated recovery after updates
Hi all, I am experiencing a problem where Solr nodes go into recovery following an update cycle. Examination of the logs indicates that the recovery is initiated by the shard master while processing regular update events, because the replica is unreachable. For example, the following is recorded in the leader’s log file: ... 2014-12-15 05:14:03.285 [qtp2092193830-400307] INFO org.apache.solr.cloud.ZkController Put replica core=listings coreNodeName=solr12:8983_solr_listings on solr12:8983_solr into leader-initiated recovery. 2014-12-15 05:14:03.285 [qtp2092193830-400307] WARN org.apache.solr.cloud.ZkController Leader is publishing core=listings coreNodeName =solr12:8983_solr_listings state=down on behalf of un-reachable replica http://solr12:8983/solr/listings/; forcePublishState? false 2014-12-15 05:14:03.287 [zkCallback-2-thread-20] INFO org.apache.solr.cloud.DistributedQueue LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged ... However when I check, I cannot detect any connectivity problems between the leader and the replica. About 40% of the time, the nodes recover without any intervention in 4 or 5 minutes. The remaining 60% of the time however, the recovering node reports a java.lang.OutOfMemoryError and Solr needs to be restarted. For background, here are some details about our configuration: * Solr 4.10.2 (problem also observed with Solr 4.6.1) * 12 shards with 2 nodes per shard * a single updater running in a separate subnet is posting updates using the SolrJ CloudSolrServer client. Updates are triggered hourly. * system is under continuous query load * autoCommit is set to 821 seconds * autoSoftCommit is set to 303 seconds I cannot correlate these recovery events to an increase in update or query load. The query traffic is does not appear to be affected by any transient connectivity issues. The only clear pattern is that these recovery events happen after an updater run and the cluster is busy processing the updates. Can suggest where to look to figure out why these recovery events are occurring? Thanks, Lindsay Martin
Re: Unexplained leader initiated recovery after updates
On 1/9/2015 4:54 PM, Lindsay Martin wrote: > I am experiencing a problem where Solr nodes go into recovery following an > update cycle. > For background, here are some details about our configuration: > * Solr 4.10.2 (problem also observed with Solr 4.6.1) > * 12 shards with 2 nodes per shard > * a single updater running in a separate subnet is posting updates using the > SolrJ CloudSolrServer client. Updates are triggered hourly. > * system is under continuous query load > * autoCommit is set to 821 seconds > * autoSoftCommit is set to 303 seconds I would suspect some kind of performance problem that likely results in the zkClientTimeout expiring. I have a standard set of questions for performance problems. Questions about zookeeper: How many ZK nodes? Is zookeeper on separate hardware? If it's on the same hardware as Solr, is its database on the same disk spindles as the Solr index, or separate spindles? Is zookeeper standalone or embedded in Solr? If it's standalone, do you happen to know the java max heap for the zookeeper processes? Questions about Solr and the hardware: How many total Solr servers? How much RAM is installed on each one? What is the max size of the Java heap? Are you running more than one Solr (JVM/container) instance per machine? If you add up all the "index" directories on a server, how much disk space does it take? Is the amount of disk space used similar on all of the servers? Thanks, Shawn
Re: Garbage Collection tuning - G1 is now a good option
On 1/1/2015 12:10 PM, Shawn Heisey wrote: > I've been working with Oracle employees to find better GC tuning > options. The results are good enough to share with the community: > > https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning > > With the latest Java 7 or Java 8 version, and a couple of tuning > options, G1GC has grown up enough to be a viable choice. Two of the > settings on that list were critical for making the performance > acceptable with my testing: ParallelRefProcEnabled and G1HeapRegionSize. > > I've included some notes on the wiki about how you can size the G1 heap > regions appropriately for your own index. A note was just recently added to the Lucene wiki, saying that the G1 collector should never be used with Lucene, because there are bugs that might cause index corruption. Solr uses Lucene, so this would apply. https://wiki.apache.org/lucene-java/JavaBugs#Oracle_Java_.2F_Sun_Java_.2F_OpenJDK_Bugs I have never had any problems with it, but perhaps I spoke too soon when recommending G1. Does anyone else have anything to add? Thanks, Shawn
Re: Garbage Collection tuning - G1 is now a good option
It looks like 32 bit is affected. > On 2013-08-14 08:27, Dawid Weiss wrote: >> >> Hi everyone, >> >> I am a committer to the Lucene/Solr project. We've recently hit what >> we believe is a JIT/GC bug -- it manifests itself only when G1GC is >> used, on a 32-bit VM: On Fri, Jan 9, 2015 at 7:10 PM, Shawn Heisey wrote: > On 1/1/2015 12:10 PM, Shawn Heisey wrote: > > I've been working with Oracle employees to find better GC tuning > > options. The results are good enough to share with the community: > > > > https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning > > > > With the latest Java 7 or Java 8 version, and a couple of tuning > > options, G1GC has grown up enough to be a viable choice. Two of the > > settings on that list were critical for making the performance > > acceptable with my testing: ParallelRefProcEnabled and G1HeapRegionSize. > > > > I've included some notes on the wiki about how you can size the G1 heap > > regions appropriately for your own index. > > A note was just recently added to the Lucene wiki, saying that the G1 > collector should never be used with Lucene, because there are bugs that > might cause index corruption. Solr uses Lucene, so this would apply. > > > https://wiki.apache.org/lucene-java/JavaBugs#Oracle_Java_.2F_Sun_Java_.2F_OpenJDK_Bugs > > I have never had any problems with it, but perhaps I spoke too soon when > recommending G1. Does anyone else have anything to add? > > Thanks, > Shawn > > -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: How does text-rev work?
So, Query Parser does some sort of magic and looks for the field with the same name and _rev suffix? But what populates that field? In the example schema, it seems to be standalone and empty. Is there a copyField missing? Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 9 January 2015 at 17:07, Jack Krupansky wrote: > Or a Jira to document it. > > The basic idea is that if a normal leading wildcard is too slow, the user > can index a copy of their text fields using the text_rev type, which > indexes terms with their characters reversed and with a special marker. > Then the query parser detects a leading wildcard and that the field type > uses the reversed wildcard filter, and then it generates a wildcard query > that using the reversed query token and wildcard pattern so that the > leading wildcard becomes a trailing wildcard or prefix query > > > -- Jack Krupansky > > On Fri, Jan 9, 2015 at 3:15 PM, Alexandre Rafalovitch > wrote: > >> Anybody? Otherwise, I guess it is a JIRA to delete the unused field? >> >> Regards, >>Alex. >> >> Sign up for my Solr resources newsletter at http://www.solr-start.com/ >> >> >> On 28 December 2014 at 13:16, Alexandre Rafalovitch >> wrote: >> > I am looking at the collection1/techproducts schema and I can't figure >> > out how the reversed wildcard example is supposed to work. >> > >> > We define text_general_rev type and text_rev field, but we don't seem >> > to be populating it at any point. And running the example does not >> > seem to show any tokens in the field even when the non-inverted text >> > field does have some. >> > >> > Apparently, there is some magic in the QueryParser to do something >> > about this at query time, but I see no explanation of what is supposed >> > to happen at the index/schema time. >> > >> > Anybody has the skinny on this one? >> > >> > Regards, >> >Alex. >> > >> > Sign up for my Solr resources newsletter at http://www.solr-start.com/ >>