Re: Filter nested index - remove empty parents
This seems to be a good approach. I will try!Thank you! Dragos From: Erick Erickson To: solr-user ; Dragos Bogdan Sent: Thursday, November 10, 2016 6:02 PM Subject: Re: Filter nested index - remove empty parents It looks like you're trying to just index tables from some DB and then search them in Solr as you would the DB. Solr join queries aren't like DB joins, especially you can't return _fields_ from the "from" table. The usual recommendation, if at all possible, is to flatten your data. This runs counter to the RDMBS reflex to normalize, normalize, normalize. However, Solr specialized in searching and handles lots and lots of data so de-normalizing is often a viable solution. Best, Erick On Thu, Nov 10, 2016 at 6:17 AM, Dragos Bogdan wrote: > Hello, > > I am new to SOLR and at the first glance, I can say this is a very good > service. Very helpful and fast. > > I am trying to filter docs based on some criteria but I have few issues > obtaining the final results. > The main objective is to have one query that is able to offer a list of > Persons with specific Profiles that have specific Experiences. > > I think I managed to obtain such list, but the issue is that I still have in > the results Persons with no Profiles, or Profiles with no Experiences. I > would need a clean list with optim execution time. > > > What I have - types of docs: > > Parents - Persons: > { "FIRSTNAME": "Ruth", > "CONTENT_TYPE_ID": "parentDocument", > "id": "-3631097568311640064"} > > Children - Profiles > { "PROFILEID": "548", > "CONTENT_TYPE_ID": "firstChildDocument", > "id": "-3631097568311640064", > "PROFILECOMPETENCYID": "553"} > > Children of Profiles are Experiences > { "EXPERIENCEID": "8158200356237475840", > "CONTENT_TYPE_ID": "secondChildDocument", > "id": "-3631097568311640064", > "PROFILE_PROFILEID": "548"} > > > > Variant 1: > > q=id:"-3631097568311640064" AND +{!parent > which=CONTENT_TYPE_ID:parentDocument v=CONTENT_TYPE_ID:firstChildDocument}& > fl=*,experiences:[subquery]& > experiences.q=(CONTENT_TYPE_ID:secondChildDocument AND > EXPERIENCEID:"-3884425047351230464")& > experiences.fq={!terms f=PROFILE_PROFILEID v=$row.PROFILEID}& > expand.field=_root_&expand=true&expand.q=CONTENT_TYPE_ID:firstChildDocument > > This approach group and filter Profiles for every Person and create a > subquery of desired Experiences for each Profile. > The issue is that I have "empty" Profiles with no Experiences in the > results, and implicitly Persons with any Experience. > > Example result attached: Example1.json > > Variant 2: > > q=CONTENT_TYPE_ID:"parentDocument" AND id:"-3631097568311640064"& > fl=*,profiles:[subquery]& > profiles.q=*:*& > profiles.fq=(CONTENT_TYPE_ID:"firstChildDocument" AND {!terms f=id > v=$row.id})& > profiles.fl=*,experiences:[subquery]& > profiles.experiences.q=*:*& > profiles.experiences.fq=((CONTENT_TYPE_ID:"secondChildDocument" AND > EXPERIENCEID:"-3884425047351230464") AND {!terms f=PROFILE_PROFILEID > v=$row.PROFILEID}) > > This approach just simple create subqueries with the desired Experiences, > but I have two issues: > - The subqueries are executed for documents that is not necessary for > example: tries to find experiences for Persons, but Experiences exists only > for Profiles > - And the same issue, the results contains Persons with no Experiences or > Profiles with no Experiences. The "empty" Persons and "empty" Profiles > should be removed. > (Somehow filter all results that have numFound: 0 ?) > > Example result attached: Example2.json > > > Questions: > > 1. Is there any solution to fix the issues with any of the above queries so > we have the desired results? > Is there any optimization can be done to have the best timings? > > Or > > 2. Is any other approach in order to obtain the desired results? Other type > of joins? > > > kind regards, > Dragos
Re: Is there a way to tell if multivalued field actually contains multiple values?
I suppose it's needless to remind that norm(field) is proportional (but not precisely by default) to number of tokens in a doc's field (although not actual text values). On Fri, Nov 11, 2016 at 5:08 AM, Alexandre Rafalovitch wrote: > Hello, > > Say I indexed a large dataset against a schemaless configuration. Now > I have a bunch of multivalued fields. Is there any way to say which of > these (text) fields have (for given data) only single values? I know I > am supposed to look at the original data, and all that, but this is > more for debugging/troubleshooting. > > Turning termOffsets/termPositions would make it easy, but that's a bit > messy for troubleshooting purposes. > > I was thinking that one giveaway is the positionIncrementGap causing > the second value's token to start at number above a hundred. But I am > not sure how to craft a query against a field to see if such a token > is generically present. > > > Any ideas? > > Regards, > Alex. > > > Solr Example reading group is starting November 2016, join us at > http://j.mp/SolrERG > Newsletter and resources for Solr beginners and intermediates: > http://www.solr-start.com/ > -- Sincerely yours Mikhail Khludnev
Re: Is there a way to tell if multivalued field actually contains multiple values?
I think you can use the term stats that Lucene tracks for each field. Compare Terms.getSumTotalTermFreq and Terms.getDocCount. If they are equal it means every document that had this field, had only one token. Mike McCandless http://blog.mikemccandless.com On Fri, Nov 11, 2016 at 5:50 AM, Mikhail Khludnev wrote: > I suppose it's needless to remind that norm(field) is proportional (but not > precisely by default) to number of tokens in a doc's field (although not > actual text values). > > On Fri, Nov 11, 2016 at 5:08 AM, Alexandre Rafalovitch > wrote: > >> Hello, >> >> Say I indexed a large dataset against a schemaless configuration. Now >> I have a bunch of multivalued fields. Is there any way to say which of >> these (text) fields have (for given data) only single values? I know I >> am supposed to look at the original data, and all that, but this is >> more for debugging/troubleshooting. >> >> Turning termOffsets/termPositions would make it easy, but that's a bit >> messy for troubleshooting purposes. >> >> I was thinking that one giveaway is the positionIncrementGap causing >> the second value's token to start at number above a hundred. But I am >> not sure how to craft a query against a field to see if such a token >> is generically present. >> >> >> Any ideas? >> >> Regards, >> Alex. >> >> >> Solr Example reading group is starting November 2016, join us at >> http://j.mp/SolrERG >> Newsletter and resources for Solr beginners and intermediates: >> http://www.solr-start.com/ >> > > > > -- > Sincerely yours > Mikhail Khludnev
Re: Sorl shards: very sensitive to swap space usage !?
On Thu, 2016-11-10 at 16:42 -0700, Shawn Heisey wrote: > If the machine that Solr is installed on is using swap, that means > you're having serious problems, and your performance will be > TERRIBLE. Agreed so far. > This kind of problem cannot be caused by Solr if it is properly > configured for the machine it's running on. That is practically a tautology. Most of the Solr setups I have worked with has been behaving as one would hope with regards to swap, but on two occasions I have experienced heavy swapping with multiple gigabytes free for disk cache. In both cases, the cache-to-index size was fairly low (let's say < 10%). My guess (I don't know the intrinsics of memory mapping vs. swapping) is that the aggressive IO for the memory mapping caused the kernel to start swapping parts of the JVM heap to get better caching of storage data. Yes, with terrible performance as a result. No matter the cause, the swapping problems were "solved" by effectively disabling the swap (swappiness 0). We did try with very conservative swapping first (swappiness 5 or something like that), but that did not work. Although that meant less free memory for disk caching, as nothing were no longer swapped, it solved our performance problems. Disabling swapping is easy to try, so I suggest doing just that. - Toke Eskildsen, State and University Library, Denmark
Re: Sorl shards: very sensitive to swap space usage !?
On 11/11/2016 6:46 AM, Toke Eskildsen wrote: > but on two occasions I have > experienced heavy swapping with multiple gigabytes free for disk > cache. In both cases, the cache-to-index size was fairly low (let's > say < 10%). My guess (I don't know the intrinsics of memory mapping > vs. swapping) is that the aggressive IO for the memory mapping caused > the kernel to start swapping parts of the JVM heap to get better > caching of storage data. Yes, with terrible performance as a result. That's really weird, and sounds like a broken operating system. I've had other issues with swap, but in those cases, free memory was actually near zero, and it sounds like your situation was not the same. So the OP here might be having similar problems even if nothing's misconfigured. If so, your solution will probably help them. > No matter the cause, the swapping problems were "solved" by > effectively disabling the swap (swappiness 0). Solr certainly doesn't need (or even want) swap, if the machine is sized right. I've read some things saying that Linux doesn't behave correctly if you completely get rid of all swap, but setting swappiness to zero sounds like a good option. The OS would still utilize swap if it actually ran out of physical memory, so you don't lose the safety valve that swap normally provides. Thanks, Shawn
Re: Wildcard searches with space in TextField/StrField
You have to query text and string fields differently, that's just the way it works. The problem is getting the query string through the parser as a _single_ token or as multiple tokens. Let's say you have a string field with the "a b" example. You have a single token a b that starts at offset 0. But with a text field, you have two tokens, a at position 0 b at position 1 But when the query parser sees "a b" (without quotes) it splits it into two tokens, and only the text field has both tokens so the string field won't match. OTOH, when the query parser sees "a\ b" it passes this through as a single token, which only matches the string field as there's no _single_ token "a b" in the text field. But a more interesting question is why you want to search this way. String fields are intended for keywords, machine-generated IDs and the like. They're pretty useless for searching anything except 1> exact tokens 2> prefixes While if you have "my dog has fleas" in a string field, you _can_ search "*dog*" and get a hit but the performance is poor when you get a large corpus. Performance for "my*" will be pretty good though. In all this sounds like an XY problem, what's the use-case you're trying to solve? Best, Erick On Thu, Nov 10, 2016 at 10:11 PM, Sandeep Khanzode wrote: > Hi Erick, Reth, > > The 'a\ b*' as well as the q.op=AND approach worked (successfully) only for > StrField for me. > > Any attempt at creating a 'a\ b*' for a TextField does not match any > documents. The parsedQuery in debug mode does show 'field:a b*'. I am sure > there are documents that should match. > Another (maybe unrelated) observation is if I have 'field:a\ b', then the > parsedQuery is field:a field:b. Which does not match as expected (matches > individually). > > Can you please provide an example that I can use in Solr Query dashboard? > That will be helpful. > > I have also seen that wildcard queries work irrespective of field type i.e. > StrField as well as TextField. That makes sense because with a > WhitespaceTokenizer only creates word boundaries when we do not use a > EdgeNGramFilter. If I am not wrong, that is. SRK > > On Friday, November 11, 2016 5:00 AM, Erick Erickson > wrote: > > > You can escape the space with a backslash as 'a\ b*' > > Best, > Erick > > On Thu, Nov 10, 2016 at 2:37 PM, Reth RM wrote: >> I don't think you can do wildcard on StrField. For text field, if your >> query is "category:(test m*)" the parsed query will be "category:test OR >> category:m*" >> You can add q.op=AND to make an AND between those terms. >> >> For phrase type wild card query support, as per docs, it >> is ComplexPhraseQueryParser that supports it. (I haven't tested it myself) >> >> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser >> >> On Thu, Nov 10, 2016 at 11:40 AM, Sandeep Khanzode < >> sandeep_khanz...@yahoo.com.invalid> wrote: >> >>> Hi, >>> How does a search like abc* work in StrField. Since the entire thing is >>> stored as a single token, is it a type of a trie structure that allows such >>> wildcard matching? >>> How can searches with space like 'a b*' be executed for text fields >>> (tokenized on whitespace)? If we specify this type of query, it is broken >>> down into two queries with field:a and field:b*. I would like them to be >>> contiguous, sort of, like a phrase search with wild card. >>> SRK > > >
Re: Keeping faster and slower solr slaves alined with the same index version
Csongor: If session locking is new to you, here is a comprehensive explanation of the "Active - Active multi-region" scenario you're encountering and how NetFlix resolves the matter. Although I remain confused by a 15 minute network transfer of non-optimized segments; or even if you are replicating after optimize rather than commit and all files are being shipped. http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html regards, will On 11/7/2016 11:13 AM, Erick Erickson wrote: > Not that I know of. Can you session lock users to a particular region? > > Best, > Erick > > On Sun, Nov 6, 2016 at 7:49 PM, Csongor Gyuricza > wrote: >> We have the following high-level solr setup: >> >> region a) 1 solr master + 3 slaves >> region b) 1 solr repeater (pointing to master in region a) + 3 slaves >> >> In region (a) Replication takes about 2 min from the master to the 3 >> slaves. Due to our network topology, replication from the master to the >> repeater takes about 15 min after which, it takes another 2 min for the >> replication to occur between the repeater and the slaves in region (b), so >> the slaves in region (b) are always 15 min behind the slaves in region (a) >> which is a problem because all slaves are behind a latency-based route53 >> record. Clients are noticing the difference because they are getting >> inconsistent data during those 15 min. >> >> I would like to solve this inconsistency. Is there a way to make the faster >> slaves in region (a) wait for all slaves in region (b) to complete >> replication and then have all 6 slaves switch to the new index >> simultaneously? if not, what is the alternative solution to this problem? >> >> - Csongor >> >> Note: We are on solr 3.5 (old, yes I know...)
Re: 5.5.3: fieldValueCache auto-warming error
On 10/11/16 17:10, Erick Erickson wrote: > Just facet on the text field yourself ;) Wish I could, this is on premise over at a client, access is difficult and their response time is pretty bad on public holidays and weekends. So I'm basically twiddling my thumbs while waiting to get more log files :-) I haven't been able to reproduce the problem locally, but there could be any number of contributing factors that I'm missing. > Kidding aside, this should be in the clear from the logs, my guess is > that the first time you see an OOM error in the logs the query will be > in the file also. We generally prefer "fail hard fast", so I think we are running with the OOM killer script in most environments. I don't think they've gone OOM in this case, though something else could have gone wrong undected. I hope I'll know more after the weekend.