date:20161111

Re: Filter nested index - remove empty parents

2016-11-11 Thread Dragos Bogdan

This seems to be a good approach. I will try!Thank you!
Dragos


  From: Erick Erickson 
 To: solr-user ; Dragos Bogdan 
 
 Sent: Thursday, November 10, 2016 6:02 PM
 Subject: Re: Filter nested index - remove empty parents
   
It looks like you're trying to just index tables from some DB and then
search them in Solr as you would the DB.

Solr join queries aren't like DB joins, especially you can't return
_fields_ from the "from" table.

The usual recommendation, if at all possible, is to flatten your data.
This runs counter to the RDMBS reflex to normalize, normalize,
normalize. However, Solr specialized in searching and handles lots
and lots of data so de-normalizing is often a viable solution.

Best,
Erick

On Thu, Nov 10, 2016 at 6:17 AM, Dragos Bogdan
 wrote:
> Hello,
>
> I am new to SOLR and at the first glance, I can say this is a very good
> service. Very helpful and fast.
>
> I am trying to filter docs based on some criteria but I have few issues
> obtaining the final results.
> The main objective is to have one query that is able to offer a list of
> Persons with specific Profiles that have specific Experiences.
>
> I think I managed to obtain such list, but the issue is that I still have in
> the results Persons with no Profiles, or Profiles with no Experiences. I
> would need a clean list with optim execution time.
>
>
> What I have - types of docs:
>
> Parents - Persons:
> {      "FIRSTNAME": "Ruth",
>        "CONTENT_TYPE_ID": "parentDocument",
>        "id": "-3631097568311640064"}
>
> Children - Profiles
> {        "PROFILEID": "548",
>          "CONTENT_TYPE_ID": "firstChildDocument",
>          "id": "-3631097568311640064",
>          "PROFILECOMPETENCYID": "553"}
>
> Children of Profiles are Experiences
>  {        "EXPERIENCEID": "8158200356237475840",
>          "CONTENT_TYPE_ID": "secondChildDocument",
>          "id": "-3631097568311640064",
>          "PROFILE_PROFILEID": "548"}
>
>
>
> Variant 1:
>
> q=id:"-3631097568311640064" AND +{!parent
> which=CONTENT_TYPE_ID:parentDocument v=CONTENT_TYPE_ID:firstChildDocument}&
> fl=*,experiences:[subquery]&
> experiences.q=(CONTENT_TYPE_ID:secondChildDocument AND
> EXPERIENCEID:"-3884425047351230464")&
> experiences.fq={!terms f=PROFILE_PROFILEID v=$row.PROFILEID}&
> expand.field=_root_&expand=true&expand.q=CONTENT_TYPE_ID:firstChildDocument
>
> This approach group and filter Profiles for every Person and create a
> subquery of desired Experiences for each Profile.
> The issue is that I have "empty" Profiles with no Experiences in the
> results, and implicitly Persons with any Experience.
>
> Example result attached: Example1.json
>
> Variant 2:
>
> q=CONTENT_TYPE_ID:"parentDocument" AND id:"-3631097568311640064"&
> fl=*,profiles:[subquery]&
> profiles.q=*:*&
> profiles.fq=(CONTENT_TYPE_ID:"firstChildDocument" AND {!terms f=id
> v=$row.id})&
> profiles.fl=*,experiences:[subquery]&
> profiles.experiences.q=*:*&
> profiles.experiences.fq=((CONTENT_TYPE_ID:"secondChildDocument" AND
> EXPERIENCEID:"-3884425047351230464") AND {!terms f=PROFILE_PROFILEID
> v=$row.PROFILEID})
>
> This approach just simple create subqueries with the desired Experiences,
> but I have two issues:
> - The subqueries are executed for documents that is not necessary for
> example: tries to find experiences for Persons, but Experiences exists only
> for Profiles
> - And the same issue, the results contains Persons with no Experiences or
> Profiles with no Experiences. The "empty" Persons and "empty" Profiles
> should be removed.
> (Somehow filter all results that have numFound: 0 ?)
>
> Example result attached: Example2.json
>
>
> Questions:
>
> 1. Is there any solution to fix the issues with any of the above queries so
> we have the desired results?
> Is there any optimization can be done to have the best timings?
>
> Or
>
> 2. Is any other approach in order to obtain the desired results? Other type
> of joins?
>
>
> kind regards,
> Dragos

Re: Is there a way to tell if multivalued field actually contains multiple values?

2016-11-11 Thread Mikhail Khludnev

I suppose it's needless to remind that norm(field) is proportional (but not
precisely by default) to number of tokens in a doc's field (although not
actual text values).

On Fri, Nov 11, 2016 at 5:08 AM, Alexandre Rafalovitch 
wrote:

> Hello,
>
> Say I indexed a large dataset against a schemaless configuration. Now
> I have a bunch of multivalued fields. Is there any way to say which of
> these (text) fields have (for given data) only single values? I know I
> am supposed to look at the original data, and all that, but this is
> more for debugging/troubleshooting.
>
> Turning termOffsets/termPositions would make it easy, but that's a bit
> messy for troubleshooting purposes.
>
> I was thinking that one giveaway is the positionIncrementGap causing
> the second value's token to start at number above a hundred. But I am
> not sure how to craft a query against a field to see if such a token
> is generically present.
>
>
> Any ideas?
>
> Regards,
> Alex.
>
> 
> Solr Example reading group is starting November 2016, join us at
> http://j.mp/SolrERG
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>



-- 
Sincerely yours
Mikhail Khludnev

Re: Is there a way to tell if multivalued field actually contains multiple values?

2016-11-11 Thread Michael McCandless

I think you can use the term stats that Lucene tracks for each field.

Compare Terms.getSumTotalTermFreq and Terms.getDocCount.  If they are
equal it means every document that had this field, had only one token.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 11, 2016 at 5:50 AM, Mikhail Khludnev  wrote:
> I suppose it's needless to remind that norm(field) is proportional (but not
> precisely by default) to number of tokens in a doc's field (although not
> actual text values).
>
> On Fri, Nov 11, 2016 at 5:08 AM, Alexandre Rafalovitch 
> wrote:
>
>> Hello,
>>
>> Say I indexed a large dataset against a schemaless configuration. Now
>> I have a bunch of multivalued fields. Is there any way to say which of
>> these (text) fields have (for given data) only single values? I know I
>> am supposed to look at the original data, and all that, but this is
>> more for debugging/troubleshooting.
>>
>> Turning termOffsets/termPositions would make it easy, but that's a bit
>> messy for troubleshooting purposes.
>>
>> I was thinking that one giveaway is the positionIncrementGap causing
>> the second value's token to start at number above a hundred. But I am
>> not sure how to craft a query against a field to see if such a token
>> is generically present.
>>
>>
>> Any ideas?
>>
>> Regards,
>> Alex.
>>
>> 
>> Solr Example reading group is starting November 2016, join us at
>> http://j.mp/SolrERG
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev

Re: Sorl shards: very sensitive to swap space usage !?

2016-11-11 Thread Toke Eskildsen

On Thu, 2016-11-10 at 16:42 -0700, Shawn Heisey wrote:
> If the machine that Solr is installed on is using swap, that means
> you're having serious problems, and your performance will be
> TERRIBLE. 

Agreed so far.

> This kind of problem cannot be caused by Solr if it is properly
> configured for the machine it's running on.

That is practically a tautology.

Most of the Solr setups I have worked with has been behaving as one
would hope with regards to swap, but on two occasions I have
experienced heavy swapping with multiple gigabytes free for disk cache.
In both cases, the cache-to-index size was fairly low (let's say <
10%). My guess (I don't know the intrinsics of memory mapping vs.
swapping) is that the aggressive IO for the memory mapping caused the
kernel to start swapping parts of the JVM heap to get better caching of
storage data. Yes, with terrible performance as a result.

No matter the cause, the swapping problems were "solved" by effectively
disabling the swap (swappiness 0). We did try with very conservative
swapping first (swappiness 5 or something like that), but that did not
work. Although that meant less free memory for disk caching, as nothing
were no longer swapped, it solved our performance problems.

Disabling swapping is easy to try, so I suggest doing just that.

- Toke Eskildsen, State and University Library, Denmark

Re: Sorl shards: very sensitive to swap space usage !?

2016-11-11 Thread Shawn Heisey

On 11/11/2016 6:46 AM, Toke Eskildsen wrote:
> but on two occasions I have
> experienced heavy swapping with multiple gigabytes free for disk
> cache. In both cases, the cache-to-index size was fairly low (let's
> say < 10%). My guess (I don't know the intrinsics of memory mapping
> vs. swapping) is that the aggressive IO for the memory mapping caused
> the kernel to start swapping parts of the JVM heap to get better
> caching of storage data. Yes, with terrible performance as a result. 

That's really weird, and sounds like a broken operating system.  I've
had other issues with swap, but in those cases, free memory was actually
near zero, and it sounds like your situation was not the same.  So the
OP here might be having similar problems even if nothing's
misconfigured.  If so, your solution will probably help them.

> No matter the cause, the swapping problems were "solved" by
> effectively disabling the swap (swappiness 0).

Solr certainly doesn't need (or even want) swap, if the machine is sized
right.  I've read some things saying that Linux doesn't behave correctly
if you completely get rid of all swap, but setting swappiness to zero
sounds like a good option.  The OS would still utilize swap if it
actually ran out of physical memory, so you don't lose the safety valve
that swap normally provides.

Thanks,
Shawn

Re: Wildcard searches with space in TextField/StrField

2016-11-11 Thread Erick Erickson

You have to query text and string fields differently, that's just the
way it works. The problem is getting the query string through the
parser as a _single_ token or as multiple tokens.

Let's say you have a string field with the "a b" example. You have a
single token
a b that starts at offset 0.

But with a text field, you have two tokens,
a at position 0
b at position 1

But when the query parser sees "a b" (without quotes) it splits it
into two tokens, and only the text field has both tokens so the string
field won't match.

OTOH, when the query parser sees "a\ b" it passes this through as a
single token, which only matches the string field as there's no
_single_ token "a b" in the text field.

But a more interesting question is why you want to search this way.
String fields are intended for keywords, machine-generated IDs and the
like. They're pretty useless for searching anything except
1> exact tokens
2> prefixes

While if you have "my dog has fleas" in a string field, you _can_
search "*dog*" and get a hit but the performance is poor when you get
a large corpus. Performance for "my*" will be pretty good though.

In all this sounds like an XY problem, what's the use-case you're
trying to solve?

Best,
Erick

On Thu, Nov 10, 2016 at 10:11 PM, Sandeep Khanzode
 wrote:
> Hi Erick, Reth,
>
> The 'a\ b*' as well as the q.op=AND approach worked (successfully) only for 
> StrField for me.
>
> Any attempt at creating a 'a\ b*' for a TextField does not match any 
> documents. The parsedQuery in debug mode does show 'field:a b*'. I am sure 
> there are documents that should match.
> Another (maybe unrelated) observation is if I have 'field:a\ b', then the 
> parsedQuery is field:a field:b. Which does not match as expected (matches 
> individually).
>
> Can you please provide an example that I can use in Solr Query dashboard? 
> That will be helpful.
>
> I have also seen that wildcard queries work irrespective of field type i.e. 
> StrField as well as TextField. That makes sense because with a 
> WhitespaceTokenizer only creates word boundaries when we do not use a 
> EdgeNGramFilter. If I am not wrong, that is. SRK
>
> On Friday, November 11, 2016 5:00 AM, Erick Erickson 
>  wrote:
>
>
>  You can escape the space with a backslash as  'a\ b*'
>
> Best,
> Erick
>
> On Thu, Nov 10, 2016 at 2:37 PM, Reth RM  wrote:
>> I don't think you can do wildcard on StrField. For text field, if your
>> query is "category:(test m*)"  the parsed query will be  "category:test OR
>> category:m*"
>> You can add q.op=AND to make an AND between those terms.
>>
>> For phrase type wild card query support, as per docs, it
>> is ComplexPhraseQueryParser that supports it. (I haven't tested it myself)
>>
>> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
>>
>> On Thu, Nov 10, 2016 at 11:40 AM, Sandeep Khanzode <
>> sandeep_khanz...@yahoo.com.invalid> wrote:
>>
>>> Hi,
>>> How does a search like abc* work in StrField. Since the entire thing is
>>> stored as a single token, is it a type of a trie structure that allows such
>>> wildcard matching?
>>> How can searches with space like 'a b*' be executed for text fields
>>> (tokenized on whitespace)? If we specify this type of query, it is broken
>>> down into two queries with field:a and field:b*. I would like them to be
>>> contiguous, sort of, like a phrase search with wild card.
>>> SRK
>
>
>

Re: Keeping faster and slower solr slaves alined with the same index version

2016-11-11 Thread Will Martin

Csongor:

If session locking is new to you, here is a comprehensive explanation of 
the "Active - Active multi-region" scenario you're encountering and how 
NetFlix resolves the matter. Although I remain confused by a 15 minute 
network transfer of non-optimized segments; or even if you are 
replicating after optimize rather than commit and all files are being 
shipped.

http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html

regards,

will



On 11/7/2016 11:13 AM, Erick Erickson wrote:
> Not that I know of. Can you session lock users to a particular region?
>
> Best,
> Erick
>
> On Sun, Nov 6, 2016 at 7:49 PM, Csongor Gyuricza
>  wrote:
>> We have the following high-level solr setup:
>>
>> region a) 1 solr master + 3 slaves
>> region b) 1 solr repeater (pointing to master in region a) + 3 slaves
>>
>> In region (a) Replication takes about 2 min from the master to the 3
>> slaves. Due to our network topology, replication from the master to the
>> repeater takes about 15 min after which, it takes another 2 min for the
>> replication to occur between the repeater and the slaves in region (b), so
>> the slaves in region (b) are always 15 min behind the slaves in region (a)
>> which is a problem because all slaves are behind a latency-based route53
>> record. Clients are noticing the difference because they are getting
>> inconsistent data during those 15 min.
>>
>> I would like to solve this inconsistency. Is there a way to make the faster
>> slaves in region (a) wait for all slaves in region (b) to complete
>> replication and then have all 6 slaves switch to the new index
>> simultaneously? if not, what is the alternative solution to this problem?
>>
>> - Csongor
>>
>> Note: We are on solr 3.5 (old, yes I know...)

Re: 5.5.3: fieldValueCache auto-warming error

2016-11-11 Thread Bram Van Dam

On 10/11/16 17:10, Erick Erickson wrote:
> Just facet on the text field yourself ;)

Wish I could, this is on premise over at a client, access is difficult
and their response time is pretty bad on public holidays and weekends.
So I'm basically twiddling my thumbs while waiting to get more log files
:-) I haven't been able to reproduce the problem locally, but there
could be any number of contributing factors that I'm missing.

> Kidding aside, this should be in the clear from the logs, my guess is
> that the first time you see an OOM error in the logs the query will be
> in the file also.

We generally prefer "fail hard fast", so I think we are running with the
OOM killer script in most environments. I don't think they've gone OOM
in this case, though something else could have gone wrong undected.

I hope I'll know more after the weekend.

Re: Filter nested index - remove empty parents

Re: Is there a way to tell if multivalued field actually contains multiple values?

Re: Is there a way to tell if multivalued field actually contains multiple values?

Re: Sorl shards: very sensitive to swap space usage !?

Re: Sorl shards: very sensitive to swap space usage !?

Re: Wildcard searches with space in TextField/StrField

Re: Keeping faster and slower solr slaves alined with the same index version

Re: 5.5.3: fieldValueCache auto-warming error

8 matches

Site Navigation

Mail list logo

Footer information