Re: Multiple collections vs multiple shards for multitenancy

2017-05-07 Thread Chris Troullis
Thanks for the great advice Erick. I will experiment with your suggestions
and see how it goes!

Chris

On Sun, May 7, 2017 at 12:34 AM, Erick Erickson 
wrote:

> Well, you've been doing your homework ;).
>
> bq: I am a little confused on this statement you made:
>
> > Plus you can't commit
> > individually, a commit on one will _still_ commit on all so you're
> > right back where you started.
>
> Never mind. autocommit kicks off on a per replica basis. IOW, when a
> new doc is indexed to a shard (really, any replica) the timer is
> started. So if replica 1_1 gets a doc and replica 2_1 doesn't, there
> is no commit on replica 2_1. My comment was mainly directed at the
> idea that you might issue commits from the client, which are
> distributed to all replicas. However, even in that case the a replica
> that has received no updates won't do anything.
>
> About the hybrid approach. I've seen situations where essentially you
> partition clients along "size" lines. So something like "put clients
> on a shared single-shard collection as long as the aggregate number of
> records is < X". The theory is that the update frequency is roughly
> the same if you have 10 clients with 100K docs each .vs. one client
> with 1M docs. So the pain of opening a new searcher is roughly the
> same. "X" here is experimentally determined.
>
> Do note that moving from master/slave to SolrCloud will reduce
> latency. In M/S, the time it takes to search is autocommit + poling
> interval + autowarm time. Going to SolrCloud will remove the "polling
> interval" from the equation. Not sure how much that helps
>
> There should be an autowarm statistic in the Solr logs BTW. Or some
> messages about "opening searcher (some hex stuff) and another message
> about when it's registered as active, along with timestamps. That'll
> tell you how long it takes to autowarm.
>
> OK. "straw man" strategy for your case. Create a collection per
> tenant. What you want to balance is where the collections are hosted.
> Host a number of small tenants on the same Solr instance and fewer
> larger tenants on other hardware. FWIW, I expect at least 25M docs per
> Solr JVM (very hardware dependent of course), although testing is
> important.
>
> Under the covers, each Solr instance establishes "watchers" on the
> collections it hosts. So if a particular Solr hosts replicas for, say,
> 10 collections, it establishes 10 watchers on the state.json zNode in
> Zookeeper. 300 collections isn't all that much in recent Solr
> installations. All that filtered through how beefy your hardware is of
> course.
>
> Startup is an interesting case, but I've put 1,600 replicas on 4 Solr
> instance on a Mac Pro (400 each). You can configure the number of
> startup threads if starting up is too painful.
>
> So a cluster with 300 collections isn't really straining things. Some
> of the literature is talking about thousands of collections.
>
> Good luck!
> Erick
>
> On Sat, May 6, 2017 at 4:26 PM, Chris Troullis 
> wrote:
> > Hi Erick,
> >
> > Thanks for the reply, I really appreciate it.
> >
> > To answer your questions, we have a little over 300 tenants, and a couple
> > of different collections, the largest of which has ~11 million documents
> > (so not terribly large). We are currently running standard Solr with
> simple
> > master/slave replication, so all of the documents are in a single solr
> > core. We are planning to move to Solr cloud for various reasons, and as
> > discussed previously, I am trying to find the best way to distribute the
> > documents to serve a more NRT focused search case.
> >
> > I totally get your point on pushing back on NRT requirements, and I have
> > done so for as long as I can. Currently our auto softcommit is set to 1
> > minute and we are able to achieve great query times with autowarming.
> > Unfortunately, due to the nature of our application, our customers expect
> > any changes they make to be visible almost immediately in search, and we
> > have recently been getting a lot of complaints in this area, leading to
> an
> > initiative to drive down the time it takes for documents to become
> visible
> > in search. Which leaves me where I am now, trying to find the right
> balance
> > between document visibility and reasonable, stable, query times.
> >
> > Regarding autowarming, our autowarming times aren't too crazy. We are
> > warming a max of 100 entries from the filter cache and it takes around
> 5-10
> > seconds to complete on average. I suspect our biggest slow down during
> > autowarming is the static warming query that we have that runs 10+ facets
> > over the entire index. Our searches are very facet intensive, we use the
> > JSON facet API to do some decently complex faceting (block joins, etc),
> and
> > for whatever reason, even though we use doc values for all of our facet
> > fields, simply warming the filter cache doesn't seem to prevent a giant
> > drop off in performance whenever a new searcher is opened. The only way I
> 

Re: Automatic conversion to Range Query

2017-05-07 Thread Erik Hatcher
Fair enough indeed.   And as you've experienced, that other functionality 
includes syntax that needs escaping.   If you're using SolrJ then there's a 
utility method to escape characters.  

Erik

> On May 6, 2017, at 20:53, Aman Deep Singh  wrote:
> 
> Hi Erik,
> We can't use dismax as we are using the other functionality of edismax
> parser
> 
> On 07-May-2017 12:13 AM, "Erik Hatcher"  wrote:
> 
> What about dismax instead of edismax?It might do the righter thing here
> without escaping.
> 
>>> On May 6, 2017, at 12:57, Shawn Heisey  wrote:
>>> 
>>> On 5/6/2017 7:09 AM, Aman Deep Singh wrote:
>>> After escaping the square bracket the query is working fine, Is their
>>> any way in the parser to avoid the automatic conversion if not proper
>>> query will be passed like in my case even though I haven't passed
>>> proper range query (with keyword TO).
>> 
>> If you use characters special to the query parser but don't want them
>> acted on by the query parser, then they need to be escaped.  That's just
>> how things work, and it's not going to change.
>> 
>> Thanks,
>> Shawn
>> 


Re: Step By Step guide to create Solr Cloud in Solr 6.x

2017-05-07 Thread Amrit Sarkar
Following up Erick's response,

This particular article will help setting up Setting up Solr Cloud 6.3.0
with Zookeeper 3.4.6


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2


Re: Slow indexing speed when collection size is large

2017-05-07 Thread Shawn Heisey
On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> For my rich documentation handling, I'm using Extracting Request Handler, and 
> it requires OCR.
>
> However, currently, for the slow indexing speed which I'm experiencing, the 
> indexing is done directly from the Sybase database. I will fetch about 1000 
> records at a time from Sybase, and stored in into a CacheRowSet for it to be 
> indexed. The query to the Sybase database is quite fast, and most of the time 
> is spend on processes in the CacheRowSet.

> A) 384 GB

> A) 22 GB

> A) 5 TB

> A) A virtual machine with Sybase database is running on the server

The discussion about the drawbacks of the Extracting Request Handler has
already taken place.  Tika should be running on separate hardware, not
embedded in Solr.  Having high-impact Tika processing run on the Solr
server is going to slow everything down.

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?

As soon as you mention virtual machines, my mental picture of the setup
becomes much less clear.  You'll need to fully describe the OS and
hardware setup, at both the hypervisor and virtual machine level.  Then
I will know what questions to ask for more detailed information.

Is Solr in a virtual machine?
Is the 384GB at the hypervisor level, or the virtual machine level?
Is the 22GB heap the total heap memory, or is that per Solr instance?

If the 5TB is Solr index data, then there's no way you're going to get
fast performance.  Putting enough memory in one machine to effectively
cache that much data is impractically expensive, and most server
hardware doesn't have enough memory slots even if you do have the
money.  384GB wouldn't be enough for 5TB of index, and that's not even
taking into account the memory needed by your software, including Solr
and Sybase.

Thanks,
Shawn



Re: Automatic conversion to Range Query

2017-05-07 Thread Rick Leir
Hi Aman,
Is the user actually entering that query? It seems unlikely. Perhaps you have a 
form selector for various Apple products. Could you not have an enumerated type 
for the products, and simplify everything? I must be missing something here. 
Cheers -- Rick

On May 6, 2017 8:38:14 AM EDT, Shawn Heisey  wrote:
>On 5/5/2017 12:42 PM, Aman Deep Singh wrote:
>> Hi Erick, I don't want to do the range query , That is why I'm using
>> the pattern replace filter to remove all the non alphanumeric to
>space
>> so that this type of situation don't arrive,Since end user can query
>> anything, also in the query I haven't mention any range related
>> keyword (TO). If my query is like [64GB/3GB] it works fine and
>doesn't
>> convert to range query.
>
>I hope I'm headed in the right direction here.
>
>Square brackets are special characters to the query parser -- they are
>typically used to specify a range query.  It's a little odd that Solr
>would add the "TO" for you like it seems to be doing, but not REALLY
>surprising.  This would be happening *before* the parts of the query
>make it to your analysis chain where you have the pattern replace
>filter.
>
>If you want to NOT have special characters perform their special
>function, but actually become part of the query, you'll need to escape
>them with a backslash.  Escaping all the special characters in your
>query yields this query:
>
>xiomi Mi 5 \-white \[64GB\/ 3GB\]
>
>It's difficult to decide whether the dash character before "white" was
>intended as a "NOT" operator or to be part of the query.  You might not
>want to escape that one.
>
>Thanks,
>Shawn

-- 
Sorry for being brief. Alternate email is rickleir at yahoo dot com 

Re: Automatic conversion to Range Query

2017-05-07 Thread Aman Deep Singh
Yes Rick,
User is actually typing this type of queries ,this was a random user query
pick from access logs


On 07-May-2017 7:29 PM, "Rick Leir"  wrote:

Hi Aman,
Is the user actually entering that query? It seems unlikely. Perhaps you
have a form selector for various Apple products. Could you not have an
enumerated type for the products, and simplify everything? I must be
missing something here. Cheers -- Rick

On May 6, 2017 8:38:14 AM EDT, Shawn Heisey  wrote:
>On 5/5/2017 12:42 PM, Aman Deep Singh wrote:
>> Hi Erick, I don't want to do the range query , That is why I'm using
>> the pattern replace filter to remove all the non alphanumeric to
>space
>> so that this type of situation don't arrive,Since end user can query
>> anything, also in the query I haven't mention any range related
>> keyword (TO). If my query is like [64GB/3GB] it works fine and
>doesn't
>> convert to range query.
>
>I hope I'm headed in the right direction here.
>
>Square brackets are special characters to the query parser -- they are
>typically used to specify a range query.  It's a little odd that Solr
>would add the "TO" for you like it seems to be doing, but not REALLY
>surprising.  This would be happening *before* the parts of the query
>make it to your analysis chain where you have the pattern replace
>filter.
>
>If you want to NOT have special characters perform their special
>function, but actually become part of the query, you'll need to escape
>them with a backslash.  Escaping all the special characters in your
>query yields this query:
>
>xiomi Mi 5 \-white \[64GB\/ 3GB\]
>
>It's difficult to decide whether the dash character before "white" was
>intended as a "NOT" operator or to be part of the query.  You might not
>want to escape that one.
>
>Thanks,
>Shawn

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com


Re: JSON facet performance for aggregations

2017-05-07 Thread Yonik Seeley
OK, so I think I know what's going on.

The current code is more optimized for finding the top K buckets from
a total of N.
When one asks to return the top 10 buckets when there are potentially
millions of buckets, it makes sense to defer calculating other metrics
for those buckets until we know which ones they are.  After we
identify the top 10 buckets, we calculate the domain for that bucket
and use that to calculate the remaining metrics.

The current method is obviously much slower when one is requesting
*all* buckets.  We might as well just calculate all metrics in the
first pass rather than trying to defer them.

This inefficiency is compounded by the fact that the fields are not
indexed.  In the second phase, finding the domain for a bucket is a
field query.  For an indexed field, this would involve a single term
lookup.  For a non-indexed docValues field, this involves a full
column scan.

If you ever want to do quick lookups on studentId, it would make sense
for it to be indexed (and why is it a double, anyway?)

I'll open up a JIRA issue for the first problem (don't defer metrics
if we're going to return all buckets anyway)

-Yonik


On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem
 wrote:
> Hi Yonik,
> We are using Solr 6.5
> Both studentId and grades are double:
>stored="true" docValues="true" multiValued="false" required="false"/>
>
> We have 1.5 million records.
>
> Thanks
> Mikhail
>
> -Original Message-
> From: Yonik Seeley [mailto:ysee...@gmail.com]
> Sent: Sunday, April 30, 2017 1:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> It is odd there would be quite such a big performance delta.
> What version of solr are you using?
> What is the fieldType of "grades"?
> -Yonik
>
>
> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem 
>  wrote:
>> 1-
>> studentId has docValue = true . it is of type double which is
>> > stored="true" docValues="true" multiValued="false" required="false"/>
>>
>>
>> 2- If we just facet without aggregation it finishes in good time 60ms:
>>
>> json.facet={
>>studentId:{
>>   type:terms,
>>   limit:-1,
>>   field:" studentId "
>>
>>}
>> }
>>
>>
>> Thanks
>>
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 10:44 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: JSON facet performance for aggregations
>>
>> Please enable doc values and try.
>> There is a bug in the source code which causes json facet on string field to 
>> run very slow. On numeric fields it runs fine with doc value enabled.
>>
>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem"
>> 
>> wrote:
>>
>>> Hi Vijay,
>>> It is already numeric field.
>>> It is huge difference between json and flat here. Do you know the
>>> reason for this? Is there a way to improve it ?
>>>
>>> -Original Message-
>>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>>> Sent: Sunday, April 30, 2017 9:58 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: JSON facet performance for aggregations
>>>
>>> Json facet on string fields run lot slower than on numeric fields.
>>> Try and see if you can represent studentid as a numeric field.
>>>
>>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>>> 
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I am trying to do aggregation with JSON faceting but performance is
>>> > very bad for one of the requests:
>>> >
>>> > json.facet={
>>> >
>>> >studentId:{
>>> >
>>> >   type:terms,
>>> >
>>> >   limit:-1,
>>> >
>>> >   field:"studentId",
>>> >
>>> >   facet:{
>>> >
>>> >   x:"sum(grades)"
>>> >
>>> >   }
>>> >
>>> >}
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> > This request finishes in 250 seconds, and we can't paginate for
>>> > this service for functional reason so we have to use limit:-1, and
>>> > the cardinality of the studentId is 7500.
>>> >
>>> >
>>> >
>>> > If I try the same with flat facet it finishes in 3 seconds :
>>> > stats=true&facet=true&stats.field={!tag=piv1
>>> > sum=true}grades&facet.pivot={!stats=piv1}studentId
>>> >
>>> >
>>> >
>>> > We are hoping to use one approach json or flat for all our services.
>>> > JSON facet performance is better for many case.
>>> >
>>> >
>>> >
>>> > Please advise on why the performance for this is so bad and if we
>>> > can improve it. Also what is the default algorithm used for json facet.
>>> >
>>> >
>>> >
>>> > Thanks
>>> >
>>> > Mikhail
>>> >
>>>


Fw: How to secure solr-6.2.0 in standalone mode?

2017-05-07 Thread FOTACHE CHRISTIAN
 

  
Hi
I'm using solr-6.2.0 in standalone and i need to setup security with kerberos 
(???) for standalone
I have previously setup basic authentication for solr-6.1.0 but it seems that 
solr-6.2.0 has a pretty different approach when it comes to security... I can't 
make it happenPlease help
Thank you,

Christian Fotache Tel: 0728.297.207


   

Re: Slow indexing speed when collection size is large

2017-05-07 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Are the two types of indexing (ERH with OCR, and indexing from a DB)
happening on the same Solr server?
A) Yes, they are happening on the same Solr server, but currently, only the
indexing from a DB is running.

Is Solr in a virtual machine?
A) No

Is the 384GB at the hypervisor level, or the virtual machine level?
A) The hypervisor level. The virtual machine for the Sybase is allocated
64GB of memory.

Is the 22GB heap the total heap memory, or is that per Solr instance?
A) Per Solr instance.

It's only the Sybase database that is running on a virtual machine under
Hyper-V. Solr is running on the main server.
The main server is running on Windows 2012, while the virtual machine is
running on SUSE Linux 9. Both Solr instances are running on SSD drive,
while the virtual machine is running on normal hard disk.

What is the best suggestion for the 5TB of indexes The searching speed is
quite fast currently, even during indexing. It is the indexing speed that
is slow.

Regards,
Edwin



On 7 May 2017 at 21:14, Shawn Heisey  wrote:

> On 5/6/2017 6:49 PM, Zheng Lin Edwin Yeo wrote:
> > For my rich documentation handling, I'm using Extracting Request
> Handler, and it requires OCR.
> >
> > However, currently, for the slow indexing speed which I'm experiencing,
> the indexing is done directly from the Sybase database. I will fetch about
> 1000 records at a time from Sybase, and stored in into a CacheRowSet for it
> to be indexed. The query to the Sybase database is quite fast, and most of
> the time is spend on processes in the CacheRowSet.
> 
> > A) 384 GB
> 
> > A) 22 GB
> 
> > A) 5 TB
> 
> > A) A virtual machine with Sybase database is running on the server
>
> The discussion about the drawbacks of the Extracting Request Handler has
> already taken place.  Tika should be running on separate hardware, not
> embedded in Solr.  Having high-impact Tika processing run on the Solr
> server is going to slow everything down.
>
> Are the two types of indexing (ERH with OCR, and indexing from a DB)
> happening on the same Solr server?
>
> As soon as you mention virtual machines, my mental picture of the setup
> becomes much less clear.  You'll need to fully describe the OS and
> hardware setup, at both the hypervisor and virtual machine level.  Then
> I will know what questions to ask for more detailed information.
>
> Is Solr in a virtual machine?
> Is the 384GB at the hypervisor level, or the virtual machine level?
> Is the 22GB heap the total heap memory, or is that per Solr instance?
>
> If the 5TB is Solr index data, then there's no way you're going to get
> fast performance.  Putting enough memory in one machine to effectively
> cache that much data is impractically expensive, and most server
> hardware doesn't have enough memory slots even if you do have the
> money.  384GB wouldn't be enough for 5TB of index, and that's not even
> taking into account the memory needed by your software, including Solr
> and Sybase.
>
> Thanks,
> Shawn
>
>


Re: Search inside grouping list

2017-05-07 Thread donjose
Could anyone can please reply for this query



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-inside-grouping-list-tp4333488p4333870.html
Sent from the Solr - User mailing list archive at Nabble.com.