Big SolrCloud cluster with a lot of collections
Hi All, I am testing a SolrCloud with many collections. The version is 5.2.1 and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then I created collections with 3 shards and replication factor of 2. It gives me 2 cores per collection on each machine.I reached almost 900 collections and then the cluster was stuck and I couldn’t revive the cluster. As I understand Solr have issues with many collections (thousands).If I will use much more machines – does it will give me the ability to create tens of thousands of collections or the limit is couple of thousands? I want to build a cluster that will handle 10 billion of documents (currently I have 1 billion) per day and to keep the data for 90 days.I want to support 2000 customers so I would like to split them to collections and also to split it by days. (180,000 collections)If I will create big collections I will have performance issues with queries and also most of the queries are for a specific customer. (I also have cross customers queries) How I can build an appropriate design? Thanks a lot,Yuri
Re: Solr relevant results
If I understood your question correctly, that's what I am suggesting to try. Notice that, as I mentioned earlier, that ignores all the complexity of similarity, ranking, etc that Solr offers. But it does not seem you need it in your particular case, as you are just searching for presence/absence of terms. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 15 August 2015 at 00:08, Brian Narsi wrote: > I see, so basically I add another field to the schema "CustomScore" and > assign score to it based on values in other fields. And then just order by > it. > > Is that right? > > On Fri, Aug 14, 2015 at 10:58 PM, Alexandre Rafalovitch > wrote: > >> Clarification: In the client that is doing the _indexing_/sending data >> to Solr. Not the one doing the querying. >> >> And custom URP if you can't change the client and need to inject that >> extra code on the Solr side. >> >> Sorry, for extra emails. >> >> Regards, >>Alex. >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 14 August 2015 at 23:57, Alexandre Rafalovitch >> wrote: >> > My suggestion was to do the mapping in the client, before you hit >> > Solr. Or in a custom UpdateRequestProcessor. Because only your client >> > app knows the order you want those things in. It certainly was not any >> > kind of alphabetical. >> > >> > Then, you just sort by that field and Solr would not care about the >> > complicated rules. Faster that way too, as the mapping only happens >> > once when the document is indexed. >> > >> > Regards, >> >Alex. >> > >> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> > http://www.solr-start.com/ >> > >> > >> > On 14 August 2015 at 23:52, Brian Narsi wrote: >> >> Search term is searched in Description. >> >> >> >> The search string is relevant in the context that the Description of >> >> returned records must contain the search string. But when several >> records >> >> Description contains the search string then they must be ordered >> according >> >> to the values in Code and Prefer. >> >> >> >> I understand what you are saying about mapping Code to numbers. But can >> you >> >> help with some examples of actual solr queries on how to do this? >> >> >> >> Thanks >> >> >> >> On Fri, Aug 14, 2015 at 2:46 PM, Alexandre Rafalovitch < >> arafa...@gmail.com> >> >> wrote: >> >> >> >>> What's the search string? Or is the search string irrelevant and >> >>> that's just your compulsory ordering. >> >>> Assuming anything that searches has to be returned and has to fit into >> >>> that order, I would frankly just map your special codes all together >> >>> to some sort of 'sort order' number. >> >>> >> >>> So, Code=>C = 4000, Code =>B=3000. Prefer=true=>100, Prefer=false=>0. >> >>> Then, sum it up. Or some such. >> >>> >> >>> Remember that fuzzy search will match even things with low probability >> >>> so a fixed sort will bring low-probability matches on top. So, either >> >>> hard non-fuzzy searches or you need to look at different solutions, >> >>> such as buckets and top-n items within those. >> >>> >> >>> Regards, >> >>> Alex. >> >>> >> >>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> >>> http://www.solr-start.com/ >> >>> >> >>> >> >>> On 14 August 2015 at 15:10, Brian Narsi wrote: >> >>> > In my documents there are several fields, but for example say there >> are >> >>> > three fields: >> >>> > >> >>> > Description - text - this variable text >> >>> > Code - string - always a single character >> >>> > Prefer - boolean >> >>> > >> >>> > User searches on Description. >> >>> > >> >>> > When returning results I have to order results as following: >> >>> > >> >>> > Code = C >> >>> > Code = B >> >>> > Code = S >> >>> > Code = N >> >>> > Prefer = true and Code is NULL >> >>> > Prefer = false and Code is NULL >> >>> > Prefer is NULL and Code is NULL >> >>> > >> >>> > How can this be achieved? >> >>> > >> >>> > Thanks in advance! >> >>> >>
Re: Big SolrCloud cluster with a lot of collections
1. Keep the number of collections down to the low hundreds max. Preferably no more than a few dozen or a hundred. 2. 8GB is too small to be useful. 16 GB min. 3. If you need large numbers of machines, organize them as separate clusters. 4. Figure 100 to 200 million documents on a Solr server. E.g., 1 billion documents would be 5 to 10 servers. Depends on document size and query latency requirements, practical limit could be higher or lower. -- Jack Krupansky On Sat, Aug 15, 2015 at 6:22 AM, yura last wrote: > Hi All, I am testing a SolrCloud with many collections. The version is > 5.2.1 and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then > I created collections with 3 shards and replication factor of 2. It gives > me 2 cores per collection on each machine.I reached almost 900 collections > and then the cluster was stuck and I couldn’t revive the cluster. As I > understand Solr have issues with many collections (thousands).If I will use > much more machines – does it will give me the ability to create tens of > thousands of collections or the limit is couple of thousands? I want to > build a cluster that will handle 10 billion of documents (currently I have > 1 billion) per day and to keep the data for 90 days.I want to support 2000 > customers so I would like to split them to collections and also to split it > by days. (180,000 collections)If I will create big collections I will have > performance issues with queries and also most of the queries are for a > specific customer. (I also have cross customers queries) How I can build an > appropriate design? Thanks a lot,Yuri
Re: Big SolrCloud cluster with a lot of collections
yura last wrote: > Hi All, I am testing a SolrCloud with many collections. The version is 5.2.1 > and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then I > created collections with 3 shards and replication factor of 2. It gives me 2 > cores per collection on each machine.I reached almost 900 collections > and then the cluster was stuck and I couldn’t revive the cluster. That mirrors what others are reporting. > As I understand Solr have issues with many collections (thousands).If I > will use much more machines – does it will give me the ability to create > tens of thousands of collections or the limit is couple of thousands? (Caveat: I have no real world experience with high collection count in Solr) Adding more machines will not really help you as the problem with thousands of collections is not hardware power per se, but rather the coordination of them. You mention 180K collections below and with the current Solr architecture, I do not see that happening. > I want to build a cluster that will handle 10 billion of documents (currently > I > have 1 billion) per day and to keep the data for 90 days. Are those real requirements or something somebody hope will come true some years down the road? Technology has a habit of catching up and while a 900 billion document setup is a challenge today, it is probably a lot easier in 5 years. When we are discussion this, it would help if you could also approximate the index size in bytes. How large do you expect the sum of shards for 1 billion of your documents to be? Likewise, which kind of queries do you expect? Grouping? Faceting? All these things multiply. Anyway, your requirements are in a league where there is not much collective experience. You will definitely have to build a serious prototype or three to get a proper idea of how much power you need: The standard advices for scaling Solr does not make economical sense beyond a point. But you seem to have started that process already with your current tests. > I want to support 2000 customers so I would like to split them to collections > and also to split it by days. (180,000 collections) As 180,000 collections currently seems infeasible for a single SolrCloud, you should consider alternatives: 1) If your collections are independent, then build fully independent clusters of machines. 2) Don't use collections for dividing data between your customers. Use a field with a customer-ID or something like that. > If I will create big collections I will have performance issues with queries > and also most of the queries are for a specific customer. Why would many smaller collections have better performance than fewer larger collections? > (I also have cross customers queries) If you make independent setups, that could be solved by querying them independently and do the merging yourself. - Toke Eskildsen
phonetic filter factory question
The JavaDoc says that the PhoneticFilterFactory will "inject" tokens with an offset of 0 into the stream. I'm assuming this means an offset of 0 from the token that it is analyzing, is that right? I am trying to collapse some of my schema, I currently have a text field that I use for general purpose text and another field with the PhoneticFilterFactory applied for finding things that are similar phonetically, but if this does inject at the current position then I could likely collapse these into a single field. As always thanks in advance! -Jamie
Re: phonetic filter factory question
>From the "teaching to fish" category of advice (since I don't know the actual answer). Did you try "Analysis" screen in the Admin UI? If you check "Verbose output" mark, you will see all the offsets and can easily confirm the detailed behavior for yourself. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 15 August 2015 at 12:22, Jamie Johnson wrote: > The JavaDoc says that the PhoneticFilterFactory will "inject" tokens with > an offset of 0 into the stream. I'm assuming this means an offset of 0 > from the token that it is analyzing, is that right? I am trying to > collapse some of my schema, I currently have a text field that I use for > general purpose text and another field with the PhoneticFilterFactory > applied for finding things that are similar phonetically, but if this does > inject at the current position then I could likely collapse these into a > single field. As always thanks in advance! > > -Jamie
Cache for percentiles facets
Hi, I have tried various options to speed up percentile calculation for facets. But the internal solr cache only speed up my queries from 22 to 19 sec. I'am using the new json facets http://yonik.com/json-facet-api/ Any tips for caching stats? -Håvard
Index very large number of documents from large number of clients
I am using SolrCloud My initial requirements are: 1) There are about 6000 clients 2) The number of documents from each client are about 50 (average document size is about 400 bytes) 3 I have to wipe off the index/collection every night and create new Any thoughts/ideas/suggestions on: 1) How to index such large number of documents i.e. do I use an http client to send documents or is data import handler right or should I try uploading CSV files? 2) How many collections should I use? 3) How many shards / replicas per collection should I use? 4) Do I need multiple Solr servers? Thanks
Re: Cache for percentiles facets
You have to provide a lot more info about your problem, including what you've tried, what your data looks like, etc. You might review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Sat, Aug 15, 2015 at 10:27 AM, Håvard Wahl Kongsgård wrote: > Hi, I have tried various options to speed up percentile calculation for > facets. But the internal solr cache only speed up my queries from 22 to 19 > sec. > > I'am using the new json facets http://yonik.com/json-facet-api/ > > Any tips for caching stats? > > > -Håvard
Re: Index very large number of documents from large number of clients
This is beyond my direct area of expertise, but one way to look at this would be: 1) Create new collections offline. Down to each of the 6000 clients having its own private collection (embedded SolrJ/server). Or some sort of mini-hubs, e.g. a server per N clients. 2) Bring those collections into central server 3) Update alias that used to point to previous collection set to point to the new one: https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateormodifyanAliasforaCollection 4) Delete old collection set, as nothing points at it anymore Now, I don't know how that would play with shards/replicas. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 15 August 2015 at 16:03, Troy Edwards wrote: > I am using SolrCloud > > My initial requirements are: > > 1) There are about 6000 clients > 2) The number of documents from each client are about 50 (average > document size is about 400 bytes) > 3 I have to wipe off the index/collection every night and create new > > Any thoughts/ideas/suggestions on: > > 1) How to index such large number of documents i.e. do I use an http client > to send documents or is data import handler right or should I try uploading > CSV files? > > 2) How many collections should I use? > > 3) How many shards / replicas per collection should I use? > > 4) Do I need multiple Solr servers? > > Thanks
Admin Login
I'm somewhat puzzled there is no built in security. I can't image anybody is running a public facing solr server with the admin page wide open? I've searched and haven't found any solutions that work out of the box. I've tried the solutions here to no avail. https://wiki.apache.org/solr/SolrSecurity and here. http://wiki.eclipse.org/Jetty/Tutorial/Realms The Solr security docs say to use the application server and if I could run it on my tomcat server I would already be done. But I'm told I can't do that? What solutions are people using? Scott -- Leave no stone unturned. Euripides
Re: Admin Login
No one runs a public-facing Solr server. Just like no one runs a public-facing MySQL server. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 15, 2015, at 4:15 PM, Scott Derrick wrote: > I'm somewhat puzzled there is no built in security. I can't image anybody is > running a public facing solr server with the admin page wide open? > > I've searched and haven't found any solutions that work out of the box. > > I've tried the solutions here to no avail. > https://wiki.apache.org/solr/SolrSecurity > > and here. http://wiki.eclipse.org/Jetty/Tutorial/Realms > > The Solr security docs say to use the application server and if I could run > it on my tomcat server I would already be done. But I'm told I can't do that? > > What solutions are people using? > > Scott > > -- > Leave no stone unturned. > Euripides
Re: Admin Login
Walter, actually that explains it perfectly! I will move behind my apache server... thanks, Scott On 8/15/2015 6:15 PM, Walter Underwood wrote: No one runs a public-facing Solr server. Just like no one runs a public-facing MySQL server. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Aug 15, 2015, at 4:15 PM, Scott Derrick wrote: I'm somewhat puzzled there is no built in security. I can't image anybody is running a public facing solr server with the admin page wide open? I've searched and haven't found any solutions that work out of the box. I've tried the solutions here to no avail. https://wiki.apache.org/solr/SolrSecurity and here. http://wiki.eclipse.org/Jetty/Tutorial/Realms The Solr security docs say to use the application server and if I could run it on my tomcat server I would already be done. But I'm told I can't do that? What solutions are people using? Scott -- Leave no stone unturned. Euripides --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus
Re: Index very large number of documents from large number of clients
On 8/15/2015 2:03 PM, Troy Edwards wrote: > I am using SolrCloud > > My initial requirements are: > > 1) There are about 6000 clients > 2) The number of documents from each client are about 50 (average > document size is about 400 bytes) > 3 I have to wipe off the index/collection every night and create new > > Any thoughts/ideas/suggestions on: > > 1) How to index such large number of documents i.e. do I use an http client > to send documents or is data import handler right or should I try uploading > CSV files? This is general info only. 6000 clients, each with half a million docs? That's 3 billion docs. There are some users who have more, but this is squarely in the realm of a HUGE install. > 2) How many collections should I use? > > 3) How many shards / replicas per collection should I use? Any answer we came up with for those two questions would involve quite a few assumptions, any one of which could be wrong. The only way to really find out what you need is to set up a prototype system and test it with real data, real indexing requests, and real queries. Record the results of the tests, change the configuration, rebuild the index(es), and run the tests again. The number one rule when it comes to Solr performance: Install enough memory so that all the index data on the server will fit in the available OS disk cache RAM. You're going to have a lot of index data. https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ https://wiki.apache.org/solr/SolrPerformanceProblems When the number of collections reaches the low hundreds, SolrCloud stability begins to suffer because of how much interaction with Zookeeper is required for very small cluster changes. When there are thousands of collections, any little problem turns into a nightmare. Adding more machines doesn't help this particular problem. Some ideas are being discussed to make this better, but users won't see the results of that effort until version 5.4 or 5.5, possibly later. > 4) Do I need multiple Solr servers? You would need multiple servers for any hope of redundancy, but the answer to the question I think you're trying to ask here is yes. Definitely. Possibly a LOT of them. Thanks, Shawn
Re: Admin Login
Scott: You better not even let them access Solr directly. http://server:port/solr/admin/collections?ACTION=delete&name=collection. Try it sometime on a collection that's not important ;) But as Walter said, that'd be similar to allowing end users unrestricted access to a SOL database, that Solr URL is akin to "drop database". Or, if you've locked down the admin stuff, http://solr:port/solr/collection/update?commit=true&stream.body=*:* Best Erick On Sat, Aug 15, 2015 at 6:57 PM, Scott Derrick wrote: > Walter, > > actually that explains it perfectly! I will move behind my apache server... > > thanks, > > Scott > > > On 8/15/2015 6:15 PM, Walter Underwood wrote: >> >> No one runs a public-facing Solr server. Just like no one runs a >> public-facing MySQL server. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >> On Aug 15, 2015, at 4:15 PM, Scott Derrick wrote: >> >>> I'm somewhat puzzled there is no built in security. I can't image >>> anybody is running a public facing solr server with the admin page wide >>> open? >>> >>> I've searched and haven't found any solutions that work out of the box. >>> >>> I've tried the solutions here to no avail. >>> https://wiki.apache.org/solr/SolrSecurity >>> >>> and here. http://wiki.eclipse.org/Jetty/Tutorial/Realms >>> >>> The Solr security docs say to use the application server and if I could >>> run it on my tomcat server I would already be done. But I'm told I can't do >>> that? >>> >>> What solutions are people using? >>> >>> Scott >>> >>> -- >>> Leave no stone unturned. >>> Euripides >> >> > > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus >
Re: Index very large number of documents from large number of clients
Piling on here. At the scale you're talking, I suspect you'll not only have a bunch of servers, you'll really have a bunch of completely separate "Solr Clouds", complete with their own Zookeepers etc. Partly for administrative sake, partly for stability, etc. Not sure that'll be true, mind you, but a "divide and conquer" approcah seems in order. And to be clear, the multiple clusters are NOT because of 3 Billion docs, I've certainly seen that number of docs fit on 10 shards when the records are as small as your's are. OTOH, I've seen it take 30 or 60 shards, but that's usually for complex documents. As Shawn says, prototyping is the only way to be sure. It's because if you choose to have 6,000 _collections_, you'll need some kind of divisions. Now, if you can create a smaller number of collections and have, say, a collection ID with each doc, you can simply add an fq=collectionID to each query and that'll show you only the docs belonging to that collection. This could be significantly simpler than maintaining 6,000 collections.. Best, Erick On Sat, Aug 15, 2015 at 8:40 PM, Shawn Heisey wrote: > On 8/15/2015 2:03 PM, Troy Edwards wrote: >> I am using SolrCloud >> >> My initial requirements are: >> >> 1) There are about 6000 clients >> 2) The number of documents from each client are about 50 (average >> document size is about 400 bytes) >> 3 I have to wipe off the index/collection every night and create new >> >> Any thoughts/ideas/suggestions on: >> >> 1) How to index such large number of documents i.e. do I use an http client >> to send documents or is data import handler right or should I try uploading >> CSV files? > > This is general info only. > > 6000 clients, each with half a million docs? That's 3 billion docs. > There are some users who have more, but this is squarely in the realm of > a HUGE install. > >> 2) How many collections should I use? >> >> 3) How many shards / replicas per collection should I use? > > Any answer we came up with for those two questions would involve quite a > few assumptions, any one of which could be wrong. The only way to > really find out what you need is to set up a prototype system and test > it with real data, real indexing requests, and real queries. Record the > results of the tests, change the configuration, rebuild the index(es), > and run the tests again. > > The number one rule when it comes to Solr performance: Install enough > memory so that all the index data on the server will fit in the > available OS disk cache RAM. You're going to have a lot of index data. > > https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ > > https://wiki.apache.org/solr/SolrPerformanceProblems > > When the number of collections reaches the low hundreds, SolrCloud > stability begins to suffer because of how much interaction with > Zookeeper is required for very small cluster changes. When there are > thousands of collections, any little problem turns into a nightmare. > Adding more machines doesn't help this particular problem. Some ideas > are being discussed to make this better, but users won't see the results > of that effort until version 5.4 or 5.5, possibly later. > >> 4) Do I need multiple Solr servers? > > You would need multiple servers for any hope of redundancy, but the > answer to the question I think you're trying to ask here is yes. > Definitely. Possibly a LOT of them. > > Thanks, > Shawn >
Re: Index very large number of documents from large number of clients
Troy Edwards wrote: > 1) There are about 6000 clients > 2) The number of documents from each client are about 50 (average > document size is about 400 bytes) So roughly 3 billion documents / 1TB index size. So at least 2 shards, due to the 2 billion limit in Lucene. If you want more advice than that, you will have to describe how the setup is to be used - How many requests persecond? - What is a typical query? - How low does the response time needs to be? > 3 I have to wipe off the index/collection every night and create new Let's say you have 4 hours to do that. That's about 200K documents/second you need to index. That is a high number and with such tiny documents, I suspect that logistics might take up the largest part of that. This might call for multiple independent setups. > 1) How to index such large number of documents i.e. do I use an http client > to send documents or is data import handler right or should I try uploading > CSV files? As the overhead for constructing and parsing XML documents is not trivial, CSV seems reasonable, Probably also DIH. > 2) How many collections should I use? Not 6000 in a single SolrCloud. > 3) How many shards / replicas per collection should I use? > 4) Do I need multiple Solr servers? Not enough data about index usage to say. Between 1 and 50, not kidding. - Toke Eskildsen