ApacheCon North America 2018 schedule is now live.
Dear Apache Enthusiast, We are pleased to announce our schedule for ApacheCon North America 2018. ApacheCon will be held September 23-27 at the Montreal Marriott Chateau Champlain in Montreal, Canada. Registration is open! The early bird rate of $575 lasts until July 21, at which time it goes up to $800. And the room block at the Marriott ($225 CAD per night, including wifi) closes on August 24th. We will be featuring more than 100 sessions on Apache projects. The schedule is now online at https://apachecon.com/acna18/ The schedule includes full tracks of content from Cloudstack[1], Tomcat[2], and our GeoSpatial community[3]. We will have 4 keynote speakers, two of whom are Apache members, and two from the wider community. On Tuesday, Apache member and former board member Cliff Schmidt will be speaking about how Amplio uses technology to educate and improve the quality of life of people living in very difficult parts of the world[4]. And Apache Fineract VP Myrle Krantz will speak about how Open Source banking is helping the global fight against poverty[5]. Then, on Wednesday, we’ll hear from Bridget Kromhout, Principal Cloud Developer Advocate from Microsoft, about the really hard problem in software - the people[6]. And Euan McLeod, VP VIPER at Comcast will show us the many ways that Apache software delivers your favorite shows to your living room[7]. ApacheCon will also feature old favorites like the Lightning Talks, the Hackathon (running the duration of the event), PGP key signing, and lots of hallway-track time to get to know your project community better. Follow us on Twitter, @ApacheCon, and join the disc...@apachecon.com mailing list (send email to discuss-subscr...@apachecon.com) to stay up to date with developments. And if your company wants to sponsor this event, get in touch at h...@apachecon.com for opportunities that are still available. See you in Montreal! Rich Bowen VP Conferences, The Apache Software Foundation h...@apachecon.com @ApacheCon [1] http://cloudstackcollab.org/ [2] http://tomcat.apache.org/conference.html [3] http://apachecon.dukecon.org/acna/2018/#/schedule?search=geospatial [4] http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/df977fd305a31b903 [5] http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/22c6c30412a3828d6 [6] http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/fbbb2384fa91ebc6b [7] http://apachecon.dukecon.org/acna/2018/#/scheduledEvent/88d50c3613852c2de
Re: Load Balancing between Two Cloud Clusters
Thank you, Erick. This is exactly the information I needed but hadn't correctly parsed as a new Solr cloud user. You've just made setting up our new configuration much easier!! Monica Skidmore Senior Software Engineer On 4/30/18, 7:29 PM, "Erick Erickson" wrote: "We need a way to determine that a node is still 'alive' and should be in the load balancer, and we need a way to know that a new node is now available and fully ready with its replicas to add to the load balancer." Why? If a Solr node is running but the replicas aren't up yet, it'll pass the request along to a node that _does_ have live replicas, you don't have to do anything. As far as the node being alive, there are lots of ways, any API end point has to have a Solr to field it, perhaps just use the Collections LIST command? "How does ZooKeeper make this determination? Does it do something different if multiple collections are on a single cluster? And, even with just one cluster, what is best practice for keeping a current list of active nodes in the cluster, especially for extremely high query rates?" This is a common misconception. ZooKeeper isn't interested in Solr at all. ZooKeeper will ping the nodes it knows about and, perhaps, remove a node from the live_nodes list, but that's all. It isn't involved in Solr's operation in terms of routing queries, updates or anything like that. _Solr_ keeps track of all this by _watching_ various znodes. Say Solr hosts some replica in a collection. when it comes up it sets a "watch" on the /collections/my_collection/state.json Znode. It also published its own state. So say it hosts three replicas for the collection. As each one is loaded and ready for action, Solr posts an update to the relevant state.json file. ZooKeeper is then responsible for telling an other node who'd set a watch that the znode has changed. ZK doesn't know or care whether those are Solr nodes or not. So when a request comes in to a Solr node, it knows what other Solr nodes host what particular replicas and does all the sub-requests itself, ZK isn't involved at all at that level. So imagine node1 hosts S1R1 and S2R1 Node2 hosts S1R2 and S2R2 (for collection A). When node1 comes up it updates the state in ZK to say S1R2 and S1R2 are "active". Now claim node2 is coming up but hasn't loaded it's cores yet. If it receives a request it can forward them on to node1. Now node2 loads both its cores. It updates the ZK node for the collection, and since node1 is watching, it fetches the updated state.json. From this point forward, both nodes have complete information about all the replicas in the collection and don't need to reference ZK any more at all. In fact, ZK can completely go away and _queries_ can continue to work off their cached state.json. Updates will fail since ZK quorums are required for updates to indexes to prevent "split brain" problems. Best, Erick On Mon, Apr 30, 2018 at 11:03 AM, Monica Skidmore wrote: > Thank you, Erick. That confirms our understanding for a single cluster, or once we select a node from one of the two clusters to query. > > As we try to set up an external load balancer to go between two clusters, though, we still have some questions. We need a way to determine that a node is still 'alive' and should be in the load balancer, and we need a way to know that a new node is now available and fully ready with its replicas to add to the load balancer. > > How does ZooKeeper make this determination? Does it do something different if multiple collections are on a single cluster? And, even with just one cluster, what is best practice for keeping a current list of active nodes in the cluster, especially for extremely high query rates? > > Again, if there's some good documentation on this, I'd love a pointer... > > Monica Skidmore > Senior Software Engineer > > > > On 4/30/18, 1:09 PM, "Erick Erickson" wrote: > > Multiple clusters with the same dataset aren't load-balanced by Solr, > you'll have to accomplish that from "outside", e.g. something that sends > queries to each cluster. > > _Within_ a cluster (collection), as long as a request gets to any Solr > node, sub-requests are distributed with an internal software LB. As far as > a single collection, you're fine just sending any query to any node. Even > if you send a query to a node that hosts no replicas for a collection, Solr > will "do the right thing" and forward it appropiately. > > HTH, > Erick > > On Mon, Apr 30, 2018 at 9:46 AM, Monica Skidmore < > monica.skidm...@careerbuilder.com> wrote: > > > We are mig
Error when indexing against a specific dynamic field type
Hello, We are migrating from solr 4.7 to 7.3. When I encounter a data item that matches a custom dynamic field from our 4.7 schema: ** , I get the following exception: *Exception writing document id FULL_36265 to the index; possible analysis error: Document contains at least one immense term in field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[108, 111, 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original message: bytes can be at most 32766 in length; got 68144.* Any ideas are greatly appreciated. Thank you. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: 7.3 appears to leak
Mạnh, Shalin, I tried to reproduce it locally but i failed, it is not just a stream of queries and frequent updates/commits. We will temporarily abuse a production machine to run 7.3 and a control machine on 7.2 to rule some things out. We have plenty custom plugins, so when i can reproduce it again, we can rule stuff out and hopefully get back at you guys! Thanks, Markus -Original message- > From:Đạt Cao Mạnh > Sent: Monday 30th April 2018 4:07 > To: solr-user@lucene.apache.org > Subject: Re: 7.3 appears to leak > > Hi Markus, > > I tried indexing documents and query documents with queries and filter > question, but can't not find any leak problems. Can you give us more > information about the leak? > > Thanks! > > On Fri, Apr 27, 2018 at 5:11 PM Shalin Shekhar Mangar < > shalinman...@gmail.com> wrote: > > > Hi Markus, > > > > Can you give an idea of what your filter queries look like? Any custom > > plugins or things we should be aware of? Simple indexing artificial docs, > > querying and committing doesn't seem to reproduce the issue for me. > > > > On Thu, Apr 26, 2018 at 10:13 PM, Markus Jelsma < > > markus.jel...@openindex.io> > > wrote: > > > > > Hello, > > > > > > We just finished upgrading our three separate clusters from 7.2.1 to 7.3, > > > which went fine, except for our main text search collection, it appears > > to > > > leak memory on commit! > > > > > > After initial upgrade we saw the cluster slowly starting to run out of > > > memory within about an hour and a half. We increased heap in case 7.3 > > just > > > requires more of it, but the heap consumption graph is still growing on > > > each commit. Heap space cannot be reclaimed by forcing the garbage > > > collector to run, everything just piles up in the OldGen. Running with > > this > > > slightly larger heap, the first nodes will run out of memory in about two > > > and a half hours after cluster restart. > > > > > > The heap eating cluster is a 2shard/3replica system on separate nodes. > > > Each replica is about 50 GB in size and about 8.5 million documents. On > > > 7.2.1 it ran fine with just a 2 GB heap. With 7.3 and 2.5 GB heap, it > > will > > > take just a little longer for it to run out of memory. > > > > > > I inspected reports shown by the sampler of VisualVM and spotted one > > > peculiarity, the number of instances of SortedIntDocSet kept growing on > > > each commit by about the same amount as the number of cached filter > > > queries. But this doesn't happen on the logs cluster, SortedIntDocSet > > > instances are neatly collected there. The number of instances also > > accounts > > > for the number of commits since start up times the cache sizes > > > > > > Our other two clusters don't have this problem, one of them receives very > > > few commits per day, but the other receives data all the time, it logs > > user > > > interactions so a large amount of data is coming in all the time. I > > cannot > > > reproduce it locally by indexing data and committing all the time, the > > peak > > > usage in OldGen stays about the same. But, i can reproduce it locally > > when > > > i introduce queries, and filter queries while indexing pieces of data and > > > committing it. > > > > > > So, what is the problem? I dug in the CHANGES.txt of both Lucene and > > Solr, > > > but nothing really caught my attention. Does anyone here have an idea > > where > > > to look? > > > > > > Many thanks, > > > Markus > > > > > > > > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > >
SolrCloud Heterogenous Hardware setup
Hi, We are building a SolrCloud setup, which will index time-series data. Being time-series data with write-once semantics, we are planning to have multiple collections i.e. one collection per month. As per our use case, end users should be able to query across last 12 months worth of data, which means 12 collections (with one collection per month). To achieve this, we are planning to leverage Solr collection aliasing such that the search_alias collection will point to the 12 collections and indexing will always happen to the latest collection. As its write-once kind of data, the question I have is whether it is possible to have two different hardware profiles within the SolrCloud cluster such that all the older collections (being read-only) will be stored on the lower hardware spec, while the latest collection (being write heavy) will be stored only on the higher hardware profile machines. - Is it possible to configure a collection such that the collection data is only stored on few nodes in the SolrCloud setup? - If this is possible, at the end of each month, what is the approach to be taken to “move” the latest collection from higher-spec hardware machines to the lower-spec ones? TIA.
Re: SolrCloud Heterogenous Hardware setup
"Is it possible to configure a collection such that the collection data is only stored on few nodes in the SolrCloud setup?" Yes. There are "node placement rules", but also you can create a collection with a createNodeSet that specifies the nodes that the replicas are placed on. " If this is possible, at the end of each month, what is the approach to be taken to “move” the latest collection from higher-spec hardware machines to the lower-spec ones?" There are a bunch of ways, in order of how long they've been around (check your version). All of these are COLLECTIONS API calls. - ADDREPLICA/DELETEREPLCIA - MOVEREPLICA - REPLACENODE The other thing you may wan to look at is that David Smiley has been working on timeseries support in Solr, but that's quite recent so may not be available in whatever version you're using. Nor do I know enough details a about it to know how (or if) it it supported the heterogeneous setup you're talking about. Check CHANGES.txt. Best, Erick On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie wrote: > Hi, > > We are building a SolrCloud setup, which will index time-series data. Being > time-series data with write-once semantics, we are planning to have > multiple collections i.e. one collection per month. As per our use case, > end users should be able to query across last 12 months worth of data, > which means 12 collections (with one collection per month). To achieve > this, we are planning to leverage Solr collection aliasing such that the > search_alias collection will point to the 12 collections and indexing will > always happen to the latest collection. > > As its write-once kind of data, the question I have is whether it is > possible to have two different hardware profiles within the SolrCloud > cluster such that all the older collections (being read-only) will be > stored on the lower hardware spec, while the latest collection (being write > heavy) will be stored only on the higher hardware profile machines. > >- Is it possible to configure a collection such that the collection data >is only stored on few nodes in the SolrCloud setup? >- If this is possible, at the end of each month, what is the approach to >be taken to “move” the latest collection from higher-spec hardware machines >to the lower-spec ones? > > TIA.
Re: Error when indexing against a specific dynamic field type
You're sending it a huge term. My guess is you're sending something like base64-encoded data or perhaps just a single unbroken string in your field. Examine your document, it should jump out at you. Best, Erick On Tue, May 1, 2018 at 7:40 AM, THADC wrote: > Hello, > > We are migrating from solr 4.7 to 7.3. When I encounter a data item that > matches a custom dynamic field from our 4.7 schema: > > * stored="true" multiValued="false"/>* > > , I get the following exception: > > *Exception writing document id FULL_36265 to the index; possible analysis > error: Document contains at least one immense term in > field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length > 32766), all of which were skipped. Please correct the analyzer to not > produce such terms. The prefix of the first immense term is: '[108, 111, > 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, > 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original > message: bytes can be at most 32766 in length; got 68144.* > > Any ideas are greatly appreciated. Thank you. > > > > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
User queries end up in filterCache if facetting is enabled
Hello, We noticed the number of entries of the filterCache to be higher than we expected, using showItems="1024" something unexpected was listed as entries of the filterCache, the complete Query.toString() of our user queries, massive entries, a lot of them. We also spotted all entries of fields we facet on, even though we don't use them as filtes, but that is caused by facet.field=enum, and should be expected, right? Now, the user query entries are not expected. In the simplest set up, searching for something and only enabling the facet engine with facet=true causes it to appears in the cache as an entry. The following queries: http://localhost:8983/solr/search/select?q=content_nl:nog&facet=true http://localhost:8983/solr/search/select?q=*:*&facet=true become listed as: CACHE.searcher.filterCache.item_*:*: org.apache.solr.search.BitDocSet@70051ee0 CACHE.searcher.filterCache.item_content_nl:nog: org.apache.solr.search.BitDocSet@13150cf6 This is on 7.3, but 7.2.1 does this as well. So, should i expect this? Can i disable this? Bug? Thanks, Markus
Re: Load Balancing between Two Cloud Clusters
Glad to help. Yeah, I thought you might have been making it harder than it needed to be ;). In SolrCloud you're constantly running up against "it's just magic until it's not", knowing when magic applies and when it doesn't can be tricky, very tricky. Basically when using LBs, people just throw nodes at the LB when they come up. If the Solr end points aren't available, then they're skipped etc. I'll also add that SolrJ, the CloudSolrClient specifically, does all this on the client side, it's ZK-aware so knows the topology of the active Solr nodes and "does the right thing" via internal LBs. Best, Erick On Tue, May 1, 2018 at 6:41 AM, Monica Skidmore wrote: > Thank you, Erick. This is exactly the information I needed but hadn't > correctly parsed as a new Solr cloud user. You've just made setting up our > new configuration much easier!! > > Monica Skidmore > Senior Software Engineer > > > > On 4/30/18, 7:29 PM, "Erick Erickson" wrote: > > "We need a way to determine that a node is still 'alive' and should be > in the load balancer, and we need a way to know that a new node is now > available and fully ready with its replicas to add to the load > balancer." > > Why? If a Solr node is running but the replicas aren't up yet, it'll > pass the request along to a node that _does_ have live replicas, you > don't have to do anything. As far as the node being alive, there are > lots of ways, any API end point has to have a Solr to field it, > perhaps just use the Collections LIST command? > > "How does ZooKeeper make this determination? Does it do something > different if multiple collections are on a single cluster? And, even > with just one cluster, what is best practice for keeping a current > list of active nodes in the cluster, especially for extremely high > query rates?" > > This is a common misconception. ZooKeeper isn't interested in Solr at > all. ZooKeeper will ping the nodes it knows about and, perhaps, remove > a node from the live_nodes list, but that's all. It isn't involved in > Solr's operation in terms of routing queries, updates or anything like > that. > > _Solr_ keeps track of all this by _watching_ various znodes. Say Solr > hosts some replica in a collection. when it comes up it sets a "watch" > on the /collections/my_collection/state.json Znode. It also published > its own state. So say it hosts three replicas for the collection. As > each one is loaded and ready for action, Solr posts an update to the > relevant state.json file. > > ZooKeeper is then responsible for telling an other node who'd set a > watch that the znode has changed. ZK doesn't know or care whether > those are Solr nodes or not. > > So when a request comes in to a Solr node, it knows what other Solr > nodes host what particular replicas and does all the sub-requests > itself, ZK isn't involved at all at that level. > > So imagine node1 hosts S1R1 and S2R1 Node2 hosts S1R2 and S2R2 (for > collection A). When node1 comes up it updates the state in ZK to say > S1R2 and S1R2 are "active". Now claim node2 is coming up but hasn't > loaded it's cores yet. If it receives a request it can forward them on > to node1. > > Now node2 loads both its cores. It updates the ZK node for the > collection, and since node1 is watching, it fetches the updated > state.json. From this point forward, both nodes have complete > information about all the replicas in the collection and don't need to > reference ZK any more at all. > > In fact, ZK can completely go away and _queries_ can continue to work > off their cached state.json. Updates will fail since ZK quorums are > required for updates to indexes to prevent "split brain" problems. > > Best, > Erick > > On Mon, Apr 30, 2018 at 11:03 AM, Monica Skidmore > wrote: > > Thank you, Erick. That confirms our understanding for a single > cluster, or once we select a node from one of the two clusters to query. > > > > As we try to set up an external load balancer to go between two > clusters, though, we still have some questions. We need a way to determine > that a node is still 'alive' and should be in the load balancer, and we need > a way to know that a new node is now available and fully ready with its > replicas to add to the load balancer. > > > > How does ZooKeeper make this determination? Does it do something > different if multiple collections are on a single cluster? And, even with > just one cluster, what is best practice for keeping a current list of active > nodes in the cluster, especially for extremely high query rates? > > > > Again, if there's some good documentation on this, I'd love a pointer... > > > > Monica Skidmore > > Senior Software Engineer > > > > > > > > On 4/30/18, 1:09 PM, "Erick Erickson" wrote:
Re: Error when indexing against a specific dynamic field type
The input in the error message starts “lorem ipsum”, so it contains spaces, but the alphaOnlySort field type (in Solr’s example schemas anyway) uses KeywordTokenizer, which tokenizes the entire input as a single token. As Erick implied, you maybe should not be doing that with this kind of data - perhaps the analyzer used by this dynamic field should change? Alternatively, you could: a) truncate long values so that a prefix makes it through the indexing process, e.g. by adding TruncateTokenFilterFactory[1] to alphaOnlySort’s analyzer, or by adding TruncateFieldUpdateProcessorFactory[2] to your update request processor chain; or b) entirely eliminate overly long values, e.g. using LengthFilterFactory[3]. [1] https://lucene.apache.org/core/7_3_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilterFactory.html [2] https://lucene.apache.org/solr/7_3_0/solr-core/org/apache/solr/update/processor/TruncateFieldUpdateProcessorFactory.html [3] https://lucene.apache.org/solr/guide/7_3/filter-descriptions.html#length-filter -- Steve www.lucidworks.com > On May 1, 2018, at 11:28 AM, Erick Erickson wrote: > > You're sending it a huge term. My guess is you're sending something > like base64-encoded data or perhaps just a single unbroken string in > your field. > > Examine your document, it should jump out at you. > > Best, > Erick > > On Tue, May 1, 2018 at 7:40 AM, THADC > wrote: >> Hello, >> >> We are migrating from solr 4.7 to 7.3. When I encounter a data item that >> matches a custom dynamic field from our 4.7 schema: >> >> *> stored="true" multiValued="false"/>* >> >> , I get the following exception: >> >> *Exception writing document id FULL_36265 to the index; possible analysis >> error: Document contains at least one immense term in >> field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length >> 32766), all of which were skipped. Please correct the analyzer to not >> produce such terms. The prefix of the first immense term is: '[108, 111, >> 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, >> 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original >> message: bytes can be at most 32766 in length; got 68144.* >> >> Any ideas are greatly appreciated. Thank you. >> >> >> >> >> >> >> >> -- >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Error when indexing against a specific dynamic field type
On 5/1/2018 8:40 AM, THADC wrote: > I get the following exception: > > *Exception writing document id FULL_36265 to the index; possible analysis > error: Document contains at least one immense term in > field="gridFacts_tsing" (whose UTF8 encoding is longer than the max length > 32766), all of which were skipped. Please correct the analyzer to not > produce such terms. The prefix of the first immense term is: '[108, 111, > 114, 101, 109, 32, 105, 112, 115, 117, 109, 32, 100, 111, 108, 111, 114, 32, > 115, 105, 116, 32, 97, 109, 101, 116, 44, 32, 99, 111]...', original > message: bytes can be at most 32766 in length; got 68144.* > > Any ideas are greatly appreciated. Thank you. The error is not ambiguous. It tells you precisely what the problem is. A single term in a Lucene index cannot be longer than about 32K, that one has a term that's more than twice that size. I'm guessing that the fieldType named alphaOnlySort is one of two things: Either the StrField class, or the TextField class with the keyword tokenizer factory. To fix this problem you will need to either reduce the size of the input on the field, or use an analysis chain that splits the input into smaller tokens. It appears that the input string is comma separated numbers, which probably should be tokenized, not treated as a single term. Thanks, Shawn
Re: SolrCloud Heterogenous Hardware setup
Thanks Erick. This information is very helpful. Will explore further on the node placement rules within Collections API. Many Thanks On 1 May 2018 at 16:26:34, Erick Erickson (erickerick...@gmail.com) wrote: "Is it possible to configure a collection such that the collection data is only stored on few nodes in the SolrCloud setup?" Yes. There are "node placement rules", but also you can create a collection with a createNodeSet that specifies the nodes that the replicas are placed on. " If this is possible, at the end of each month, what is the approach to be taken to “move” the latest collection from higher-spec hardware machines to the lower-spec ones?" There are a bunch of ways, in order of how long they've been around (check your version). All of these are COLLECTIONS API calls. - ADDREPLICA/DELETEREPLCIA - MOVEREPLICA - REPLACENODE The other thing you may wan to look at is that David Smiley has been working on timeseries support in Solr, but that's quite recent so may not be available in whatever version you're using. Nor do I know enough details a about it to know how (or if) it it supported the heterogeneous setup you're talking about. Check CHANGES.txt. Best, Erick On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie wrote: > Hi, > > We are building a SolrCloud setup, which will index time-series data. Being > time-series data with write-once semantics, we are planning to have > multiple collections i.e. one collection per month. As per our use case, > end users should be able to query across last 12 months worth of data, > which means 12 collections (with one collection per month). To achieve > this, we are planning to leverage Solr collection aliasing such that the > search_alias collection will point to the 12 collections and indexing will > always happen to the latest collection. > > As its write-once kind of data, the question I have is whether it is > possible to have two different hardware profiles within the SolrCloud > cluster such that all the older collections (being read-only) will be > stored on the lower hardware spec, while the latest collection (being write > heavy) will be stored only on the higher hardware profile machines. > > - Is it possible to configure a collection such that the collection data > is only stored on few nodes in the SolrCloud setup? > - If this is possible, at the end of each month, what is the approach to > be taken to “move” the latest collection from higher-spec hardware machines > to the lower-spec ones? > > TIA.
Re: Error when indexing against a specific dynamic field type
Erick, thanks for the response. I have a number of documents in our database where solr is throwing the same exception against *_tsing types. However, when I index against the same document with our solr 4.7, it is successfully indexed. So, I assume something is different between 4.7 and 7.3. I was assuming I could adjust the dynamic field somehow so that it indexes against these documents without errors when using 7.3. I can't remove the offending documents. Its my customer's data Is there some adjustment I can make to the dynamic field? Thanks again. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: SolrCloud Heterogenous Hardware setup
I had a similar problem some time back. Although it might not be the best way, but I used cron to move data from a high-end-spec to a lower-end-spec. It worked beautifully Deepak "The greatness of a nation can be judged by the way its animals are treated. Please stop cruelty to Animals, become a Vegan" +91 73500 12833 deic...@gmail.com Facebook: https://www.facebook.com/deicool LinkedIn: www.linkedin.com/in/deicool "Plant a Tree, Go Green" Make In India : http://www.makeinindia.com/home On Tue, May 1, 2018 at 10:02 PM, Greenhorn Techie wrote: > Thanks Erick. This information is very helpful. Will explore further on the > node placement rules within Collections API. > > Many Thanks > > > On 1 May 2018 at 16:26:34, Erick Erickson (erickerick...@gmail.com) wrote: > > "Is it possible to configure a collection such that the collection > data is only stored on few nodes in the SolrCloud setup?" > > Yes. There are "node placement rules", but also you can create a > collection with a createNodeSet that specifies the nodes that the > replicas are placed on. > > " If this is possible, at the end of each month, what is the approach > to be taken to “move” the latest collection from higher-spec hardware > machines to the lower-spec ones?" > > There are a bunch of ways, in order of how long they've been around > (check your version). All of these are COLLECTIONS API calls. > - ADDREPLICA/DELETEREPLCIA > - MOVEREPLICA > - REPLACENODE > > The other thing you may wan to look at is that David Smiley has been > working on timeseries support in Solr, but that's quite recent so may > not be available in whatever version you're using. Nor do I know > enough details a about it to know how (or if) it it supported the > heterogeneous setup you're talking about. Check CHANGES.txt. > > Best, > Erick > > On Tue, May 1, 2018 at 7:59 AM, Greenhorn Techie > wrote: > > Hi, > > > > We are building a SolrCloud setup, which will index time-series data. > Being > > time-series data with write-once semantics, we are planning to have > > multiple collections i.e. one collection per month. As per our use case, > > end users should be able to query across last 12 months worth of data, > > which means 12 collections (with one collection per month). To achieve > > this, we are planning to leverage Solr collection aliasing such that the > > search_alias collection will point to the 12 collections and indexing > will > > always happen to the latest collection. > > > > As its write-once kind of data, the question I have is whether it is > > possible to have two different hardware profiles within the SolrCloud > > cluster such that all the older collections (being read-only) will be > > stored on the lower hardware spec, while the latest collection (being > write > > heavy) will be stored only on the higher hardware profile machines. > > > > - Is it possible to configure a collection such that the collection data > > is only stored on few nodes in the SolrCloud setup? > > - If this is possible, at the end of each month, what is the approach to > > be taken to “move” the latest collection from higher-spec hardware > machines > > to the lower-spec ones? > > > > TIA. >
Re: Error when indexing against a specific dynamic field type
Steve's comment is much more germane. KeywordTokenizer, used in alphaOnlySort last I knew is not appropriate at all. Do you really want single tokens that consist of the entire document for sorting purposes? Wouldn't the first 1K be enough? It looks like this was put in in 4.0, so I'm guessing your analysis chain is different now between the two versions. It doesn't really matter though, this is not going to be changed. You'll have to do something about your long fields or your analysis chain. And/or revisit what you hope to accomplish with using that field type on such a field, I'm almost certain your use case is flawed. Best, Erick On Tue, May 1, 2018 at 10:35 AM, THADC wrote: > Erick, thanks for the response. I have a number of documents in our database > where solr is throwing the same exception against *_tsing types. > > However, when I index against the same document with our solr 4.7, it is > successfully indexed. So, I assume something is different between 4.7 and > 7.3. I was assuming I could adjust the dynamic field somehow so that it > indexes against these documents without errors when using 7.3. > > I can't remove the offending documents. Its my customer's data > > Is there some adjustment I can make to the dynamic field? > > Thanks again. > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Query Regarding Solr Garbage Collection
Hi, Following the https://wiki.apache.org/solr/SolrPerformanceFactors article, I understand that Garbage Collection might be triggered due to significant increase in JVM heap usage unless a commit is performed. Given this background, I am curious to understand the reasons / factors that contribute to increased heap usage of Solr JVM, which would thus force a Garbage Collection cycle. Especially, what are the factors that contribute to heap usage increase during indexing time and what factors contribute during search/query time? Thanks
Median Date
All, We have a dateImported field in our schema. I'd like to generate a statistic showing the median dateImported (actually we want median age of the documents, based on the dateImported value). I have other stats that calculate the median value of numbers (like price). This was achieved with something like: rows=0&stats=true&stats.field={!tag=piv1 percentiles='50'}price&facet=true&facet.pivot={!stats=piv1 }status I have not found a way to calculate the median dateImported. The mean works, but we need median. Any help would be appreciated? Cheers, Jim
Solr Heap usage
Hi, Wondering what are the considerations to be aware to arrive at an optimal heap size for Solr JVM? Though I did discuss this on the IRC, I am still unclear on how Solr uses the JVM heap space. Are there any pointers to understand this aspect better? Given that Solr requires an optimally configured heap, so that the remaining unused memory can be used for OS disk cache, I wonder how to best configure Solr heap. Also, on the IRC it was discussed that having 31GB of heap is better than having 32GB due to Java’s internal usage of heap. Can anyone guide further on heap configuration please? Thanks
Re: 7.3 appears to leak
Thank Markus, So I will go ahead with 7.3.1 release. On Tue, May 1, 2018 at 9:41 PM Markus Jelsma wrote: > Mạnh, Shalin, > > I tried to reproduce it locally but i failed, it is not just a stream of > queries and frequent updates/commits. We will temporarily abuse a > production machine to run 7.3 and a control machine on 7.2 to rule some > things out. > > We have plenty custom plugins, so when i can reproduce it again, we can > rule stuff out and hopefully get back at you guys! > > Thanks, > Markus > > -Original message- > > From:Đạt Cao Mạnh > > Sent: Monday 30th April 2018 4:07 > > To: solr-user@lucene.apache.org > > Subject: Re: 7.3 appears to leak > > > > Hi Markus, > > > > I tried indexing documents and query documents with queries and filter > > question, but can't not find any leak problems. Can you give us more > > information about the leak? > > > > Thanks! > > > > On Fri, Apr 27, 2018 at 5:11 PM Shalin Shekhar Mangar < > > shalinman...@gmail.com> wrote: > > > > > Hi Markus, > > > > > > Can you give an idea of what your filter queries look like? Any custom > > > plugins or things we should be aware of? Simple indexing artificial > docs, > > > querying and committing doesn't seem to reproduce the issue for me. > > > > > > On Thu, Apr 26, 2018 at 10:13 PM, Markus Jelsma < > > > markus.jel...@openindex.io> > > > wrote: > > > > > > > Hello, > > > > > > > > We just finished upgrading our three separate clusters from 7.2.1 to > 7.3, > > > > which went fine, except for our main text search collection, it > appears > > > to > > > > leak memory on commit! > > > > > > > > After initial upgrade we saw the cluster slowly starting to run out > of > > > > memory within about an hour and a half. We increased heap in case 7.3 > > > just > > > > requires more of it, but the heap consumption graph is still growing > on > > > > each commit. Heap space cannot be reclaimed by forcing the garbage > > > > collector to run, everything just piles up in the OldGen. Running > with > > > this > > > > slightly larger heap, the first nodes will run out of memory in > about two > > > > and a half hours after cluster restart. > > > > > > > > The heap eating cluster is a 2shard/3replica system on separate > nodes. > > > > Each replica is about 50 GB in size and about 8.5 million documents. > On > > > > 7.2.1 it ran fine with just a 2 GB heap. With 7.3 and 2.5 GB heap, it > > > will > > > > take just a little longer for it to run out of memory. > > > > > > > > I inspected reports shown by the sampler of VisualVM and spotted one > > > > peculiarity, the number of instances of SortedIntDocSet kept growing > on > > > > each commit by about the same amount as the number of cached filter > > > > queries. But this doesn't happen on the logs cluster, SortedIntDocSet > > > > instances are neatly collected there. The number of instances also > > > accounts > > > > for the number of commits since start up times the cache sizes > > > > > > > > Our other two clusters don't have this problem, one of them receives > very > > > > few commits per day, but the other receives data all the time, it > logs > > > user > > > > interactions so a large amount of data is coming in all the time. I > > > cannot > > > > reproduce it locally by indexing data and committing all the time, > the > > > peak > > > > usage in OldGen stays about the same. But, i can reproduce it locally > > > when > > > > i introduce queries, and filter queries while indexing pieces of > data and > > > > committing it. > > > > > > > > So, what is the problem? I dug in the CHANGES.txt of both Lucene and > > > Solr, > > > > but nothing really caught my attention. Does anyone here have an idea > > > where > > > > to look? > > > > > > > > Many thanks, > > > > Markus > > > > > > > > > > > > > > > > -- > > > Regards, > > > Shalin Shekhar Mangar. > > > > > >
Re: Learning to Rank (LTR) with grouping
* "Top K shouldn't start from the "start" parameter, if it does, it is a bug. "*** 1. I clearly see that LTR do re-rank based on the start parameter. 2. When reRankDocs=24, pageSize=24, I still get the second page of results re-ranked by ltr plugin when I query with start=24. Alessandro Benedetti wrote > Are you using SolrCloud or any distributed search ? > > If you are using just a single Solr instance, LTR should have no problem > with pagination. > The re-rank involves the top K and then you paginate. > So if a document from the original score page 1 ends up in page 3, you > will > see it at page three. > have you verified that : "Say, if an item (Y) from second page is moved to > first page after > re-ranking, while an item (X) from first page is moved away from the first > page. ?" > Top K shouldn't start from the "start" parameter, if it does, it is a bug. > > The situation change a little with distributed search where you can > experiment this behaviour : > > *Pagination* > Let’s explore the scenario on a single Solr node and on a sharded > architecture. > > SINGLE SOLR NODE > > reRankDocs=15 > rows=10 > This means each page is composed by 10 results. > What happens when we hit the page 2 ? > The first 5 documents in the search results will have been rescored and > affected by the reranking. > The latter 5 documents will preserve the original score and original > ranking. > > e.g. > Doc 11 – score= 1.2 > Doc 12 – score= 1.1 > Doc 13 – score= 1.0 > Doc 14 – score= 0.9 > Doc 15 – score= 0.8 > Doc 16 – score= 5.7 > Doc 17 – score= 5.6 > Doc 18 – score= 5.5 > Doc 19 – score= 4.6 > Doc 20 – score= 2.4 > This means that score(15) could be < score(16), but document 15 and 16 are > still in the expected order. > The reason is that the top 15 documents are rescored and reranked and the > rest is left unchanged. > > *SHARDED ARCHITECTURE* > > reRankDocs=15 > rows=10 > Shards number=2 > When looking for the page 2, Solr will trigger queries to she shards to > collect 2 pages per shard : > Shard1 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored > docs > (page2) > Shard2 : 10 ReRanked docs (page1) + 5 ReRanked docs + 5 OriginalScored > docs > (page2) > > The the results will be merged, and possibly, original scored search > results > can top up reranked docs. > A possible solution could be to normalise the scores to prevent any > possibility that a reranked result is surpassed by original scored ones. > > Note: The problem is going to happen after you reach rows * page > > reRankDocs. In situations when reRankDocs is quite high , the problem will > occur only in deep paging. > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html - --Ilay -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html