Re: NRT for new items in index
On 2019/08/03 18:00:28, Furkan KAMACI wrote: > Hi, > > First of all, could you check here: > https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > to > better understand hard commits, soft commits and transaction logs to > achieve NRT search. > > Kind Regards, > Furkan KAMACI > > On Wed, Jul 31, 2019 at 3:47 PM profiuser wrote: > > > Hi, > > > > we have something about 400 000 000 items in a solr collection. > > We have set up auto commit property for this collection to 15 minutes. > > Is a big collection and we using some caches etc. Therefore we have big > > autocommit value. > > > > This have disadvantage that we haven't NRT searches. > > > > We would like to have NRT at least for searching for the newly added items. > > > > We read about new functionality "Category routed alilases" in a solr > > version > > 8.1. > > > > And we got an idea, that we could add to our collection schema field for > > routing. > > And at the time of indexing we check if item is new and to routing field we > > set up value "new", or the item is older than some time period we set up > > value to "old". > > And we will have one category routed alias routedCollection, and there will > > be 2 collections old and new. > > > > If we index new item, router choose new collection and this item is > > inserted > > to it. After some period we reindex item and we decide that this item is > > old > > and to routing field we set up value "old". Router decide to update > > (insert) > > item to collection old. But we expect that solr automatically check > > uniqueness in all routed collections. And if solr found item in other > > collection, than will be automatically deleted. But not !!! > > > > Is this expected behaviour? > > > > Could be used this functionality for issue we have? Or could someone > > suggest > > another solution, which ensure that we have all new items ready for NRT > > searches? > > > > Thanks for your help > > > > > > > > > > > > > > -- > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > > Hi, we know this page, and we understand how commits and transaction logs works, but as I said we have a very big index size ;-) Therefore we cannot create commits to often. We must cache data for fast search, and if we will commit to often, then we can any cache throw out. Now we have only one server, and we prepare new solution with Solr Cloud. Where we would have several servers. We have limited resources and we cannot afford to have for example 20 Solr servers, which I believe is a standard solution for big indexes. Therefore we search for some compromise between price/performance. Therefore we think about have more collections. And one collection would be a daily feed (small index) and then we can commit every several seconds. And these collections would be merge to main collection alias. Do you have another idea? Best
Re: NRT for new items in index
On 2019/08/06 06:43:20, Jörn Franke wrote: > Do you have some more information on index and size? > > Do you have to store everything in the Index? Can you store some data (blobs > etc) outside ? > > I think you are generally right with your solution, but also be aware that it > is sometimes cheaper to have several servers instead keeping engineer busy > for some months to find a solution. I don’t say this is the case in your > solution and I am also not a fan at throwing hardware at a problem, but an > engineer (even if it affects him/herself) should always make that decision. > That does not necessarily mean that engineer looses a job - one can implement > other valuable features for a customer. > > > Am 06.08.2019 um 08:21 schrieb Updates Profimedia : > > > > > > > >> On 2019/08/03 18:00:28, Furkan KAMACI wrote: > >> Hi, > >> > >> First of all, could you check here: > >> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > >> to > >> better understand hard commits, soft commits and transaction logs to > >> achieve NRT search. > >> > >> Kind Regards, > >> Furkan KAMACI > >> > >>> On Wed, Jul 31, 2019 at 3:47 PM profiuser wrote: > >>> > >>> Hi, > >>> > >>> we have something about 400 000 000 items in a solr collection. > >>> We have set up auto commit property for this collection to 15 minutes. > >>> Is a big collection and we using some caches etc. Therefore we have big > >>> autocommit value. > >>> > >>> This have disadvantage that we haven't NRT searches. > >>> > >>> We would like to have NRT at least for searching for the newly added > >>> items. > >>> > >>> We read about new functionality "Category routed alilases" in a solr > >>> version > >>> 8.1. > >>> > >>> And we got an idea, that we could add to our collection schema field for > >>> routing. > >>> And at the time of indexing we check if item is new and to routing field > >>> we > >>> set up value "new", or the item is older than some time period we set up > >>> value to "old". > >>> And we will have one category routed alias routedCollection, and there > >>> will > >>> be 2 collections old and new. > >>> > >>> If we index new item, router choose new collection and this item is > >>> inserted > >>> to it. After some period we reindex item and we decide that this item is > >>> old > >>> and to routing field we set up value "old". Router decide to update > >>> (insert) > >>> item to collection old. But we expect that solr automatically check > >>> uniqueness in all routed collections. And if solr found item in other > >>> collection, than will be automatically deleted. But not !!! > >>> > >>> Is this expected behaviour? > >>> > >>> Could be used this functionality for issue we have? Or could someone > >>> suggest > >>> another solution, which ensure that we have all new items ready for NRT > >>> searches? > >>> > >>> Thanks for your help > >>> > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > >>> > >> > > > > Hi, > > > > we know this page, and we understand how commits and transaction logs > > works, but as I said we have a very big index size ;-) Therefore we cannot > > create commits to often. > > We must cache data for fast search, and if we will commit to often, then we > > can any cache throw out. > > > > Now we have only one server, and we prepare new solution with Solr Cloud. > > Where we would have several servers. We have limited resources and we > > cannot afford to have for example 20 Solr servers, which I believe is a > > standard solution for big indexes. > > > > Therefore we search for some compromise between price/performance. > > Therefore we think about have more collections. And one collection would be > > a daily feed (small index) and then we can commit every several seconds. > > And these collections would be merge to main collection alias. > > > > Do you have another idea? > > > > Best > > > > > > > We have almost 500 GB index. We store only data which we need, rest of data we have in other storages (database, filesystem). We have a big daily feed, something about 150 000 new items per day. And similar count of updates/deletes we have too. These are very live data. And back to the first question, do you have someone experience with the new functionality of "Category routed alilases" . The problem with updating category at existing item to reindex to other collection, and remaining original item in original collection? This means, that authors of this functionality doesn't assume, that someone change category to existing item?
Re: NRT for new items in index
Thank you for the interesting reply. You confirmed our assumptions about that. The usage of two or more collections, as Jörn Franke said, is more complicated for developing. And for a now we will only try split image to more shards and servers and try to reduce commit times too. I think that NRT times about one minute are acceptable Thank you On 2019/08/06 19:59:49, Shawn Heisey wrote: > On 7/31/2019 6:47 AM, profiuser wrote: > > we have something about 400 000 000 items in a solr collection. > > We have set up auto commit property for this collection to 15 minutes. > > Is a big collection and we using some caches etc. Therefore we have big > > autocommit value. > > I would set autoCommit to 60 seconds (a value of 6) with > openSearcher set to false. This will not affect change visibility in > any way, but it will keep your transaction logs from becoming huge. > Commits that do NOT open a new searcher are very fast. > > Then I would use autoSoftCommit as a failsafe on change visibility. > Start with a value between two and five minutes. > > > This have disadvantage that we haven't NRT searches. > > > > We would like to have NRT at least for searching for the newly added items. > > > > We read about new functionality "Category routed alilases" in a solr version > > 8.1. > > > > And we got an idea, that we could add to our collection schema field for > > routing. > > And at the time of indexing we check if item is new and to routing field we > > set up value "new", or the item is older than some time period we set up > > value to "old". > > And we will have one category routed alias routedCollection, and there will > > be 2 collections old and new. > > > > If we index new item, router choose new collection and this item is inserted > > to it. After some period we reindex item and we decide that this item is old > > and to routing field we set up value "old". Router decide to update (insert) > > item to collection old. But we expect that solr automatically check > > uniqueness in all routed collections. And if solr found item in other > > collection, than will be automatically deleted. But not !!! > > > > Is this expected behaviour? > > I know very little about the new routed collection capability, but in > general, I would not expect Solr to check more than one collection for > an existing ID value when it is indexing. I don't think there's > anything happening at that level that even knows about other > collections. If you want to split your index into hot and cold pieces, > you're probably going to need to have your indexing software be aware of > that and either figure out where to send deletes, or just send deletes > to all parts of the index. > > What kind of lag time do you think about when you imagine near real time > indexing? Note that extremely short NRT times may not be achievable, > especially with the large index you're using. A good starting point in > my opinion is 3, which is 30 seconds. > > What I would do is use the autoCommit and autoSoftCommit settings that I > mentioned above, and include a "commitWithin" parameter on all indexing > requests. The commitWithin would be for NRT. > > Thanks, > Shawn >