Sharding and Replication
Hi, I had questions on implementation of Sharding and Replication features of Solr/Cloud. 1. I noticed that when sharding is enabled for a collection - individual requests are sent to each node serving as a shard. 2. Replication too follows above strategy of sending individual documents to the nodes serving as a replica. I am working with a system that requires massive number of writes - I have noticed that due to above reason - the cloud eventually starts to fail (Even though I am using a ensemble). I do understand the reason behind individual updates - but why not batch them up or give a option to batch N updates in either of the above case - I did come across a presentation that talked about batching 10 updates for replication at least, but I do not think this is the case. - Asif
Sharding and Replication clarification
Hi, I had questions on implementation of Sharding and Replication features of Solr/Cloud. 1. I noticed that when sharding is enabled for a collection - individual requests are sent to each node serving as a shard. 2. Replication too follows above strategy of sending individual documents to the nodes serving as a replica. I am working with a system that requires massive number of writes - I have noticed that due to above reason - the cloud eventually starts to fail (Even though I am using a ensemble). I do understand the reason behind individual updates - but why not batch them up or give a option to batch N updates in either of the above case - I did come across a presentation that talked about batching 10 updates for replication at least, but I do not think this is the case. - Asif
Re: Sharding and Replication
Erick, Thanks for your reply. You are right about 10 updates being batch up - It was hard to figure out due to large number of updates/logging that happens in our system. We are batching 1000 updates every time. Here is my observation from leader and replica - 1. Leader logs are clearly indicating that 1000 updates arrived - [ (1000 adds)],commit=] 2. On replica - for each 1000 document adds on leader - I see a lot of requests on replica - with no indication of how many updates in each request. Digging a little bit into Solr code I figured this variable I am interested in - maxBufferedAddsPerServer is set to 10 - http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/update/SolrCmdDistributor.java?view=markup This means for a batch update of 1000 documents - we will be seeing 100 requests for replica - which translates into 100 writes per collection per second in our system. Should this variable be made configurable via solrconfig.xml (or any other appropriate place)? A little background about a system we are trying to build - real time analytics solution using the Solr Cloud + Atomic updates - we have very high amount of writes - going as high as 1000 updates a second (possibly more in long run). - Asif On Sat, Jun 22, 2013 at 4:21 AM, Erick Erickson wrote: > Update are batched, but it's on a per-request basis. So, if > you're sending one document at a time you'll won't get any > batching. If you send 10 docs at a time and they happen to > go to 10 different shards, you'll get 10 different update > requests. > > If you're sending 1,000 docs per update you' should be seeing > some batching going on. > > bq: but why not batch them up or give a option to batch N > updates in either of the above case > > I suspect what you're seeing is that you're not sending very > many docs per update request and so are being mislead. > > But that's a guess since you haven't provided much in the > way of data on _how_ you're updating. > > bq: the cloud eventually starts to fail > How? Details matter. > > Best > Erick > > On Wed, Jun 19, 2013 at 4:23 AM, Asif wrote: > > Hi, > > > > I had questions on implementation of Sharding and Replication features of > > Solr/Cloud. > > > > 1. I noticed that when sharding is enabled for a collection - individual > > requests are sent to each node serving as a shard. > > > > 2. Replication too follows above strategy of sending individual documents > > to the nodes serving as a replica. > > > > I am working with a system that requires massive number of writes - I > have > > noticed that due to above reason - the cloud eventually starts to fail > > (Even though I am using a ensemble). > > > > I do understand the reason behind individual updates - but why not batch > > them up or give a option to batch N updates in either of the above case > - I > > did come across a presentation that talked about batching 10 updates for > > replication at least, but I do not think this is the case. > > - Asif >
Re: Sharding and Replication
Erick, Its a completely practical problem - we are exploring Solr to build a real time analytics/data solution for a system handling about 1000 qps. We have various metrics that are stored as different collections on the cloud, which means very high amount of writes. The cloud also needs to support about 300-400 qps. We initially tested with a single Solr node on a 16 core / 24 GB box for a single metric. We saw that writes were not a issue at all - Solr was handling it extremely well. We were also able to achieve about 200 qps from a single node. When we set up the cloud ( a ensemble on 6 boxes), we saw very high CPU usage on the replicas. Up to 10 cores were getting used for writes on the replicas. Hence my concern with respect to batch updates for the replicas. BTW, I altered the maxBufferedAddsPerServer to 1000 - and now CPU usage is very similar to single node installation. - Asif On Sat, Jun 22, 2013 at 9:53 PM, Erick Erickson wrote: > Yeah, there's been talk of making this configurable, but there are > more pressing priorities so far. > > So just to be clear, is this theoretical or practical? I know of several > very > high-performance situations where 1,000 updates/sec (and I'm assuming > that it's 1,000 docs/sec not 1,000 batches of 1,000 docs) hasn't caused > problems here. So unless you're actually seeing performance problems > as opposed to fearing that there _might_ be, I'd just go on the to the next > urgent problem. > > Best > Erick > > On Fri, Jun 21, 2013 at 8:34 PM, Asif wrote: > > Erick, > > > > Thanks for your reply. > > > > You are right about 10 updates being batch up - It was hard to figure out > > due to large number of updates/logging that happens in our system. > > > > We are batching 1000 updates every time. > > > > Here is my observation from leader and replica - > > > > 1. Leader logs are clearly indicating that 1000 updates arrived - [ (1000 > > adds)],commit=] > > 2. On replica - for each 1000 document adds on leader - I see a lot of > > requests on replica - with no indication of how many updates in each > > request. > > > > Digging a little bit into Solr code I figured this variable I am > > interested in - maxBufferedAddsPerServer is set to 10 - > > > > > http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/update/SolrCmdDistributor.java?view=markup > > > > This means for a batch update of 1000 documents - we will be seeing 100 > > requests for replica - which translates into 100 writes per collection > per > > second in our system. > > > > Should this variable be made configurable via solrconfig.xml (or any > other > > appropriate place)? > > > > A little background about a system we are trying to build - real time > > analytics solution using the Solr Cloud + Atomic updates - we have very > > high amount of writes - going as high as 1000 updates a second (possibly > > more in long run). > > > > - Asif > > > > > > > > > > > > On Sat, Jun 22, 2013 at 4:21 AM, Erick Erickson >wrote: > > > >> Update are batched, but it's on a per-request basis. So, if > >> you're sending one document at a time you'll won't get any > >> batching. If you send 10 docs at a time and they happen to > >> go to 10 different shards, you'll get 10 different update > >> requests. > >> > >> If you're sending 1,000 docs per update you' should be seeing > >> some batching going on. > >> > >> bq: but why not batch them up or give a option to batch N > >> updates in either of the above case > >> > >> I suspect what you're seeing is that you're not sending very > >> many docs per update request and so are being mislead. > >> > >> But that's a guess since you haven't provided much in the > >> way of data on _how_ you're updating. > >> > >> bq: the cloud eventually starts to fail > >> How? Details matter. > >> > >> Best > >> Erick > >> > >> On Wed, Jun 19, 2013 at 4:23 AM, Asif wrote: > >> > Hi, > >> > > >> > I had questions on implementation of Sharding and Replication > features of > >> > Solr/Cloud. > >> > > >> > 1. I noticed that when sharding is enabled for a collection - > individual > >> > requests are sent to each node serving as a shard. > >> > > >> > 2. Replication too follows above strategy of sending individual > documents > >> > to the nodes serving as a replica. > >> > > >> > I am working with a system that requires massive number of writes - I > >> have > >> > noticed that due to above reason - the cloud eventually starts to fail > >> > (Even though I am using a ensemble). > >> > > >> > I do understand the reason behind individual updates - but why not > batch > >> > them up or give a option to batch N updates in either of the above > case > >> - I > >> > did come across a presentation that talked about batching 10 updates > for > >> > replication at least, but I do not think this is the case. > >> > - Asif > >> >
indexing documents in Apache Solr using php-curl library
I am indexing the file using php curl library. I am stuck here with the code echo "Stored in: " . "upload/" . $_FILES["file"]["name"]; $result=move_uploaded_file($_FILES["file"]["tmp_name"],"upload/" . $_FILES["file"]["name"]); if ($result == 1) echo "Upload done ."; $options = getopt("f:"); $infile = $options['f']; $url = "http://localhost:8983/solr/update/";; $filename = "upload/" . $_FILES["file"]["name"]; $handle = fopen($filename, "rb"); $contents = fread($handle, filesize($filename)); fclose($handle); echo $url; $post_string = file_get_contents("upload/" . $_FILES["file"]["name"]); echo $contents; $header = array("Content-type:text/xml; charset=utf-8"); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HTTPHEADER, $header); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string); curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1); curl_setopt($ch, CURLINFO_HEADER_OUT, 1); $data = curl_exec($ch); if (curl_errno($ch)) { print "curl_error:" . curl_error($ch); } else { curl_close($ch); print "curl exited okay\n"; echo "Data returned...\n"; echo "\n"; echo $data; echo "\n"; } Nothing is showing as a result. Moreover there is nothing shown in the event log of Apache Solr. please help me with the code -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-documents-in-Apache-Solr-using-php-curl-library-tp3992452.html Sent from the Solr - User mailing list archive at Nabble.com.
Hiring solr experts
Hi all, Does anyone here have any experience hiring solr experts? Are there any specific channels that you had good success with? Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Hiring solr experts
We're actually looking to bring someone on full-time. On Sat, Aug 7, 2010 at 3:13 PM, Erick Erickson wrote: > Well, what do you want them to do? Come work full-time for > your company or consult/contract? > > If the latter, have you seen this? > http://wiki.apache.org/solr/Support > > On Sat, Aug 7, 2010 at 4:59 AM, Asif Rahman wrote: > > > Hi all, > > > > Does anyone here have any experience hiring solr experts? Are there any > > specific channels that you had good success with? > > > > Thanks, > > > > Asif > > > > -- > > Asif Rahman > > Lead Engineer - NewsCred > > a...@newscred.com > > http://platform.newscred.com > > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Facet filtering
Is there any way to assign metadata to terms in a field and then filter on that metadata when using that field as a facet? For example, I have a collection of news articles in my index. Each article has a field that contains tags based on the topics discussed in the article. An article might have the tags "Barack Obama" and "Chicago". I want to assign metadata describing what type of entity each tag is. For these tags the metadata would be "person" for "Barack Obama" and "place" for "Chicago". Then I want to issue a facet query that returns only "person" facets. I can think of two possible solutions for this, both with shortcomings. 1) I could create a field for each metadata category. So the schema would have the fields "tag_person" and "tag_place". The problem with this method is that I am limited to filtering by a single criterion for each of my queries. 2) I could leave the Solr schema unmodified and post-process the query. This solution is less elegant than one that could be completely contained within Solr. I also imagine that it would be less performant. Any thoughts? Thanks in advance, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Facets with an IDF concept
Hi Wojtek: Sorry for the late, late reply. I haven't implemented this yet, but it is on the (long) list of my todos. Have you made any progress? Asif On Thu, Aug 13, 2009 at 5:42 PM, wojtekpia wrote: > > Hi Asif, > > Did you end up implementing this as a custom sort order for facets? I'm > facing a similar problem, but not related to time. Given 2 terms: > A: appears twice in half the search results > B: appears once in every search result > I think term A is more "interesting". Using facets sorted by frequency, > term > B is more important (since it shows up first). To me, terms that appear in > all documents aren't really that interesting. I'm thinking of using a > combination of document count (in the result set, not globally) and term > frequency (in the result set, not globally) to come up with a facet sort > order. > > Wojtek > -- > View this message in context: > http://www.nabble.com/Facets-with-an-IDF-concept-tp24071160p24959192.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Incorrect sort with with function query in query parameters
Hi all, I'm having an issue with the order of my results when attempting to sort by a function in my query. Looking at the debug output of the query, the score returned with in the result section for any given document does not match the score in the debug output. It turns out that if I optimize the index, then the results are sorted correctly. The scores in the debug output are the correct scores. This behavior only occurs using a recent nightly build of Solr. It works correctly in Solr 1.3. An example query is: http://localhost:8080/solr/core-01/select?qt=standard&fl=*,score&rows=10&q=*:*%20_val_:"recip(rord(article_published_at),1,1000,1000)"^1&debugQuery=on I've attached the result to this email. Can anybody shed any light on this problem? Thanks, Asif http://www.nabble.com/file/p22735009/result.xml result.xml -- View this message in context: http://www.nabble.com/Incorrect-sort-with-with-function-query-in-query-parameters-tp22735009p22735009.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Incorrect sort with with function query in query parameters
Hi Otis, Any documents marked deleted in this index are just the result of updates to those documents. There are no purely deleted documents. Furthermore, the field that I am ordering by in my function query remains untouched over the updates. I've read in other posts that the logic used by the debug component to calculate the score is different from what the query component uses. The score shown in the debug output is correct. It seems like the two components are getting two different values for the rord function. I'm particularly concerned by the fact that this only happens in the nightly build. Any ideas on how to correct this? Unfortunately, it's not feasible for me to only perform searches on optimized indices because we are doing constant updates. Thanks, Asif Otis Gospodnetic wrote: > > > Asif, > > Could it have something to do with the deleted documents in your > unoptimized index? There documents are only marked as deleted. When you > run optimize you really remove them completely. It could be that they are > getting counted by something and that messes up the scoring/order. > > -- View this message in context: http://www.nabble.com/Incorrect-sort-with-with-function-query-in-query-parameters-tp22735009p22741058.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Incorrect sort with with function query in query parameters
I have been intending to although I have been dragging my feet on it. I've never opened a bug before so I'm not sure of the protocol. If you don't mind, it would be great if you could send me a pm and point me in the right direction. Thanks, Asif On Mon, May 18, 2009 at 7:30 PM, Ensdorf Ken wrote: > > > > A Unit test would be ideal, but even if you can just provide a list of > > steps (ie: using this solrconfig+schema, index these docs, then update > > this one doc, then execute this search) it can help people track things > > down. > > > > Please open a bug and attach as much detail as you can there. > > > > > > -Hoss > > Was a bug ever opened on this? I am seeing similar behavior (though in my > case it's the debug scores that look wrong). > > -Ken > >
Facets with an IDF concept
Hi all, We have an index of news articles that are tagged with news topics. Currently, we use solr facets to see which topics are popular for a given query or time period. I'd like to apply the concept of IDF to the facet counts so as to penalize the topics that occur broadly through our index. I've begun to write custom facet component that applies the IDF to the facet counts, but I also wanted to check if anyone has experience using facets in this way. Thanks, Asif
Re: Facets with an IDF concept
Hi again, I guess nobody has used facets in the way I described below before. Do any of the experts have any ideas as to how to do this efficiently and correctly? Any thoughts would be greatly appreciated. Thanks, Asif On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman wrote: > Hi all, > > We have an index of news articles that are tagged with news topics. > Currently, we use solr facets to see which topics are popular for a given > query or time period. I'd like to apply the concept of IDF to the facet > counts so as to penalize the topics that occur broadly through our index. > I've begun to write custom facet component that applies the IDF to the facet > counts, but I also wanted to check if anyone has experience using facets in > this way. > > Thanks, > > Asif > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Facets with an IDF concept
Hi Kent, Your problem is close cousin of the problem that we're tackling. We have experience the same problem as you when calculating facets on MoreLikeThis queries, since those queries tend to match a lot of documents. We used one of the solutions that you mentioned, rank cutoff, to solve it. We first run the MoreLikeThis query, then use the top N documents' unique ids as a filter query for a second query. The performance is still acceptable, however our index size is smaller than yours by an order of magnitude. Regards, Asif On Tue, Jun 23, 2009 at 10:34 AM, Kent Fitch wrote: > Hi Asif, > > I was holding back because we have a similar problem, but we're not > sure how best to approach it, or even whether approaching it at all is > the right thing to do. > > Background: > - large index (~35m documents) > - about 120k on these include full text book contents plus metadata, > the rest are just metadata > - we plan to increase number of full text books to around 1m, number > of records will greatly increase > > We've found that because of the sheer volume of content in full text, > we get lots of results in full text of very low relevance. The Lucene > relevance ranking works wonderfully to "hide" these way down the list, > and when these are the only results at all, the user may be delighted > to find obscure hits. > > But when you search for, say : soldier of fortune : one of the 55k+ > results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it > probably isn't relevant. The searcher will find it in the result > sets, but should the author, subject, dates, formats etc (our facets) > of Huck Finn be contributing to the facets shown to the user as > equally as, say, the top 500 results? Maybe, but perhaps they are > "diluting" the value of facets contributed by the more relevant > results. > > So, we are considering restricting the contents of the result bit set > used for faceting to exclude results with a very very low score (with > our own QueryComponent). But there are problems: > > - what's a low score? How will a low score threshold vary across > queries? (Or should we use a rank cutoff instead, which is much more > expensive to compute, or some combo that works with results that only > have very low relevance results?) > > - should we do this for all facets, or just some (where the less > relevant results seem particularly annoying, as they can "mask" facets > from the most relevant results - the authors, years and subjects we > have full text for are not representative of the whole corpus) > > - if a searcher pages through to the 1000th result page, down to these > less relevant results, should we somehow include these results in the > facets we show? > > sorry, only more questions! > > Regards, > > Kent Fitch > > On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman wrote: > > Hi again, > > > > I guess nobody has used facets in the way I described below before. Do > any > > of the experts have any ideas as to how to do this efficiently and > > correctly? Any thoughts would be greatly appreciated. > > > > Thanks, > > > > Asif > > > > On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman wrote: > > > >> Hi all, > >> > >> We have an index of news articles that are tagged with news topics. > >> Currently, we use solr facets to see which topics are popular for a > given > >> query or time period. I'd like to apply the concept of IDF to the facet > >> counts so as to penalize the topics that occur broadly through our > index. > >> I've begun to write custom facet component that applies the IDF to the > facet > >> counts, but I also wanted to check if anyone has experience using facets > in > >> this way. > >> > >> Thanks, > >> > >> Asif > >> > > > > > > > > -- > > Asif Rahman > > Lead Engineer - NewsCred > > a...@newscred.com > > http://platform.newscred.com > > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Facets with an IDF concept
Hi Grant, I'll give a real life example of the problem that we are trying to solve. We index a large number of current news articles on a continuing basis. We tag these articles with news topics (e.g. Barack Obama, Iran, etc.). We then use these tags to facet our queries. For example, we might issue a query for all articles in the last 24 hours. The facets would then tell us which news topics have been written about the most in that period. The problem is that "Barack Obama", for example, is always written about in high frequency, as opposed to "Iran" which is currently very hot in the news, but which has not always been the case. In this case, we'd like to see "Iran" show up higher than "Barack Obama" in the facet results. To me, this seems identical to the tf-idf scoring expression that is used in normal search. The facet count is analogous to the tf and I can access the facet term idf's through the Similarity API. Is my reasoning sound? Can you provide any guidance as to the best way to implement this? Thanks for your help, Asif On Tue, Jun 23, 2009 at 1:19 PM, Grant Ingersoll wrote: > > On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote: > > Hi again, >> >> I guess nobody has used facets in the way I described below before. Do >> any >> of the experts have any ideas as to how to do this efficiently and >> correctly? Any thoughts would be greatly appreciated. >> >> Thanks, >> >> Asif >> >> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman wrote: >> >> Hi all, >>> >>> We have an index of news articles that are tagged with news topics. >>> Currently, we use solr facets to see which topics are popular for a given >>> query or time period. I'd like to apply the concept of IDF to the facet >>> counts so as to penalize the topics that occur broadly through our index. >>> I've begun to write custom facet component that applies the IDF to the >>> facet >>> counts, but I also wanted to check if anyone has experience using facets >>> in >>> this way. >>> >> > > I'm not sure I'm following. Would you be faceting on one field, but using > the DF from some other field? Faceting is already a count of all the > documents that contain the term on a given field for that search. If I'm > understanding, you would still do the typical faceting, but then rerank by > the global DF values, right? > > Backing up, what is the problem you are seeing that you are trying to > solve? > > I think you could do this, but you'd have to hook it in yourself. By > penalize, do you mean remove, or just have them in the sort? Generally > speaking, looking up the DF value can be expensive, especially if you do a > lot of skipping around. I don't know how pluggable the sort capabilities > are for faceting, but that might be the place to start if you are just > looking at the sorting options. > > > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Create incremental snapshot
Tushar: Is it necessary to do the optimize on each iteration? When you run an optimize, the entire index is rewritten. Thus each index file can have at most one hard link and each snapshot will consume the full amount of space on your disk. Asir On Thu, Jul 9, 2009 at 3:26 AM, tushar kapoor < tushar_kapoor...@rediffmail.com> wrote: > > What I gather from this discussion is - > > 1. Snapshots are always hard links and not actual files so they cannot > possibly consume the same amountof space. > 2. Snapshots contain hard links to existing docs + delta docs. > > We are facing a situation wherein the snapshot occupies the same space as > the actual indexes thus violating the first point. > We have a batch processing scheme for refreshing indexes. the steps we > follow are - > > 1. Delete 200 documents in one go. > 2. Do an optimize. > 3. Create the 200 documents deleted earlier. > 4. Do a commit. > > This process continues for around 160,000 documents i.e. 800 times and by > the end of it we have 800 snapshots. > > The size of actual indexes is 200 Mb and remarkably all the 800 snapshots > are of size around 200 Mb each. In effect this process consumes around 160 > Gb space on our disks. This is causing a lot of pain right now. > > My concern are - Is our understanding of the snapshooter correct ? Should > this massive space consumption be happening at all ? Are we missing > something critical ? > > Regards, > Tushar. > > Shalin Shekhar Mangar wrote: > > > > On Sat, Apr 18, 2009 at 1:06 PM, Koushik Mitra > > wrote: > > > >> Ok > >> > >> If these are hard links, then where does the index data get stored? > Those > >> must be getting stored somewhere in the file system. > >> > > > > Yes, of course they are stored on disk. The hard links are created from > > the > > actual files inside the index directory. When those older files are > > deleted > > by Solr, they are still left on the disk if at least one hard link to > that > > file exists. If you are looking for how to clean old snapshots, you could > > use the snapcleaner script. > > > > Is that what you wanted to do? > > > > -- > > Regards, > > Shalin Shekhar Mangar. > > > > > > -- > View this message in context: > http://www.nabble.com/Create-incremental-snapshot-tp23109877p24405434.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Solr 1.5 in production
What is the prevailing opinion on using solr 1.5 in a production environment? I know that many people were using 1.4 in production for a while before it became an official release. Specifically I'm interested in using some of the new spatial features. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Solr 1.5 in production
One piece of functionality that I need is the ability to index a spatial shape. I've begun implementing this for solr 1.4 using just the spatial capabilities in lucene with a custom update processor and query parser. At this point I'm only supporting rectangles and the shapes are being indexed as sets of spatial tiles. In solr 1.5, I believe the correct implementation would be as a field type with the new subfielding capabilities. Do you have any thoughts about the approach I'm taking? Asif On Fri, Feb 19, 2010 at 5:10 PM, Grant Ingersoll wrote: > > On Feb 19, 2010, at 4:54 PM, Asif Rahman wrote: > > > What is the prevailing opinion on using solr 1.5 in a production > > environment? I know that many people were using 1.4 in production for a > > while before it became an official release. > > > > Specifically I'm interested in using some of the new spatial features. > > These aren't fully baked yet (still need some spatial filtering > capabilities which I'm getting close to done with, or close enough to submit > a patch anyway), but feedback would be welcome. The main risk, I suppose, > is that any new APIs could change. Other than that, the usually advice > applies: Test it out in your environment and see if it meets your needs. > On the spatial stuff, we'd definitely appreciate feedback on performance, > functionality, APIs, etc. > > -Grant -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Solr 1.5 in production
We're modeling hyperlocal news articles. Each article is indexed with a shape that corresponds to the region of the map that is covered by the source of the article. We considered modeling the locality of the articles as points, but that approach would have limited our search options to bounding box filters and slow distance calculations. We also felt that the shape approach more closely resembled the true nature of the data. For now, we're only using rectangles, but this approach will give us the ability to index amorphous shapes as well. Are there any other techniques for indexing shape data? On Mon, Feb 22, 2010 at 4:29 PM, Grant Ingersoll wrote: > > On Feb 20, 2010, at 8:53 AM, Asif Rahman wrote: > > > One piece of functionality that I need is the ability to index a spatial > > shape. I've begun implementing this for solr 1.4 using just the spatial > > capabilities in lucene with a custom update processor and query parser. > At > > this point I'm only supporting rectangles and the shapes are being > indexed > > as sets of spatial tiles. In solr 1.5, I believe the correct > implementation > > would be as a field type with the new subfielding capabilities. Do you > have > > any thoughts about the approach I'm taking? > > Yeah, I think in 1.5 you would use the subfield approach. How do you plan > on using the shape in search? > > > > > Asif > > > > On Fri, Feb 19, 2010 at 5:10 PM, Grant Ingersoll >wrote: > > > >> > >> On Feb 19, 2010, at 4:54 PM, Asif Rahman wrote: > >> > >>> What is the prevailing opinion on using solr 1.5 in a production > >>> environment? I know that many people were using 1.4 in production for > a > >>> while before it became an official release. > >>> > >>> Specifically I'm interested in using some of the new spatial features. > >> > >> These aren't fully baked yet (still need some spatial filtering > >> capabilities which I'm getting close to done with, or close enough to > submit > >> a patch anyway), but feedback would be welcome. The main risk, I > suppose, > >> is that any new APIs could change. Other than that, the usually advice > >> applies: Test it out in your environment and see if it meets your > needs. > >> On the spatial stuff, we'd definitely appreciate feedback on > performance, > >> functionality, APIs, etc. > >> > >> -Grant > > > > > > > > > > -- > > Asif Rahman > > Lead Engineer - NewsCred > > a...@newscred.com > > http://platform.newscred.com > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Stemming Filters in wiki
I see that the entries for PorterStemFilterFactory, EnglishPorterFilterFactory, and SnowballPorterFilterFactory have been removed from the Analyzers, Tokenizers, and Token Filters wiki page. Is there a reason for this? Thanks, asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Index-time vs. search-time boosting performance
Hi, What are the performance ramifications for using a function-based boost at search time (through bf in dismax parser) versus an index-time boost? Currently I'm using boost functions on a 15GB index of ~14mm documents. Our queries generally match many thousands of documents. I'm wondering if I would see a performance improvement by switching over to index-time boosting. Thanks, Asif -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Index-time vs. search-time boosting performance
Perhaps I should have been more specific in my initial post. I'm doing date-based boosting on the documents in my index, so as to assign a higher score to more recent documents. Currently I'm using a boost function to achieve this. I'm wondering if there would be a performance improvement if instead of using the boost function at search time, I indexed the documents with a date-based boost. On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson wrote: > Index time boosting is different than search time boosting, so > asking about performance is irrelevant. > > Paraphrasing Hossman from years ago on the Lucene list (from > memory). > > ...index time boosting is a way of saying this documents' > title is more important than other documents' titles. Search > time boosting is a way of saying "I care about documents > whose titles contain this term more than other documents > whose titles may match other parts of this query" > > HTH > Erick > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman wrote: > > > Hi, > > > > What are the performance ramifications for using a function-based boost > at > > search time (through bf in dismax parser) versus an index-time boost? > > Currently I'm using boost functions on a 15GB index of ~14mm documents. > > Our > > queries generally match many thousands of documents. I'm wondering if I > > would see a performance improvement by switching over to index-time > > boosting. > > > > Thanks, > > > > Asif > > > > -- > > Asif Rahman > > Lead Engineer - NewsCred > > a...@newscred.com > > http://platform.newscred.com > > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Index-time vs. search-time boosting performance
It seems like it would be far more efficient to calculate the boost factor once and store it rather than calculating it for each request in real-time. Some of our queries match tens of thousands if not hundreds of thousands of documents in a 15GB index. However, I'm not well-versed in lucene internals so I may be misunderstanding what is going on here. On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill wrote: > I've done a lot of recency boosting to documents, and I'm wondering why you > would want to do that at index time. If you are continuously indexing new > documents, what was "recent" when it was indexed becomes, over time "less > recent". Are you unsatisfied with your current performance with the boost > function? Query-time recency boosting is a fairly common thing to do, and, > if done correctly, shouldn't be a performance concern. > > -Jay > http://lucidimagination.com > > > On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman wrote: > > > Perhaps I should have been more specific in my initial post. I'm doing > > date-based boosting on the documents in my index, so as to assign a > higher > > score to more recent documents. Currently I'm using a boost function to > > achieve this. I'm wondering if there would be a performance improvement > if > > instead of using the boost function at search time, I indexed the > documents > > with a date-based boost. > > > > On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson > >wrote: > > > > > Index time boosting is different than search time boosting, so > > > asking about performance is irrelevant. > > > > > > Paraphrasing Hossman from years ago on the Lucene list (from > > > memory). > > > > > > ...index time boosting is a way of saying this documents' > > > title is more important than other documents' titles. Search > > > time boosting is a way of saying "I care about documents > > > whose titles contain this term more than other documents > > > whose titles may match other parts of this query" > > > > > > HTH > > > Erick > > > > > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman wrote: > > > > > > > Hi, > > > > > > > > What are the performance ramifications for using a function-based > boost > > > at > > > > search time (through bf in dismax parser) versus an index-time boost? > > > > Currently I'm using boost functions on a 15GB index of ~14mm > documents. > > > > Our > > > > queries generally match many thousands of documents. I'm wondering > if > > I > > > > would see a performance improvement by switching over to index-time > > > > boosting. > > > > > > > > Thanks, > > > > > > > > Asif > > > > > > > > -- > > > > Asif Rahman > > > > Lead Engineer - NewsCred > > > > a...@newscred.com > > > > http://platform.newscred.com > > > > > > > > > > > > > > > -- > > Asif Rahman > > Lead Engineer - NewsCred > > a...@newscred.com > > http://platform.newscred.com > > > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Index-time vs. search-time boosting performance
I know how to index a document with a boost but am still not sure whether I'll see a search performance improvement with it. The initial decision to use a boost function at search-time was made to preserve the flexibility to tweak the function without having to a full reindex. I no longer need that flexibility so was wondering if I would get better performance by implement the boost at index-time. On Fri, Jun 4, 2010 at 11:48 PM, Jonathan Rochkind wrote: > The SolrRelevancyFAQ does suggest that both index-time and search-time > boosting can be used to boost the score of newer documents, but doesn't > suggest what reasons/contexts one might choose one vs the other. It only > provides an example of search-time boost though, so it doesn't answer the > question of how to do an index time boost, if that was a question. > > > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents > > Sorry, this doesn't answer your question, but does contribute the fact that > some author of the FAQ at some point considered index-time boost not > neccesarily unreasonable. > > From: Asif Rahman [a...@newscred.com] > Sent: Friday, June 04, 2010 11:31 PM > To: solr-user@lucene.apache.org > Subject: Re: Index-time vs. search-time boosting performance > > It seems like it would be far more efficient to calculate the boost factor > once and store it rather than calculating it for each request in real-time. > Some of our queries match tens of thousands if not hundreds of thousands of > documents in a 15GB index. However, I'm not well-versed in lucene > internals > so I may be misunderstanding what is going on here. > > > On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill wrote: > > > I've done a lot of recency boosting to documents, and I'm wondering why > you > > would want to do that at index time. If you are continuously indexing new > > documents, what was "recent" when it was indexed becomes, over time "less > > recent". Are you unsatisfied with your current performance with the boost > > function? Query-time recency boosting is a fairly common thing to do, > and, > > if done correctly, shouldn't be a performance concern. > > > > -Jay > > http://lucidimagination.com > > > > > > On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman wrote: > > > > > Perhaps I should have been more specific in my initial post. I'm doing > > > date-based boosting on the documents in my index, so as to assign a > > higher > > > score to more recent documents. Currently I'm using a boost function > to > > > achieve this. I'm wondering if there would be a performance > improvement > > if > > > instead of using the boost function at search time, I indexed the > > documents > > > with a date-based boost. > > > > > > On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > Index time boosting is different than search time boosting, so > > > > asking about performance is irrelevant. > > > > > > > > Paraphrasing Hossman from years ago on the Lucene list (from > > > > memory). > > > > > > > > ...index time boosting is a way of saying this documents' > > > > title is more important than other documents' titles. Search > > > > time boosting is a way of saying "I care about documents > > > > whose titles contain this term more than other documents > > > > whose titles may match other parts of this query" > > > > > > > > HTH > > > > Erick > > > > > > > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman > wrote: > > > > > > > > > Hi, > > > > > > > > > > What are the performance ramifications for using a function-based > > boost > > > > at > > > > > search time (through bf in dismax parser) versus an index-time > boost? > > > > > Currently I'm using boost functions on a 15GB index of ~14mm > > documents. > > > > > Our > > > > > queries generally match many thousands of documents. I'm wondering > > if > > > I > > > > > would see a performance improvement by switching over to index-time > > > > > boosting. > > > > > > > > > > Thanks, > > > > > > > > > > Asif > > > > > > > > > > -- > > > > > Asif Rahman > > > > > Lead Engineer - NewsCred > > > > > a...@newscred.com > > > > > http://platform.newscred.com > > > > > > > > > > > > > > > > > > > > > -- > > > Asif Rahman > > > Lead Engineer - NewsCred > > > a...@newscred.com > > > http://platform.newscred.com > > > > > > > > > -- > Asif Rahman > Lead Engineer - NewsCred > a...@newscred.com > http://platform.newscred.com > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Index-time vs. search-time boosting performance
Thanks everyone for your help so far. I'm still trying to get to the bottom of whether switching over to index-time boosts will give me a performance improvement, and if so if it will be noticeable. This is all under the assumption that I can achieve the scoring functionality that I need with either index-time or search-time boosting (given the loss of precision. I can always dust off the old profiler to see what's going on with the search-time boosts, but testing the index-time boosts will require a full reindex, which could take days with our dataset. On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir wrote: > On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman wrote: > > > Perhaps I should have been more specific in my initial post. I'm doing > > date-based boosting on the documents in my index, so as to assign a > higher > > score to more recent documents. Currently I'm using a boost function to > > achieve this. I'm wondering if there would be a performance improvement > if > > instead of using the boost function at search time, I indexed the > documents > > with a date-based boost. > > > > > Asif, without knowing more details, before you look at performance you > might > want to consider the relevance impacts of switching to index-time boosting > for your use case too. > > You can read more about the differences here: > http://lucene.apache.org/java/3_0_1/scoring.html > > But I think the most important for this date-influenced use case is: > > "Indexing time boosts are preprocessed for storage efficiency and written > to > the directory (when writing the document) in a single byte (!)" > > If you do this as an index-time boost, your boosts will lose lots of > precision for this reason. > > -- > Robert Muir > rcm...@gmail.com > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: Index-time vs. search-time boosting performance
I still need a relatively precise boost. No less precise than hourly. I think that would make for a pretty messy field query. On Mon, Jun 7, 2010 at 2:15 AM, Lance Norskog wrote: > If you are unhappy with the performance overhead of a function boost, > you can push it into a field query by boosting date ranges. > > You would group in date ranges: documents in September would be > boosted 1.0, October 2.0, November 3.0 etc. > > > On 6/5/10, Asif Rahman wrote: > > Thanks everyone for your help so far. I'm still trying to get to the > bottom > > of whether switching over to index-time boosts will give me a performance > > improvement, and if so if it will be noticeable. This is all under the > > assumption that I can achieve the scoring functionality that I need with > > either index-time or search-time boosting (given the loss of precision. > I > > can always dust off the old profiler to see what's going on with the > > search-time boosts, but testing the index-time boosts will require a full > > reindex, which could take days with our dataset. > > > > On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir wrote: > > > >> On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman wrote: > >> > >> > Perhaps I should have been more specific in my initial post. I'm > doing > >> > date-based boosting on the documents in my index, so as to assign a > >> higher > >> > score to more recent documents. Currently I'm using a boost function > to > >> > achieve this. I'm wondering if there would be a performance > improvement > >> if > >> > instead of using the boost function at search time, I indexed the > >> documents > >> > with a date-based boost. > >> > > >> > > >> Asif, without knowing more details, before you look at performance you > >> might > >> want to consider the relevance impacts of switching to index-time > boosting > >> for your use case too. > >> > >> You can read more about the differences here: > >> http://lucene.apache.org/java/3_0_1/scoring.html > >> > >> But I think the most important for this date-influenced use case is: > >> > >> "Indexing time boosts are preprocessed for storage efficiency and > written > >> to > >> the directory (when writing the document) in a single byte (!)" > >> > >> If you do this as an index-time boost, your boosts will lose lots of > >> precision for this reason. > >> > >> -- > >> Robert Muir > >> rcm...@gmail.com > >> > > > > > > > > -- > > Asif Rahman > > Lead Engineer - NewsCred > > a...@newscred.com > > http://platform.newscred.com > > > > > -- > Lance Norskog > goks...@gmail.com > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com
Re: how to test solr's performance?
I'm currently doing a trial with NewRelic's RPM Gold product. I'm having a great experience so far. Among other things, it gives stats on response time and throughput, broken down by request URL. It also gives stats on heap size and GC. Using this tool, we were able to determine which calls were using the most resources and implemented application level caching to ease the load on our Solr server. It also helped us tweak some issues we were having with garbage collection. On Thu, Jun 10, 2010 at 4:53 AM, Marc Sturlese wrote: > > I normally use jmeter, jconsole and iostat. Recently > http://www.newrelic.com/solr.html has been released > -- > View this message in context: > http://lucene.472066.n3.nabble.com/how-to-test-solr-s-performance-tp881928p885025.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Asif Rahman Lead Engineer - NewsCred a...@newscred.com http://platform.newscred.com