Solr: separating index and storage
Consider the following use case. Certain words are extracted from a document and indexed. The exact sentence containing the word cannot be stored alongside the extracted word because of the volume at which the documents grow; How can the index and, lets call it doc servers be separated ? An option is to store the sentences in MongoDB or a RDBMS. But there seems to be a schema level design issue. Assuming 'word' to be a multivalued field, how do we associate to it a reference to the corresponding entry in the doc server. May create (word_1, ref_1) tuples. Is there any other in-built feature ? Any related project which separates index & doc servers ? Thanks, Sourajit
Re: Solr: separating index and storage
Absolutely. Solr will return the reference along the docs/results; those references may be used to look-up the actual stuff. Such use cases aren't hard to solve. If the use case demands returning the actual stuff alongside the results, it becomes non-trivial, especially during high loads. To avoid this and do a quick implementation I can judiciously create stored fields and see how it performs. I will need to figure out what happens if the volume growth of stored fields is high, how much is the disk I/O and what happens if we shard the index, like, what happens to the stored fields then. Best, Sourajit On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson wrote: > You have to index something with your Solr documents that > has meaning in _your_ system so you can find the > original record. You don't search this field, you just > return it with the search results and then use it to get > the original document. > > If you're storing the original in a DB, this can be the PK. > If on a file system the path. etc. > > Essentially, since the association is specific to your environment > you need to handle it explicitly... > > Best > Erick > > On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak > wrote: > > Consider the following use case. > > > > Certain words are extracted from a document and indexed. The exact > sentence > > containing the word cannot be stored alongside the extracted word because > > of the volume at which the documents grow; How can the index and, lets > call > > it doc servers be separated ? > > > > An option is to store the sentences in MongoDB or a RDBMS. But there > seems > > to be a schema level design issue. Assuming 'word' to be a multivalued > > field, how do we associate to it a reference to the corresponding entry > in > > the doc server. > > > > May create (word_1, ref_1) tuples. Is there any other in-built feature ? > > > > Any related project which separates index & doc servers ? > > > > Thanks, > > Sourajit >
Re: Solr: separating index and storage
Each day the index grows by ~250 MB; however I am anticipating that this growth will slow down because there will be repetitions (just a guess). Its not the order of growth but limitation of our infrastructure. Basically a budgetary constraint :-) Apparently there seems to be no problem than disk space. So we will go ahead with the idea of stored fields. On Thu, Jun 6, 2013 at 5:03 PM, Erick Erickson wrote: > By and large, stored fields are pretty irrelevant for resource > consumption _except_ for > disk space consumed. Sharded systems work fine, the > stored data is stored in the index files (*.fdt and *.fdx) files in > each segment on each shard. > > But you haven't told us anything about your data. How much are > you talking about here? 100s of G? Terabytes? Other than disk > space, You may well be anticipating problems that don't exist... > > Now, when _returning_ documents the fields must be read, so > there is some resource consumption there which you can > mitigate with lazy field loading. But this is usually just a few docs > so often isn't a problem. > > Best > Erick > > On Thu, Jun 6, 2013 at 3:34 AM, Sourajit Basak > wrote: > > Absolutely. Solr will return the reference along the docs/results; those > > references may be used to look-up the actual stuff. Such use cases aren't > > hard to solve. > > > > If the use case demands returning the actual stuff alongside the results, > > it becomes non-trivial, especially during high loads. > > > > To avoid this and do a quick implementation I can judiciously create > stored > > fields and see how it performs. I will need to figure out what happens if > > the volume growth of stored fields is high, how much is the disk I/O and > > what happens if we shard the index, like, what happens to the stored > fields > > then. > > > > Best, > > Sourajit > > > > > > > > > > On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson >wrote: > > > >> You have to index something with your Solr documents that > >> has meaning in _your_ system so you can find the > >> original record. You don't search this field, you just > >> return it with the search results and then use it to get > >> the original document. > >> > >> If you're storing the original in a DB, this can be the PK. > >> If on a file system the path. etc. > >> > >> Essentially, since the association is specific to your environment > >> you need to handle it explicitly... > >> > >> Best > >> Erick > >> > >> On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak > >> wrote: > >> > Consider the following use case. > >> > > >> > Certain words are extracted from a document and indexed. The exact > >> sentence > >> > containing the word cannot be stored alongside the extracted word > because > >> > of the volume at which the documents grow; How can the index and, lets > >> call > >> > it doc servers be separated ? > >> > > >> > An option is to store the sentences in MongoDB or a RDBMS. But there > >> seems > >> > to be a schema level design issue. Assuming 'word' to be a multivalued > >> > field, how do we associate to it a reference to the corresponding > entry > >> in > >> > the doc server. > >> > > >> > May create (word_1, ref_1) tuples. Is there any other in-built > feature ? > >> > > >> > Any related project which separates index & doc servers ? > >> > > >> > Thanks, > >> > Sourajit > >> >
Re: index merge question
I have noticed that when I write a doc with an id that already exists, it creates a new revision with the only the fields from the second write. I guess there is a REST API in the latest solr version which updates only selected fields. In my opinion, merge should be creating a doc which is a union of the fields assuming the fields are conforming to the schema of the output index. ~ Sourajit On Sun, Jun 9, 2013 at 12:06 AM, Mark Miller wrote: > > On Jun 8, 2013, at 12:52 PM, Jamie Johnson wrote: > > > When merging through the core admin ( > > http://wiki.apache.org/solr/MergingSolrIndexes) what is the policy for > > conflicts during the merge? So for instance if I am merging core 1 and > > core 2 into core 0 (first example), what happens if core 1 and core 2 > both > > have a document with the same key, say core 1 has a newer version of core > > 2? Does the merge fail, does the newer document remain? > > You end up with both documents, both with that ID - not generally a > situation you want to end up in. You need to ensure unique id's in the > input data or replace the index rather than merging into it. > > > > > Also if using the srcCore method if a document with key 1 is written > while > > an index also with key 1 is being merged what happens? > > It depends on the order I think - if the doc is written after the merge > and it's an update, it will update the doc that was just merged in. If the > merge comes second, you have the doc twice and it's a problem. > > - Mark
Re: [blogpost] Memory is overrated, use SSDs
@Erick, Your revelation on SSDs is very valuable. Do you have any idea on the following ? Does more processors with less cores or less processors with more cores i.e. which of 4P2C or 2P4C has best cost per query ? ~ Sourajit On Fri, Jun 7, 2013 at 4:45 PM, Erick Erickson wrote: > Thanks for this, hard data is always welcome! > > Another blog post for my reference list! > > Erick > > On Fri, Jun 7, 2013 at 2:59 AM, Toke Eskildsen > wrote: > > On Fri, 2013-06-07 at 07:15 +0200, Andy wrote: > >> One question I have is did you precondition the SSD ( > http://www.sandforce.com/userfiles/file/downloads/FMS2009_F2A_Smith.pdf)? SSD > performance tends to take a very deep dive once all blocks are > written at least once and the garbage collector kicks in. > > > > Not explicitly so. The machine is our test server with the SSDs in RAID > > 0 with - to my knowledge - no TRIM support. They are 2½ year old and has > > had a fair amount of data written and being 3/4 full most of the time. > > At one point in time we experimented with 10M+ relatively small files > > and a couple of 40GB databases, so the drives are definitely not in > > pristine condition. > > > > Anyway, as Solr searches is heavy on tiny random reads, I suspect that > > search performance will be largely unaffected by SSD fragmentation. It > > would be interesting to examine, but for now I cannot prioritize another > > large performance test. > > > > > > Thank you for your input. I will update the blog post accordingly, > > Toke Eskildsen, State and University Library, Denmark > > >
Re: [blogpost] Memory is overrated, use SSDs
Hopefully I will be able to post results shortly on 2P4C performance. ~ Sourajit On Mon, Jun 10, 2013 at 2:20 AM, Toke Eskildsen wrote: > Sourajit Basak [sourajit.ba...@gmail.com]: > > Does more processors with less cores or less processors with more cores > > i.e. which of 4P2C or 2P4C has best cost per query ? > > I have not tested that, so everything I say is (somewhat qualified) > guesswork. > > Assuming a NUMA architecture, my guess is that 2P4C would be superior to > 4P2C. Solr utilizes both disk caching and explicit caching on the JVM heap > making memory access quite heavy; the less processors, the higher the > chance that the memory will be controlled by the processor running the > given search thread. I am by no means a NUMA expert, but it seems that > requests for memory controlled by another processor takes about twice as > long as local memory. > > Our machine is a NUMA dual processor and if I can find the time, I would > love to perform some tests on how that part influences query time. If would > be interesting to lock usage to 2 cores on each processor vs. 4 cores on > the same processor. The tricky part is to ensure that RAM is fully > controlled by the single processor in the second test, including the disk > cache. > > Regards, > Toke Eskildsen
edismax: date range facet with queries containing OR clause
When we have a user query like keyword1 OR keyword2, we can find the count of each keyword using the following params. q= keyword1 OR keyword2 facet.query=keyword1 facet.query=keyword2 facet=true How do we do a date range facet that will return results for each keyword faceted by date range ?
Re: edismax: date range facet with queries containing OR clause
Thats exactly how we are doing now. However, we need to offer the search over slow networks, hence was wondering if there's a way to reduce server round-trips. On Sun, Jun 23, 2013 at 7:14 PM, Jack Krupansky wrote: > Just do separate faceted query requests: > > q= keyword1 > facet.range=date_field_name > ... > facet=true > > q= keyword2 > facet.range=date_field_name > ... > facet=true > > Where the "..." means fill in the additional facet.range.xxx parameters > (start, end, gap, etc.) > > -- Jack Krupansky > > -Original Message- From: Sourajit Basak > Sent: Sunday, June 23, 2013 8:52 AM > To: solr-user@lucene.apache.org > Subject: edismax: date range facet with queries containing OR clause > > > When we have a user query like keyword1 OR keyword2, we can find the count > of each keyword using the following params. > > q= keyword1 OR keyword2 > facet.query=keyword1 > facet.query=keyword2 > facet=true > > How do we do a date range facet that will return results for each keyword > faceted by date range ? >
Re: edismax: date range facet with queries containing OR clause
Is there a way to write this query using pivots. Will try out and post here. Appreciate if someone points to a way. On Sun, Jun 23, 2013 at 7:53 PM, Sourajit Basak wrote: > Thats exactly how we are doing now. However, we need to offer the search > over slow networks, hence was wondering if there's a way to reduce server > round-trips. > > > On Sun, Jun 23, 2013 at 7:14 PM, Jack Krupansky > wrote: > >> Just do separate faceted query requests: >> >> q= keyword1 >> facet.range=date_field_name >> ... >> facet=true >> >> q= keyword2 >> facet.range=date_field_name >> ... >> facet=true >> >> Where the "..." means fill in the additional facet.range.xxx parameters >> (start, end, gap, etc.) >> >> -- Jack Krupansky >> >> -Original Message- From: Sourajit Basak >> Sent: Sunday, June 23, 2013 8:52 AM >> To: solr-user@lucene.apache.org >> Subject: edismax: date range facet with queries containing OR clause >> >> >> When we have a user query like keyword1 OR keyword2, we can find the count >> of each keyword using the following params. >> >> q= keyword1 OR keyword2 >> facet.query=keyword1 >> facet.query=keyword2 >> facet=true >> >> How do we do a date range facet that will return results for each keyword >> faceted by date range ? >> > >
Re: edismax: date range facet with queries containing OR clause
We are using edismax; the keywords can be in any of the 'qf' fields specified. Assume 'qf' to be a single fieldA, then the following doesn't seem to make sense. q=keyword1 OR keyword2 facet=true facet.pivot=fieldA,date_field The purpose is to display the count of the matches of keyword1 & keyword2 in fieldA over a range of dates. Seems like this isn't possible. On Sun, Jun 23, 2013 at 8:09 PM, Jack Krupansky wrote: > If your keywords are the value in some other field, then, yes, you can use > facet pivots: > > facet.pivot=keyword_field,**date_field > > (See the example in the book! Or on the wiki.) > > > -- Jack Krupansky > > -Original Message- From: Sourajit Basak > Sent: Sunday, June 23, 2013 10:29 AM > To: solr-user@lucene.apache.org > Subject: Re: edismax: date range facet with queries containing OR clause > > > Is there a way to write this query using pivots. Will try out and post > here. > Appreciate if someone points to a way. > > > > > On Sun, Jun 23, 2013 at 7:53 PM, Sourajit Basak > **wrote: > > Thats exactly how we are doing now. However, we need to offer the search >> over slow networks, hence was wondering if there's a way to reduce server >> round-trips. >> >> >> On Sun, Jun 23, 2013 at 7:14 PM, Jack Krupansky >> **wrote: >> >> Just do separate faceted query requests: >>> >>> q= keyword1 >>> facet.range=date_field_name >>> ... >>> facet=true >>> >>> q= keyword2 >>> facet.range=date_field_name >>> ... >>> facet=true >>> >>> Where the "..." means fill in the additional facet.range.xxx parameters >>> (start, end, gap, etc.) >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Sourajit Basak >>> Sent: Sunday, June 23, 2013 8:52 AM >>> To: solr-user@lucene.apache.org >>> Subject: edismax: date range facet with queries containing OR clause >>> >>> >>> When we have a user query like keyword1 OR keyword2, we can find the >>> count >>> of each keyword using the following params. >>> >>> q= keyword1 OR keyword2 >>> facet.query=keyword1 >>> facet.query=keyword2 >>> facet=true >>> >>> How do we do a date range facet that will return results for each keyword >>> faceted by date range ? >>> >>> >> >> >