Solr: separating index and storage

2013-06-03 Thread Sourajit Basak
Consider the following use case.

Certain words are extracted from a document and indexed. The exact sentence
containing the word cannot be stored alongside the extracted word because
of the volume at which the documents grow; How can the index and, lets call
it doc servers be separated ?

An option is to store the sentences in MongoDB or a RDBMS. But there seems
to be a schema level design issue. Assuming 'word' to be a multivalued
field, how do we associate to it a reference to the corresponding entry in
the doc server.

May create (word_1, ref_1) tuples. Is there any other in-built feature ?

Any related project which separates index & doc servers ?

Thanks,
Sourajit


Re: Solr: separating index and storage

2013-06-06 Thread Sourajit Basak
Absolutely. Solr will return the reference along the docs/results; those
references may be used to look-up the actual stuff. Such use cases aren't
hard to solve.

If the use case demands returning the actual stuff alongside the results,
it becomes non-trivial, especially during high loads.

To avoid this and do a quick implementation I can judiciously create stored
fields and see how it performs. I will need to figure out what happens if
the volume growth of stored fields is high, how much is the disk I/O and
what happens if we shard the index, like, what happens to the stored fields
then.

Best,
Sourajit




On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson wrote:

> You have to index something with your Solr documents that
> has meaning in _your_ system so you can find the
> original record. You don't search this field, you just
> return it with the search results and then use it to get
> the original document.
>
> If you're storing the original in a DB, this can be the PK.
> If on a file system the path. etc.
>
> Essentially, since the association is specific to your environment
> you need to handle it explicitly...
>
> Best
> Erick
>
> On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
>  wrote:
> > Consider the following use case.
> >
> > Certain words are extracted from a document and indexed. The exact
> sentence
> > containing the word cannot be stored alongside the extracted word because
> > of the volume at which the documents grow; How can the index and, lets
> call
> > it doc servers be separated ?
> >
> > An option is to store the sentences in MongoDB or a RDBMS. But there
> seems
> > to be a schema level design issue. Assuming 'word' to be a multivalued
> > field, how do we associate to it a reference to the corresponding entry
> in
> > the doc server.
> >
> > May create (word_1, ref_1) tuples. Is there any other in-built feature ?
> >
> > Any related project which separates index & doc servers ?
> >
> > Thanks,
> > Sourajit
>


Re: Solr: separating index and storage

2013-06-06 Thread Sourajit Basak
Each day the index grows by ~250 MB; however I am anticipating that this
growth will slow down because there will be repetitions (just a guess). Its
not the order of growth but limitation of our infrastructure. Basically a
budgetary constraint :-)

Apparently there seems to be no problem than disk space. So we will go
ahead with the idea of stored fields.




On Thu, Jun 6, 2013 at 5:03 PM, Erick Erickson wrote:

> By and large, stored fields are pretty irrelevant for resource
> consumption _except_ for
> disk space consumed. Sharded systems work fine, the
> stored data is stored in the index files (*.fdt and *.fdx) files in
> each segment on each shard.
>
> But you haven't told us anything about your data. How much are
> you talking about here? 100s of G? Terabytes? Other than disk
> space, You may well be anticipating problems that don't exist...
>
> Now, when _returning_ documents the fields must be read, so
> there is some resource consumption there which you can
> mitigate with lazy field loading. But this is usually just a few docs
> so often isn't a problem.
>
> Best
> Erick
>
> On Thu, Jun 6, 2013 at 3:34 AM, Sourajit Basak 
> wrote:
> > Absolutely. Solr will return the reference along the docs/results; those
> > references may be used to look-up the actual stuff. Such use cases aren't
> > hard to solve.
> >
> > If the use case demands returning the actual stuff alongside the results,
> > it becomes non-trivial, especially during high loads.
> >
> > To avoid this and do a quick implementation I can judiciously create
> stored
> > fields and see how it performs. I will need to figure out what happens if
> > the volume growth of stored fields is high, how much is the disk I/O and
> > what happens if we shard the index, like, what happens to the stored
> fields
> > then.
> >
> > Best,
> > Sourajit
> >
> >
> >
> >
> > On Tue, Jun 4, 2013 at 5:31 PM, Erick Erickson  >wrote:
> >
> >> You have to index something with your Solr documents that
> >> has meaning in _your_ system so you can find the
> >> original record. You don't search this field, you just
> >> return it with the search results and then use it to get
> >> the original document.
> >>
> >> If you're storing the original in a DB, this can be the PK.
> >> If on a file system the path. etc.
> >>
> >> Essentially, since the association is specific to your environment
> >> you need to handle it explicitly...
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Jun 3, 2013 at 11:56 AM, Sourajit Basak
> >>  wrote:
> >> > Consider the following use case.
> >> >
> >> > Certain words are extracted from a document and indexed. The exact
> >> sentence
> >> > containing the word cannot be stored alongside the extracted word
> because
> >> > of the volume at which the documents grow; How can the index and, lets
> >> call
> >> > it doc servers be separated ?
> >> >
> >> > An option is to store the sentences in MongoDB or a RDBMS. But there
> >> seems
> >> > to be a schema level design issue. Assuming 'word' to be a multivalued
> >> > field, how do we associate to it a reference to the corresponding
> entry
> >> in
> >> > the doc server.
> >> >
> >> > May create (word_1, ref_1) tuples. Is there any other in-built
> feature ?
> >> >
> >> > Any related project which separates index & doc servers ?
> >> >
> >> > Thanks,
> >> > Sourajit
> >>
>


Re: index merge question

2013-06-08 Thread Sourajit Basak
I have noticed that when I write a doc with an id that already exists, it
creates a new revision with the only the fields from the second write. I
guess there is a REST API in the latest solr version which updates only
selected fields.

In my opinion, merge should be creating a doc which is a union of the
fields assuming the fields are conforming to the schema of the output
index.

~ Sourajit


On Sun, Jun 9, 2013 at 12:06 AM, Mark Miller  wrote:

>
> On Jun 8, 2013, at 12:52 PM, Jamie Johnson  wrote:
>
> > When merging through the core admin (
> > http://wiki.apache.org/solr/MergingSolrIndexes) what is the policy for
> > conflicts during the merge?  So for instance if I am merging core 1 and
> > core 2 into core 0 (first example), what happens if core 1 and core 2
> both
> > have a document with the same key, say core 1 has a newer version of core
> > 2?  Does the merge fail, does the newer document remain?
>
> You end up with both documents, both with that ID - not generally a
> situation you want to end up in. You need to ensure unique id's in the
> input data or replace the index rather than merging into it.
>
> >
> > Also if using the srcCore method if a document with key 1 is written
> while
> > an index also with key 1 is being merged what happens?
>
> It depends on the order I think - if the doc is written after the merge
> and it's an update, it will update the doc that was just merged in. If the
> merge comes second, you have the doc twice and it's a problem.
>
> - Mark


Re: [blogpost] Memory is overrated, use SSDs

2013-06-09 Thread Sourajit Basak
@Erick,
Your revelation on SSDs is very valuable.
Do you have any idea on the following ?

Does more processors with less cores or less processors with more cores
i.e. which of 4P2C or 2P4C has best cost per query ?

~ Sourajit


On Fri, Jun 7, 2013 at 4:45 PM, Erick Erickson wrote:

> Thanks for this, hard data is always welcome!
>
> Another blog post for my reference list!
>
> Erick
>
> On Fri, Jun 7, 2013 at 2:59 AM, Toke Eskildsen 
> wrote:
> > On Fri, 2013-06-07 at 07:15 +0200, Andy wrote:
> >> One question I have is did you precondition the SSD (
> http://www.sandforce.com/userfiles/file/downloads/FMS2009_F2A_Smith.pdf)? SSD 
> performance tends to take a very deep dive once all blocks are
> written at least once and the garbage collector kicks in.
> >
> > Not explicitly so. The machine is our test server with the SSDs in RAID
> > 0 with - to my knowledge - no TRIM support. They are 2½ year old and has
> > had a fair amount of data written and being 3/4 full most of the time.
> > At one point in time we experimented with 10M+ relatively small files
> > and a couple of 40GB databases, so the drives are definitely not in
> > pristine condition.
> >
> > Anyway, as Solr searches is heavy on tiny random reads, I suspect that
> > search performance will be largely unaffected by SSD fragmentation. It
> > would be interesting to examine, but for now I cannot prioritize another
> > large performance test.
> >
> >
> > Thank you for your input. I will update the blog post accordingly,
> > Toke Eskildsen, State and University Library, Denmark
> >
>


Re: [blogpost] Memory is overrated, use SSDs

2013-06-09 Thread Sourajit Basak
Hopefully I will be able to post results shortly on 2P4C performance.

~ Sourajit


On Mon, Jun 10, 2013 at 2:20 AM, Toke Eskildsen wrote:

> Sourajit Basak [sourajit.ba...@gmail.com]:
> > Does more processors with less cores or less processors with more cores
> > i.e. which of 4P2C or 2P4C has best cost per query ?
>
> I have not tested that, so everything I say is (somewhat qualified)
> guesswork.
>
> Assuming a NUMA architecture, my guess is that 2P4C would be superior to
> 4P2C. Solr utilizes both disk caching and explicit caching on the JVM heap
> making memory access quite heavy; the less processors, the higher the
> chance that the memory will be controlled by the processor running the
> given search thread. I am by no means a NUMA expert, but it seems that
> requests for memory controlled by another processor takes about twice as
> long as local memory.
>
> Our machine is a NUMA dual processor and if I can find the time, I would
> love to perform some tests on how that part influences query time. If would
> be interesting to lock usage to 2 cores on each processor vs. 4 cores on
> the same processor. The tricky part is to ensure that RAM is fully
> controlled by the single processor in the second test, including the disk
> cache.
>
> Regards,
> Toke Eskildsen


edismax: date range facet with queries containing OR clause

2013-06-23 Thread Sourajit Basak
When we have a user query like keyword1 OR keyword2, we can find the count
of each keyword using the following params.

q= keyword1 OR keyword2
facet.query=keyword1
facet.query=keyword2
facet=true

How do we do a date range facet that will return results for each keyword
faceted by date range ?


Re: edismax: date range facet with queries containing OR clause

2013-06-23 Thread Sourajit Basak
Thats exactly how we are doing now. However, we need to offer the search
over slow networks, hence was wondering if there's a way to reduce server
round-trips.


On Sun, Jun 23, 2013 at 7:14 PM, Jack Krupansky wrote:

> Just do separate faceted query requests:
>
> q= keyword1
> facet.range=date_field_name
> ...
> facet=true
>
> q= keyword2
> facet.range=date_field_name
> ...
> facet=true
>
> Where the "..." means fill in the additional facet.range.xxx parameters
> (start, end, gap, etc.)
>
> -- Jack Krupansky
>
> -Original Message- From: Sourajit Basak
> Sent: Sunday, June 23, 2013 8:52 AM
> To: solr-user@lucene.apache.org
> Subject: edismax: date range facet with queries containing OR clause
>
>
> When we have a user query like keyword1 OR keyword2, we can find the count
> of each keyword using the following params.
>
> q= keyword1 OR keyword2
> facet.query=keyword1
> facet.query=keyword2
> facet=true
>
> How do we do a date range facet that will return results for each keyword
> faceted by date range ?
>


Re: edismax: date range facet with queries containing OR clause

2013-06-23 Thread Sourajit Basak
Is there a way to write this query using pivots. Will try out and post here.
Appreciate if someone points to a way.




On Sun, Jun 23, 2013 at 7:53 PM, Sourajit Basak wrote:

> Thats exactly how we are doing now. However, we need to offer the search
> over slow networks, hence was wondering if there's a way to reduce server
> round-trips.
>
>
> On Sun, Jun 23, 2013 at 7:14 PM, Jack Krupansky 
> wrote:
>
>> Just do separate faceted query requests:
>>
>> q= keyword1
>> facet.range=date_field_name
>> ...
>> facet=true
>>
>> q= keyword2
>> facet.range=date_field_name
>> ...
>> facet=true
>>
>> Where the "..." means fill in the additional facet.range.xxx parameters
>> (start, end, gap, etc.)
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Sourajit Basak
>> Sent: Sunday, June 23, 2013 8:52 AM
>> To: solr-user@lucene.apache.org
>> Subject: edismax: date range facet with queries containing OR clause
>>
>>
>> When we have a user query like keyword1 OR keyword2, we can find the count
>> of each keyword using the following params.
>>
>> q= keyword1 OR keyword2
>> facet.query=keyword1
>> facet.query=keyword2
>> facet=true
>>
>> How do we do a date range facet that will return results for each keyword
>> faceted by date range ?
>>
>
>


Re: edismax: date range facet with queries containing OR clause

2013-06-23 Thread Sourajit Basak
We are using edismax; the keywords can be in any of the 'qf' fields
specified. Assume 'qf' to be a single fieldA, then the following doesn't
seem to make sense.

q=keyword1 OR keyword2
facet=true
facet.pivot=fieldA,date_field

The purpose is to display the count of the matches of keyword1 & keyword2
in fieldA over a range of dates.

Seems like this isn't possible.


On Sun, Jun 23, 2013 at 8:09 PM, Jack Krupansky wrote:

> If your keywords are the value in some other field, then, yes, you can use
> facet pivots:
>
> facet.pivot=keyword_field,**date_field
>
> (See the example in the book! Or on the wiki.)
>
>
> -- Jack Krupansky
>
> -Original Message- From: Sourajit Basak
> Sent: Sunday, June 23, 2013 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: edismax: date range facet with queries containing OR clause
>
>
> Is there a way to write this query using pivots. Will try out and post
> here.
> Appreciate if someone points to a way.
>
>
>
>
> On Sun, Jun 23, 2013 at 7:53 PM, Sourajit Basak 
> **wrote:
>
>  Thats exactly how we are doing now. However, we need to offer the search
>> over slow networks, hence was wondering if there's a way to reduce server
>> round-trips.
>>
>>
>> On Sun, Jun 23, 2013 at 7:14 PM, Jack Krupansky 
>> **wrote:
>>
>>  Just do separate faceted query requests:
>>>
>>> q= keyword1
>>> facet.range=date_field_name
>>> ...
>>> facet=true
>>>
>>> q= keyword2
>>> facet.range=date_field_name
>>> ...
>>> facet=true
>>>
>>> Where the "..." means fill in the additional facet.range.xxx parameters
>>> (start, end, gap, etc.)
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Sourajit Basak
>>> Sent: Sunday, June 23, 2013 8:52 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: edismax: date range facet with queries containing OR clause
>>>
>>>
>>> When we have a user query like keyword1 OR keyword2, we can find the
>>> count
>>> of each keyword using the following params.
>>>
>>> q= keyword1 OR keyword2
>>> facet.query=keyword1
>>> facet.query=keyword2
>>> facet=true
>>>
>>> How do we do a date range facet that will return results for each keyword
>>> faceted by date range ?
>>>
>>>
>>
>>
>