Re: Possibilities of (near) real time search with solr

Peter Karich Tue, 16 Nov 2010 14:12:09 -0800

 Hi Peter,

thanks for your response. I will dig into the sharding stuff asap :-)

 This may have changed recently, but the NRT stuff - e.g. per-segment
 commits etc. is for the latest Solr 4 trunk only.


Do I need to turn something 'on'?
Or do you know wether the NRT patches are documented somewhere?

 Be careful about merging, as all involved cores will pause
 for the merging period.


Really all involved cores? Not only the target core?

 The trickiest bit about the above is defining when data is
 deemed to be 'old'


In my case this should be simply: Just tweets which are one day old?

 and then moving that data in an efficient manner to a read-only shard.

How do you do this? query the old core + index the new with the querieddata? Or from DB?


Regards,
Peter (K).

Hi Peter,

First off, many thanks for putting together the NRT Wiki page!

This may have changed recently, but the NRT stuff - e.g. per-segment
commits etc. is for the latest Solr 4 trunk only.
If your setup uses the 3x Solr code branch, then there's a bit of work
to do to move to the new version.
Some of this is due to the new 3.x Lucene, which has a lot of cool new
stuff in it, but also deprecates a lot of old stuff,
so existing SolrJ clients and custom server-side code/configuration
will need to take this into account.
We've not had the time to do this, so that's about as far as I can go
on that one for now.

We have had some very good success with distributed/shard searching -
i.e. 'new' data arrives in a relatively small index, and so can remain
fast, whilst distributed shards hold 'older' data and so can keep
their caches warm (i.e. very few/no commits). This works particularly
well for summary data  (facets, filter queries etc. that sit in
caches) .
Be careful about merging, as all involved cores will pause for the
merging period. Really needs to be done out-of-hours, or better still,
offline (i.e. replicate the cores, then merge, then bring them live).
The trickiest bit about the above is defining when data is deemed to
be 'old' and then moving that data in an efficient manner to a
read-only shard. Using SolrJ can help in this regard as it can offload
some of the administration from the server(s).

Thanks,
Peter


On Mon, Nov 15, 2010 at 8:06 PM, Peter Karich<peat...@yahoo.de>  wrote:

Hi,

I wanted to provide my indexed docs (tweets) relative fast: so 1 to 10 sec
or even 30 sec would be ok.

At the moment I am using the read only core scenario described here (point
5)*
with a commit frequency of 180 seconds which was fine until some days. (I am
using solr1.4.1)
Now the time a commit takes is too high (40-80s) and too CPU-heavy because
the index is too large>7GB.

I thought about some possible solutions:
1. using solr NRT patches**
2. using shards (+ multicore) where I feed into a relative small core and
merges them later (every hour or so) to reduce the number of cores
3. It would be also nice if someone could explain what and if there are
benefits when using solr4.0 ...

The problem for 1. is that I haven't found a guide how to apply all the
patches. Or is NRT not possible at the moment with solr? Does anybody has a
link for me?

Then I looked into solution 2. It seems to me that the CPU- and
administration-overhead of sharding can be quite high. Any hints (I am using
SolrJ)? E.g. I need to include the date facet patch

Or how would you solve this?

Regards,
Peter.

*
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201009.mbox/%3caanlktincgekjlbxe_bsaahlct_hlr_kwuxm5zxovt...@mail.gmail.com%3e

**
https://issues.apache.org/jira/browse/SOLR-1606



--
http://jetwick.com twitter search prototype

Re: Possibilities of (near) real time search with solr

Reply via email to