Naveen:
See below:
*NRT with Apache Solr 3.3 and RankingAlgorithm does need a commit for a
document to become searchable*. Any document that you add through update
becomes immediately searchable. So no need to commit from within your
update client code. Since there is no commit, the cache does not have to be
cleared or the old searchers closed or new searchers opened, and warmed
(error that you are facing).
Looking at the link which you mentioned is clearly what we wanted. But the
real thing is that you have "RA does need a commit for a document to become
searchable" (please take a look at bold sentence) .
Yes, as said earlier you do not need a commit. A document becomes
searchable as soon as you add it. Below is an example of adding a
document with curl (this from the wiki at
http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x):
curl
"http://localhost:8983/solr/update/csv?stream.file=/tmp/x1.csv&encapsulator=%1f"
There is no commit included. The contents of the document become
immediately searchable.
In future, for more loads, can it cater to Master Slave (Replication) and
etc to scale and perform better? If yes, we would like to go for NRT and
looking at the performance described in the article is acceptable. We were
expecting the same real time performance for a single user.
There are no changes to Master/Slave (replication) process. So any
changes you have currently will work as before or if you enable
replication later, it should still work as without NRT.
What about multiple users, should we wait for 1-2 secs before calling the
curl request to make SOLR perform better. Or internally it will handle with
multiple request (multithreaded and etc).
Again for updating documents, you do not have to change your current
process or code. Everything remains the same, except that if you were
including commit, you do not include commit in your update statements.
There is no change to the existing update process so internally it will
not queue or multi-thread updates. It is as in existing Solr
functionality, there no changes to the existing setup.
Regarding perform better, in the Wiki paper every update through curl
adds (streams) 500 documents. So you could take this approach. (this was
something that I chose randomly to test the performance but seems to be
good)
What would be doc size (10,000 docs) to allow JVM perform better? Have you
done any kind of benchmarking in terms of multi threaded and multi user for
NRT and also JVM tuning in terms of SOLR sever performance. Any kind of
performance analysis would help us to decide quickly to switch over to NRT.
The performance discussed in the wiki paper uses the MBArtists index.
The MBArtists index is the index used as one of the examples in the
book, Solr 1.4 Enterprise Search Server. You can download and build this
index if you have the book or can also download the contents from
musicbrainz.org. Each doc maybe about 100 bytes and has about 7 fields.
Performance with wikipedia's xml dump, commenting out skipdoc field
(include redirects) in the dataconfig.xml [ dataimport handler ], the
update performance is about 15000 docs / sec (100 million docs), with
the skipdoc enabled (does not skip redirects), the performance is about
1350 docs / sec [ time spent mostly converting validating/xml than
actual update ] (about 11 million docs ). Documents in wikipedia can be
quite big, at least avg size of about 2500-5000 bytes or more.
I would suggest that you download and give NRT with Apache Solr 3.3 and
RankingAlgorithm a try and get a feel of it as this would be the best
way to see how your config works with it.
Questions in terms for switching over to NRT,
1.Should we upgrade to SOLR 4.x ?
2. Any benchmarking (10,000 docs/secs). The question here is more specific
the detail of individual doc (fields, number of fields, fields size,
parameters affecting performance with faceting or w/o faceting)
Please see the MBArtists index as discussed above.
3. What about multiple users ?
A user in real time might be having an large doc size of .1 million. How to
break and analyze which one is better (though it is our task to do). But
still any kind of break up will help us. Imagine a user inbox.
You maybe able to stream the documents in a set as in the example in the
wiki. The example streams 500 documents at a time. The wiki paper has an
example of a document that was used. You could copy/paste that to try it
out.
4. JVM tuning and performance result based on Multithreaded environment.
5. Machine Details (RAM, CPU, and settings from SOLR perspective).
Default Solr settings with the shipped jetty container. The startup
script used is available when you download Solr 3.3 with
RankingAlgorithm. It has mx set to 2Gb and uses the default collector
with parallel collection enabled for the young generation. The system
is a x86_64 Linux (2.6 kernel), 2 core (2.5Ghz) and uses internal disks
for indexing.
My suggestion would be to download a version of Solr 3.3 with
RankingAlgorithm and give it a try to see if any changes are needed from
your existing setup.
Regards,
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org
Hoping that you are getting my point. We want to benchmark the performance.
If you can involve me in your group, that would be great.
Thanks
Naveen
2011/8/15 Nagendra Nagarajayya<nnagaraja...@transaxtions.com>
Bill:
I did look at Marks performance tests. Looks very interesting.
Here is the Apacle Solr 3.3 with RankingAlgorithm NRT performance:
http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x<http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x>
Regards
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.**org<http://rankingalgorithm.tgels.org>
On 8/14/2011 7:47 PM, Bill Bell wrote:
I understand.
Have you looked at Mark's patch? From his performance tests, it looks
pretty good.
When would RA work better?
Bill
On 8/14/11 8:40 PM, "Nagendra Nagarajayya"<nnagarajayya@**
transaxtions.com<nnagaraja...@transaxtions.com>>
wrote:
Bill:
The technical details of the NRT implementation in Apache Solr with
RankingAlgorithm (SOLR-RA) is available here:
http://solr-ra.tgels.com/**papers/NRT_Solr_**RankingAlgorithm.pdf<http://solr-ra.tgels.com/papers/NRT_Solr_RankingAlgorithm.pdf>
(Some changes for Solr 3.x, but for most it is as above)
Regarding support for 4.0 trunk, should happen sometime soon.
Regards
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.**org<http://rankingalgorithm.tgels.org>
On 8/14/2011 7:11 PM, Bill Bell wrote:
OK,
I'll ask the elephant in the roomÅ .
What is the difference between the new UpdateHandler from Mark and the
SOLR-RA?
The UpdateHandler works with 4.0 does SOLR-RA work with 4.0 trunk?
Pros/Cons?
On 8/14/11 8:10 PM, "Nagendra
Nagarajayya"<nnagarajayya@**transaxtions.com<nnagaraja...@transaxtions.com>
wrote:
Naveen:
NRT with Apache Solr 3.3 and RankingAlgorithm does need a commit for a
document to become searchable. Any document that you add through update
becomes immediately searchable. So no need to commit from within your
update client code. Since there is no commit, the cache does not have
to be cleared or the old searchers closed or new searchers opened, and
warmed (error that you are facing).
Regards
- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.**org<http://rankingalgorithm.tgels.org>
On 8/14/2011 10:37 AM, Naveen Gupta wrote:
Hi Mark/Erick/Nagendra,
I was not very confident about NRT at that point of time, when we
started
project almost 1 year ago, definitely i would try NRT and see the
performance.
The current requirement was working fine till we were using
commitWithin 10
millisecs in the XMLDocument which we were posting to SOLR.
But due to which, we were getting very poor performance (almost 3 mins
for
15,000 docs) per user. There are many paraller user committing to our
SOLR.
So we removed the commitWithin, and hence performance was much much
better.
But then we are getting this maxWarmingSearcher Error, because we are
committing separately as a curl request after once entire doc is
submitted
for indexing.
The question here is what is difference between commitWithin and
commit
(apart from the fact that commit takes memory and processes and
additional
hardware usage)
Why we want it to be visible as soon as possible, since we are
applying
many
business rules on top of the results (older indexes as well as new
one)
and
apply different filters.
upto 5 mins is fine for us. but more than that we need to think then
other
optimizations.
We will definitely try NRT. But please tell me other options which we
can
apply in order to optimize.?
Thanks
Naveen
On Sun, Aug 14, 2011 at 9:42 PM, Erick
Erickson<erickerickson@gmail.**com<erickerick...@gmail.com>>wrote:
Ah, thanks, Mark... I must have been looking at the wrong JIRAs.
Erick
On Sun, Aug 14, 2011 at 10:02 AM, Mark Miller<markrmil...@gmail.com>
wrote:
On Aug 14, 2011, at 9:03 AM, Erick Erickson wrote:
You either have to go to near real time (NRT), which is under
development, but not committed to trunk yet
NRT support is committed to trunk.
- Mark Miller
lucidimagination.com