Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Lance Norskog Sat, 25 Aug 2012 18:27:03 -0700

A few other things:
Support: many of the Solr committers do not like the Embedded server.
It does not get much attention, so if you find problems with it you
may have to fix them and get someone to review and commit the fixes.
I'm not saying they sabotage it, there just is not much interest in
making it first-class.


Replication: you can replicate from the Embedded server with the old
rsync-based replicator. The Java Replication tool requires servlets.
If you are Unix-savvy, the rsync tool is fine.

Indexing speed:
1) You can use shards to split the index into pieces. This divides the
indexing work among the shards.
2) Do not store the giant data. A lot of sites instead archive the
datafile and index a link to the file. Giant stored fields cause
indexing speed to drop dramatically because stored data is not saved
just once: it is copied repeatedly during merging as new documents are
added. Index data is also copied around, but this tends to increase
sub-linearly since documents share terms.
3) Do not store positions and offsets. These allow you to do phrase
queries because they store the position of each word. They take a lot
of memory, and have to be copied around during merging.

On Thu, Aug 23, 2012 at 1:31 AM, Mikhail Khludnev
<mkhlud...@griddynamics.com> wrote:
> I know the following drawbacks of EmbServer:
>
>    - org.apache.solr.client.solrj.request.UpdateRequest.getContentStreams()
>    which is called on handling update request, provides a lot of garbage in
>    memory and bloat it by expensive XML.
>    - 
> org.apache.solr.response.BinaryResponseWriter.getParsedResponse(SolrQueryRequest,
>    SolrQueryResponse) does something like this on response side - it just
>    bloat your heap
>
> for me your task is covered by Multiple Cores. Anyway if you are ok with
> EmbeddedServer let it be. Just be aware of stream updates feature
> http://wiki.apache.org/solr/ContentStream
>
> my average indexing speed estimate is for fairly small docs less than 1K
> (which are always used for micro-benchmarking).
>
> Much analysis is the key argument for invoking updates in multiple threads.
> What's your CPU stat during indexing?
>
>
>
>
> On Thu, Aug 23, 2012 at 7:52 AM, ksu wildcats <ksu.wildc...@gmail.com>wrote:
>
>> Thanks for the reply Mikhail.
>>
>> For our needs the speed is more important than flexibility and we have huge
>> text files (ex: blogs / articles ~2 MB size) that needs to be read from our
>> filesystem and then store into the index.
>>
>> We have our app creating separate core per client (dynamically) and there
>> is
>> one instance of EmbeddedSolrServer for each core thats used for adding
>> documents to the index.
>> Each document has about 10 fields and one of the field has ~2MB data stored
>> (stored = true, analyzed=true).
>> Also we have logic built into our webapp to dynamically create the solr
>> config files
>> (solrConfig & schema per core - filters/analyzers/handler values can be
>> different for each core)
>> for each core before creating an instance of EmbeddedSolrServer for that
>> core.
>> Another reason to go with EmbeddedSolrServer is to reduce overhead of
>> transporting large data (~2 MB) over http/xml.
>>
>> We use this setup for building our master index which then gets replicated
>> to slave servers
>> using replication scripts provided by solr.
>> We also have solr admin ui integrated into our webapp (using admin jsp &
>> handlers from solradmin ui)
>>
>> We have been using this MultiCore setup for more than a year now and so far
>> we havent run into any issues with EmbeddedSolrServer integrated into our
>> webapp.
>> However I am now trying to figure out the impact if we allow multiple
>> threads sending request to EmbeddedSolrServer (same core) for adding docs
>> to
>> index simultaneously.
>>
>> Our understanding was that EmbeddedSolrServer would give us better
>> performance over http solr for our needs.
>> Its quite possible that we might be wrong and http solr would have given us
>> similar/better performance.
>>
>> Also based on documentation from SolrWiki I am assuming that
>> EmbeddedSolrServer API is same as the one used by Http Solr.
>>
>> Said that, can you please tell if there is any specific downside to using
>> EmbeddedSolrServer that could cause issues for us down the line.
>>
>> I am also interested in your below comment about indexing 1 million docs in
>> few mins. Ideally we would like to get to that speed
>> I am assuming this depends on the size of the doc and type of
>> analyzer/tokenizer/filters being used. Correct?
>> Can you please share (or point me to documentation) on how to get this
>> speed
>> for 1 mil docs.
>> >>  - one million is a fairly small amount, in average it should be indexed
>> >> in few mins. I doubt that you really need to distribute indexing
>>
>> Thanks
>> -K
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-Index-Concurrency-Is-it-possible-to-have-multiple-threads-write-to-same-index-tp4002544p4002776.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Tech Lead
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mkhlud...@griddynamics.com>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr - Index Concurrency - Is it possible to have multiple threads write to same index?

Reply via email to