The difference with solr cell is, that i'am sending every single document 
to solr cell and don't collect them until i have a couple of them in my 
memory. 
Using mainly the code form here: 
http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ


Erick Erickson <erickerick...@gmail.com> schrieb am 25.09.2012 15:47:34:

> Von:
> 
> Erick Erickson <erickerick...@gmail.com>
> 
> An:
> 
> solr-user@lucene.apache.org
> 
> Datum:
> 
> 25.09.2012 15:48
> 
> Betreff:
> 
> Re: Re: Solr Cell Questions
> 
> bq: how many documents per minute, second, what ever can i put into solr
> 
> Too many variables to say. I've seen several thousand truly simple
> docs/sec. But since you're doing the Tika processing that's probably
> going to be your limiting factor. And it'll be many fewer...
> 
> I don't understand your OOM issue when running Tika on the client. Or,
> rather, why you think using SolrCell makes this different. SolrCell also
> uses Tika. So my suspicion it that your client-side process simply isn't
> allocating much memory to the JVM, did you try bumping the memory
> on your client?
> 
> Best
> Erick
> 
> On Tue, Sep 25, 2012 at 5:23 AM,  <johannes.schwendin...@blum.com> 
wrote:
> > Thank you Erick for your respone,
> >
> > I've already tried what you've suggested and got some out of memory
> > exceptions. Because of this i like the solution with solr Cell where i 
can
> > send the file directly to solr via stream and don't collect them in my
> > memory.
> >
> > And another question that came to my mind, how many documents per 
minute,
> > second, what ever can i put into solr. Say XML format and from 100kb 
to
> > 100MB.
> > Is there a number or is it to dependent from hardware and settings?
> >
> >
> > Best
> > Johannes
> >
> > Erick Erickson <erickerick...@gmail.com> schrieb am 25.09.2012 
00:22:26:
> >
> >> Von:
> >>
> >> Erick Erickson <erickerick...@gmail.com>
> >>
> >> An:
> >>
> >> solr-user@lucene.apache.org
> >>
> >> Datum:
> >>
> >> 25.09.2012 00:23
> >>
> >> Betreff:
> >>
> >> Re: Solr Cell Questions
> >>
> >> If you're concerned about throughput, consider moving all the
> >> SolrCell (Tika) processing off the server. SolrCell is way cool
> >> for showing what can be done, but its downside is you're
> >> moving all the processing of the structured documents to the
> >> same machine doing the indexing. Pretty soon, especially
> >> with significant size files, you're spending all your CPU cycles
> >> parsing the files...
> >>
> >> Happens there's a blog about this:
> >> http://searchhub.org/dev/2012/02/14/indexing-with-solrj/
> >>
> >> By moving the indexing to N clients, you can increase
> >> throughput until you make Solr work hard to do the indexing....
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Sep 24, 2012 at 10:04 AM,  <johannes.schwendin...@blum.com>
> > wrote:
> >> > Hi,
> >> >
> >> > Im currently experimenting with Solr Cell to index files to Solr.
> > During
> >> > this some questions came up.
> >> >
> >> > 1. Is it possible (and wise) to connect to Solr Cell with multiple
> > Threads
> >> > at the same time to index several documents at the same time?
> >> > This question came up because my prrogramm takes about 6hours to 
index
> >> > round 35000 docs. (no production environment, only example solr and 
a
> >> > little desktop machine but I think its very slow, and I know solr
> > isn't
> >> > the bottleneck (yet))
> >> >
> >> > 2. If 1 is possible, how many Threads should do this and how many
> > memory
> >> > Solr needs? I've tried it but i run into an out of memory 
exception.
> >> >
> >> > Thanks in advantage
> >> >
> >> > Best Regards
> >> > Johannes

Reply via email to