The difference with solr cell is, that i'am sending every single document to solr cell and don't collect them until i have a couple of them in my memory. Using mainly the code form here: http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ
Erick Erickson <erickerick...@gmail.com> schrieb am 25.09.2012 15:47:34: > Von: > > Erick Erickson <erickerick...@gmail.com> > > An: > > solr-user@lucene.apache.org > > Datum: > > 25.09.2012 15:48 > > Betreff: > > Re: Re: Solr Cell Questions > > bq: how many documents per minute, second, what ever can i put into solr > > Too many variables to say. I've seen several thousand truly simple > docs/sec. But since you're doing the Tika processing that's probably > going to be your limiting factor. And it'll be many fewer... > > I don't understand your OOM issue when running Tika on the client. Or, > rather, why you think using SolrCell makes this different. SolrCell also > uses Tika. So my suspicion it that your client-side process simply isn't > allocating much memory to the JVM, did you try bumping the memory > on your client? > > Best > Erick > > On Tue, Sep 25, 2012 at 5:23 AM, <johannes.schwendin...@blum.com> wrote: > > Thank you Erick for your respone, > > > > I've already tried what you've suggested and got some out of memory > > exceptions. Because of this i like the solution with solr Cell where i can > > send the file directly to solr via stream and don't collect them in my > > memory. > > > > And another question that came to my mind, how many documents per minute, > > second, what ever can i put into solr. Say XML format and from 100kb to > > 100MB. > > Is there a number or is it to dependent from hardware and settings? > > > > > > Best > > Johannes > > > > Erick Erickson <erickerick...@gmail.com> schrieb am 25.09.2012 00:22:26: > > > >> Von: > >> > >> Erick Erickson <erickerick...@gmail.com> > >> > >> An: > >> > >> solr-user@lucene.apache.org > >> > >> Datum: > >> > >> 25.09.2012 00:23 > >> > >> Betreff: > >> > >> Re: Solr Cell Questions > >> > >> If you're concerned about throughput, consider moving all the > >> SolrCell (Tika) processing off the server. SolrCell is way cool > >> for showing what can be done, but its downside is you're > >> moving all the processing of the structured documents to the > >> same machine doing the indexing. Pretty soon, especially > >> with significant size files, you're spending all your CPU cycles > >> parsing the files... > >> > >> Happens there's a blog about this: > >> http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ > >> > >> By moving the indexing to N clients, you can increase > >> throughput until you make Solr work hard to do the indexing.... > >> > >> Best > >> Erick > >> > >> On Mon, Sep 24, 2012 at 10:04 AM, <johannes.schwendin...@blum.com> > > wrote: > >> > Hi, > >> > > >> > Im currently experimenting with Solr Cell to index files to Solr. > > During > >> > this some questions came up. > >> > > >> > 1. Is it possible (and wise) to connect to Solr Cell with multiple > > Threads > >> > at the same time to index several documents at the same time? > >> > This question came up because my prrogramm takes about 6hours to index > >> > round 35000 docs. (no production environment, only example solr and a > >> > little desktop machine but I think its very slow, and I know solr > > isn't > >> > the bottleneck (yet)) > >> > > >> > 2. If 1 is possible, how many Threads should do this and how many > > memory > >> > Solr needs? I've tried it but i run into an out of memory exception. > >> > > >> > Thanks in advantage > >> > > >> > Best Regards > >> > Johannes