from:"Brandon Waterloo"

Multiple Cores with Solr Cell for indexing documents

2011-03-22 Thread Brandon Waterloo

Hello everyone,

I've been trying for several hours now to set up Solr with multiple cores with 
Solr Cell working on each core.  The only items being indexed are PDF, DOC, and 
TXT files (with the possibility of expanding this list, but for now, just 
assume the only things in the index should be documents).

I never had any problems with Solr Cell when I was using a single core.  In 
fact, I just ran the default installation in example/ and worked from that.  
However, trying to migrate to multi-core has been a never ending list of 
problems.

Any time I try to add a document to the index (using the same curl command as I 
did to add to the single core, of course adding the core name to the request 
URL-- host/solr/corename/update/extract...), I get HTTP 500 errors due to 
classes not being found and/or lazy loading errors.  I've copied the exact 
example/lib directory into the cores, and that doesn't work either.

Frankly the only libraries I want are those relevant to indexing files.  The 
less bloat, the better, after all.  However, I cannot figure out where to put 
what files, and why the example installation works perfectly for single-core 
but not with multi-cores.

Here is an example of the errors I'm receiving:

command prompt> curl 
"host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F 
"myfile=@test2.txt"




Error 500 

HTTP ERROR: 500org/apache/tika/exception/TikaException

java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.exception.TikaException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 27 more

RequestURI=/solr/core0/update/extracthttp://jetty.mortbay.org/";>Powered by Jetty://























Any assistance you could provide or installation guides/tutorials/etc. that you 
could link me to would be greatly appreciated.  Thank you all for your time!

~Brandon Waterloo

Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo

Hello everyone,

I've been trying for several hours now to set up Solr with multiple cores with 
Solr Cell working on each core. The only items being indexed are PDF, DOC, and 
TXT files (with the possibility of expanding this list, but for now, just 
assume the only things in the index should be documents).

I never had any problems with Solr Cell when I was using a single core. In 
fact, I just ran the default installation in example/ and worked from that. 
However, trying to migrate to multi-core has been a never ending list of 
problems.

Any time I try to add a document to the index (using the same curl command as I 
did to add to the single core, of course adding the core name to the request 
URL-- host/solr/corename/update/extract...), I get HTTP 500 errors due to 
classes not being found and/or lazy loading errors. I've copied the exact 
example/lib directory into the cores, and that doesn't work either.

Frankly the only libraries I want are those relevant to indexing files. The 
less bloat, the better, after all. However, I cannot figure out where to put 
what files, and why the example installation works perfectly for single-core 
but not with multi-cores.

Here is an example of the errors I'm receiving:

command prompt> curl 
"host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F 
"myfile=@test2.txt"




Error 500 

HTTP ERROR: 500org/apache/tika/exception/TikaException

java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: java.lang.ClassNotFoundException: 
org.apache.tika.exception.TikaException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
... 27 more

RequestURI=/solr/core0/update/extracthttp://jetty.mortbay.org/";>Powered by Jetty://























Any assistance you could provide or installation guides/tutorials/etc. that you 
could link me to would be greatly appreciated. Thank you all for your time!

~Brandon Waterloo

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo

Well, there lies the problem--it's not JUST the Tika jar.  If it's not one 
thing, it's another, and I'm not even sure which directory Solr actually looks 
in.  In my Solr.xml file I have it use a shared library folder for every core.  
Since each core will be holding very homologous data, there's no need to have 
any different library modules for each.

The relevant line in my solr.xml file is .  That is housed in .../example/solr/.  So, does it look in 
.../example/lib or .../example/solr/lib?

~Brandon Waterloo

From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Thursday, March 24, 2011 11:29 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Multiple Cores with Solr Cell for indexing documents

Sounds like the Tika jar is not on the class path. Add it to a directory where
Solr's looking for libs.

On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
> Hello everyone,
>
> I've been trying for several hours now to set up Solr with multiple cores
> with Solr Cell working on each core. The only items being indexed are PDF,
> DOC, and TXT files (with the possibility of expanding this list, but for
> now, just assume the only things in the index should be documents).
>
> I never had any problems with Solr Cell when I was using a single core. In
> fact, I just ran the default installation in example/ and worked from
> that. However, trying to migrate to multi-core has been a never ending
> list of problems.
>
> Any time I try to add a document to the index (using the same curl command
> as I did to add to the single core, of course adding the core name to the
> request URL-- host/solr/corename/update/extract...), I get HTTP 500 errors
> due to classes not being found and/or lazy loading errors. I've copied the
> exact example/lib directory into the cores, and that doesn't work either.
>
> Frankly the only libraries I want are those relevant to indexing files. The
> less bloat, the better, after all. However, I cannot figure out where to
> put what files, and why the example installation works perfectly for
> single-core but not with multi-cores.
>
> Here is an example of the errors I'm receiving:
>
> command prompt> curl
> "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
> "myfile=@test2.txt"
>
> 
> 
> 
> Error 500 
> 
> HTTP ERROR: 500org/apache/tika/exception/TikaException
>
> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:
> 359) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at
> org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedH
> andler(RequestHandlers.java:240) at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReque
> st(RequestHandlers.java:231) at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
> :338) at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
> a:241) at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
> er.java:1089) at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216
> ) at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCo
> llection.java:211) at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:
> 114) at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.jav
> a:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:
> 226) at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java
> :442) Caused by: java.lang.ClassNotFoundException:
>

Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo

Well, there lies the problem--it's not JUST the Tika jar.  If it's not one 
thing, it's another, and I'm not even sure which directory Solr actually looks 
in.  In my Solr.xml file I have it use a shared library folder for every core.  
Since each core will be holding very homologous data, there's no need to have 
any different library modules for each.

The relevant line in my solr.xml file is .  That is housed in .../example/solr/.  So, does it look in 
.../example/lib or .../example/solr/lib?

~Brandon Waterloo

From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Thursday, March 24, 2011 11:29 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Multiple Cores with Solr Cell for indexing documents

Sounds like the Tika jar is not on the class path. Add it to a directory where
Solr's looking for libs.

On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
> Hello everyone,
>
> I've been trying for several hours now to set up Solr with multiple cores
> with Solr Cell working on each core. The only items being indexed are PDF,
> DOC, and TXT files (with the possibility of expanding this list, but for
> now, just assume the only things in the index should be documents).
>
> I never had any problems with Solr Cell when I was using a single core. In
> fact, I just ran the default installation in example/ and worked from
> that. However, trying to migrate to multi-core has been a never ending
> list of problems.
>
> Any time I try to add a document to the index (using the same curl command
> as I did to add to the single core, of course adding the core name to the
> request URL-- host/solr/corename/update/extract...), I get HTTP 500 errors
> due to classes not being found and/or lazy loading errors. I've copied the
> exact example/lib directory into the cores, and that doesn't work either.
>
> Frankly the only libraries I want are those relevant to indexing files. The
> less bloat, the better, after all. However, I cannot figure out where to
> put what files, and why the example installation works perfectly for
> single-core but not with multi-cores.
>
> Here is an example of the errors I'm receiving:
>
> command prompt> curl
> "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
> "myfile=@test2.txt"
>
> 
> 
> 
> Error 500 
> 
> HTTP ERROR: 500org/apache/tika/exception/TikaException
>
> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:247)
> at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:
> 359) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at
> org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedH
> andler(RequestHandlers.java:240) at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReque
> st(RequestHandlers.java:231) at
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java
> :338) at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
> a:241) at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
> er.java:1089) at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216
> ) at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCo
> llection.java:211) at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:
> 114) at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.jav
> a:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:
> 226) at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java
> :442) Caused by: java.lang.ClassNotFoundException:
>

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Brandon Waterloo

I did finally manage to deploy Solr with multiple cores but we've been running 
into so many problems with permissions, index location, and other things that I 
(quite fortunately) convinced my boss that multiple cores are not the way to go 
here.  I had in place a single-core system that would filter the results based 
on their ID numbers, and show only the subset of results that you wanted to 
see.  The disadvantage is that it's a single core and thus will take longer to 
search over the entire index.  The advantage is that it's better in every other 
way.

So the plan now is to move back to single-core searching and then test it with 
a huge amount of documents to see whether performance is seriously impacted or 
not.  So for now, I guess we can consider this thread resolved.

Thanks for all your help guys!

~Brandon Waterloo



From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Friday, March 25, 2011 1:23 PM
To: solr-user@lucene.apache.org
Cc: Upayavira
Subject: Re: Multiple Cores with Solr Cell for indexing documents

You can only set properties for a lib dir that must be used in solrconfig.xml.
You can use sharedLib in solr.xml though.

> There's options in solr.xml that point to lib dirs. Make sure you get
> them right.
>
> Upayavira
>
> On Thu, 24 Mar 2011 23:28 +0100, "Markus Jelsma"
>
>  wrote:
> > I believe it's example/solr/lib where it looks for shared libs in
> > multicore.
> > But, each core can has its own lib dir, usually in core/lib. This is
> > referenced to in solrconfig.xml, see the example config for the lib
> > directive.
> >
> > > Well, there lies the problem--it's not JUST the Tika jar.  If it's not
> > > one thing, it's another, and I'm not even sure which directory Solr
> > > actually looks in.  In my Solr.xml file I have it use a shared library
> > > folder for every core.  Since each core will be holding very
> > > homologous data, there's no need to have any different library modules
> > > for each.
> > >
> > > The relevant line in my solr.xml file is  > > sharedLib="lib">.  That is housed in .../example/solr/.  So, does it
> > > look in .../example/lib or .../example/solr/lib?
> > >
> > > ~Brandon Waterloo
> > > 
> > > From: Markus Jelsma [markus.jel...@openindex.io]
> > > Sent: Thursday, March 24, 2011 11:29 AM
> > > To: solr-user@lucene.apache.org
> > > Cc: Brandon Waterloo
> > > Subject: Re: Multiple Cores with Solr Cell for indexing documents
> > >
> > > Sounds like the Tika jar is not on the class path. Add it to a
> > > directory where Solr's looking for libs.
> > >
> > > On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
> > > > Hello everyone,
> > > >
> > > > I've been trying for several hours now to set up Solr with multiple
> > > > cores with Solr Cell working on each core. The only items being
> > > > indexed are PDF, DOC, and TXT files (with the possibility of
> > > > expanding this list, but for now, just assume the only things in the
> > > > index should be documents).
> > > >
> > > > I never had any problems with Solr Cell when I was using a single
> > > > core. In fact, I just ran the default installation in example/ and
> > > > worked from that. However, trying to migrate to multi-core has been
> > > > a never ending list of problems.
> > > >
> > > > Any time I try to add a document to the index (using the same curl
> > > > command as I did to add to the single core, of course adding the core
> > > > name to the request URL-- host/solr/corename/update/extract...), I
> > > > get HTTP 500 errors due to classes not being found and/or lazy
> > > > loading errors. I've copied the exact example/lib directory into the
> > > > cores, and that doesn't work either.
> > > >
> > > > Frankly the only libraries I want are those relevant to indexing
> > > > files. The less bloat, the better, after all. However, I cannot
> > > > figure out where to put what files, and why the example installation
> > > > works perfectly for single-core but not with multi-cores.
> > > >
> > > > Here is an example of the errors I'm receiving:
> > > >
> > > > command prompt> curl
> > > > "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
> > > > "myfile=@t

Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo

 Hey everybody,

I've been running into some issues indexing a very large set of documents.  
There's about 4000 PDF files, ranging in size from 160MB to 10KB.  Obviously 
this is a big task for Solr.  I have a PHP script that iterates over the 
directory and uses PHP cURL to query Solr to index the files.  For now, commit 
is set to false to speed up the indexing, and I'm assuming that Solr should be 
auto-committing as necessary.  I'm using the default solrconfig.xml file 
included in apache-solr-1.4.1\example\solr\conf.  Once all the documents have 
been finished the PHP script queries Solr to commit.

The main problem is that after a few thousand documents (around 2000 last time 
I tried), nearly every document begins causing Java exceptions in Solr:

Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pdf.PDFParser@11d329d
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 23 more
Caused by: java.io.IOException: expected='endobj' firstReadAttempt='' 
secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:707)
at org.pdfbox.pdmodel.PDDocument.load(PDDocument.java:691)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:40)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 25 more

As far as I know there's nothing special about these documents so I'm wondering 
if it's not properly autocommitting.  What would be appropriate settings in 
solrconfig.xml for this particular application?  I'd like it to autocommit as 
soon as it needs to but no more often than that for the sake of efficiency.  
Obviously it takes long enough to index 4000 documents and there's no reason to 
make it take longer.  Thanks for your help!

~Brandon Waterloo

RE: Problems indexing very large set of documents

2011-04-04 Thread Brandon Waterloo

Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo


From: Anuj Kumar [anujs...@gmail.com]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

<http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>at org.mortbay.jetty.Server.handle(Server.java:285)
>at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)
>at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
>at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
>at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal
> IOException from org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:125)
>at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>... 23 more
> Caused by: java.io.IOException: expected='endobj' firstReadAttempt=''
> secondReadAttempt='' org.pdfbox.io.PushBackInputStream@b19bfc
>at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:502)
>at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:176)
>at org.pdfbox.pdmodel.PDDocument.lo

RE: Problems indexing very large set of documents

2011-04-05 Thread Brandon Waterloo

It wasn't just a single file, it was dozens of files all having problems toward 
the end just before I killed the process.

IPADDR -  -  [04/04/2011:17:17:03 +] "POST 
/solr/update/extract?literal.id=32-130-AFB-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:05 +] "POST 
/solr/update/extract?literal.id=32-130-AFC-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:09 +] "POST 
/solr/update/extract?literal.id=32-130-AFD-84&commit=false HTTP/1.1" 500 4557
IPADDR -  -  [04/04/2011:17:17:14 +] "POST 
/solr/update/extract?literal.id=32-130-AFE-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:21 +] "POST 
/solr/update/extract?literal.id=32-130-AFF-84&commit=false HTTP/1.1" 500 4558
IPADDR -  -  [04/04/2011:17:17:21 +] "POST 
/solr/update/extract?literal.id=32-130-B00-84&commit=false HTTP/1.1" 500 4557

That is by no means all the errors, that is just a sample of a few.  You can 
see they all threw HTTP 500 errors.  What is strange is, nearly every file 
succeeded before about the 2200-files-mark, and nearly every file after that 
failed.


~Brandon Waterloo


From: Anuj Kumar [anujs...@gmail.com]
Sent: Monday, April 04, 2011 2:48 PM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

In the log messages are you able to locate the file at which it fails? Looks 
like TIKA is unable to parse one of your PDF files for the details. We need to 
hunt that one out.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:57 PM, Brandon Waterloo 
mailto:brandon.water...@matrix.msu.edu>> wrote:
Looks like I'm using Tika 0.4:
apache-solr-1.4.1/contrib/extraction/lib/tika-core-0.4.jar
.../tika-parsers-0.4.jar

~Brandon Waterloo


From: Anuj Kumar [anujs...@gmail.com<mailto:anujs...@gmail.com>]
Sent: Monday, April 04, 2011 2:12 PM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

This is related to Apache TIKA. Which version are you using?
Please see this thread for more details-
http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

<http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html>Hope
it helps.

Regards,
Anuj

On Mon, Apr 4, 2011 at 11:30 PM, Brandon Waterloo <
brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>> wrote:

>  Hey everybody,
>
> I've been running into some issues indexing a very large set of documents.
>  There's about 4000 PDF files, ranging in size from 160MB to 10KB.
>  Obviously this is a big task for Solr.  I have a PHP script that iterates
> over the directory and uses PHP cURL to query Solr to index the files.  For
> now, commit is set to false to speed up the indexing, and I'm assuming that
> Solr should be auto-committing as necessary.  I'm using the default
> solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once
> all the documents have been finished the PHP script queries Solr to commit.
>
> The main problem is that after a few thousand documents (around 2000 last
> time I tried), nearly every document begins causing Java exceptions in Solr:
>
> Apr 4, 2011 1:18:01 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@11d329d
>at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>at
> or

RE: Problems indexing very large set of documents

2011-04-08 Thread Brandon Waterloo

I had some time to do some research into the problems.  From what I can tell, 
it appears Solr is tripping up over the filename.  These are strictly examples, 
but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this 
filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple 
periods.  As there are about 1700 files whose filenames are similar to the 
second format it is simply not possible to change their filenames.  In addition 
they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I 
simply SOL until the Solr dev team can work on this? (assuming I put in a 
ticket)

Thanks again everyone,

~Brandon Waterloo



From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml


-Hoss

RE: Problems indexing very large set of documents

2011-04-08 Thread Brandon Waterloo

A second test has revealed that it is something to do with the contents, and 
not the literal filenames, of the second set of files.  I renamed one of the 
second-format files and tested it and Solr still failed.  However, the problem 
still only applies to those files of the second naming format.

From: Brandon Waterloo [brandon.water...@matrix.msu.edu]
Sent: Friday, April 08, 2011 10:40 AM
To: solr-user@lucene.apache.org
Subject: RE: Problems indexing very large set of documents

I had some time to do some research into the problems.  From what I can tell, 
it appears Solr is tripping up over the filename.  These are strictly examples, 
but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this 
filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple 
periods.  As there are about 1700 files whose filenames are similar to the 
second format it is simply not possible to change their filenames.  In addition 
they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I 
simply SOL until the Solr dev team can work on this? (assuming I put in a 
ticket)

Thanks again everyone,

~Brandon Waterloo

From: Chris Hostetter [hossman_luc...@fucit.org]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml

-Hoss

RE: Problems indexing very large set of documents

2011-04-08 Thread Brandon Waterloo

I think I've finally found the problem.  The files that work are PDF version 
1.6.  The files that do NOT work are PDF version 1.4.  I'll look into updating 
all the old documents to PDF 1.6.

Thanks everyone!

~Brandon Waterloo

From: Ezequiel Calderara [ezech...@gmail.com]
Sent: Friday, April 08, 2011 11:35 AM
To: solr-user@lucene.apache.org
Cc: Brandon Waterloo
Subject: Re: Problems indexing very large set of documents

Maybe those files are created with a different Adobe Format version...

See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html

On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo 
mailto:brandon.water...@matrix.msu.edu>> wrote:
A second test has revealed that it is something to do with the contents, and 
not the literal filenames, of the second set of files.  I renamed one of the 
second-format files and tested it and Solr still failed.  However, the problem 
still only applies to those files of the second naming format.
________
From: Brandon Waterloo 
[brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>]
Sent: Friday, April 08, 2011 10:40 AM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

I had some time to do some research into the problems.  From what I can tell, 
it appears Solr is tripping up over the filename.  These are strictly examples, 
but, Solr handles this filename fine:

32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf

However, it fails with either a parsing error or an EOF exception on this 
filename:

32-130-A08-84-al.sff.document.nusa197102.pdf

The only significant difference is that the second filename contains multiple 
periods.  As there are about 1700 files whose filenames are similar to the 
second format it is simply not possible to change their filenames.  In addition 
they are being used by other applications.

Is there something I can change in Solr configs to fix this issue or am I 
simply SOL until the Solr dev team can work on this? (assuming I put in a 
ticket)

Thanks again everyone,

~Brandon Waterloo



From: Chris Hostetter 
[hossman_luc...@fucit.org<mailto:hossman_luc...@fucit.org>]
Sent: Tuesday, April 05, 2011 3:03 PM
To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
Subject: RE: Problems indexing very large set of documents

: It wasn't just a single file, it was dozens of files all having problems
: toward the end just before I killed the process.
   ...
: That is by no means all the errors, that is just a sample of a few.
: You can see they all threw HTTP 500 errors.  What is strange is, nearly
: every file succeeded before about the 2200-files-mark, and nearly every
: file after that failed.

..the root question is: do those files *only* fail if you have already
indexed ~2200 files, or do they fail if you start up your server and index
them first?

there may be a resource issued (if it only happens after indexing 2200) or
it may just be a problem with a large number of your PDFs that your
iteration code just happens to get to at that point.

If it's the former, then there may e something buggy about how Solr is
using Tika to cause the problem -- if it's the later, then it's a straight
Tika parsing issue.

: > now, commit is set to false to speed up the indexing, and I'm assuming that
: > Solr should be auto-committing as necessary.  I'm using the default
: > solrconfig.xml file included in apache-solr-1.4.1\example\solr\conf.  Once

solr does no autocommitting by default, you need to check your
solrconfig.xml


-Hoss



--
__
Ezequiel.

Http://www.ironicnet.com

RE: Problems indexing very large set of documents

2011-04-11 Thread Brandon Waterloo

I found a simpler command-line method to update the PDF files.  On some 
documents it does so perfect, the result is a pixel-for-pixel match and none of 
the OCR text (which is what all these PDFs are, newspaper articles that have 
been passed through OCR) is lost.  However, on other documents the result is 
considerably blurrier and some of the OCR text is lost.

We've decided to skip any documents that Tika cannot index for now.

As Lance stated, it's not specifically the version that causes the problem but 
rather some quirks caused by different PDF writers, a few tests have confirmed 
this, so we can't use version to determine which should be skipped.  I'm 
examining the XML responses from the queries, and I cannot figure out how to 
tell from the XML response whether or not a document was successfully indexed.  
The status value seems to be 0 regardless of whether indexing was successful or 
not.

So my question is, how can I tell from the response whether or not indexing was 
actually successful?

~Brandon Waterloo

From: Lance Norskog [goks...@gmail.com]
Sent: Sunday, April 10, 2011 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Problems indexing very large set of documents

There is a library called iText. It parses and writes PDFs very very
well, and a simple program will let you do a batch conversion.  PDFs
are made by a wide range of programs, not just Adobe code. Many of
these do weird things and make small mistakes that Tika does not know
to handle. In other words there is "dirty PDF" just like "dirty HTML".

A percentage of PDFs will fail and that's life. One site that gets
press releases from zillions of sites (and thus a wide range of PDF
generators) has a 15% failure rate with Tika.

Lance

On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo
 wrote:
> I think I've finally found the problem.  The files that work are PDF version 
> 1.6.  The files that do NOT work are PDF version 1.4.  I'll look into 
> updating all the old documents to PDF 1.6.
>
> Thanks everyone!
>
> ~Brandon Waterloo
> 
> From: Ezequiel Calderara [ezech...@gmail.com]
> Sent: Friday, April 08, 2011 11:35 AM
> To: solr-user@lucene.apache.org
> Cc: Brandon Waterloo
> Subject: Re: Problems indexing very large set of documents
>
> Maybe those files are created with a different Adobe Format version...
>
> See this: 
> http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html
>
> On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo 
> mailto:brandon.water...@matrix.msu.edu>> 
> wrote:
> A second test has revealed that it is something to do with the contents, and 
> not the literal filenames, of the second set of files.  I renamed one of the 
> second-format files and tested it and Solr still failed.  However, the 
> problem still only applies to those files of the second naming format.
> 
> From: Brandon Waterloo 
> [brandon.water...@matrix.msu.edu<mailto:brandon.water...@matrix.msu.edu>]
> Sent: Friday, April 08, 2011 10:40 AM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> I had some time to do some research into the problems.  From what I can tell, 
> it appears Solr is tripping up over the filename.  These are strictly 
> examples, but, Solr handles this filename fine:
>
> 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf
>
> However, it fails with either a parsing error or an EOF exception on this 
> filename:
>
> 32-130-A08-84-al.sff.document.nusa197102.pdf
>
> The only significant difference is that the second filename contains multiple 
> periods.  As there are about 1700 files whose filenames are similar to the 
> second format it is simply not possible to change their filenames.  In 
> addition they are being used by other applications.
>
> Is there something I can change in Solr configs to fix this issue or am I 
> simply SOL until the Solr dev team can work on this? (assuming I put in a 
> ticket)
>
> Thanks again everyone,
>
> ~Brandon Waterloo
>
>
> 
> From: Chris Hostetter 
> [hossman_luc...@fucit.org<mailto:hossman_luc...@fucit.org>]
> Sent: Tuesday, April 05, 2011 3:03 PM
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: RE: Problems indexing very large set of documents
>
> : It wasn't just a single file, it was dozens of files all having problems
> : toward the end just before I killed the process.
>   ...
> : That is by no means all the errors, that is just a sample of a few.
> : You can see they all threw HTTP 500 erro

Multiple Cores with Solr Cell for indexing documents

Multiple Cores with Solr Cell for indexing documents

RE: Multiple Cores with Solr Cell for indexing documents

Multiple Cores with Solr Cell for indexing documents

RE: Multiple Cores with Solr Cell for indexing documents

Problems indexing very large set of documents

RE: Problems indexing very large set of documents

RE: Problems indexing very large set of documents

RE: Problems indexing very large set of documents

RE: Problems indexing very large set of documents

RE: Problems indexing very large set of documents

RE: Problems indexing very large set of documents

12 matches

Site Navigation

Mail list logo

Footer information