Autofill 'id' field with the URL of files posted to Solr?

2010-04-18 Thread pk

Hi,
I need to submit thousands of online PDF/html files to Solr. I can submit
one file using SolrJ (StreamingUpdateSolrServer and
..solr.common.util.ContentStreamBase.URLStream), setting literal.id
parameter to the url. I can't do the same with a batch of multiple files, as
their 'id' should be unique (set to their urls).

I couldn't get this to work. Is there a way to somehow get the 'id' field
set automatically to the url of the files posted to Solr (something like to
'stream_name')? How to set this in solrconfig.xml or schema.xml?  or any
other way?

If their url can be put in some other field (like 'url' iitself) that will
also serve my purpose.

Thanks for your help.
-- 
View this message in context: 
http://n3.nabble.com/Autofill-id-field-with-the-URL-of-files-posted-to-Solr-tp727985p727985.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr throws TikaException while parsing sample PDF

2010-04-18 Thread pk

Hi,
while posting a sample pdf (that comes with Solr dist'n) to solr, i'm
getting a TikaException. 
Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr.
Other sample pdfs can be parsed and indexed successfully.. I;m getting same
error with some other pdfs also (but adobe reader can open them fine, so i
dont think they have an issue in formatting or are corrupt etc)... Here is
the trace...


found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf ::
size=286242
Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {} 0 640
Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Una
ble to extract PDF content
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
mentLoader.java:211)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea
mHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav
a:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re
questHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)

at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241
)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil
terChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain
.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:
213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:
172)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10
8)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn
ection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5
28)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke
rThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6
89)
at java.lang.Thread.run(Thread.java:595)
Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF
content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu
mentLoader.java:190)
... 20 more
Caused by: java.util.zip.ZipException: incorrect header check
at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140)
at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97)
at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290)
at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235)
at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170)
at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101)
at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132)
at
org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at
org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at
org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
... 24 more

Apr 18, 2010 10:31:34 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update/extract
params={wt=javabin&waitFlush=true&literal.index
Date=2010-04-18+&commit=true&waitSearcher=true&version=1&literal.id=C%253A%255Csolr_1.4.0%
255Cdocs%255CInstalling%2BSolr%2Bin%2BTomcat.pdf} status=500 QTime=640
Exception in handling an uplaoded file:C:\solr_1.4.0\docs\Installing Solr in
Tomcat.pdf :
Internal Server Error

Internal Server Error

request:
http://localhost:8080/solr/update/extract?literal.id=

Re: Autofill 'id' field with the URL of files posted to Solr?

2010-04-18 Thread pk

Lance,
I can submit and extract pdf contents using Solr and SolrJ, as i indicated
earlier. 
I've made 'id' a mandatory field and i had to submit its value while
submitting (request.addParams("literal.id",url))..

If i put multiple files/streams in the request, then i can't put 'id' this
way as the params are common to all files/streams which is not what i want.

If somehow i can map stream_name/url of the files to 'id' field, that's all
i need.
Thanks.

-- 
View this message in context: 
http://n3.nabble.com/Autofill-id-field-with-the-URL-of-files-posted-to-Solr-tp727985p728932.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Autofill 'id' field with the URL of files posted to Solr?

2010-04-21 Thread pk

Can somebody suggest something similar or this is not possible to autofill
'id' using configuration only?
-- 
View this message in context: 
http://n3.nabble.com/Autofill-id-field-with-the-URL-of-files-posted-to-Solr-tp727985p739606.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Problem with pdf, upgrading Cell

2010-04-30 Thread pk

Mark,
did you managed to get it work?

I did try latest Tika (0.7) command line and successfully parsed earlier
problematic pdf. Then i replaced Tika related jars in Solr-1.4
contrib/extraction/lib folder with new ones. Now it doesn;t throw any
exception, but no content extraction, only metadata! It now even doesn't
extract content from pdfs which it was able to earlier (v0.4). Strange..

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-with-pdf-upgrading-Cell-tp745557p767447.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Any way to get top 'n' queries searched from Solr?

2010-04-30 Thread pk

Peter, 
It seems that your solution (SOLR-1872) requires authentication too (and be
tracked via ur uuid), but my users will be general public using browsers,
and i can't force any such auth restrictions. Also you didn't mention if you
are already persisting the audit data.. Or i may need to extend it to work
for my problem..

My requirement is simple: to know top n query strings with their frequencies
etc..
Thanks though.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Any-way-to-get-top-n-queries-searched-from-Solr-tp767165p767482.html
Sent from the Solr - User mailing list archive at Nabble.com.