Response writer configs
Hi all I'm starting to play with Solr. This might be a silly question and not particularly important but I'm curious. I setup the example site using the tutorial. It works very well. I was looking around the config files and notice that in my solrconfig.xml that the queryResponseWriter area is commented out but they all still work. wt=php etc returns the php format. How is it working if they're not defined? Are they defined elsewhere? Thanks Ross
Solr Cell - PDFs plus literal metadata - GET or POST ?
Hi all I'm experimenting with Solr. I've successfully indexed some PDFs and all looks good but now I want to index some PDFs with metadata pulled from another source. I see this example in the docs. curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah"; -F "tutori...@tutorial.pdf" I can write code to generate a script with those commands substituting my own literal.whatever. My metadata could be up to a couple of KB in size. Is there a way of making the literal a POST variable rather than a GET? Will Solr Cell accept it as a POST? Something doesn't feel right about generating a huge long URL. I think Tomcat can handle up to 8 KB by default so I guess that's okay although I'm not sure how long a Linux command line can reasonably be. I know Curl may not be the right thing to use for production use but this is initially to get some data indexed for test and demo. Thanks Ross
Re: Solr Cell - PDFs plus literal metadata - GET or POST ?
On Tue, Jan 5, 2010 at 2:25 PM, Giovanni Fernandez-Kincade wrote: > Really? Doesn't it have to be delimited differently, if both the file > contents and the document metadata will be part of the POST data? How does > Solr Cell tell the difference between the literals and the start of the file? > I've tried this before and haven't had any luck with it. Thanks Shalin. And Giovanni, yes it definitely works. This will set literal.mydata to the contents of mydata.txt curl "http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true"; -F "myfi...@tutorial.html" -F "literal.mydata= > -Original Message- > From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] > Sent: Monday, January 04, 2010 4:28 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ? > > On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote: > >> Hi all >> >> I'm experimenting with Solr. I've successfully indexed some PDFs and >> all looks good but now I want to index some PDFs with metadata pulled >> from another source. I see this example in the docs. >> >> curl " >> http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah >> " >> -F "tutori...@tutorial.pdf" >> >> I can write code to generate a script with those commands substituting >> my own literal.whatever. My metadata could be up to a couple of KB in >> size. Is there a way of making the literal a POST variable rather than >> a GET? > > > With Curl? Yes, see the man page. > > >> Will Solr Cell accept it as a POST? > > > Yes, it will. > > -- > Regards, > Shalin Shekhar Mangar. >
Solr Cell. Seems to be only indexing the first N bytes of a text file.
Hi all I'm trying to index some text files using Solr Cell. I'm using the schema from Avi Rappoport's tutorial about indexing html and text files although I also had the same problem with the example/solr setup. My problem is that words past or "below" a certain point in a file are not being indexed. I must be hitting some limit but I haven't been able to figure out what. I'm hosting with Tomcat and using cURL to post files to /update/extract as per Avi's tutorial and other docs. I don't think it's an http limit during the POST because the whole file is being successfully stored in Solr. I know that because if I retrieve the file body with a query that does work, the word that doesn't work appears lower down in the returned contents. I'm storing the contents now for testing. Once I have this working, the file contents will probably be indexed only. On a test file that I've been editing and moving my unique word around, it seems to stop working if that word is beyond the 100 KB point in the file. I think another file earlier gave a different result. Hopefully I'm missing something obvious. Thanks for any help. Ross
Re: Solr Cell. Seems to be only indexing the first N bytes of a text file.
Thanks Erick. That was it. All looking good now. Cheers Ross On Sat, Mar 20, 2010 at 9:29 PM, Erick Erickson wrote: > Does our solarconfig file have a line like... > 1 > ? > > Try upping the 1... > > HTH > Erick > > On Sat, Mar 20, 2010 at 8:40 PM, Ross wrote: > >> Hi all >> >> I'm trying to index some text files using Solr Cell. I'm using the >> schema from Avi Rappoport's tutorial about indexing html and text >> files although I also had the same problem with the example/solr >> setup. >> >> My problem is that words past or "below" a certain point in a file are >> not being indexed. I must be hitting some limit but I haven't been >> able to figure out what. I'm hosting with Tomcat and using cURL to >> post files to /update/extract as per Avi's tutorial and other docs. I >> don't think it's an http limit during the POST because the whole file >> is being successfully stored in Solr. I know that because if I >> retrieve the file body with a query that does work, the word that >> doesn't work appears lower down in the returned contents. I'm storing >> the contents now for testing. Once I have this working, the file >> contents will probably be indexed only. >> >> On a test file that I've been editing and moving my unique word >> around, it seems to stop working if that word is beyond the 100 KB >> point in the file. I think another file earlier gave a different >> result. >> >> Hopefully I'm missing something obvious. >> >> Thanks for any help. >> >> Ross >> >
Solr crashing while extracting from very simple text file
Hi all I'm trying to import some text files. I'm mostly following Avi Rappoport's tutorial. Some of my files cause Solr to crash while indexing. I've narrowed it down to a very simple example. I have a file named test.txt with one line. That line is the word XXBLE and nothing else This is the command I'm using. curl "http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true"; -F "myfi...@test.txt" The result is pasted below. Other files work just fine. The problem seems to be related to the letters B and E. If I change them to something else or make them lower case then it works. In my real files, the XX is something else but the result is the same. It's a common word in the files. I guess for this "quick and dirty" job I'm doing I could do a bulk replace in the files to make it lower case. Is there any workaround for this? Thanks Ross Apache Tomcat/6.0.20 - Error report<!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--> HTTP Status 500 - org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:636) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) ... 18 more Caused by: java.lang.NullPointerException at java.io.Reader.<init>(Reader.java:78) at java.io.BufferedReader.<init>(BufferedReader.java:93) at java.io.BufferedReader.<init>(BufferedReader.java:108) at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:59) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) ... 20 more type Status reportmessage org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.ha
Re: Solr crashing while extracting from very simple text file
Thanks Georg I don't think it's that because it crashes on a one word test file I create using the nano editor. I don't think nano is adding anything extra. My real files are created by a Windows utility called pdftotext. I solved the problem by getting pdftotext to generate html files rather than plain text. It just adds an html header and wraps everything in a tag. That seems to keep Solr happy. Ross On Mon, Mar 22, 2010 at 9:08 AM, György Frivolt wrote: > Hi, > > I had problem with indexing documents some months ago as well. I found > that there were XML control characters in the documents and these were not > handled by Solr. Maybe it is the case for you as well. > > Regards, > > Georg > > > On Sun, Mar 21, 2010 at 5:58 PM, Ross wrote: > >> Hi all >> >> I'm trying to import some text files. I'm mostly following Avi >> Rappoport's tutorial. Some of my files cause Solr to crash while >> indexing. I've narrowed it down to a very simple example. >> >> I have a file named test.txt with one line. That line is the word >> XXBLE and nothing else >> >> This is the command I'm using. >> >> curl " >> http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true >> " >> -F "myfi...@test.txt" >> >> The result is pasted below. Other files work just fine. The problem >> seems to be related to the letters B and E. If I change them to >> something else or make them lower case then it works. In my real >> files, the XX is something else but the result is the same. It's a >> common word in the files. I guess for this "quick and dirty" job I'm >> doing I could do a bulk replace in the files to make it lower case. >> >> Is there any workaround for this? >> >> Thanks >> Ross >> >> Apache Tomcat/6.0.20 - Error >> report<!--H1 >> >> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} >> H2 >> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} >> H3 >> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} >> BODY >> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} >> B >> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} >> P >> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A >> {color : black;}A.name {color : black;}HR {color : >> #525D76;}--> HTTP Status 500 - >> org.apache.tika.exception.TikaException: Unexpected RuntimeException >> from org.apache.tika.parser.txt.txtpar...@19ccba >> >> org.apache.solr.common.SolrException: >> org.apache.tika.exception.TikaException: Unexpected RuntimeException >> from org.apache.tika.parser.txt.txtpar...@19ccba >> at >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) >> at >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >> at >> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >> at >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) >> at >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) >> at >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) >> at >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) >> at >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) >> at >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) >> at >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) >> at >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) >> at >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) >> at >> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) >> at >> org.apache.coyote.http11.Http11Pro
Re: Solr crashing while extracting from very simple text file
I thought you might ask that :-) It's because the pdf files are scanned from paper documents and OCR'd to produce text. They still contain the image so are huge. The smaller files are about 40 MB and cause a Java out of heap memory error. The larger files are getting close to 500 MB. I didn't have anything to do with the scanning. I'm guessing but it seems that something in the Tomcat / Solr / Tika implementation tries to load it all into memory at once. pdftotext (part of http://www.foolabs.com/xpdf/download.html ) seems to do it nicely and processes small chunks at a time. Ross On Mon, Mar 22, 2010 at 9:43 AM, Erik Hatcher wrote: > Why not feed the original PDF files in instead? Just curious if pdftotext > is doing a better job than Tika's PDFBox stuff. > > Erik > > On Mar 22, 2010, at 9:30 AM, Ross wrote: > >> Thanks Georg >> >> I don't think it's that because it crashes on a one word test file I >> create using the nano editor. I don't think nano is adding anything >> extra. >> >> My real files are created by a Windows utility called pdftotext. I >> solved the problem by getting pdftotext to generate html files rather >> than plain text. It just adds an html header and wraps everything in a >> tag. That seems to keep Solr happy. >> >> Ross >> >> On Mon, Mar 22, 2010 at 9:08 AM, György Frivolt >> wrote: >>> >>> Hi, >>> >>> I had problem with indexing documents some months ago as well. I found >>> that there were XML control characters in the documents and these were >>> not >>> handled by Solr. Maybe it is the case for you as well. >>> >>> Regards, >>> >>> Georg >>> >>> >>> On Sun, Mar 21, 2010 at 5:58 PM, Ross wrote: >>> >>>> Hi all >>>> >>>> I'm trying to import some text files. I'm mostly following Avi >>>> Rappoport's tutorial. Some of my files cause Solr to crash while >>>> indexing. I've narrowed it down to a very simple example. >>>> >>>> I have a file named test.txt with one line. That line is the word >>>> XXBLE and nothing else >>>> >>>> This is the command I'm using. >>>> >>>> curl " >>>> >>>> http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true >>>> " >>>> -F "myfi...@test.txt" >>>> >>>> The result is pasted below. Other files work just fine. The problem >>>> seems to be related to the letters B and E. If I change them to >>>> something else or make them lower case then it works. In my real >>>> files, the XX is something else but the result is the same. It's a >>>> common word in the files. I guess for this "quick and dirty" job I'm >>>> doing I could do a bulk replace in the files to make it lower case. >>>> >>>> Is there any workaround for this? >>>> >>>> Thanks >>>> Ross >>>> >>>> Apache Tomcat/6.0.20 - Error >>>> report<!--H1 >>>> >>>> >>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} >>>> H2 >>>> >>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} >>>> H3 >>>> >>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} >>>> BODY >>>> >>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} >>>> B >>>> >>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} >>>> P >>>> >>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A >>>> {color : black;}A.name {color : black;}HR {color : >>>> #525D76;}--> HTTP Status 500 - >>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException >>>> from org.apache.tika.parser.txt.txtpar...@19ccba >>>> >>>> org.apache.solr.common.SolrException: >>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException >>>> from org.apache.tika.parser.txt.txtpar...@19ccba >>>> at >>>> >>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) >>>> at &
Re: Solr crashing while extracting from very simple text file
Does anyone have any thoughts or suggestions on this? I guess it's really a Tika problem. Should I try to report it to the Tika project? I wonder if someone could try it to see if it's a general problem or just me. I can reproduce it by firing up the nano editor, creating a file with XXBLE on one line and nothing else. Try indexing that and Solr / Tika crashes. I can avoid it by editing the file slightly but I haven't really been able to discover a consistent pattern. It works if I change the word to lower case. Also a three line file like this works a a XXBLE but not x x XXBLE It's a bit unfortunate because a similar word (a person's name ??BLE ) with the same problem appears frequently in upper case near the top of my files. Cheers Ross On Sun, Mar 21, 2010 at 12:58 PM, Ross wrote: > Hi all > > I'm trying to import some text files. I'm mostly following Avi > Rappoport's tutorial. Some of my files cause Solr to crash while > indexing. I've narrowed it down to a very simple example. > > I have a file named test.txt with one line. That line is the word > XXBLE and nothing else > > This is the command I'm using. > > curl > "http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true"; > -F "myfi...@test.txt" > > The result is pasted below. Other files work just fine. The problem > seems to be related to the letters B and E. If I change them to > something else or make them lower case then it works. In my real > files, the XX is something else but the result is the same. It's a > common word in the files. I guess for this "quick and dirty" job I'm > doing I could do a bulk replace in the files to make it lower case. > > Is there any workaround for this? > > Thanks > Ross > > Apache Tomcat/6.0.20 - Error > report<!--H1 > {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} > H2 > {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} > H3 > {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} > BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} > B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} > P > {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A > {color : black;}A.name {color : black;}HR {color : > #525D76;}--> HTTP Status 500 - > org.apache.tika.exception.TikaException: Unexpected RuntimeException > from org.apache.tika.parser.txt.txtpar...@19ccba > > org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException > from org.apache.tika.parser.txt.txtpar...@19ccba > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) > at java.lang.Thread.run(Thread.java:636) > Caused by: org.apache.tika.exception.TikaException: Unexpected > RuntimeException from org.apache.ti
Re: Solr crashing while extracting from very simple text file
Hi Chris, thanks for looking at this. I'm using Solr 1.4.0 including the Tika that's in the tgz file which means Tika 0.4. I've now discovered that only two letters are required. A single line with XE will crash it. This fails: r...@gamma:/home/ross# hexdump -C test.txt 58 45 0a |XE.| 0003 r...@gamma:/home/ross# This works r...@gamma:/home/ross# hexdump -C test.txt 58 46 0a |XF.| 0003 r...@gamma:/home/ross# XA, XB, XC, XD, XF all work okay. There's just something special about XE. The command I use is: curl "http://localhost:8080/solr-example/update/extract?literal.id=doc1&fmap.content=body&commit=true"; -F "myfi...@test.txt" I filed a bug at https://issues.apache.org/jira/browse/TIKA-397 but I guess 0.4 is an old version so I wouldn't expert it to get much attention. It looks like I should upgrade Tika to 0.6. I don't really know how to do that or if Solr 1.4 works with Tika 0.6. The Tika pages talk about using Maven to build it. Sorry, I'm no Linux expert. Ross On Thu, Apr 1, 2010 at 1:07 PM, Chris Hostetter wrote: > > : Yes, please report this to the Tika project. > > except that when i run "tika-app-0.6.jar" on a text file like the one Ross > describes, i don't get the error he describes, which means it may be > something off in how Solr is using Tika. > > Ross: I can't reproduce this error on the trunk using the example solr > configs and the text file below. can you verify exactly which version of > SOlr you are using (and which version of tika you are using inside solr) > and the exact byte contents of your simplest problematic text file? > > hoss...@brunner:~/tmp$ cat tmp.txt > x > x > XXBLE > hoss...@brunner:~/tmp$ hexdump -C tmp.txt > 78 0a 78 0a 58 58 42 4c 45 0a |x.x.XXBLE.| > 000a > hoss...@brunner:~/tmp$ curl > "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F > "myfi...@tmp.txt" > > > 0 name="QTime">66 > > > > -Hoss > >
RE: Indexing data from multiple datasources
This thread got me thinking a bit... Does SOLR support the concept of "partial updates" to documents? By this I mean updating a subset of fields in a document that already exists in the index, and without having to resubmit the entire document. An example would be storing/indexing user tags associated with documents. These tags will not be available when the document is initially presented to SOLR, and may or may not come along at a later time. When that time comes, can we just submit the tag data (and document identifier I'd imagine), or do we have to import the entire document? new to SOLR... > Date: Thu, 9 Jun 2011 14:00:43 -0400 > Subject: Re: Indexing data from multiple datasources > From: erickerick...@gmail.com > To: solr-user@lucene.apache.org > > How are you using it? Streaming the files to Solr via HTTP? You can use Tika > on the client to extract the various bits from the structured documents, and > use SolrJ to assemble various bits of that data Tika exposes into a > Solr document > that you then send to Solr. At the point you're transferring data from the > Tika parse to the Solr document, you could add any data from your database > that > you wanted. > > The result is that you'd be indexing the complete Solr document only once. > > You're right that updating a document in Solr overwrites the previous > version and any > data in the previous version is lost > > Best > Erick > > On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges wrote: > > Hello Erick, > > > > Thanks for the response. No, I am using the extract handler to extract the > > data from my text files. In your second approach, you say I could use a DIH > > to update the index which would have been created by the extract handler in > > the first phase. I thought that lets say I get info from the DB and update > > the index with the document ID, will I overwrite the data and lose the > > initial data from the extract handler phase? Thanks > > > > Greg > > > > -Original Message- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: 9 juin 2011 12:15 > > To: solr-user@lucene.apache.org > > Subject: Re: Indexing data from multiple datasources > > > > Hmmm, when you say you use Tika, are you using some custom Java code? > > Because > > if you are, the best thing to do is query your database at that point > > and add whatever information > > you need to the document. > > > > If you're using DIH to do the crawl, consider implementing a > > Transformer to do the database > > querying and modify the document as necessary This is pretty > > simple to do, we can > > chat a bit more depending on whether either approach makes sense. > > > > Best > > Erick > > > > > > > > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges > > wrote: > >> Hello all, > >> > >> I have checked the forums to see if it is possible to create and index > >> from multiple datasources. I have found references to SOLR-1358, but I > >> don't think this fits my scenario. In all, we have an application where we > >> upload files. On the file upload, I use the Tika extract handler to save > >> metadata from the file (_attr, literal values, etc..). We also have a > >> database which has information on the uploaded files, like the category, > >> type, etc.. I would like to update the index to include this information > >> from the db in the index for each document. If I run a dataimporthandler > >> after the extract phase I am afraid that by updating the doc in the index > >> by its id will just cause that I overwrite the old information with the > >> info from the DB (what I understand is that Solr updates its index by ID > >> by deleting first then recreating the info). > >> > >> Anyone have any pointers, is there a clean way to do this, or must I find > >> a way to pass the db metadata to the extract handler and save it as > >> literal fields? > >> > >> Thanks in advance > >> > >> Greg > >> > >
Unsubscribing
I've tried multiple times to unsubscribe from this list using the proper method (mailto:solr-user-unsubscr...@lucene.apache.org), but it's not working! Can anyone help with that?
RE: Unsubscribing
Nothing in the Junk folder, but that reminded me that our company is using a 3rd party spam filter (i.e., Lanlogic)... which sure enough had snagged the confirmation emails. Since the list emails were going through I never thought to check the filtering systems. Thanks for jogging my memory. :-) Ross From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Tue 2/3/2009 2:18 PM To: solr-user@lucene.apache.org Subject: Re: Unsubscribing : Subject: Unsubscribing : : I've tried multiple times to unsubscribe from this list using the proper : method (mailto:solr-user-unsubscr...@lucene.apache.org), but it's not : working! Can anyone help with that? Did you get a confirmation email from the mailing list software asking you to verify that you really wanted to unsubscribe? (is it in a Junk Mail or Spam folder that you didn't think to check?) did you reply to it according to the instructions? see also... http://www.nabble.com/Re%3A-PLEASE-REMOVE-ME-FROM-THIS-EMAIL-LIST!-p10879673.html -Hoss
Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib
Hi all, I am trying to run Solr on OSX, after a successful installation and tests on Linux, while trying to run with JDK 1.5.0, I am getting the following exception... HTTP ERROR: 500 Unable to compile class for JSP Generated servlet error: error: error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib; java.util.zip.ZipException: error in opening zip file Note: /tmp/Jetty__8983__solr_23999/org/apache/jsp/admin/ index_jsp.java uses unchecked or unsafe operations. I have checked /usr/local/lib and can see that 'libsvnjavahl-1.0.0.0.dylib' is in fact present there, I assume it is probably quite easy to fix this, or that plenty of people are running Solr on OSX, I would appreciate any advice on how to fix this up, thanks for your time, Ross McDonald ___ Inbox full of spam? Get leading spam protection and 1GB storage with All New Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html
Re: Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib
Thanks for the quick response guys, I am using jdk 1.5, and am ensuring use of this jdk by typing the full path. As regards doing an 'unzip -l' on the file it indeed generates an error.. /usr/local/lib rossputin$ unzip -l libsvnjavahl-1.0.0.0.dylib Archive: libsvnjavahl-1.0.0.0.dylib End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: libsvnjavahl-1.0.0.0.dylib may be a plain executable, not an archive unzip: cannot find zipfile directory in one of libsvnjavahl-1.0.0.0.dylib or libsvnjavahl-1.0.0.0.dylib.zip, and cannot find libsvnjavahl-1.0.0.0.dylib.ZIP, period. oh dear.. maybe this is corrupt? A 'jar tvf' generates no output, once again indicating that the file is not a valid jar, regards, Ross. On 15 Aug 2006, at 00:09, Chris Hostetter wrote: : The older version of Jetty that we are using requires the JDK version, : not the JRE version of 'java' so it can compile JSPs via javac. Maybe : that's be the problem? Try typing the full path to the java : executable to verify. Perhaps ... but this seems like an awfully strnage exception to get if the problem were that it can't find the compiler... : > error: error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib; : > java.util.zip.ZipException: error in opening zip file ...he said that /usr/local/lib/libsvnjavahl-1.0.0.0.dylib did in fact exist. (I would expect a class not found or an attempt at opening a file that doesn't exist in the other case) I don't really know much about Java on Macs, or what a "dylib" file is ... but the exception seems to indicate that it's expecting to find a Zip file (possible a jar?) Ross: does "unzip -l" or "jar tf" on /usr/local/lib/libsvnjavahl-1.0.0.0.dylib work for you? -Hoss ___ Try the all-new Yahoo! Mail. "The New Version is radically easier to use" The Wall Street Journal http://uk.docs.yahoo.com/nowyoucan.html
Re: Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib
Hi all, yes I suspect that may be the problem, I will go through some web guides on Java and Mac, see if anythingcomes up.. thanks, Ross. On 15 Aug 2006, at 13:44, Erik Hatcher wrote: Just as a data point, my team (3 of us) develop on OS X using Solr with no problems. Two of us are on MacBook Pro's and one poor soul is on a PowerBook. I know that doesn't help, and I do recall stumbling into this particular issue or one very much like it a long while ago (not Solr related) on my previous PowerBook system but I don't, unfortunately, recall how I resolved it. I remember having some growing pains when switching from JDK 1.4 to 1.5 and how to ensure the environment is configured appropriately - might that be it? Erik On Aug 15, 2006, at 8:24 AM, Mike Baranczak wrote: A .dylib file isn't a zip or a jar at all, it's a native OS X shared library. I have absolutely NO idea why Solr is trying to open it. Are you deploying just the stock version, or did you add some of your own code to it? -MB On Aug 15, 2006, at 2:06 AM, Ross McDonald wrote: Thanks for the quick response guys, I am using jdk 1.5, and am ensuring use of this jdk by typing the full path. As regards doing an 'unzip -l' on the file it indeed generates an error.. /usr/local/lib rossputin$ unzip -l libsvnjavahl-1.0.0.0.dylib Archive: libsvnjavahl-1.0.0.0.dylib End-of-central-directory signature not found. Either this file is not a zipfile, or it constitutes one disk of a multi-part archive. In the latter case the central directory and zipfile comment will be found on the last disk(s) of this archive. note: libsvnjavahl-1.0.0.0.dylib may be a plain executable, not an archive unzip: cannot find zipfile directory in one of libsvnjavahl-1.0.0.0.dylib or libsvnjavahl-1.0.0.0.dylib.zip, and cannot find libsvnjavahl-1.0.0.0.dylib.ZIP, period. oh dear.. maybe this is corrupt? A 'jar tvf' generates no output, once again indicating that the file is not a valid jar, regards, Ross. ___ All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html
Re: Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib
Fantastic, thanks for your efforts Chris, that was in fact the problem, I removed all trace of those files, and it works just fine!!! Now I just need to figure out why my system was still displaying these symptons despite updates being done... once again thanks for your help, Ross. On 15 Aug 2006, at 19:06, Chris Hostetter wrote: Wild speculation: but perhaps this is just a classpath issue. Perhaps the on the fly compiler used by Jetty (and Tomcat in the case of the URL i sent earlier) tries to build a complete list of resources available from every item in the classpath before compiling JSPs, and in your case these dylib files are mistakenly in your classpath. In any event, i'm guessing that if you tried to run Jetty without Solr and load up a simple hellowworld.jsp you'd get the same problem. You may want to try installing Tomcat to see if you get same problem with it. (trying a hellowworld.jsp before trying Solr) you know what ... i think this is definitely it. This TomCat bug claims it's a bug in the Apple J2SE 5.0 which has been fixed (look at comment #10) ... http://issues.apache.org/bugzilla/show_bug.cgi?id=34856 -Hoss ___ All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html