Response writer configs

2009-12-01 Thread Ross
Hi all

I'm starting to play with Solr. This might be a silly question and not
particularly important but I'm curious.

I setup the example site using the tutorial. It works very well. I was
looking around the config files and notice that in my solrconfig.xml
that the queryResponseWriter area is commented out but they all still
work. wt=php etc returns the php format. How is it working if they're
not defined? Are they defined elsewhere?

Thanks
Ross


Solr Cell - PDFs plus literal metadata - GET or POST ?

2009-12-29 Thread Ross
Hi all

I'm experimenting with Solr. I've successfully indexed some PDFs and
all looks good but now I want to index some PDFs with metadata pulled
from another source. I see this example in the docs.

curl 
"http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah";
 -F "tutori...@tutorial.pdf"

I can write code to generate a script with those commands substituting
my own literal.whatever.  My metadata could be up to a couple of KB in
size. Is there a way of making the literal a POST variable rather than
a GET?  Will Solr Cell accept it as a POST? Something doesn't feel
right about generating a huge long URL. I think Tomcat can handle up
to 8 KB by default so I guess that's okay although I'm not sure how
long a Linux command line can reasonably be.

I know Curl may not be the right thing to use for production use but
this is initially to get some data indexed for test and demo.

Thanks
Ross


Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-06 Thread Ross
On Tue, Jan 5, 2010 at 2:25 PM, Giovanni Fernandez-Kincade
 wrote:
> Really? Doesn't it have to be delimited differently, if both the file 
> contents and the document metadata will be part of the POST data? How does 
> Solr Cell tell the difference between the literals and the start of the file? 
> I've tried this before and haven't had any luck with it.

Thanks Shalin.

And Giovanni, yes it definitely works.

This will set literal.mydata to the contents of mydata.txt

curl 
"http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
-F "myfi...@tutorial.html" -F "literal.mydata=
> -Original Message-
> From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
> Sent: Monday, January 04, 2010 4:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ?
>
> On Wed, Dec 30, 2009 at 7:49 AM, Ross  wrote:
>
>> Hi all
>>
>> I'm experimenting with Solr. I've successfully indexed some PDFs and
>> all looks good but now I want to index some PDFs with metadata pulled
>> from another source. I see this example in the docs.
>>
>> curl "
>> http://localhost:8983/solr/update/extract?literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&boost.foo_t=3&literal.blah_s=Bah
>> "
>>  -F "tutori...@tutorial.pdf"
>>
>> I can write code to generate a script with those commands substituting
>> my own literal.whatever.  My metadata could be up to a couple of KB in
>> size. Is there a way of making the literal a POST variable rather than
>> a GET?
>
>
> With Curl? Yes, see the man page.
>
>
>>  Will Solr Cell accept it as a POST?
>
>
> Yes, it will.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Solr Cell. Seems to be only indexing the first N bytes of a text file.

2010-03-20 Thread Ross
Hi all

I'm trying to index some text files using Solr Cell. I'm using the
schema from Avi Rappoport's tutorial about indexing html and text
files although I also had the same problem with the example/solr
setup.

My problem is that words past or "below" a certain point in a file are
not being indexed. I must be hitting some limit but I haven't been
able to figure out what. I'm hosting with Tomcat and using cURL to
post files to /update/extract as per Avi's tutorial and other docs. I
don't think it's an http limit during the POST because the whole file
is being successfully stored in Solr. I know that because if I
retrieve the file body with a query that does work, the word that
doesn't work appears lower down in the returned contents. I'm storing
the contents now for testing. Once I have this working, the file
contents will probably be indexed only.

On a test file that I've been editing and moving my unique word
around, it seems to stop working if that word is beyond the 100 KB
point in the file. I think another file earlier gave a different
result.

Hopefully I'm missing something obvious.

Thanks for any help.

Ross


Re: Solr Cell. Seems to be only indexing the first N bytes of a text file.

2010-03-20 Thread Ross
Thanks Erick.

That was it. All looking good now.

Cheers
Ross


On Sat, Mar 20, 2010 at 9:29 PM, Erick Erickson  wrote:
> Does our solarconfig file have a line like...
> 1
> ?
>
> Try upping the 1...
>
> HTH
> Erick
>
> On Sat, Mar 20, 2010 at 8:40 PM, Ross  wrote:
>
>> Hi all
>>
>> I'm trying to index some text files using Solr Cell. I'm using the
>> schema from Avi Rappoport's tutorial about indexing html and text
>> files although I also had the same problem with the example/solr
>> setup.
>>
>> My problem is that words past or "below" a certain point in a file are
>> not being indexed. I must be hitting some limit but I haven't been
>> able to figure out what. I'm hosting with Tomcat and using cURL to
>> post files to /update/extract as per Avi's tutorial and other docs. I
>> don't think it's an http limit during the POST because the whole file
>> is being successfully stored in Solr. I know that because if I
>> retrieve the file body with a query that does work, the word that
>> doesn't work appears lower down in the returned contents. I'm storing
>> the contents now for testing. Once I have this working, the file
>> contents will probably be indexed only.
>>
>> On a test file that I've been editing and moving my unique word
>> around, it seems to stop working if that word is beyond the 100 KB
>> point in the file. I think another file earlier gave a different
>> result.
>>
>> Hopefully I'm missing something obvious.
>>
>> Thanks for any help.
>>
>> Ross
>>
>


Solr crashing while extracting from very simple text file

2010-03-21 Thread Ross
Hi all

I'm trying to import some text files. I'm mostly following Avi
Rappoport's tutorial.  Some of my files cause Solr to crash while
indexing. I've narrowed it down to a very simple example.

I have a file named test.txt with one line. That line is the word
XXBLE and nothing else

This is the command I'm using.

curl 
"http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true";
-F "myfi...@test.txt"

The result is pasted below. Other files work just fine. The problem
seems to be related to the letters B and E. If I change them to
something else or make them lower case then it works. In my real
files, the XX is something else but the result is the same. It's a
common word in the files. I guess for this "quick and dirty" job I'm
doing I could do a bulk replace in the files to make it lower case.

Is there any workaround for this?

Thanks
Ross

Apache Tomcat/6.0.20 - Error
report<!--H1
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
H3 
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
P 
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color :
#525D76;}--> HTTP Status 500 -
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Thread.java:636)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
... 18 more
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.BufferedReader.<init>(BufferedReader.java:93)
at java.io.BufferedReader.<init>(BufferedReader.java:108)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:59)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119)
... 20 more
type Status
reportmessage
org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.txt.txtpar...@19ccba

org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.txt.txtpar...@19ccba
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
at 
org.apache.solr.ha

Re: Solr crashing while extracting from very simple text file

2010-03-22 Thread Ross
Thanks Georg

I don't think it's that because it crashes on a one word test file I
create using the nano editor. I don't think nano is adding anything
extra.

My real files are created by a Windows utility called pdftotext. I
solved the problem by getting pdftotext to generate html files rather
than plain text. It just adds an html header and wraps everything in a
 tag. That seems to keep Solr happy.

Ross

On Mon, Mar 22, 2010 at 9:08 AM, György Frivolt
 wrote:
> Hi,
>
>    I had problem with indexing documents some months ago as well. I found
> that there were XML control characters in the documents and these were not
> handled by Solr. Maybe it is the case for you as well.
>
> Regards,
>
>    Georg
>
>
> On Sun, Mar 21, 2010 at 5:58 PM, Ross  wrote:
>
>> Hi all
>>
>> I'm trying to import some text files. I'm mostly following Avi
>> Rappoport's tutorial.  Some of my files cause Solr to crash while
>> indexing. I've narrowed it down to a very simple example.
>>
>> I have a file named test.txt with one line. That line is the word
>> XXBLE and nothing else
>>
>> This is the command I'm using.
>>
>> curl "
>> http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true
>> "
>> -F "myfi...@test.txt"
>>
>> The result is pasted below. Other files work just fine. The problem
>> seems to be related to the letters B and E. If I change them to
>> something else or make them lower case then it works. In my real
>> files, the XX is something else but the result is the same. It's a
>> common word in the files. I guess for this "quick and dirty" job I'm
>> doing I could do a bulk replace in the files to make it lower case.
>>
>> Is there any workaround for this?
>>
>> Thanks
>> Ross
>>
>> Apache Tomcat/6.0.20 - Error
>> report<!--H1
>>
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>> H2
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
>> H3
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>> BODY
>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>> B
>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
>> P
>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>> {color : black;}A.name {color : black;}HR {color :
>> #525D76;}--> HTTP Status 500 -
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> from org.apache.tika.parser.txt.txtpar...@19ccba
>>
>> org.apache.solr.common.SolrException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>> from org.apache.tika.parser.txt.txtpar...@19ccba
>>        at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>        at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>        at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>        at
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>        at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>        at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>        at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>        at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>        at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>        at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>        at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>>        at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>        at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>        at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
>>        at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
>>        at
>> org.apache.coyote.http11.Http11Pro

Re: Solr crashing while extracting from very simple text file

2010-03-22 Thread Ross
I thought you might ask that :-)

It's because the pdf files are scanned from paper documents and OCR'd
to produce text. They still contain the image so are huge. The smaller
files are about 40 MB and cause a Java out of heap memory error. The
larger files are getting close to 500 MB. I didn't have anything to do
with the scanning. I'm guessing but it seems that something in the
Tomcat / Solr / Tika implementation tries to load it all into memory
at once.

pdftotext (part of http://www.foolabs.com/xpdf/download.html ) seems
to do it nicely and processes small chunks at a time.

Ross


On Mon, Mar 22, 2010 at 9:43 AM, Erik Hatcher  wrote:
> Why not feed the original PDF files in instead?  Just curious if pdftotext
> is doing a better job than Tika's PDFBox stuff.
>
>        Erik
>
> On Mar 22, 2010, at 9:30 AM, Ross wrote:
>
>> Thanks Georg
>>
>> I don't think it's that because it crashes on a one word test file I
>> create using the nano editor. I don't think nano is adding anything
>> extra.
>>
>> My real files are created by a Windows utility called pdftotext. I
>> solved the problem by getting pdftotext to generate html files rather
>> than plain text. It just adds an html header and wraps everything in a
>>  tag. That seems to keep Solr happy.
>>
>> Ross
>>
>> On Mon, Mar 22, 2010 at 9:08 AM, György Frivolt
>>  wrote:
>>>
>>> Hi,
>>>
>>>   I had problem with indexing documents some months ago as well. I found
>>> that there were XML control characters in the documents and these were
>>> not
>>> handled by Solr. Maybe it is the case for you as well.
>>>
>>> Regards,
>>>
>>>   Georg
>>>
>>>
>>> On Sun, Mar 21, 2010 at 5:58 PM, Ross  wrote:
>>>
>>>> Hi all
>>>>
>>>> I'm trying to import some text files. I'm mostly following Avi
>>>> Rappoport's tutorial.  Some of my files cause Solr to crash while
>>>> indexing. I've narrowed it down to a very simple example.
>>>>
>>>> I have a file named test.txt with one line. That line is the word
>>>> XXBLE and nothing else
>>>>
>>>> This is the command I'm using.
>>>>
>>>> curl "
>>>>
>>>> http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true
>>>> "
>>>> -F "myfi...@test.txt"
>>>>
>>>> The result is pasted below. Other files work just fine. The problem
>>>> seems to be related to the letters B and E. If I change them to
>>>> something else or make them lower case then it works. In my real
>>>> files, the XX is something else but the result is the same. It's a
>>>> common word in the files. I guess for this "quick and dirty" job I'm
>>>> doing I could do a bulk replace in the files to make it lower case.
>>>>
>>>> Is there any workaround for this?
>>>>
>>>> Thanks
>>>> Ross
>>>>
>>>> Apache Tomcat/6.0.20 - Error
>>>> report<!--H1
>>>>
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>>> H2
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
>>>> H3
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>>> BODY
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>>> B
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
>>>> P
>>>>
>>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>>> {color : black;}A.name {color : black;}HR {color :
>>>> #525D76;}--> HTTP Status 500 -
>>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>>> from org.apache.tika.parser.txt.txtpar...@19ccba
>>>>
>>>> org.apache.solr.common.SolrException:
>>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
>>>> from org.apache.tika.parser.txt.txtpar...@19ccba
>>>>       at
>>>>
>>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>>>>       at
&

Re: Solr crashing while extracting from very simple text file

2010-03-31 Thread Ross
Does anyone have any thoughts or suggestions on this?  I guess it's
really a Tika problem. Should I try to report it to the Tika project?

I wonder if someone could try it to see if it's a general problem or
just me. I can reproduce it by firing up the nano editor, creating a
file with XXBLE on one line and nothing else. Try indexing that and
Solr / Tika crashes. I can avoid it by editing the file slightly but I
haven't really been able to discover a consistent pattern. It works if
I change the word to lower case. Also a three line file like this
works

a
a
XXBLE

but not

x
x
XXBLE

It's a bit unfortunate because a similar word (a person's name ??BLE )
with the same problem appears frequently in upper case near the top of
my files.

Cheers
Ross


On Sun, Mar 21, 2010 at 12:58 PM, Ross  wrote:
> Hi all
>
> I'm trying to import some text files. I'm mostly following Avi
> Rappoport's tutorial.  Some of my files cause Solr to crash while
> indexing. I've narrowed it down to a very simple example.
>
> I have a file named test.txt with one line. That line is the word
> XXBLE and nothing else
>
> This is the command I'm using.
>
> curl 
> "http://localhost:8080/solr-example/update/extract?literal.id=1&commit=true";
> -F "myfi...@test.txt"
>
> The result is pasted below. Other files work just fine. The problem
> seems to be related to the letters B and E. If I change them to
> something else or make them lower case then it works. In my real
> files, the XX is something else but the result is the same. It's a
> common word in the files. I guess for this "quick and dirty" job I'm
> doing I could do a bulk replace in the files to make it lower case.
>
> Is there any workaround for this?
>
> Thanks
> Ross
>
> Apache Tomcat/6.0.20 - Error
> report<!--H1
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
> H2 
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;}
> H3 
> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
> BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
> B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;}
> P 
> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
> {color : black;}A.name {color : black;}HR {color :
> #525D76;}--> HTTP Status 500 -
> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> from org.apache.tika.parser.txt.txtpar...@19ccba
>
> org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> from org.apache.tika.parser.txt.txtpar...@19ccba
>        at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211)
>        at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>        at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>        at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>        at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>        at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>        at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>        at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
>        at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849)
>        at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>        at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
>        at java.lang.Thread.run(Thread.java:636)
> Caused by: org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.ti

Re: Solr crashing while extracting from very simple text file

2010-04-01 Thread Ross
Hi Chris, thanks for looking at this.

I'm using Solr 1.4.0 including the Tika that's in the tgz file which
means Tika 0.4.

I've now discovered that only two letters are required. A single line
with XE will crash it.

This fails:

r...@gamma:/home/ross# hexdump -C test.txt
  58 45 0a  |XE.|
0003
r...@gamma:/home/ross#

This works

r...@gamma:/home/ross# hexdump -C test.txt
  58 46 0a  |XF.|
0003
r...@gamma:/home/ross#

XA, XB, XC, XD, XF all work okay. There's just something special about XE.

The command I use is:

curl 
"http://localhost:8080/solr-example/update/extract?literal.id=doc1&fmap.content=body&commit=true";
-F "myfi...@test.txt"

I filed a bug at https://issues.apache.org/jira/browse/TIKA-397 but I
guess 0.4 is an old version so I wouldn't expert it to get much
attention.

It looks like I should upgrade Tika to 0.6. I don't really know how to
do that or if Solr 1.4 works with Tika 0.6. The Tika pages talk about
using Maven to build it. Sorry, I'm no Linux expert.

Ross


On Thu, Apr 1, 2010 at 1:07 PM, Chris Hostetter
 wrote:
>
> : Yes, please report this to the Tika project.
>
> except that when i run "tika-app-0.6.jar" on a text file like the one Ross
> describes, i don't get the error he describes, which means it may be
> something off in how Solr is using Tika.
>
> Ross: I can't reproduce this error on the trunk using the example solr
> configs and the text file below.  can you verify exactly which version of
> SOlr you are using (and which version of tika you are using inside solr)
> and the exact byte contents of your simplest problematic text file?
>
> hoss...@brunner:~/tmp$ cat tmp.txt
> x
> x
> XXBLE
> hoss...@brunner:~/tmp$ hexdump -C tmp.txt
>   78 0a 78 0a 58 58 42 4c  45 0a                    |x.x.XXBLE.|
> 000a
> hoss...@brunner:~/tmp$ curl 
> "http://localhost:8983/solr/update/extract?literal.id=1&commit=true"; -F 
> "myfi...@tmp.txt"
> 
> 
> 0 name="QTime">66
> 
>
>
> -Hoss
>
>


RE: Indexing data from multiple datasources

2011-06-09 Thread David Ross

This thread got me thinking a bit...
Does SOLR support the concept of "partial updates" to documents?  By this I 
mean updating a subset of fields in a document that already exists in the 
index, and without having to resubmit the entire document.
An example would be storing/indexing user tags associated with documents. These 
tags will not be available when the document is initially presented to SOLR, 
and may or may not come along at a later time. When that time comes, can we 
just submit the tag data (and document identifier I'd imagine), or do we have 
to import the entire document?
new to SOLR...

> Date: Thu, 9 Jun 2011 14:00:43 -0400
> Subject: Re: Indexing data from multiple datasources
> From: erickerick...@gmail.com
> To: solr-user@lucene.apache.org
> 
> How are you using it? Streaming the files to Solr via HTTP? You can use Tika
> on the client to extract the various bits from the structured documents, and
> use SolrJ to assemble various bits of that data Tika exposes into a
> Solr document
> that you then send to Solr. At the point you're transferring data from the
> Tika parse to the Solr document, you could add any data from your database 
> that
> you wanted.
> 
> The result is that you'd be indexing the complete Solr document only once.
> 
> You're right that updating a document in Solr overwrites the previous
> version and any
> data in the previous version is lost
> 
> Best
> Erick
> 
> On Thu, Jun 9, 2011 at 1:20 PM, Greg Georges  wrote:
> > Hello Erick,
> >
> > Thanks for the response. No, I am using the extract handler to extract the 
> > data from my text files. In your second approach, you say I could use a DIH 
> > to update the index which would have been created by the extract handler in 
> > the first phase. I thought that lets say I get info from the DB and update 
> > the index with the document ID, will I overwrite the data and lose the 
> > initial data from the extract handler phase? Thanks
> >
> > Greg
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: 9 juin 2011 12:15
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing data from multiple datasources
> >
> > Hmmm, when you say you use Tika, are you using some custom Java code? 
> > Because
> > if you are, the best thing to do is query your database at that point
> > and add whatever information
> > you need to the document.
> >
> > If you're using DIH to do the crawl, consider implementing a
> > Transformer to do the database
> > querying and modify the document as necessary This is pretty
> > simple to do, we can
> > chat a bit more depending on whether either approach makes sense.
> >
> > Best
> > Erick
> >
> >
> >
> > On Thu, Jun 9, 2011 at 10:43 AM, Greg Georges  
> > wrote:
> >> Hello all,
> >>
> >> I have checked the forums to see if it is possible to create and index 
> >> from multiple datasources. I have found references to SOLR-1358, but I 
> >> don't think this fits my scenario. In all, we have an application where we 
> >> upload files. On the file upload, I use the Tika extract handler to save 
> >> metadata from the file (_attr, literal values, etc..). We also have a 
> >> database which has information on the uploaded files, like the category, 
> >> type, etc.. I would like to update the index to include this information 
> >> from the db in the index for each document. If I run a dataimporthandler 
> >> after the extract phase I am afraid that by updating the doc in the index 
> >> by its id will just cause that I overwrite the old information with the 
> >> info from the DB (what I understand is that Solr updates its index by ID 
> >> by deleting first then recreating the info).
> >>
> >> Anyone have any pointers, is there a clean way to do this, or must I find 
> >> a way to pass the db metadata to the extract handler and save it as 
> >> literal fields?
> >>
> >> Thanks in advance
> >>
> >> Greg
> >>
> >
  

Unsubscribing

2009-02-02 Thread Ross MacKinnon
I've tried multiple times to unsubscribe from this list using the proper method 
(mailto:solr-user-unsubscr...@lucene.apache.org), but it's not working!  Can 
anyone help with that?
 


RE: Unsubscribing

2009-02-03 Thread Ross MacKinnon
Nothing in the Junk folder, but that reminded me that our company is using a 
3rd party spam filter (i.e., Lanlogic)... which sure enough had snagged the 
confirmation emails.  Since the list emails were going through I never thought 
to check the filtering systems.  Thanks for jogging my memory. :-)
 
Ross



From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
Sent: Tue 2/3/2009 2:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Unsubscribing




: Subject: Unsubscribing
:
: I've tried multiple times to unsubscribe from this list using the proper
: method (mailto:solr-user-unsubscr...@lucene.apache.org), but it's not
: working!  Can anyone help with that?

Did you get a confirmation email from the mailing list software asking you
to verify that you really wanted to unsubscribe?  (is it in a Junk Mail or
Spam folder that you didn't think to check?) did you reply to it
according to the instructions?

see also...

http://www.nabble.com/Re%3A-PLEASE-REMOVE-ME-FROM-THIS-EMAIL-LIST!-p10879673.html



-Hoss




Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib

2006-08-14 Thread Ross McDonald

Hi all,

I am trying to run Solr on OSX, after a successful installation and  
tests on Linux,
while trying to run with JDK 1.5.0, I am getting the following  
exception...


HTTP ERROR: 500

Unable to compile class for JSP

Generated servlet error:
error: error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib;  
java.util.zip.ZipException: error in opening zip file
Note: /tmp/Jetty__8983__solr_23999/org/apache/jsp/admin/ 
index_jsp.java uses unchecked or unsafe operations.


I have checked /usr/local/lib and can see that  
'libsvnjavahl-1.0.0.0.dylib' is in fact present there,


I assume it is probably quite easy to fix this, or that plenty of  
people are running Solr on OSX, I would appreciate any advice on how  
to fix this up,


thanks for your time,

Ross McDonald


___ 
Inbox full of spam? Get leading spam protection and 1GB storage with All New Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html


Re: Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib

2006-08-14 Thread Ross McDonald

Thanks for the quick response guys,

I am using jdk 1.5, and am ensuring use of this jdk by typing the  
full path.


As regards doing an 'unzip -l' on the file it indeed generates an  
error..



/usr/local/lib rossputin$ unzip -l libsvnjavahl-1.0.0.0.dylib
Archive:  libsvnjavahl-1.0.0.0.dylib
  End-of-central-directory signature not found.  Either this file is  
not
  a zipfile, or it constitutes one disk of a multi-part archive.  In  
the
  latter case the central directory and zipfile comment will be  
found on

  the last disk(s) of this archive.
note:  libsvnjavahl-1.0.0.0.dylib may be a plain executable, not an  
archive
unzip:  cannot find zipfile directory in one of  
libsvnjavahl-1.0.0.0.dylib or
libsvnjavahl-1.0.0.0.dylib.zip, and cannot find  
libsvnjavahl-1.0.0.0.dylib.ZIP, period.


oh dear.. maybe this is corrupt?

A 'jar tvf' generates no output, once again indicating that the file  
is not a valid jar,


regards,

Ross.



On 15 Aug 2006, at 00:09, Chris Hostetter wrote:



: The older version of Jetty that we are using requires the JDK  
version,
: not the JRE version of 'java' so it can compile JSPs via javac.   
Maybe

: that's be the problem?  Try typing the full path to the java
: executable to verify.

Perhaps ... but this seems like an awfully strnage exception to get  
if the

problem were that it can't find the compiler...

: > error: error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib;
: > java.util.zip.ZipException: error in opening zip file

...he said that /usr/local/lib/libsvnjavahl-1.0.0.0.dylib did in fact
exist.  (I would expect a class not found or an attempt at opening  
a file

that doesn't exist in the other case)

I don't really know much about Java on Macs, or what a "dylib" file  
is ...
but the exception seems to indicate that it's expecting to find a  
Zip file

(possible a jar?)

Ross: does "unzip -l" or "jar tf" on
/usr/local/lib/libsvnjavahl-1.0.0.0.dylib work for you?




-Hoss





___ 
Try the all-new Yahoo! Mail. "The New Version is radically easier to use" – The Wall Street Journal 
http://uk.docs.yahoo.com/nowyoucan.html


Re: Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib

2006-08-15 Thread Ross McDonald

Hi all,

yes I suspect that may be the problem, I will go through some web  
guides on Java and Mac, see if anythingcomes up..


thanks,

Ross.

On 15 Aug 2006, at 13:44, Erik Hatcher wrote:

Just as a data point, my team (3 of us) develop on OS X using Solr  
with no problems.  Two of us are on MacBook Pro's and one poor soul  
is on a PowerBook.  I know that doesn't help, and I do recall  
stumbling into this particular issue or one very much like it a  
long while ago (not Solr related) on my previous PowerBook system  
but I don't, unfortunately, recall how I resolved it.  I remember  
having some growing pains when switching from JDK 1.4 to 1.5 and  
how to ensure the environment is configured appropriately - might  
that be it?


Erik


On Aug 15, 2006, at 8:24 AM, Mike Baranczak wrote:

A .dylib file isn't a zip or a jar at all, it's a native OS X  
shared library. I have absolutely NO idea why Solr is trying to  
open it. Are you deploying just the stock version, or did you add  
some of your own code to it?


-MB



On Aug 15, 2006, at 2:06 AM, Ross McDonald wrote:


Thanks for the quick response guys,

I am using jdk 1.5, and am ensuring use of this jdk by typing the  
full path.


As regards doing an 'unzip -l' on the file it indeed generates an  
error..



/usr/local/lib rossputin$ unzip -l libsvnjavahl-1.0.0.0.dylib
Archive:  libsvnjavahl-1.0.0.0.dylib
  End-of-central-directory signature not found.  Either this file  
is not
  a zipfile, or it constitutes one disk of a multi-part archive.   
In the
  latter case the central directory and zipfile comment will be  
found on

  the last disk(s) of this archive.
note:  libsvnjavahl-1.0.0.0.dylib may be a plain executable, not  
an archive
unzip:  cannot find zipfile directory in one of  
libsvnjavahl-1.0.0.0.dylib or
libsvnjavahl-1.0.0.0.dylib.zip, and cannot find  
libsvnjavahl-1.0.0.0.dylib.ZIP, period.


oh dear.. maybe this is corrupt?

A 'jar tvf' generates no output, once again indicating that the  
file is not a valid jar,


regards,

Ross.











___ 
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html


Re: Mac OSX - error reading /usr/local/lib/libsvnjavahl-1.0.0.0.dylib

2006-08-16 Thread Ross McDonald
Fantastic, thanks for your efforts Chris, that was in fact the  
problem, I removed all trace of those files, and it works just fine!!!


Now I just need to figure out why my system was still displaying  
these symptons despite updates being done...


once again thanks for your help,

Ross.

On 15 Aug 2006, at 19:06, Chris Hostetter wrote:



Wild speculation: but perhaps this is just a classpath issue.   
Perhaps the

on the fly compiler used by Jetty (and Tomcat in the case of the URL i
sent earlier) tries to build a complete list of resources available  
from
every item in the classpath before compiling JSPs, and in your case  
these

dylib files are mistakenly in your classpath.

In any event, i'm guessing that if you tried to run Jetty without  
Solr and
load up a simple hellowworld.jsp you'd get the same problem.  You  
may want
to try installing Tomcat to see if you get same problem with it.  
(trying a

hellowworld.jsp before trying Solr)


you know what ... i think this is definitely it.  This TomCat bug  
claims
it's  a bug in the Apple J2SE 5.0 which has been fixed (look at  
comment

#10) ...

  http://issues.apache.org/bugzilla/show_bug.cgi?id=34856



-Hoss







___ 
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine 
http://uk.docs.yahoo.com/nowyoucan.html