Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-06-21 Thread Andreas Kemkes
We are successfully extracting PDF content with Solr 3.1 and Tika 0.9. Replace fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar tika-parsers-0.8.jar with fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar tika-parsers-0.9.jar I'm not entirely certa

Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-06-20 Thread Andreas Kemkes
I've unsuccessfully attempted to go down this road - there are API changes, some of which I was able to solve by taking code snippets from Solr 3.1. Some extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9 - some tests not passing' in the archive). Ultimately, I de

Re: TikaEntityProcessor

2011-04-20 Thread Andreas Kemkes
I went unsuccessfully down this path - too many incompatibilities among versions - some code changes and recompiling required. See also thread "Solr 1.4.1 and Tika 0.9 - some tests not passing" for remaining issues. You'll have better luck with the newer Solr 3.1 release, which already uses T

Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-04-01 Thread Andreas Kemkes
Thank you. That is valuable guidance. In light of the recent release of Solr 3.1, I decided to first try that distribution, as it already uses Tika 0.8, which is much closer to my target. Out of the box (i.e., w/o replacing the Tika and PDFBox libraries) the tests pass, yet I see the error bel

Re: Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-28 Thread Andreas Kemkes
t;/a>. (Tika 0.9) This has a <a href="http://www.apache.org";>link</a>. (Tika 0.4) ____ From: Andreas Kemkes To: solr-user@lucene.apache.org Sent: Tue, March 22, 2011 10:30:57 AM Subject: Solr 1.4.1 and Tika 0.9 - some tests not passing Due

Solr 1.4.1 and Tika 0.9 - some tests not passing

2011-03-22 Thread Andreas Kemkes
Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9. With the changes we made to Solr 1.4.1, we can successfully index the previously failing PDF documents. Unfortunately we cannot get the HTML-r

Re: Omit hour-min-sec in search?

2011-03-06 Thread Andreas Kemkes
How about [-MM-DDThh:mm:ssZ/DAY TO -MM-DDThh:mm:ssZ+1DAY/DAY]? See DateField.html in your Solr API documentation for more. Andreas From: Jan Høydahl To: solr-user@lucene.apache.org Sent: Sun, March 6, 2011 1:40:59 PM Subject: Re: Omit hour-min-sec in

Re: More Date Math: NOW/WEEK

2011-03-05 Thread Andreas Kemkes
Thank you for the clarification. Personally, I believe it is correct for a week to start in a different month/year and it is certainly what I would expect. As you pointed out, these time units don't form a strictly ordered set (...>year>month>day>..., week>day...). Complications arise from th

Re: Limiting on dates in Solr

2011-03-03 Thread Andreas Kemkes
2011-03-03T59:59:99.999Z - shouldn't that be 2011-03-03T23:59:59.999Z From: Steve Lewis To: solr-user@lucene.apache.org Sent: Thu, March 3, 2011 11:21:53 AM Subject: Limiting on dates in Solr I am treating Solr as a NoSQL db that has great search capabilities.

Re: Tika metadata extracted per supported document format?

2011-02-28 Thread Andreas Kemkes
Chris: Yes, I only see the output below. I'm familiar with the information in http://wiki.apache.org/solr/ExtractingRequestHandler, except for the tika.config part, which I haven't touched. Even when running documents through Tika directly, the output of metadata is highly dependent on what

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
And get a print out of the met keys that Tika supports. Some parsers add their own that aren't part of this met listing, but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: > Hello, > > I've asked this on the Ti

Re: upgrading to Tika 0.9 on Solr 1.4.1

2011-02-25 Thread Andreas Kemkes
According to the Tika release notes, it's fixed in 0.9. Haven't tried it myself. A critical backwards incompatible bug in PDF parsing that was introduced in Tika 0.8 has been fixed. (TIKA-548) Andreas From: Darx Oman To: solr-user@lucene.apache.org Sent: F

Re: Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
but this is a relatively comprehensive list. Cheers, Chris On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote: > Hello, > > I've asked this on the Tika mailing list w/o an answer, so apologies for > cross-posting. > > I'm trying to find information that tells me s

Tika metadata extracted per supported document format?

2011-02-25 Thread Andreas Kemkes
Hello, I've asked this on the Tika mailing list w/o an answer, so apologies for cross-posting. I'm trying to find information that tells me specifically what metadata is provided for the different supported document formats. Unfortunately all I was able to find so far is "The Metadata produce

More Date Math: NOW/WEEK

2011-02-23 Thread Andreas Kemkes
Date Math is great. NOW/MONTH, NOW/DAY are all working and very useful, so naively I tried NOW/WEEK, which failed. Digging into the source code of DateMathParser.java, i found the following comment: 99 // NOTE: consciously choosing not to support WEEK at this time, 100 // becau

Re: Date Math

2011-02-23 Thread Andreas Kemkes
Thank you, that clarifies it. Good catch on "-DAY". I had noticed it after submitting but as "-1DAY" causes the same ParseException, I didn't amend the question. Andreas From: Chris Hostetter To: solr-user@lucene.apache.org Sent: Tue, February 22, 2011 6:18

Re: Index Design Question

2011-02-18 Thread Andreas Kemkes
Thank you. These are good general suggestion. Regarding the optimization for indexing vs. querying: are there any specific recommendations for each of those cases available somewhere. A link, for example, would be fabulous. I'm also still curious about solutions that go further. For example,

Index Design Question

2011-02-17 Thread Andreas Kemkes
We are indexing documents with several associated fields for search and display, some of which may change with a much higher frequency than the document content. As per my understanding, we have to resubmit the entire gamut of fields with every update. If the reindexing of the documents beco

Date Math

2011-02-17 Thread Andreas Kemkes
The SolrQuerySyntax Wiki page refers to DateMathParser for examples. When I tried "-1DAY", I got: org.apache.lucene.queryParser.ParseException: Cannot parse 'last_modified:-DAY': Encountered " "-" "- "" at line 1, column 14. Was expecting one of: "(" ... "*" ... ... ...

Controlling Tika's metadata

2011-01-28 Thread Andreas Kemkes
Just getting my feet wet with the text extraction using both schema and solrconfig settings from the example directory in the 1.4 distribution, so I might miss something obvious. Trying to provide my own title (and discarding the one received through Tika's metadata) wasn't straightforward. I h