We are successfully extracting PDF content with Solr 3.1 and Tika 0.9.
Replace
fontbox-1.3.1.jar jempbox-1.3.1.jar pdfbox-1.3.1.jar tika-core-0.8.jar
tika-parsers-0.8.jar
with
fontbox-1.4.0.jar jempbox-1.4.0.jar pdfbox-1.4.0.jar tika-core-0.9.jar
tika-parsers-0.9.jar
I'm not entirely certa
I've unsuccessfully attempted to go down this road - there are API changes,
some
of which I was able to solve by taking code snippets from Solr 3.1. Some
extraction-related tests for wouldn't pass (look for 'Solr 1.4.1 and Tika 0.9
-
some tests not passing' in the archive). Ultimately, I de
I went unsuccessfully down this path - too many incompatibilities among
versions
- some code changes and recompiling required. See also thread "Solr 1.4.1 and
Tika 0.9 - some tests not passing" for remaining issues. You'll have better
luck with the newer Solr 3.1 release, which already uses T
Thank you. That is valuable guidance.
In light of the recent release of Solr 3.1, I decided to first try that
distribution, as it already uses Tika 0.8, which is much closer to my target.
Out of the box (i.e., w/o replacing the Tika and PDFBox libraries) the tests
pass, yet I see the error bel
t;/a>. (Tika 0.9)
This has a <a href="http://www.apache.org";>link</a>. (Tika 0.4)
____
From: Andreas Kemkes
To: solr-user@lucene.apache.org
Sent: Tue, March 22, 2011 10:30:57 AM
Subject: Solr 1.4.1 and Tika 0.9 - some tests not passing
Due
Due to some PDF indexing issues with the Solr 1.4.1 distribution, we would like
to upgrade it to Tika 0.9, as the issues are not occurring in Tika 0.9.
With the changes we made to Solr 1.4.1, we can successfully index the
previously
failing PDF documents.
Unfortunately we cannot get the HTML-r
How about [-MM-DDThh:mm:ssZ/DAY TO -MM-DDThh:mm:ssZ+1DAY/DAY]? See
DateField.html in your Solr API documentation for more.
Andreas
From: Jan Høydahl
To: solr-user@lucene.apache.org
Sent: Sun, March 6, 2011 1:40:59 PM
Subject: Re: Omit hour-min-sec in
Thank you for the clarification.
Personally, I believe it is correct for a week to start in a different
month/year and it is certainly what I would expect. As you pointed out, these
time units don't form a strictly ordered set (...>year>month>day>...,
week>day...).
Complications arise from th
2011-03-03T59:59:99.999Z - shouldn't that be 2011-03-03T23:59:59.999Z
From: Steve Lewis
To: solr-user@lucene.apache.org
Sent: Thu, March 3, 2011 11:21:53 AM
Subject: Limiting on dates in Solr
I am treating Solr as a NoSQL db that has great search capabilities.
Chris:
Yes, I only see the output below.
I'm familiar with the information in
http://wiki.apache.org/solr/ExtractingRequestHandler, except for
the tika.config part, which I haven't touched.
Even when running documents through Tika directly, the output of metadata is
highly dependent on what
And get a print out of the met keys that Tika supports. Some parsers add their
own that aren't part of this met listing, but this is a relatively
comprehensive
list.
Cheers,
Chris
On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:
> Hello,
>
> I've asked this on the Ti
According to the Tika release notes, it's fixed in 0.9. Haven't tried it
myself.
A critical backwards incompatible bug in PDF parsing that was introduced in
Tika
0.8 has been fixed. (TIKA-548)
Andreas
From: Darx Oman
To: solr-user@lucene.apache.org
Sent: F
but this is a relatively
comprehensive
list.
Cheers,
Chris
On Feb 25, 2011, at 12:10 PM, Andreas Kemkes wrote:
> Hello,
>
> I've asked this on the Tika mailing list w/o an answer, so apologies for
> cross-posting.
>
> I'm trying to find information that tells me s
Hello,
I've asked this on the Tika mailing list w/o an answer, so apologies for
cross-posting.
I'm trying to find information that tells me specifically what metadata is
provided for the different supported document formats. Unfortunately all I was
able to find so far is "The Metadata produce
Date Math is great.
NOW/MONTH, NOW/DAY are all working and very useful, so naively I tried
NOW/WEEK,
which failed.
Digging into the source code of DateMathParser.java, i found the following
comment:
99 // NOTE: consciously choosing not to support WEEK at this time,
100 // becau
Thank you, that clarifies it. Good catch on "-DAY". I had noticed it after
submitting but as "-1DAY" causes the same ParseException, I didn't amend the
question.
Andreas
From: Chris Hostetter
To: solr-user@lucene.apache.org
Sent: Tue, February 22, 2011 6:18
Thank you. These are good general suggestion.
Regarding the optimization for indexing vs. querying: are there any specific
recommendations for each of those cases available somewhere. A link, for
example, would be fabulous.
I'm also still curious about solutions that go further.
For example,
We are indexing documents with several associated fields for search and
display,
some of which may change with a much higher frequency than the document
content.
As per my understanding, we have to resubmit the entire gamut of fields with
every update.
If the reindexing of the documents beco
The SolrQuerySyntax Wiki page refers to DateMathParser for examples.
When I tried "-1DAY", I got:
org.apache.lucene.queryParser.ParseException: Cannot parse
'last_modified:-DAY':
Encountered " "-" "- "" at line 1, column 14.
Was expecting one of: "(" ... "*" ... ... ...
Just getting my feet wet with the text extraction using both schema and
solrconfig settings from the example directory in the 1.4 distribution, so I
might miss something obvious.
Trying to provide my own title (and discarding the one received through Tika's
metadata) wasn't straightforward. I h
20 matches
Mail list logo