>1) the toughest pdfs to identify are those that are partly
searchable (text) and partly not (image-based text). However, I've
found that such documents tend to exist in clusters.
Agreed. We should do something better in Tika to identify image-only pages on
a page-by-page basis, and
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during
content extraction.[1] I had two big concerns when I heard of your task:
1) image only pdfs, which can parse without problem, but which might yield 0
content.
2) emails (see, e.g. SOLR-12048)
It sounds like yo
+1 to Charlie's guidance.
And...
>60,000 documents, mostly pdfs and emails.
> However, there's a premium on precision (and recall) in searches.
Please, oh, please, no matter what you're using for content/text extraction
and/or OCR, run tika-eval[1] on the output to ensure that that you are gett
There's also, of course, tika-server. 😊
No matter the method, it is always best to isolate Tika to its own jvm, vm or m.
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk]
Sent: Monday, April 9, 2018 4:15 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Sol
+1
https://lucidworks.com/2012/02/14/indexing-with-solrj/
We should add a chatbot to the list that includes Charlie's advice and the link
to Erick's blog post whenever Tika is used. 😊
-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk]
Sent: Monday, April 9, 2018 12:44 P
For a simple illustration of Charlie's point and a side bonus on the 78 reasons
to use the ICUFoldingFilter if you happen to be processing Arabic script
languages, see slides 31-33:
https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_
Nice. Thank you!
-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
Sent: Thursday, February 15, 2018 2:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr search word NOT followed by another word
Hi,
I did not provide the right query. If you query as {!c
I just updated the SpanQueryParser (LUCENE-5205) and its Solr plugin
(SOLR-5410) for master and 7.2.1.
What version of Solr are you using and which version of the plugin?
These should be available on maven central shortly: version 7.2-0.1
org.tallison.solr
solr-5410
7.2-0.1
Or you
I've been away from the ComplexQueryParser for a while, and I was wrong when I
said in my earlier email that no currently included Solr parse generates a
SpanNotQuery.
You're right, Emir, that the ComplexQueryParser does generate a SpanNotQuery,
and, y, I just tried this with 7.2.1, and it re
41 AM, Allison, Timothy B.
wrote:
> That requires a SpanNotQuery. AFAIK, there is no way to do this with
> the current parsers included in Solr.
>
> My SpanQueryParser does cover this, and I'm hoping to port it to 7.x
> today or tomorrow.
>
> Syntax would be "
That requires a SpanNotQuery. AFAIK, there is no way to do this with the
current parsers included in Solr.
My SpanQueryParser does cover this, and I'm hoping to port it to 7.x today or
tomorrow.
Syntax would be "Leonardo [da vinci]"!~0,1
https://issues.apache.org/jira/browse/LUCENE-5205
http
ug Turnbull and John Berryman's "Relevant Search" enough on
how to layer fields...among many other great insights:
https://www.manning.com/books/relevant-search
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 9:20 AM
To:
t do you suggest to use for stemming instead of "Porter" ? I guess, it
wasn't chosen intentionally.
In the best we trust
Georgy Nevsky
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, November 30, 2017 8:25 AM
To: solr-user@lucene.apa
The initial question wasn't about a phrasal search, but I largely agree that
diff q parsers handle the analysis chain differently for multiterms.
Yes, Porter is crazily aggressive. USE WITH CAUTION!
As has been pointed out, use the Solr admin window and the "debug" in the query
option to see
1']"
);
Notice how cr\u00E6zy* is used as a query term which mimics the behaviour I
originally reported, namely that CPQP does not analyse it because of the
wildcard and thus does not hit the charfilter from the query side.
2017-10-06 20:54 GMT+02:00 Allison, Timothy B. :
> Th
That could be it. I'm not able to reproduce this with trunk. More next week.
In trunk, if I add this to schema15.xml:
This test passes.
@Test
public void testCharFilter() {
assertU(adoc("iso-latin1", "cr\u00E6zy tr\u00E6n", "id", "1"));
assertU(comm
ses, but the regular multiterms
should be ok.
Still no answer for you...
2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :
> There's every chance that I'm missing something at the Solr level, but
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still
> not ap
e certain of it :-)
Do you remember any reason that multi term analysis is not happening in
ComplexPhraseQueryParser?
I'm on 6.6.1, so latest on the 6.x branch.
2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :
> There's every chance that I'm missing something at the Solr level
lob/master/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117
-----Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, October 5, 2017 8:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Complexphrase treats wildca
What version of Solr are you using?
I thought this had been fixed fairly recently, but I can't quickly find the
JIRA. Let me take a look.
Best,
Tim
This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and
[2], which handles analysis of multiterms even in phra
https://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going_out_of_memory._Why_does_DataImportHandler_bring_everything_to_memory.3F
-Original Message-
From: Deeksha Sharma [mailto:dsha...@flexera.co
bq: How do I get a list of all valid field names based on the file type
bq: You don't. At least I've never found any. Plus various document formats
will allow custom meta-data fields so there's no definitive list.
It would be trivial to add field counts per mime to tika-eval. If you're
interes
Solrians,
We have a request to drop phonetic strings from xlsx as the default in Tika.
I'm not familiar enough with Japanese to know if users would generally expect
to be able to search on these as well as the original. The current practice is
to include them.
Any recommendations? Thank y
+1
I was hoping to use this as a case for arguing for turning off an overly
aggressive stemmer, but I checked on your 10 docs and query, and David is
right, of course -- if you change the default operator to AND, you only get the
one document back that you had intended to.
I can still use this
>4. Write an external program that fetches the file, fetches the metadata,
>combines them, and send them to Solr.
I've done this with some custom crawls. Thanks to Erick Erickson, this is a
snap:
https://lucidworks.com/2012/02/14/indexing-with-solrj/
With the caveat that Tika should really be i
Solr index changes to
http://localhost:80/solr/v20170703xxx/update...
Time spent: 0:00:00.350
On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B.
wrote:
> https://issues.apache.org/jira/browse/SOLR-10335 is tracking the
> upgrade in Solr to Tika 1.15. Please chime in on that issue.
>
>
>http - however, the big advantage of doing your indexing on different machine
>is that the heavy lifting that tika does in extracting text from documents,
>finding metadata etc is not happening on the server. If the indexer crashes,
>it doesn’t affect Solr either.
+1
for what can go wrong:
> So, if you are trying to make sure your index breaks words properly on
> eastern languages, just use ICU Tokenizer.
I defer to the expertise on this list, but last I checked ICUTokenizer uses
dictionary lookup to tokenize CJK. This may work well for some tasks, but I
haven't evaluated whe
Yeah, Chris knows a thing or two about Tika. :)
-Original Message-
From: ZiYuan [mailto:ziyu...@gmail.com]
Sent: Tuesday, June 20, 2017 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting
matched text with context
No intenti
> There is no standard across different types of docs as to what meta-data
> field is
>> included. PDF might have a "last_edited" field. Word might have a
>> "last_modified" field where the two mean the same thing.
On Tika, we _try_ to normalize fields according to various standards, the most
AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
Great Tim.
What do I need to do to integrate it on my current installation?
On May 31, 2017 16:24, "Allison, Timothy B." wrote:
Apache Tika 1.15 is now available.
-Original Message
Apache Tika version 1.15 now handles XLSB files. The behavior described below
is the expected behavior if a file type is identified but there is no parser to
handle that file type.
A little late to the game, I admit... :)
Cheers,
Tim
FromRoland Everaert
Subject Re: XLSB fi
Apache Tika 1.15 is now available.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Tuesday, May 9, 2017 7:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
Probably better to ask on the Tika list. We'l
or download
somewhere
G.
On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B.
wrote:
> The release candidate for POI was just cut...unfortunately, I think
> after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for opening
> that!
>
> That'll be done
If you don't care about sentence boundaries, but just want a window around
target terms and you want concordance functionality (sort before, after, etc),
you might check out LUCENE-5317, which is available as a standalone jar on my
github site [1] and is available through maven central.
Using a
test
it :)
SAX sounds interesting, and from info that I found in google it could solve my
issues.
On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
wrote:
> It depends. We've been trying to make parsers more, erm, flexible,
> but there are some problems from which we c
sponses.
Are there any posibilities to ignore parsing errors and continue indexing?
because now solr/tika stops parsing whole document if it finds any exception
On Apr 11, 2017 19:51, "Allison, Timothy B." wrote:
> You might want to drop a note to the dev or user's list on Apache PO
You might want to drop a note to the dev or user's list on Apache POI.
I'm not extremely familiar with the vsd(x) portion of our code base.
The first item ("PolylineTo") may be caused by a mismatch btwn your doc and the
ooxml spec.
The second item appears to be an unsupported feature.
The thir
Please open an issue on Tika's JIRA and share the triggering file if possible.
If we can touch the file, we may be able to recommend alternate ways to
configure Tika's encoding detectors. We just added configurability to the
encoding detectors and that will be available with Tika 1.15. [1]
We
> Also we will try to decouple tika to solr.
+1
-Original Message-
From: tstusr [mailto:ulfrhe...@gmail.com]
Sent: Friday, March 31, 2017 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr performance issue on indexing
Hi, thanks for the feedback.
Yes, it is about OOM, indeed e
> Note that the OCRing is a separate task from Solr indexing, and is best done
> on separate machines.
+1
-Original Message-
From: Rick Leir [mailto:rl...@leirtech.com]
Sent: Thursday, March 30, 2017 7:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing speed reduced significant
]
Sent: Monday, March 27, 2017 11:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Index scanned documents
I tried this solution from Tim Allison, and it works.
http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files
Regards,
Edwin
On 27 March 2017 at 20:07, A
Please also see:
https://wiki.apache.org/tika/TikaOCR
and
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR
If you have any other questions about Apache Tika and OCR, please feel free to
ask on our users list as well: u...@tika.apache.org
Cheers,
Tim
-Origin
All,
I finally got around to documenting Apache Tika's MockParser[1]. As of Tika
1.15 (unreleased), add tika-core-tests.jar to your class path, and you can
simulate:
1. Regular catchable exceptions
2. OOMs
3. Permanent hangs
This will allow you to determine if your ingest framework is robust
>It is *strongly* recommended to *not* use >the Tika that's embedded within
>Solr, but >instead to do the processing outside of Solr >in a program of your
>own and index the results.
+1
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3CBY2PR09MB11210EDFCFA297528940B07C
ml-schemas-1.3.jar instead of poi-ooxml-schemas-3.15.jar 2.
curvesapi-1.03.jar
So, now I'm waiting when this will be implemented in a official version of
solr/tika.
Regards,
Gytis
On Mon, Feb 6, 2017 at 4:16 PM, Allison, Timothy B.
wrote:
> Argh. Looks like we need to add curvesapi
Argh. Looks like we need to add curvesapi (BSD 3-clause) to Solr.
For now, add this jar:
https://mvnrepository.com/artifact/com.github.virtuald/curvesapi/1.03
See also [1]
[1]
http://apache-poi.1045710.n5.nabble.com/support-for-reading-Microsoft-Visio-2013-vsdx-format-td5721500.html
-Ori
d to
go. [3]"
as tika is failing, is it could help or not?
Gytis
On Fri, Feb 3, 2017 at 10:31 PM, Allison, Timothy B.
wrote:
> This is a Tika/POI problem. Please download tika-app 1.14 [1] or a
> nightly version of Tika [2] and run
>
> java -jar tika-app.jar
>
> If th
This is a Tika/POI problem. Please download tika-app 1.14 [1] or a nightly
version of Tika [2] and run
java -jar tika-app.jar
If the problem is fixed, we'll try to upgrade dependencies in Solr. If it
isn't fixed, please open a bug on Tika's Jira.
If this is a missing bean issue (sorry, I c
This came up back in September [1] and [2]. Same trigger...crazy number of
divs.
I think we could modify the AutoDetectParser to enable configuration of maximum
zip-bomb depth via tika-config.
If there's any interest in this, re-open TIKA-2091, and I'll take a look.
Best,
Tim
> I don't see any weird character when I manual copy it to any text editor.
That's a good diagnostic step, but there's a chance that Adobe (or your viewer)
got it right, and Tika or PDFBox isn't getting it right.
If you run tika-app on the file [0], do you get the same problem? See our stub
on
You've gotten far better answers on this already, but you can use the
SpanNotQuery in the SpanQueryParser I maintain and have published to maven
central [1][2][3].
This does not carry out any nlp, but this would allow literal "headache (no
not)"!~5,0 -> "headache" but not if "no" or "not" shows
All,
I recently blogged about some of the work we're doing with a large scale
regression corpus to make Tika, POI and PDFBox more robust and to identify
regressions before release. If you'd like to chip in with recommendations,
requests or Hadoop/Spark clusters (why not shoot for the stars), p
This doesn't answer your question, but Erick Erickson's blog on this topic is
invaluable:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
-Original Message-
From: Vasu Y [mailto:vya...@gmail.com]
Sent: Monday, October 3, 2016
t;> 11133_f6ef-eutelsat.htm
>>
>> I'll try to create a ticket for this on Jira if I find its location
>> but feel free to open it yourself if you prefer, just let me know.
>>
>> Em 22-09-2016 12:33, Allison, Timothy B. escreveu:
>>>>
>>>&g
va API and examples for SolrJ and Tika to
>>> achieve that...
>>>
>>> Just wanted to confirm. I'll try to get a sample HTML yielding to
>>> this problem and attach it to Jira.
>>>
>>> Thanks,
>>> Rodrigo.
>>>
>>> Em 22-09-
> I'll try to get a sample HTML yielding to this problem and attach it to Jira.
Great! Tika 1.14 is around the corner...if this is an easy fix ... :)
Thank you.
Y, looks like Nick (gagravarr) has answered on SO -- can't do it in Tika
currently.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Thursday, September 22, 2016 10:42 AM
To: solr-user@lucene.apache.org
Cc: 'u...@tika.apache.org'
Subject: RE
I don't think that's configurable at the moment.
Tika-colleagues, any recommendations?
If you're able to share the file on Tika's jira, we'd be happy to take a look.
You shouldn't be getting the zip bomb unless there is a mismatch between
opening and closing tags (which could point to a bug
ICU normalization (ICUFoldingFilterFactory) will at least handle "ß" -> "ss"
(IIRC) and some other language-general variants that might get you close.
There are, of course, language specific analyzers
(https://wiki.apache.org/solr/LanguageAnalysis#German) , but I don't think
they'll get you Fo
+1 to langdetect
In Tika 2.0, we're going to remove our own language detection code and allow
users to select Optimaize (fork of langdetect), MIT Lincoln Lab’s Text.jl
library or Yalder (https://github.com/kkrugler/yalder). The first two are now
available in Tika 1.13.
-Original Message--
Not that I need any other book beyond this one... but I didn't realize that the
50% discount code applies to all books in the order. :)
Congratulations, Doug and John!
-Original Message-
From: Doug Turnbull [mailto:dturnb...@opensourceconnections.com]
Sent: Tuesday, June 21, 2016 2:12 P
>Awesome, 0 pre and 1 post works!
Great!
> What if I wanted to match thirty, but exclude if six or seven are included
> anywhere in the document?
Any time you need "anywhere in the document", use a "regular" query (not
SpanQuery). As you wrote initially, you can construct a BooleanQuery that
>Perhaps I'm misunderstanding the pre/post parameters?
Pre/post parameters: " 'six' or 'seven' should not appear $pre tokens before
'thirty' or $post tokens after 'thirty'
Maybe something like this:
spanNear([
spanNear([field:one, field:thousand, field:one, field:hundred], 0, true),
spanNot(
> dtSearch allows a user to have NOTs embedded in proximity searches.
And, if you're heading down the path of building your own queryparser to handle
dtSearch's syntax, please read and heed Charlie Hull's post:
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/
See also:
http://www.fl
From: Brandon Miller [mailto:computerengineer.bran...@gmail.com]
Sent: Monday, June 20, 2016 4:12 PM
To: Allison, Timothy B. ; solr-user@lucene.apache.org
Subject: Re: SpanQuery - How to wrap a NOT subquery
Thank you, Timothy.
I have support for and am using SpanNotQuery elsewhere. Maybe there is
I was just looking at SolrCellBuilder, and it looks like there's an assumption
that documents will not have attachments/embedded objects. Unless I
misunderstand the code, users will not be able to search documents inside zips,
or attachments in msg/ doc/pdf/etc (cf. SOLR-7189).
Are embedded do
>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should
>be straightforward:
http://searchhub.org/2012/02/14/indexing-with-solrj/
+1
> We tend to prefer running Tika externally as it's entirely possible
> that Tika will crash or hang with certain files - and that will
uence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-DifferencesbetweenLuceneQueryParserandtheSolrStandardQueryParser
Regards,
Alex.
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/
On 3 June 2016 at 23:23, Allison, Timothy B. wrote:
> All,
> This is a toy example, b
All,
This is a toy example, but is there a way to search for, say, stores with
sales of > $x in the last 2 months with Solr?
$x and the time frame are selected by the user at query time.
If the queries could be constrained (this is still tbd), I could see updating
"stats" fields within eac
ext/css" charset="utf-8"
>> media="screen" href="/wiki/modernized/css/screen.css"/>
>> <link rel="stylesheet" type="text/css" charset="utf-8"
>> media="print" href="
Of course, for greater control over indexing (and for more robust handling of
exceedingly rare (but real) infinite loops/OOM caused by Tika), consider SolrJ:
http://searchhub.org/2012/02/14/indexing-with-solrj/
-Original Message-
From: Simon Blandford [mailto:simon.blandf...@bkconnect.ne
I'm only minimally familiar with Solr Cell, but...
1) It looks like you aren't setting extractFormat=text. According to [0]...the
default is xhtml which will include a bunch of the metadata.
2) is there an attr_* dynamic field in your index with type="ignored"? This
would strip out the attr_ f
>...and I've just blogged about some of the issues one can run into with this
>sort of project, hope this is useful!
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/
+1 completely non-trivial task to roll your own.
I'd add that incorporating multiterm analysis (analysis/normalization
Depending on your needs, you might want to take a look at my SpanQueryParser
(LUCENE-5205/SOLR-5410). It does not offer dtsearch syntax, but if the
SurroundQueryParser was close enough, this parser may be of use. If you need
modifications to it, let me know. I'm in the process of adding
Span
If I understand the question correctly...
I'm assuming you are indexing rich documents (PDF/DOC/MSG, etc) with DIH's Tika
handler. Some of those documents have attachments.
If that's the case, all of the content of embedded docs _should_[0] be
extracted, but then all of that content across the
Y, integrating Tika is non-trivial. I think Uwe adds the dependencies with
great care by hand by carefully looking at the dependency tree in Maven and
making sure there weren't any conflicts.
-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, May 4, 20
Y. Solr 6.0.0 is shipping with Tika 1.7. Grobid came in with Tika 1.11.
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: Wednesday, May 4, 2016 10:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Integrating grobid with Tika in solr
I think Solr is using
I think Solr is using a version of Tika that predates that addition of the
Grobid parser. You'll have to add that manually somehow until Solr upgrades to
Tika 1.13 (soon to be released...I think). SOLR-8981.
-Original Message-
From: Betsey Benagh [mailto:betsey.ben...@stresearch.com]
> I can tell you that Tika is quite the resource hog. It is likely chewing up
> CPU and memory
> resources at an incredible rate, slowing down your Solr server. You
> would probably see better performance than ERH if you incorporate Tika
> and SolrJ into a client indexing program that runs o
> If you're going to use Tika for production indexing, you should write
> a Java program using SolrJ and Tika so that you are in complete
> control, and so Solr isn't unstable.
+1
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201601.mbox/%3cby2pr09mb11210edfcfa297528940b07c7...@by
55 AM, Allison, Timothy B.
wrote:
> Should have looked at how we handle psts before earlier responsesorry.
>
> What you're seeing is Tika's default treatment of embedded documents,
> it concatenates them all into one string. It'll do the same thing for
> zip fi
ut
>> > what's going on with the offending document(s). Or record the name
>> > somewhere and skip it next time 'round. Or
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough&q
Should have looked at how we handle psts before earlier responsesorry.
What you're seeing is Tika's default treatment of embedded documents, it
concatenates them all into one string. It'll do the same thing for zip files
and other container files. The default Tika format is xhtml, and we i
control the document corpus,
> you have to build something far more tolerant as per Tim's comments.
>
> FWIW,
> Erick
>
> On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>
> wrote:
> > I completely agree on the impulse, and for the vast majority of the
>
Y, this looks like a Tika feature. If you run the tika-app.jar [1]on your file
and you get the same output, then that's Tika's doing.
Drop a note on the u...@tika.apache.org list if Tika isn't meeting your needs.
-Original Message-
From: Sreenivasa Kallu [mailto:sreenivasaka...@gmail.co
Ha. Spoke too soon about this thread not getting swamped.
Will add the dropwizard-tika-server to our wiki page. Thank you for the link!
As a side note, I'll submit a pull request to update the AbstractTikaResource
to avoid a potential NPE if the mime type can't be parsed...we just fixed this
ust catch any exceptions
in my code and "do the right thing". I'm not sure I see any real benefit in yet
another JVM.
FWIW,
Erick
On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B. wrote:
> I have one answer here [0], but I'd be interested to hear what Solr
> user
I have one answer here [0], but I'd be interested to hear what Solr
users/devs/integrators have experienced on this topic.
[0]
http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3CCY1PR09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod.outlook.com%3E
-Original Me
you'll have to grab that and add it to your class path. :)
See also, very recently:
https://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%3C027601d15ea8%2443ffcf90%24cbff6eb0%24%40thetaphi.de%3E
-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org]
S
ork. I'm trying to
> use Tika from my own crawler application that uses SojrJ to send the
> raw text to Solr for indexing.
>
> What is it that I am missing?!
>
> Steve
>
> On Tue, Feb 2, 2016 at 3:03 PM, Allison, Timothy B.
>
> wrote:
>
>> Mig
Might not have the parsers on your path within your Solr framework?
Which tika jars are on your path?
If you want the functionality of all of Tika, use the standalone tika-app.jar,
but do not use the app in the same JVM as Solr...without a custom class loader.
The Solr team carefully prunes
Three basic options:
1) one generic field that handles non-whitespace languages and normalization
robustly (downside: no language specific stopwords, stemming, etc)
2) one field per language (hope lang id works and that you don't have many
multilingual docs)
3) one Solr core for language (ditto)
Don't know what the answer from the Solr side is, but from the Tika side, I
recently failed to get TIKA-1830 into Tika 1.12...so there may be a need to
wait for Tika 1.13.
No matter the answer on when there'll be an upgrade within Solr, I strongly
encourage carving Tika into a separate JVM/serv
Might want to look into:
https://github.com/flaxsearch/luwak
or
https://github.com/OpenSextant/SolrTextTagger
-Original Message-
From: Will Moy [mailto:w...@fullfact.org]
Sent: Tuesday, January 05, 2016 11:02 AM
To: solr-user@lucene.apache.org
Subject: Many patterns against many sen
I concur with Erick and Upayavira that it is best to keep Tika in a separate
JVM...well, ideally a separate box or rack or even data center [0][1]. :)
But seriously, if you're using DIH/SolrCell, you have to configure Tika to
parse documents recursively. This was made possible in SOLR-7189...se
The other thing to check is the ComplexPhraseQueryParser, see:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser
It uses the Span queries to build up the query...
Best,
Erick
On Fri, Dec 18, 2015 at 11:23 AM, Allison, Timothy B.
wrote:
> Hi Jo
Hi Johannes,
I suspect that Scott's answer would be more efficient than the following, and
I may be misunderstanding the problem!
This type of search is supported at the Lucene level by a SpanNearQuery with
inOrder set to false.
So, how do you get a SpanQuery in Solr? You might want to l
Generally, I'd recommend opening an issue on PDFBox's Jira with the file that
you shared. Tika uses PDFBox...if a fix can be made there, it will propagate
back through Tika to Solr.
That said, PDFBox 2.0-RC2 extracts no text and warns: WARNING: No Unicode
mapping for CID+71 (71) in font 505Edd
Agree with all below, and don't hesitate to open a ticket on Tika's Jira and/or
POI's bugzilla...especially if you can share the triggering document.
-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Thursday, November 05, 2015 6:05 PM
To: solr-user
Subjec
1 - 100 of 128 matches
Mail list logo