Re: Odp.: solr issue with pdf forms

Erick Erickson Thu, 30 Apr 2015 08:29:32 -0700

Jack:

I keep forgetting those things exist, thanks for the reminder!


On Thu, Apr 30, 2015 at 8:23 AM, Jack Krupansky
<jack.krupan...@gmail.com> wrote:
> Or use a Solr update processor to scrub the source values. The regex
> pattern replacement processor could do the trick:
> http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html
>
> -- Jack Krupansky
>
> On Thu, Apr 30, 2015 at 11:17 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> OK, given all that Tika _is_ sending the weird characters to Solr. You
>> can get them out of the index by using someting like
>> PatternReplaceTokenFilterFactory or PatternReplaceCharFilterFactory in
>> you analysis chain. However, you'll still be stuck with the odd
>> characters showing up in your browser.
>>
>> Don't know what options are configurable for Tika to not return these
>> so I'm not much help there.
>>
>> You could, however, use Tika with SolrJ to do this on the client and
>> scrub the data there. Here's an example.
>>
>> https://lucidworks.com/blog/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Thu, Apr 30, 2015 at 12:03 AM,  <steve.sch...@t-systems.com> wrote:
>> > Hey, thanks a lot for the hint with pdfbox-app.jar.
>> > For testing purpose I now extracted a affected pdf form and a usual pdf
>> file.
>> > The result ist he following:
>> >
>> > Usual pdf file:
>> > Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
>> eirmod tempor invidunt ut
>> > labore et d
>> >
>> > pdf form:
>> > Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz
>> >
>> > Best
>> > Steve
>> >
>> > -----Ursprüngliche Nachricht-----
>> > Von: Allison, Timothy B. [mailto:talli...@mitre.org]
>> > Gesendet: Mittwoch, 29. April 2015 14:16
>> > An: solr-user@lucene.apache.org
>> > Cc: u...@tika.apache.org
>> > Betreff: RE: Odp.: solr issue with pdf forms
>> >
>> > I completely agree with Erick about the utility of the TermsComponent to
>> see what is actually being indexed.  If you find problems there and if you
>> haven't done so already, you might also investigate further down the
>> stack.  It might make sense to run the tika-app.jar (whichever version you
>> are using in DIH or other mechanism?) or even the pdfbox-app.jar
>> (ExtractText option) on your files outside of Solr to see what text/noise
>> you're getting for the files that are causing problems.
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Erick Erickson [mailto:erickerick...@gmail.com]
>> > Sent: Tuesday, April 28, 2015 9:07 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Odp.: solr issue with pdf forms
>> >
>> > There better be.
>> >
>> > 1> go to the admin UI
>> > 2> select a core
>> > 3> select "schema browser"
>> > 4> select a field from the drop-down
>> >
>> > Until you do step 4 the window will be pretty blank.
>> >
>> > Here's the info for TermsComponent, what have you tried?
>> >
>> > https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>> >
>> > Best,
>> > Erick
>> >
>> > On Tue, Apr 28, 2015 at 1:04 PM,  <steve.sch...@t-systems.com> wrote:
>> >> Thanks a lot for being patient with me. Unfortunately there is no
>> >> button "load term info". :-( Can you may be help me using the
>> TermsComponent instead? I read it is per default configured.
>> >>
>> >> Thanks a lot
>> >> Best
>> >> Steve
>> >>
>> >> -----Ursprüngliche Nachricht-----
>> >> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> >> Gesendet: Montag, 27. April 2015 17:23
>> >> An: solr-user@lucene.apache.org
>> >> Betreff: Re: Odp.: solr issue with pdf forms
>> >>
>> >> We're still not quite there. There should be a "load term info" button
>> on that page. Clicking that button will show you the terms in your index
>> (as opposed to the raw stored input which is what you get when you look at
>> results in the browser). My bet is that you'll see perfectly normal tokens
>> in the index that will NOT have the wonky characters you see in the display.
>> >>
>> >> If that's the case, then you have a browser issue, Solr is working
>> perfectly fine. On the other hand, if the individual terms are weird, then
>> you have something more fundamental going on.
>> >>
>> >> Which is why I mentioned the TermsComponent. That will return indexed
>> tokens, and allows you a bit more flexibility than the admin page in terms
>> of what tokens you see, but it's essentially the same information.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Sun, Apr 26, 2015 at 11:18 PM,  <steve.sch...@t-systems.com> wrote:
>> >>> Erick,
>> >>>
>> >>> thanks a lot for helping me here. In my case it ist he "content" field
>> which is displayed not correctly. So I went tot he schema browser like you
>> pointed out. Here ist he information I found:
>> >>> Field: content
>> >>> Field Type: text
>> >>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> >>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> >>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> >>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> >>> Filters:
>> >>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> >>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> >>> catenateAll: 0 catenateNumbers: 1 }
>> >>> org.apache.solr.analysis.LowerCaseFilterFactory
>> >>> args:{luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> >>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> >>> LUCENE_36 }
>> >>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> >>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> >>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> >>> luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> >>> luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> >>> args:{luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>> >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> >>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> >>> Filters:
>> >>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> >>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> >>> catenateAll: 0 catenateNumbers: 0 }
>> >>> org.apache.solr.analysis.LowerCaseFilterFactory
>> >>> args:{luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> >>> luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> >>> args:{luceneMatchVersion: LUCENE_36 }
>> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
>> >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> >>> args:{luceneMatchVersion: LUCENE_36 }
>> >>> Distinct:  160403
>> >>>
>> >>> Does this somehow help to figure out the issue?
>> >>> Thanks
>> >>> Best
>> >>> Steve
>> >>>
>> >>>
>> >>> -----Ursprüngliche Nachricht-----
>> >>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> >>> Gesendet: Freitag, 24. April 2015 20:15
>> >>> An: solr-user@lucene.apache.org
>> >>> Betreff: Re: Odp.: solr issue with pdf forms
>> >>>
>> >>> Steve:
>> >>>
>> >>> Right, it's not exactly obvious. Bring up the admin UI, something like
>> http://localhost:8983/solr. From there you have to select a core in the
>> 'core selector' drop-down on the left side. If you're using SolrCloud, this
>> will have a rather strange name, but it should be easy to identify what
>> collection it belongs to.
>> >>>
>> >>> At that point you'll see a bunch of new options, among them "schema
>> browser". From there, select your field from the drop-down that will
>> appear, then a button should pop up "load term info".
>> >>>
>> >>> NOTE: you can get the same information from the TermsComponent, see:
>> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> >>> This is a little more flexible because you can, among other things,
>> specify the place to start. In your case you might specify
>> terms.prefix=mein which will show you the terms that are actually being
>> _searched_ as opposed to being stored. This latter is what you see in the
>> browser when you search for docs and is sometimes misleading as you're
>> (probably) seeing.
>> >>>
>> >>> Best,
>> >>> Erick
>> >>>
>> >>> On Fri, Apr 24, 2015 at 1:58 AM,  <steve.sch...@t-systems.com> wrote:
>> >>>> Hey Erick,
>> >>>>
>> >>>> thanks a lot for your answer. I went to the admin schema browser,
>> >>>> but what should I see there? Sorry I'm not firm with the admin
>> >>>> schema browser. :-(
>> >>>>
>> >>>> Best
>> >>>> Steve
>> >>>>
>> >>>>
>> >>>> -----Ursprüngliche Nachricht-----
>> >>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> >>>> Gesendet: Donnerstag, 23. April 2015 18:00
>> >>>> An: solr-user@lucene.apache.org
>> >>>> Betreff: Re: Odp.: solr issue with pdf forms
>> >>>>
>> >>>> When you say "they're not indexed correctly", what's your evidence?
>> >>>> You cannot rely
>> >>>> on the display in the browser, that's the raw input just as it was
>> sent to Solr, _not_ the actual tokens in the index. What do you see when
>> you go to the admin schema browser pate and load the actual tokens.
>> >>>>
>> >>>> Or use the TermsComponent
>> >>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen
>> >>>> t
>> >>>> ) to see the actual terms in the index as opposed to the stored data
>> >>>> you see in the browser when you look at search results.
>> >>>>
>> >>>> If the actual terms don't seem right _in the index_ we need to see
>> your analysis chain, i.e. your fieldType definition.
>> >>>>
>> >>>> I'm, 90% sure you're seeing the stored data and your terms are
>> indexed just fine, but I've certainly been wrong before, more times than I
>> want to remember.....
>> >>>>
>> >>>> Best,
>> >>>> Erick
>> >>>>
>> >>>> On Thu, Apr 23, 2015 at 1:18 AM,  <steve.sch...@t-systems.com> wrote:
>> >>>>> Hey Erick,
>> >>>>>
>> >>>>> thanks for your answer. They are not indexed correctly. Also
>> throught the solr admin interface I see these typical questionmarks within
>> a rhombus where a blank space should be.
>> >>>>> I now figured out the following (not sure if it is relevant at all):
>> >>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>> >>>>> indexed correctly, no issues
>> >>>>> - PDF documents (with editable form fields) created with "Adobe
>> >>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>> >>>>>
>> >>>>> Best
>> >>>>> Steve
>> >>>>>
>> >>>>> -----Ursprüngliche Nachricht-----
>> >>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> >>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>> >>>>> An: solr-user@lucene.apache.org
>> >>>>> Betreff: Re: Odp.: solr issue with pdf forms
>> >>>>>
>> >>>>> Are they not _indexed_ correctly or not being displayed correctly?
>> >>>>> Take a look at admin UI>>schema browser>> your field and press the
>> "load terms" button. That'll show you what is _in_ the index as opposed to
>> what the raw data looked like.
>> >>>>>
>> >>>>> When you return the field in a Solr search, you get a verbatim,
>> un-analyzed copy of your original input. My guess is that your browser
>> isn't using the compatible character encoding for display.
>> >>>>>
>> >>>>> Best,
>> >>>>> Erick
>> >>>>>
>> >>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <steve.sch...@t-systems.com>
>> wrote:
>> >>>>>> Thanks for your answer. Maybe my English is not good enough, what
>> are you trying to say? Sorry I didn't get the point.
>> >>>>>> :-(
>> >>>>>>
>> >>>>>>
>> >>>>>> -----Ursprüngliche Nachricht-----
>> >>>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com]
>> >>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>> >>>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>> >>>>>> Betreff: Odp.: solr issue with pdf forms
>> >>>>>>
>> >>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>> >>>>>>
>> >>>>>> @LAFK_PL
>> >>>>>>   Oryginalna wiadomość
>> >>>>>> Od: steve.sch...@t-systems.com
>> >>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>> >>>>>> Do: solr-user@lucene.apache.org
>> >>>>>> Odpowiedz: solr-user@lucene.apache.org
>> >>>>>> Temat: solr issue with pdf forms
>> >>>>>>
>> >>>>>> Hi guys,
>> >>>>>>
>> >>>>>> hopefully you can help me with my issue. We are using a solr setup
>> and have the following issue:
>> >>>>>> - usual pdf files are indexed just fine
>> >>>>>> - pdf files with writable form-fields look like this:
>> >>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>> >>>>>> und v ollständig sind
>> >>>>>>
>> >>>>>> Somehow the blank space character is not indexed correctly.
>> >>>>>>
>> >>>>>> Is this a know issue? Does anybody have an idea?
>> >>>>>>
>> >>>>>> Thanks a lot
>> >>>>>> Best
>> >>>>>> Steve
>>

Re: Odp.: solr issue with pdf forms

Reply via email to