Re: Odp.: solr issue with pdf forms

Jack Krupansky Thu, 30 Apr 2015 08:24:53 -0700

Or use a Solr update processor to scrub the source values. The regex
pattern replacement processor could do the trick:
http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html


-- Jack Krupansky

On Thu, Apr 30, 2015 at 11:17 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> OK, given all that Tika _is_ sending the weird characters to Solr. You
> can get them out of the index by using someting like
> PatternReplaceTokenFilterFactory or PatternReplaceCharFilterFactory in
> you analysis chain. However, you'll still be stuck with the odd
> characters showing up in your browser.
>
> Don't know what options are configurable for Tika to not return these
> so I'm not much help there.
>
> You could, however, use Tika with SolrJ to do this on the client and
> scrub the data there. Here's an example.
>
> https://lucidworks.com/blog/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Apr 30, 2015 at 12:03 AM,  <steve.sch...@t-systems.com> wrote:
> > Hey, thanks a lot for the hint with pdfbox-app.jar.
> > For testing purpose I now extracted a affected pdf form and a usual pdf
> file.
> > The result ist he following:
> >
> > Usual pdf file:
> > Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy
> eirmod tempor invidunt ut
> > labore et d
> >
> > pdf form:
> > Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz
> >
> > Best
> > Steve
> >
> > -----Ursprüngliche Nachricht-----
> > Von: Allison, Timothy B. [mailto:talli...@mitre.org]
> > Gesendet: Mittwoch, 29. April 2015 14:16
> > An: solr-user@lucene.apache.org
> > Cc: u...@tika.apache.org
> > Betreff: RE: Odp.: solr issue with pdf forms
> >
> > I completely agree with Erick about the utility of the TermsComponent to
> see what is actually being indexed.  If you find problems there and if you
> haven't done so already, you might also investigate further down the
> stack.  It might make sense to run the tika-app.jar (whichever version you
> are using in DIH or other mechanism?) or even the pdfbox-app.jar
> (ExtractText option) on your files outside of Solr to see what text/noise
> you're getting for the files that are causing problems.
> >
> >
> >
> > -----Original Message-----
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Tuesday, April 28, 2015 9:07 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Odp.: solr issue with pdf forms
> >
> > There better be.
> >
> > 1> go to the admin UI
> > 2> select a core
> > 3> select "schema browser"
> > 4> select a field from the drop-down
> >
> > Until you do step 4 the window will be pretty blank.
> >
> > Here's the info for TermsComponent, what have you tried?
> >
> > https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> >
> > Best,
> > Erick
> >
> > On Tue, Apr 28, 2015 at 1:04 PM,  <steve.sch...@t-systems.com> wrote:
> >> Thanks a lot for being patient with me. Unfortunately there is no
> >> button "load term info". :-( Can you may be help me using the
> TermsComponent instead? I read it is per default configured.
> >>
> >> Thanks a lot
> >> Best
> >> Steve
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> >> Gesendet: Montag, 27. April 2015 17:23
> >> An: solr-user@lucene.apache.org
> >> Betreff: Re: Odp.: solr issue with pdf forms
> >>
> >> We're still not quite there. There should be a "load term info" button
> on that page. Clicking that button will show you the terms in your index
> (as opposed to the raw stored input which is what you get when you look at
> results in the browser). My bet is that you'll see perfectly normal tokens
> in the index that will NOT have the wonky characters you see in the display.
> >>
> >> If that's the case, then you have a browser issue, Solr is working
> perfectly fine. On the other hand, if the individual terms are weird, then
> you have something more fundamental going on.
> >>
> >> Which is why I mentioned the TermsComponent. That will return indexed
> tokens, and allows you a bit more flexibility than the admin page in terms
> of what tokens you see, but it's essentially the same information.
> >>
> >> Best,
> >> Erick
> >>
> >> On Sun, Apr 26, 2015 at 11:18 PM,  <steve.sch...@t-systems.com> wrote:
> >>> Erick,
> >>>
> >>> thanks a lot for helping me here. In my case it ist he "content" field
> which is displayed not correctly. So I went tot he schema browser like you
> pointed out. Here ist he information I found:
> >>> Field: content
> >>> Field Type: text
> >>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
> >>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
> >>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
> >>> spell teaser Position Increment Gap:  100 Index Analyzer:
> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory
> >>> Filters:
> >>> org.apache.solr.analysis.WordDelimiterFilterFactory
> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
> >>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
> >>> catenateAll: 0 catenateNumbers: 1 }
> >>> org.apache.solr.analysis.LowerCaseFilterFactory
> >>> args:{luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
> >>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
> >>> LUCENE_36 }
> >>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
> >>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
> >>> minWordSize: 5 dictionary: german/german-common-nouns.txt
> >>> luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.StopFilterFactory args:{words:
> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
> >>> luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory
> >>> args:{luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
> >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
> >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
> >>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory
> >>> Filters:
> >>> org.apache.solr.analysis.WordDelimiterFilterFactory
> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
> >>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
> >>> catenateAll: 0 catenateNumbers: 0 }
> >>> org.apache.solr.analysis.LowerCaseFilterFactory
> >>> args:{luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.StopFilterFactory args:{words:
> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
> >>> luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory
> >>> args:{luceneMatchVersion: LUCENE_36 }
> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
> >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36
> >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
> >>> args:{luceneMatchVersion: LUCENE_36 }
> >>> Distinct:  160403
> >>>
> >>> Does this somehow help to figure out the issue?
> >>> Thanks
> >>> Best
> >>> Steve
> >>>
> >>>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> >>> Gesendet: Freitag, 24. April 2015 20:15
> >>> An: solr-user@lucene.apache.org
> >>> Betreff: Re: Odp.: solr issue with pdf forms
> >>>
> >>> Steve:
> >>>
> >>> Right, it's not exactly obvious. Bring up the admin UI, something like
> http://localhost:8983/solr. From there you have to select a core in the
> 'core selector' drop-down on the left side. If you're using SolrCloud, this
> will have a rather strange name, but it should be easy to identify what
> collection it belongs to.
> >>>
> >>> At that point you'll see a bunch of new options, among them "schema
> browser". From there, select your field from the drop-down that will
> appear, then a button should pop up "load term info".
> >>>
> >>> NOTE: you can get the same information from the TermsComponent, see:
> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
> >>> This is a little more flexible because you can, among other things,
> specify the place to start. In your case you might specify
> terms.prefix=mein which will show you the terms that are actually being
> _searched_ as opposed to being stored. This latter is what you see in the
> browser when you search for docs and is sometimes misleading as you're
> (probably) seeing.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Fri, Apr 24, 2015 at 1:58 AM,  <steve.sch...@t-systems.com> wrote:
> >>>> Hey Erick,
> >>>>
> >>>> thanks a lot for your answer. I went to the admin schema browser,
> >>>> but what should I see there? Sorry I'm not firm with the admin
> >>>> schema browser. :-(
> >>>>
> >>>> Best
> >>>> Steve
> >>>>
> >>>>
> >>>> -----Ursprüngliche Nachricht-----
> >>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> >>>> Gesendet: Donnerstag, 23. April 2015 18:00
> >>>> An: solr-user@lucene.apache.org
> >>>> Betreff: Re: Odp.: solr issue with pdf forms
> >>>>
> >>>> When you say "they're not indexed correctly", what's your evidence?
> >>>> You cannot rely
> >>>> on the display in the browser, that's the raw input just as it was
> sent to Solr, _not_ the actual tokens in the index. What do you see when
> you go to the admin schema browser pate and load the actual tokens.
> >>>>
> >>>> Or use the TermsComponent
> >>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen
> >>>> t
> >>>> ) to see the actual terms in the index as opposed to the stored data
> >>>> you see in the browser when you look at search results.
> >>>>
> >>>> If the actual terms don't seem right _in the index_ we need to see
> your analysis chain, i.e. your fieldType definition.
> >>>>
> >>>> I'm, 90% sure you're seeing the stored data and your terms are
> indexed just fine, but I've certainly been wrong before, more times than I
> want to remember.....
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>> On Thu, Apr 23, 2015 at 1:18 AM,  <steve.sch...@t-systems.com> wrote:
> >>>>> Hey Erick,
> >>>>>
> >>>>> thanks for your answer. They are not indexed correctly. Also
> throught the solr admin interface I see these typical questionmarks within
> a rhombus where a blank space should be.
> >>>>> I now figured out the following (not sure if it is relevant at all):
> >>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
> >>>>> indexed correctly, no issues
> >>>>> - PDF documents (with editable form fields) created with "Adobe
> >>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
> >>>>>
> >>>>> Best
> >>>>> Steve
> >>>>>
> >>>>> -----Ursprüngliche Nachricht-----
> >>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> >>>>> Gesendet: Mittwoch, 22. April 2015 17:11
> >>>>> An: solr-user@lucene.apache.org
> >>>>> Betreff: Re: Odp.: solr issue with pdf forms
> >>>>>
> >>>>> Are they not _indexed_ correctly or not being displayed correctly?
> >>>>> Take a look at admin UI>>schema browser>> your field and press the
> "load terms" button. That'll show you what is _in_ the index as opposed to
> what the raw data looked like.
> >>>>>
> >>>>> When you return the field in a Solr search, you get a verbatim,
> un-analyzed copy of your original input. My guess is that your browser
> isn't using the compatible character encoding for display.
> >>>>>
> >>>>> Best,
> >>>>> Erick
> >>>>>
> >>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <steve.sch...@t-systems.com>
> wrote:
> >>>>>> Thanks for your answer. Maybe my English is not good enough, what
> are you trying to say? Sorry I didn't get the point.
> >>>>>> :-(
> >>>>>>
> >>>>>>
> >>>>>> -----Ursprüngliche Nachricht-----
> >>>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com]
> >>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
> >>>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
> >>>>>> Betreff: Odp.: solr issue with pdf forms
> >>>>>>
> >>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
> >>>>>>
> >>>>>> @LAFK_PL
> >>>>>>   Oryginalna wiadomość
> >>>>>> Od: steve.sch...@t-systems.com
> >>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
> >>>>>> Do: solr-user@lucene.apache.org
> >>>>>> Odpowiedz: solr-user@lucene.apache.org
> >>>>>> Temat: solr issue with pdf forms
> >>>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>> hopefully you can help me with my issue. We are using a solr setup
> and have the following issue:
> >>>>>> - usual pdf files are indexed just fine
> >>>>>> - pdf files with writable form-fields look like this:
> >>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
> >>>>>> und v ollständig sind
> >>>>>>
> >>>>>> Somehow the blank space character is not indexed correctly.
> >>>>>>
> >>>>>> Is this a know issue? Does anybody have an idea?
> >>>>>>
> >>>>>> Thanks a lot
> >>>>>> Best
> >>>>>> Steve
>

Re: Odp.: solr issue with pdf forms

Reply via email to