Steve, Another possibility is to use the Linux pdftotext command-line utility or a software daemon linked with the libraries it uses, usually part of the poppler-utils package. pdfbox should have the same basic capabilities, but may run a little slower.
If you have very many "filled pdf" forms, then your relevancy may be affected once you address the problem because the form labels will be repeated in many documents: Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und v ollständig sind In that case, you could try to get a hold of the FDF files. On the other hand, you may be indexing all of a companies empty forms, which is fine. Take a look at this workflow for filling out a PDF form with pdftk: http://www.myown1.com/linux/pdf_formfill.shtml Your users may be using a different workflow, but they should have the fdf file somewhere, and depending on your application, that may be a better target for indexing due to relevancy. It might be not good to turn "structured data" in the form of fdf files into "unstructured data" by indexing filled-out forms. -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 30, 2015 11:28 AM To: solr-user@lucene.apache.org Subject: Re: Odp.: solr issue with pdf forms Jack: I keep forgetting those things exist, thanks for the reminder! On Thu, Apr 30, 2015 at 8:23 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > Or use a Solr update processor to scrub the source values. The regex > pattern replacement processor could do the trick: > http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/p > rocessor/RegexReplaceProcessorFactory.html > > -- Jack Krupansky > > On Thu, Apr 30, 2015 at 11:17 AM, Erick Erickson > <erickerick...@gmail.com> > wrote: > >> OK, given all that Tika _is_ sending the weird characters to Solr. >> You can get them out of the index by using someting like >> PatternReplaceTokenFilterFactory or PatternReplaceCharFilterFactory >> in you analysis chain. However, you'll still be stuck with the odd >> characters showing up in your browser. >> >> Don't know what options are configurable for Tika to not return these >> so I'm not much help there. >> >> You could, however, use Tika with SolrJ to do this on the client and >> scrub the data there. Here's an example. >> >> https://lucidworks.com/blog/indexing-with-solrj/ >> >> Best, >> Erick >> >> On Thu, Apr 30, 2015 at 12:03 AM, <steve.sch...@t-systems.com> wrote: >> > Hey, thanks a lot for the hint with pdfbox-app.jar. >> > For testing purpose I now extracted a affected pdf form and a usual >> > pdf >> file. >> > The result ist he following: >> > >> > Usual pdf file: >> > Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam >> > nonumy >> eirmod tempor invidunt ut >> > labore et d >> > >> > pdf form: >> > Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise >> > bei.^HDaz >> > >> > Best >> > Steve >> > >> > -----Ursprüngliche Nachricht----- >> > Von: Allison, Timothy B. [mailto:talli...@mitre.org] >> > Gesendet: Mittwoch, 29. April 2015 14:16 >> > An: solr-user@lucene.apache.org >> > Cc: u...@tika.apache.org >> > Betreff: RE: Odp.: solr issue with pdf forms >> > >> > I completely agree with Erick about the utility of the >> > TermsComponent to >> see what is actually being indexed. If you find problems there and >> if you haven't done so already, you might also investigate further >> down the stack. It might make sense to run the tika-app.jar >> (whichever version you are using in DIH or other mechanism?) or even >> the pdfbox-app.jar (ExtractText option) on your files outside of Solr >> to see what text/noise you're getting for the files that are causing >> problems. >> > >> > >> > >> > -----Original Message----- >> > From: Erick Erickson [mailto:erickerick...@gmail.com] >> > Sent: Tuesday, April 28, 2015 9:07 PM >> > To: solr-user@lucene.apache.org >> > Subject: Re: Odp.: solr issue with pdf forms >> > >> > There better be. >> > >> > 1> go to the admin UI >> > 2> select a core >> > 3> select "schema browser" >> > 4> select a field from the drop-down >> > >> > Until you do step 4 the window will be pretty blank. >> > >> > Here's the info for TermsComponent, what have you tried? >> > >> > https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen >> > t >> > >> > Best, >> > Erick >> > >> > On Tue, Apr 28, 2015 at 1:04 PM, <steve.sch...@t-systems.com> wrote: >> >> Thanks a lot for being patient with me. Unfortunately there is no >> >> button "load term info". :-( Can you may be help me using the >> TermsComponent instead? I read it is per default configured. >> >> >> >> Thanks a lot >> >> Best >> >> Steve >> >> >> >> -----Ursprüngliche Nachricht----- >> >> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >> Gesendet: Montag, 27. April 2015 17:23 >> >> An: solr-user@lucene.apache.org >> >> Betreff: Re: Odp.: solr issue with pdf forms >> >> >> >> We're still not quite there. There should be a "load term info" >> >> button >> on that page. Clicking that button will show you the terms in your >> index (as opposed to the raw stored input which is what you get when >> you look at results in the browser). My bet is that you'll see >> perfectly normal tokens in the index that will NOT have the wonky characters >> you see in the display. >> >> >> >> If that's the case, then you have a browser issue, Solr is working >> perfectly fine. On the other hand, if the individual terms are weird, >> then you have something more fundamental going on. >> >> >> >> Which is why I mentioned the TermsComponent. That will return >> >> indexed >> tokens, and allows you a bit more flexibility than the admin page in >> terms of what tokens you see, but it's essentially the same information. >> >> >> >> Best, >> >> Erick >> >> >> >> On Sun, Apr 26, 2015 at 11:18 PM, <steve.sch...@t-systems.com> wrote: >> >>> Erick, >> >>> >> >>> thanks a lot for helping me here. In my case it ist he "content" >> >>> field >> which is displayed not correctly. So I went tot he schema browser >> like you pointed out. Here ist he information I found: >> >>> Field: content >> >>> Field Type: text >> >>> Properties: Indexed, Tokenized, Stored, TermVector Stored >> >>> Schema: Indexed, Tokenized, Stored, TermVector Stored >> >>> Index: Indexed, Tokenized, Stored, TermVector Stored Copied Into: >> >>> spell teaser Position Increment Gap: 100 Index Analyzer: >> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: >> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory >> >>> Filters: >> >>> org.apache.solr.analysis.WordDelimiterFilterFactory >> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 >> >>> generateNumberParts: 1 >> >>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: >> >>> 1 >> >>> catenateAll: 0 catenateNumbers: 1 } >> >>> org.apache.solr.analysis.LowerCaseFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: >> >>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: >> >>> LUCENE_36 } >> >>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory >> >>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: >> >>> 4 >> >>> minWordSize: 5 dictionary: german/german-common-nouns.txt >> >>> luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.StopFilterFactory args:{words: >> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: >> >>> true >> >>> luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: >> >>> german/protwords.txt language: German2 luceneMatchVersion: >> >>> LUCENE_36 } >> >>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer: >> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: >> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory >> >>> Filters: >> >>> org.apache.solr.analysis.WordDelimiterFilterFactory >> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 >> >>> generateNumberParts: 1 >> >>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: >> >>> 1 >> >>> catenateAll: 0 catenateNumbers: 0 } >> >>> org.apache.solr.analysis.LowerCaseFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.StopFilterFactory args:{words: >> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: >> >>> true >> >>> luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: >> >>> german/protwords.txt language: German2 luceneMatchVersion: >> >>> LUCENE_36 } >> >>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> Distinct: 160403 >> >>> >> >>> Does this somehow help to figure out the issue? >> >>> Thanks >> >>> Best >> >>> Steve >> >>> >> >>> >> >>> -----Ursprüngliche Nachricht----- >> >>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >>> Gesendet: Freitag, 24. April 2015 20:15 >> >>> An: solr-user@lucene.apache.org >> >>> Betreff: Re: Odp.: solr issue with pdf forms >> >>> >> >>> Steve: >> >>> >> >>> Right, it's not exactly obvious. Bring up the admin UI, something >> >>> like >> http://localhost:8983/solr. From there you have to select a core in >> the 'core selector' drop-down on the left side. If you're using >> SolrCloud, this will have a rather strange name, but it should be >> easy to identify what collection it belongs to. >> >>> >> >>> At that point you'll see a bunch of new options, among them >> >>> "schema >> browser". From there, select your field from the drop-down that will >> appear, then a button should pop up "load term info". >> >>> >> >>> NOTE: you can get the same information from the TermsComponent, see: >> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. >> >>> This is a little more flexible because you can, among other >> >>> things, >> specify the place to start. In your case you might specify >> terms.prefix=mein which will show you the terms that are actually >> being _searched_ as opposed to being stored. This latter is what you >> see in the browser when you search for docs and is sometimes >> misleading as you're >> (probably) seeing. >> >>> >> >>> Best, >> >>> Erick >> >>> >> >>> On Fri, Apr 24, 2015 at 1:58 AM, <steve.sch...@t-systems.com> wrote: >> >>>> Hey Erick, >> >>>> >> >>>> thanks a lot for your answer. I went to the admin schema >> >>>> browser, but what should I see there? Sorry I'm not firm with >> >>>> the admin schema browser. :-( >> >>>> >> >>>> Best >> >>>> Steve >> >>>> >> >>>> >> >>>> -----Ursprüngliche Nachricht----- >> >>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >>>> Gesendet: Donnerstag, 23. April 2015 18:00 >> >>>> An: solr-user@lucene.apache.org >> >>>> Betreff: Re: Odp.: solr issue with pdf forms >> >>>> >> >>>> When you say "they're not indexed correctly", what's your evidence? >> >>>> You cannot rely >> >>>> on the display in the browser, that's the raw input just as it >> >>>> was >> sent to Solr, _not_ the actual tokens in the index. What do you see >> when you go to the admin schema browser pate and load the actual tokens. >> >>>> >> >>>> Or use the TermsComponent >> >>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Comp >> >>>> onen >> >>>> t >> >>>> ) to see the actual terms in the index as opposed to the stored >> >>>> data you see in the browser when you look at search results. >> >>>> >> >>>> If the actual terms don't seem right _in the index_ we need to >> >>>> see >> your analysis chain, i.e. your fieldType definition. >> >>>> >> >>>> I'm, 90% sure you're seeing the stored data and your terms are >> indexed just fine, but I've certainly been wrong before, more times >> than I want to remember..... >> >>>> >> >>>> Best, >> >>>> Erick >> >>>> >> >>>> On Thu, Apr 23, 2015 at 1:18 AM, <steve.sch...@t-systems.com> wrote: >> >>>>> Hey Erick, >> >>>>> >> >>>>> thanks for your answer. They are not indexed correctly. Also >> throught the solr admin interface I see these typical questionmarks >> within a rhombus where a blank space should be. >> >>>>> I now figured out the following (not sure if it is relevant at all): >> >>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" >> >>>>> are indexed correctly, no issues >> >>>>> - PDF documents (with editable form fields) created with "Adobe >> >>>>> InDesign CS5 (7.0.1)" are indexed with the blank space issue >> >>>>> >> >>>>> Best >> >>>>> Steve >> >>>>> >> >>>>> -----Ursprüngliche Nachricht----- >> >>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >>>>> Gesendet: Mittwoch, 22. April 2015 17:11 >> >>>>> An: solr-user@lucene.apache.org >> >>>>> Betreff: Re: Odp.: solr issue with pdf forms >> >>>>> >> >>>>> Are they not _indexed_ correctly or not being displayed correctly? >> >>>>> Take a look at admin UI>>schema browser>> your field and press >> >>>>> the >> "load terms" button. That'll show you what is _in_ the index as >> opposed to what the raw data looked like. >> >>>>> >> >>>>> When you return the field in a Solr search, you get a verbatim, >> un-analyzed copy of your original input. My guess is that your >> browser isn't using the compatible character encoding for display. >> >>>>> >> >>>>> Best, >> >>>>> Erick >> >>>>> >> >>>>> On Wed, Apr 22, 2015 at 7:08 AM, <steve.sch...@t-systems.com> >> wrote: >> >>>>>> Thanks for your answer. Maybe my English is not good enough, >> >>>>>> what >> are you trying to say? Sorry I didn't get the point. >> >>>>>> :-( >> >>>>>> >> >>>>>> >> >>>>>> -----Ursprüngliche Nachricht----- >> >>>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com] >> >>>>>> Gesendet: Mittwoch, 22. April 2015 14:01 >> >>>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org >> >>>>>> Betreff: Odp.: solr issue with pdf forms >> >>>>>> >> >>>>>> Out of my head I'd follow how are writable PDFs created and encoded. >> >>>>>> >> >>>>>> @LAFK_PL >> >>>>>> Oryginalna wiadomość >> >>>>>> Od: steve.sch...@t-systems.com >> >>>>>> Wysłano: środa, 22 kwietnia 2015 12:41 >> >>>>>> Do: solr-user@lucene.apache.org >> >>>>>> Odpowiedz: solr-user@lucene.apache.org >> >>>>>> Temat: solr issue with pdf forms >> >>>>>> >> >>>>>> Hi guys, >> >>>>>> >> >>>>>> hopefully you can help me with my issue. We are using a solr >> >>>>>> setup >> and have the following issue: >> >>>>>> - usual pdf files are indexed just fine >> >>>>>> - pdf files with writable form-fields look like this: >> >>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben >> >>>>>> korrekt und v ollständig sind >> >>>>>> >> >>>>>> Somehow the blank space character is not indexed correctly. >> >>>>>> >> >>>>>> Is this a know issue? Does anybody have an idea? >> >>>>>> >> >>>>>> Thanks a lot >> >>>>>> Best >> >>>>>> Steve >>