Or use a Solr update processor to scrub the source values. The regex pattern replacement processor could do the trick: http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html
-- Jack Krupansky On Thu, Apr 30, 2015 at 11:17 AM, Erick Erickson <erickerick...@gmail.com> wrote: > OK, given all that Tika _is_ sending the weird characters to Solr. You > can get them out of the index by using someting like > PatternReplaceTokenFilterFactory or PatternReplaceCharFilterFactory in > you analysis chain. However, you'll still be stuck with the odd > characters showing up in your browser. > > Don't know what options are configurable for Tika to not return these > so I'm not much help there. > > You could, however, use Tika with SolrJ to do this on the client and > scrub the data there. Here's an example. > > https://lucidworks.com/blog/indexing-with-solrj/ > > Best, > Erick > > On Thu, Apr 30, 2015 at 12:03 AM, <steve.sch...@t-systems.com> wrote: > > Hey, thanks a lot for the hint with pdfbox-app.jar. > > For testing purpose I now extracted a affected pdf form and a usual pdf > file. > > The result ist he following: > > > > Usual pdf file: > > Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy > eirmod tempor invidunt ut > > labore et d > > > > pdf form: > > Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz > > > > Best > > Steve > > > > -----Ursprüngliche Nachricht----- > > Von: Allison, Timothy B. [mailto:talli...@mitre.org] > > Gesendet: Mittwoch, 29. April 2015 14:16 > > An: solr-user@lucene.apache.org > > Cc: u...@tika.apache.org > > Betreff: RE: Odp.: solr issue with pdf forms > > > > I completely agree with Erick about the utility of the TermsComponent to > see what is actually being indexed. If you find problems there and if you > haven't done so already, you might also investigate further down the > stack. It might make sense to run the tika-app.jar (whichever version you > are using in DIH or other mechanism?) or even the pdfbox-app.jar > (ExtractText option) on your files outside of Solr to see what text/noise > you're getting for the files that are causing problems. > > > > > > > > -----Original Message----- > > From: Erick Erickson [mailto:erickerick...@gmail.com] > > Sent: Tuesday, April 28, 2015 9:07 PM > > To: solr-user@lucene.apache.org > > Subject: Re: Odp.: solr issue with pdf forms > > > > There better be. > > > > 1> go to the admin UI > > 2> select a core > > 3> select "schema browser" > > 4> select a field from the drop-down > > > > Until you do step 4 the window will be pretty blank. > > > > Here's the info for TermsComponent, what have you tried? > > > > https://cwiki.apache.org/confluence/display/solr/The+Terms+Component > > > > Best, > > Erick > > > > On Tue, Apr 28, 2015 at 1:04 PM, <steve.sch...@t-systems.com> wrote: > >> Thanks a lot for being patient with me. Unfortunately there is no > >> button "load term info". :-( Can you may be help me using the > TermsComponent instead? I read it is per default configured. > >> > >> Thanks a lot > >> Best > >> Steve > >> > >> -----Ursprüngliche Nachricht----- > >> Von: Erick Erickson [mailto:erickerick...@gmail.com] > >> Gesendet: Montag, 27. April 2015 17:23 > >> An: solr-user@lucene.apache.org > >> Betreff: Re: Odp.: solr issue with pdf forms > >> > >> We're still not quite there. There should be a "load term info" button > on that page. Clicking that button will show you the terms in your index > (as opposed to the raw stored input which is what you get when you look at > results in the browser). My bet is that you'll see perfectly normal tokens > in the index that will NOT have the wonky characters you see in the display. > >> > >> If that's the case, then you have a browser issue, Solr is working > perfectly fine. On the other hand, if the individual terms are weird, then > you have something more fundamental going on. > >> > >> Which is why I mentioned the TermsComponent. That will return indexed > tokens, and allows you a bit more flexibility than the admin page in terms > of what tokens you see, but it's essentially the same information. > >> > >> Best, > >> Erick > >> > >> On Sun, Apr 26, 2015 at 11:18 PM, <steve.sch...@t-systems.com> wrote: > >>> Erick, > >>> > >>> thanks a lot for helping me here. In my case it ist he "content" field > which is displayed not correctly. So I went tot he schema browser like you > pointed out. Here ist he information I found: > >>> Field: content > >>> Field Type: text > >>> Properties: Indexed, Tokenized, Stored, TermVector Stored > >>> Schema: Indexed, Tokenized, Stored, TermVector Stored > >>> Index: Indexed, Tokenized, Stored, TermVector Stored Copied Into: > >>> spell teaser Position Increment Gap: 100 Index Analyzer: > >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: > >>> org.apache.solr.analysis.WhitespaceTokenizerFactory > >>> Filters: > >>> org.apache.solr.analysis.WordDelimiterFilterFactory > >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 > >>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1 > >>> catenateAll: 0 catenateNumbers: 1 } > >>> org.apache.solr.analysis.LowerCaseFilterFactory > >>> args:{luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: > >>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: > >>> LUCENE_36 } > >>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory > >>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 > >>> minWordSize: 5 dictionary: german/german-common-nouns.txt > >>> luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.StopFilterFactory args:{words: > >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true > >>> luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.GermanNormalizationFilterFactory > >>> args:{luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: > >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 > >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory > >>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer: > >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: > >>> org.apache.solr.analysis.WhitespaceTokenizerFactory > >>> Filters: > >>> org.apache.solr.analysis.WordDelimiterFilterFactory > >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 > >>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1 > >>> catenateAll: 0 catenateNumbers: 0 } > >>> org.apache.solr.analysis.LowerCaseFilterFactory > >>> args:{luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.StopFilterFactory args:{words: > >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true > >>> luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.GermanNormalizationFilterFactory > >>> args:{luceneMatchVersion: LUCENE_36 } > >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: > >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 > >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory > >>> args:{luceneMatchVersion: LUCENE_36 } > >>> Distinct: 160403 > >>> > >>> Does this somehow help to figure out the issue? > >>> Thanks > >>> Best > >>> Steve > >>> > >>> > >>> -----Ursprüngliche Nachricht----- > >>> Von: Erick Erickson [mailto:erickerick...@gmail.com] > >>> Gesendet: Freitag, 24. April 2015 20:15 > >>> An: solr-user@lucene.apache.org > >>> Betreff: Re: Odp.: solr issue with pdf forms > >>> > >>> Steve: > >>> > >>> Right, it's not exactly obvious. Bring up the admin UI, something like > http://localhost:8983/solr. From there you have to select a core in the > 'core selector' drop-down on the left side. If you're using SolrCloud, this > will have a rather strange name, but it should be easy to identify what > collection it belongs to. > >>> > >>> At that point you'll see a bunch of new options, among them "schema > browser". From there, select your field from the drop-down that will > appear, then a button should pop up "load term info". > >>> > >>> NOTE: you can get the same information from the TermsComponent, see: > >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. > >>> This is a little more flexible because you can, among other things, > specify the place to start. In your case you might specify > terms.prefix=mein which will show you the terms that are actually being > _searched_ as opposed to being stored. This latter is what you see in the > browser when you search for docs and is sometimes misleading as you're > (probably) seeing. > >>> > >>> Best, > >>> Erick > >>> > >>> On Fri, Apr 24, 2015 at 1:58 AM, <steve.sch...@t-systems.com> wrote: > >>>> Hey Erick, > >>>> > >>>> thanks a lot for your answer. I went to the admin schema browser, > >>>> but what should I see there? Sorry I'm not firm with the admin > >>>> schema browser. :-( > >>>> > >>>> Best > >>>> Steve > >>>> > >>>> > >>>> -----Ursprüngliche Nachricht----- > >>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] > >>>> Gesendet: Donnerstag, 23. April 2015 18:00 > >>>> An: solr-user@lucene.apache.org > >>>> Betreff: Re: Odp.: solr issue with pdf forms > >>>> > >>>> When you say "they're not indexed correctly", what's your evidence? > >>>> You cannot rely > >>>> on the display in the browser, that's the raw input just as it was > sent to Solr, _not_ the actual tokens in the index. What do you see when > you go to the admin schema browser pate and load the actual tokens. > >>>> > >>>> Or use the TermsComponent > >>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen > >>>> t > >>>> ) to see the actual terms in the index as opposed to the stored data > >>>> you see in the browser when you look at search results. > >>>> > >>>> If the actual terms don't seem right _in the index_ we need to see > your analysis chain, i.e. your fieldType definition. > >>>> > >>>> I'm, 90% sure you're seeing the stored data and your terms are > indexed just fine, but I've certainly been wrong before, more times than I > want to remember..... > >>>> > >>>> Best, > >>>> Erick > >>>> > >>>> On Thu, Apr 23, 2015 at 1:18 AM, <steve.sch...@t-systems.com> wrote: > >>>>> Hey Erick, > >>>>> > >>>>> thanks for your answer. They are not indexed correctly. Also > throught the solr admin interface I see these typical questionmarks within > a rhombus where a blank space should be. > >>>>> I now figured out the following (not sure if it is relevant at all): > >>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are > >>>>> indexed correctly, no issues > >>>>> - PDF documents (with editable form fields) created with "Adobe > >>>>> InDesign CS5 (7.0.1)" are indexed with the blank space issue > >>>>> > >>>>> Best > >>>>> Steve > >>>>> > >>>>> -----Ursprüngliche Nachricht----- > >>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] > >>>>> Gesendet: Mittwoch, 22. April 2015 17:11 > >>>>> An: solr-user@lucene.apache.org > >>>>> Betreff: Re: Odp.: solr issue with pdf forms > >>>>> > >>>>> Are they not _indexed_ correctly or not being displayed correctly? > >>>>> Take a look at admin UI>>schema browser>> your field and press the > "load terms" button. That'll show you what is _in_ the index as opposed to > what the raw data looked like. > >>>>> > >>>>> When you return the field in a Solr search, you get a verbatim, > un-analyzed copy of your original input. My guess is that your browser > isn't using the compatible character encoding for display. > >>>>> > >>>>> Best, > >>>>> Erick > >>>>> > >>>>> On Wed, Apr 22, 2015 at 7:08 AM, <steve.sch...@t-systems.com> > wrote: > >>>>>> Thanks for your answer. Maybe my English is not good enough, what > are you trying to say? Sorry I didn't get the point. > >>>>>> :-( > >>>>>> > >>>>>> > >>>>>> -----Ursprüngliche Nachricht----- > >>>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com] > >>>>>> Gesendet: Mittwoch, 22. April 2015 14:01 > >>>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org > >>>>>> Betreff: Odp.: solr issue with pdf forms > >>>>>> > >>>>>> Out of my head I'd follow how are writable PDFs created and encoded. > >>>>>> > >>>>>> @LAFK_PL > >>>>>> Oryginalna wiadomość > >>>>>> Od: steve.sch...@t-systems.com > >>>>>> Wysłano: środa, 22 kwietnia 2015 12:41 > >>>>>> Do: solr-user@lucene.apache.org > >>>>>> Odpowiedz: solr-user@lucene.apache.org > >>>>>> Temat: solr issue with pdf forms > >>>>>> > >>>>>> Hi guys, > >>>>>> > >>>>>> hopefully you can help me with my issue. We are using a solr setup > and have the following issue: > >>>>>> - usual pdf files are indexed just fine > >>>>>> - pdf files with writable form-fields look like this: > >>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt > >>>>>> und v ollständig sind > >>>>>> > >>>>>> Somehow the blank space character is not indexed correctly. > >>>>>> > >>>>>> Is this a know issue? Does anybody have an idea? > >>>>>> > >>>>>> Thanks a lot > >>>>>> Best > >>>>>> Steve >