Jack: I keep forgetting those things exist, thanks for the reminder!
On Thu, Apr 30, 2015 at 8:23 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > Or use a Solr update processor to scrub the source values. The regex > pattern replacement processor could do the trick: > http://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html > > -- Jack Krupansky > > On Thu, Apr 30, 2015 at 11:17 AM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> OK, given all that Tika _is_ sending the weird characters to Solr. You >> can get them out of the index by using someting like >> PatternReplaceTokenFilterFactory or PatternReplaceCharFilterFactory in >> you analysis chain. However, you'll still be stuck with the odd >> characters showing up in your browser. >> >> Don't know what options are configurable for Tika to not return these >> so I'm not much help there. >> >> You could, however, use Tika with SolrJ to do this on the client and >> scrub the data there. Here's an example. >> >> https://lucidworks.com/blog/indexing-with-solrj/ >> >> Best, >> Erick >> >> On Thu, Apr 30, 2015 at 12:03 AM, <steve.sch...@t-systems.com> wrote: >> > Hey, thanks a lot for the hint with pdfbox-app.jar. >> > For testing purpose I now extracted a affected pdf form and a usual pdf >> file. >> > The result ist he following: >> > >> > Usual pdf file: >> > Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy >> eirmod tempor invidunt ut >> > labore et d >> > >> > pdf form: >> > Bitte^Hlegen^HSie^Hdem^HAntrag Kopien aller Einkommensnachweise bei.^HDaz >> > >> > Best >> > Steve >> > >> > -----Ursprüngliche Nachricht----- >> > Von: Allison, Timothy B. [mailto:talli...@mitre.org] >> > Gesendet: Mittwoch, 29. April 2015 14:16 >> > An: solr-user@lucene.apache.org >> > Cc: u...@tika.apache.org >> > Betreff: RE: Odp.: solr issue with pdf forms >> > >> > I completely agree with Erick about the utility of the TermsComponent to >> see what is actually being indexed. If you find problems there and if you >> haven't done so already, you might also investigate further down the >> stack. It might make sense to run the tika-app.jar (whichever version you >> are using in DIH or other mechanism?) or even the pdfbox-app.jar >> (ExtractText option) on your files outside of Solr to see what text/noise >> you're getting for the files that are causing problems. >> > >> > >> > >> > -----Original Message----- >> > From: Erick Erickson [mailto:erickerick...@gmail.com] >> > Sent: Tuesday, April 28, 2015 9:07 PM >> > To: solr-user@lucene.apache.org >> > Subject: Re: Odp.: solr issue with pdf forms >> > >> > There better be. >> > >> > 1> go to the admin UI >> > 2> select a core >> > 3> select "schema browser" >> > 4> select a field from the drop-down >> > >> > Until you do step 4 the window will be pretty blank. >> > >> > Here's the info for TermsComponent, what have you tried? >> > >> > https://cwiki.apache.org/confluence/display/solr/The+Terms+Component >> > >> > Best, >> > Erick >> > >> > On Tue, Apr 28, 2015 at 1:04 PM, <steve.sch...@t-systems.com> wrote: >> >> Thanks a lot for being patient with me. Unfortunately there is no >> >> button "load term info". :-( Can you may be help me using the >> TermsComponent instead? I read it is per default configured. >> >> >> >> Thanks a lot >> >> Best >> >> Steve >> >> >> >> -----Ursprüngliche Nachricht----- >> >> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >> Gesendet: Montag, 27. April 2015 17:23 >> >> An: solr-user@lucene.apache.org >> >> Betreff: Re: Odp.: solr issue with pdf forms >> >> >> >> We're still not quite there. There should be a "load term info" button >> on that page. Clicking that button will show you the terms in your index >> (as opposed to the raw stored input which is what you get when you look at >> results in the browser). My bet is that you'll see perfectly normal tokens >> in the index that will NOT have the wonky characters you see in the display. >> >> >> >> If that's the case, then you have a browser issue, Solr is working >> perfectly fine. On the other hand, if the individual terms are weird, then >> you have something more fundamental going on. >> >> >> >> Which is why I mentioned the TermsComponent. That will return indexed >> tokens, and allows you a bit more flexibility than the admin page in terms >> of what tokens you see, but it's essentially the same information. >> >> >> >> Best, >> >> Erick >> >> >> >> On Sun, Apr 26, 2015 at 11:18 PM, <steve.sch...@t-systems.com> wrote: >> >>> Erick, >> >>> >> >>> thanks a lot for helping me here. In my case it ist he "content" field >> which is displayed not correctly. So I went tot he schema browser like you >> pointed out. Here ist he information I found: >> >>> Field: content >> >>> Field Type: text >> >>> Properties: Indexed, Tokenized, Stored, TermVector Stored >> >>> Schema: Indexed, Tokenized, Stored, TermVector Stored >> >>> Index: Indexed, Tokenized, Stored, TermVector Stored Copied Into: >> >>> spell teaser Position Increment Gap: 100 Index Analyzer: >> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: >> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory >> >>> Filters: >> >>> org.apache.solr.analysis.WordDelimiterFilterFactory >> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 >> >>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1 >> >>> catenateAll: 0 catenateNumbers: 1 } >> >>> org.apache.solr.analysis.LowerCaseFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: >> >>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: >> >>> LUCENE_36 } >> >>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory >> >>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 >> >>> minWordSize: 5 dictionary: german/german-common-nouns.txt >> >>> luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.StopFilterFactory args:{words: >> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true >> >>> luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: >> >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 >> >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer: >> >>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: >> >>> org.apache.solr.analysis.WhitespaceTokenizerFactory >> >>> Filters: >> >>> org.apache.solr.analysis.WordDelimiterFilterFactory >> >>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 >> >>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1 >> >>> catenateAll: 0 catenateNumbers: 0 } >> >>> org.apache.solr.analysis.LowerCaseFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.StopFilterFactory args:{words: >> >>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true >> >>> luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.GermanNormalizationFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: >> >>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 >> >>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory >> >>> args:{luceneMatchVersion: LUCENE_36 } >> >>> Distinct: 160403 >> >>> >> >>> Does this somehow help to figure out the issue? >> >>> Thanks >> >>> Best >> >>> Steve >> >>> >> >>> >> >>> -----Ursprüngliche Nachricht----- >> >>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >>> Gesendet: Freitag, 24. April 2015 20:15 >> >>> An: solr-user@lucene.apache.org >> >>> Betreff: Re: Odp.: solr issue with pdf forms >> >>> >> >>> Steve: >> >>> >> >>> Right, it's not exactly obvious. Bring up the admin UI, something like >> http://localhost:8983/solr. From there you have to select a core in the >> 'core selector' drop-down on the left side. If you're using SolrCloud, this >> will have a rather strange name, but it should be easy to identify what >> collection it belongs to. >> >>> >> >>> At that point you'll see a bunch of new options, among them "schema >> browser". From there, select your field from the drop-down that will >> appear, then a button should pop up "load term info". >> >>> >> >>> NOTE: you can get the same information from the TermsComponent, see: >> >>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. >> >>> This is a little more flexible because you can, among other things, >> specify the place to start. In your case you might specify >> terms.prefix=mein which will show you the terms that are actually being >> _searched_ as opposed to being stored. This latter is what you see in the >> browser when you search for docs and is sometimes misleading as you're >> (probably) seeing. >> >>> >> >>> Best, >> >>> Erick >> >>> >> >>> On Fri, Apr 24, 2015 at 1:58 AM, <steve.sch...@t-systems.com> wrote: >> >>>> Hey Erick, >> >>>> >> >>>> thanks a lot for your answer. I went to the admin schema browser, >> >>>> but what should I see there? Sorry I'm not firm with the admin >> >>>> schema browser. :-( >> >>>> >> >>>> Best >> >>>> Steve >> >>>> >> >>>> >> >>>> -----Ursprüngliche Nachricht----- >> >>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >>>> Gesendet: Donnerstag, 23. April 2015 18:00 >> >>>> An: solr-user@lucene.apache.org >> >>>> Betreff: Re: Odp.: solr issue with pdf forms >> >>>> >> >>>> When you say "they're not indexed correctly", what's your evidence? >> >>>> You cannot rely >> >>>> on the display in the browser, that's the raw input just as it was >> sent to Solr, _not_ the actual tokens in the index. What do you see when >> you go to the admin schema browser pate and load the actual tokens. >> >>>> >> >>>> Or use the TermsComponent >> >>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen >> >>>> t >> >>>> ) to see the actual terms in the index as opposed to the stored data >> >>>> you see in the browser when you look at search results. >> >>>> >> >>>> If the actual terms don't seem right _in the index_ we need to see >> your analysis chain, i.e. your fieldType definition. >> >>>> >> >>>> I'm, 90% sure you're seeing the stored data and your terms are >> indexed just fine, but I've certainly been wrong before, more times than I >> want to remember..... >> >>>> >> >>>> Best, >> >>>> Erick >> >>>> >> >>>> On Thu, Apr 23, 2015 at 1:18 AM, <steve.sch...@t-systems.com> wrote: >> >>>>> Hey Erick, >> >>>>> >> >>>>> thanks for your answer. They are not indexed correctly. Also >> throught the solr admin interface I see these typical questionmarks within >> a rhombus where a blank space should be. >> >>>>> I now figured out the following (not sure if it is relevant at all): >> >>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are >> >>>>> indexed correctly, no issues >> >>>>> - PDF documents (with editable form fields) created with "Adobe >> >>>>> InDesign CS5 (7.0.1)" are indexed with the blank space issue >> >>>>> >> >>>>> Best >> >>>>> Steve >> >>>>> >> >>>>> -----Ursprüngliche Nachricht----- >> >>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> >>>>> Gesendet: Mittwoch, 22. April 2015 17:11 >> >>>>> An: solr-user@lucene.apache.org >> >>>>> Betreff: Re: Odp.: solr issue with pdf forms >> >>>>> >> >>>>> Are they not _indexed_ correctly or not being displayed correctly? >> >>>>> Take a look at admin UI>>schema browser>> your field and press the >> "load terms" button. That'll show you what is _in_ the index as opposed to >> what the raw data looked like. >> >>>>> >> >>>>> When you return the field in a Solr search, you get a verbatim, >> un-analyzed copy of your original input. My guess is that your browser >> isn't using the compatible character encoding for display. >> >>>>> >> >>>>> Best, >> >>>>> Erick >> >>>>> >> >>>>> On Wed, Apr 22, 2015 at 7:08 AM, <steve.sch...@t-systems.com> >> wrote: >> >>>>>> Thanks for your answer. Maybe my English is not good enough, what >> are you trying to say? Sorry I didn't get the point. >> >>>>>> :-( >> >>>>>> >> >>>>>> >> >>>>>> -----Ursprüngliche Nachricht----- >> >>>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com] >> >>>>>> Gesendet: Mittwoch, 22. April 2015 14:01 >> >>>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org >> >>>>>> Betreff: Odp.: solr issue with pdf forms >> >>>>>> >> >>>>>> Out of my head I'd follow how are writable PDFs created and encoded. >> >>>>>> >> >>>>>> @LAFK_PL >> >>>>>> Oryginalna wiadomość >> >>>>>> Od: steve.sch...@t-systems.com >> >>>>>> Wysłano: środa, 22 kwietnia 2015 12:41 >> >>>>>> Do: solr-user@lucene.apache.org >> >>>>>> Odpowiedz: solr-user@lucene.apache.org >> >>>>>> Temat: solr issue with pdf forms >> >>>>>> >> >>>>>> Hi guys, >> >>>>>> >> >>>>>> hopefully you can help me with my issue. We are using a solr setup >> and have the following issue: >> >>>>>> - usual pdf files are indexed just fine >> >>>>>> - pdf files with writable form-fields look like this: >> >>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt >> >>>>>> und v ollständig sind >> >>>>>> >> >>>>>> Somehow the blank space character is not indexed correctly. >> >>>>>> >> >>>>>> Is this a know issue? Does anybody have an idea? >> >>>>>> >> >>>>>> Thanks a lot >> >>>>>> Best >> >>>>>> Steve >>