Sorry, but there really isn't... :-/ I never used the terms component. So I first looked if it is configured, and it really is. Then I tried to get an idea how it works and tried the examples described in the doku. After that I tried to figure out how to get the output from the "misscoded" pdf content. My first step was to find the fields I need:
http://IP:8080/solr/core_de/terms?terms.fl=content&terms.fl=fileReferenceDocumentId&terms.fl=fileName This gives me a top 10 list of the indexed documents and shows the fields content, fileReferenceDocumentId and fileName if I understand the documentation correctly. Now I tried to limit the output to the specified file which has the coding issues: http://IP:8080/solr/core_de/terms?terms.fl=content&terms.fl=fileReferenceDocumentId&terms.fl=fileName&terms.prefix=CODING-ISSUE.pdf But this is then not showing the content of the content field anymore. :-( The result looks like this: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> </lst> <lst name="terms"> <lst name="content"/> <lst name="fileReferenceDocumentId"/> <lst name="fileName"> <int name=" CODING-ISSUE.pdf ">3</int> </lst> </lst> </response> Any help would be appreciated Thanks a lot Best Steve -----Ursprüngliche Nachricht----- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Mittwoch, 29. April 2015 03:07 An: solr-user@lucene.apache.org Betreff: Re: Odp.: solr issue with pdf forms There better be. 1> go to the admin UI 2> select a core 3> select "schema browser" 4> select a field from the drop-down Until you do step 4 the window will be pretty blank. Here's the info for TermsComponent, what have you tried? https://cwiki.apache.org/confluence/display/solr/The+Terms+Component Best, Erick On Tue, Apr 28, 2015 at 1:04 PM, <steve.sch...@t-systems.com> wrote: > Thanks a lot for being patient with me. Unfortunately there is no > button "load term info". :-( Can you may be help me using the TermsComponent > instead? I read it is per default configured. > > Thanks a lot > Best > Steve > > -----Ursprüngliche Nachricht----- > Von: Erick Erickson [mailto:erickerick...@gmail.com] > Gesendet: Montag, 27. April 2015 17:23 > An: solr-user@lucene.apache.org > Betreff: Re: Odp.: solr issue with pdf forms > > We're still not quite there. There should be a "load term info" button on > that page. Clicking that button will show you the terms in your index (as > opposed to the raw stored input which is what you get when you look at > results in the browser). My bet is that you'll see perfectly normal tokens in > the index that will NOT have the wonky characters you see in the display. > > If that's the case, then you have a browser issue, Solr is working perfectly > fine. On the other hand, if the individual terms are weird, then you have > something more fundamental going on. > > Which is why I mentioned the TermsComponent. That will return indexed tokens, > and allows you a bit more flexibility than the admin page in terms of what > tokens you see, but it's essentially the same information. > > Best, > Erick > > On Sun, Apr 26, 2015 at 11:18 PM, <steve.sch...@t-systems.com> wrote: >> Erick, >> >> thanks a lot for helping me here. In my case it ist he "content" field which >> is displayed not correctly. So I went tot he schema browser like you pointed >> out. Here ist he information I found: >> Field: content >> Field Type: text >> Properties: Indexed, Tokenized, Stored, TermVector Stored >> Schema: Indexed, Tokenized, Stored, TermVector Stored >> Index: Indexed, Tokenized, Stored, TermVector Stored Copied Into: >> spell teaser Position Increment Gap: 100 Index Analyzer: >> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: >> org.apache.solr.analysis.WhitespaceTokenizerFactory >> Filters: >> org.apache.solr.analysis.WordDelimiterFilterFactory >> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 >> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1 >> catenateAll: 0 catenateNumbers: 1 } >> org.apache.solr.analysis.LowerCaseFilterFactory >> args:{luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: >> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: >> LUCENE_36 } >> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory >> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 >> minWordSize: 5 dictionary: german/german-common-nouns.txt >> luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.StopFilterFactory args:{words: >> german/stopwords.txt ignoreCase: true enablePositionIncrements: true >> luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.GermanNormalizationFilterFactory >> args:{luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: >> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 >> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory >> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer: >> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: >> org.apache.solr.analysis.WhitespaceTokenizerFactory >> Filters: >> org.apache.solr.analysis.WordDelimiterFilterFactory >> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1 >> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1 >> catenateAll: 0 catenateNumbers: 0 } >> org.apache.solr.analysis.LowerCaseFilterFactory >> args:{luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.StopFilterFactory args:{words: >> german/stopwords.txt ignoreCase: true enablePositionIncrements: true >> luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.GermanNormalizationFilterFactory >> args:{luceneMatchVersion: LUCENE_36 } >> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: >> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 >> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory >> args:{luceneMatchVersion: LUCENE_36 } >> Distinct: 160403 >> >> Does this somehow help to figure out the issue? >> Thanks >> Best >> Steve >> >> >> -----Ursprüngliche Nachricht----- >> Von: Erick Erickson [mailto:erickerick...@gmail.com] >> Gesendet: Freitag, 24. April 2015 20:15 >> An: solr-user@lucene.apache.org >> Betreff: Re: Odp.: solr issue with pdf forms >> >> Steve: >> >> Right, it's not exactly obvious. Bring up the admin UI, something like >> http://localhost:8983/solr. From there you have to select a core in the >> 'core selector' drop-down on the left side. If you're using SolrCloud, this >> will have a rather strange name, but it should be easy to identify what >> collection it belongs to. >> >> At that point you'll see a bunch of new options, among them "schema >> browser". From there, select your field from the drop-down that will appear, >> then a button should pop up "load term info". >> >> NOTE: you can get the same information from the TermsComponent, see: >> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component. >> This is a little more flexible because you can, among other things, specify >> the place to start. In your case you might specify terms.prefix=mein which >> will show you the terms that are actually being _searched_ as opposed to >> being stored. This latter is what you see in the browser when you search for >> docs and is sometimes misleading as you're (probably) seeing. >> >> Best, >> Erick >> >> On Fri, Apr 24, 2015 at 1:58 AM, <steve.sch...@t-systems.com> wrote: >>> Hey Erick, >>> >>> thanks a lot for your answer. I went to the admin schema browser, >>> but what should I see there? Sorry I'm not firm with the admin >>> schema browser. :-( >>> >>> Best >>> Steve >>> >>> >>> -----Ursprüngliche Nachricht----- >>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >>> Gesendet: Donnerstag, 23. April 2015 18:00 >>> An: solr-user@lucene.apache.org >>> Betreff: Re: Odp.: solr issue with pdf forms >>> >>> When you say "they're not indexed correctly", what's your evidence? >>> You cannot rely >>> on the display in the browser, that's the raw input just as it was sent to >>> Solr, _not_ the actual tokens in the index. What do you see when you go to >>> the admin schema browser pate and load the actual tokens. >>> >>> Or use the TermsComponent >>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen >>> t >>> ) to see the actual terms in the index as opposed to the stored data >>> you see in the browser when you look at search results. >>> >>> If the actual terms don't seem right _in the index_ we need to see your >>> analysis chain, i.e. your fieldType definition. >>> >>> I'm, 90% sure you're seeing the stored data and your terms are indexed just >>> fine, but I've certainly been wrong before, more times than I want to >>> remember..... >>> >>> Best, >>> Erick >>> >>> On Thu, Apr 23, 2015 at 1:18 AM, <steve.sch...@t-systems.com> wrote: >>>> Hey Erick, >>>> >>>> thanks for your answer. They are not indexed correctly. Also throught the >>>> solr admin interface I see these typical questionmarks within a rhombus >>>> where a blank space should be. >>>> I now figured out the following (not sure if it is relevant at all): >>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are >>>> indexed correctly, no issues >>>> - PDF documents (with editable form fields) created with "Adobe >>>> InDesign CS5 (7.0.1)" are indexed with the blank space issue >>>> >>>> Best >>>> Steve >>>> >>>> -----Ursprüngliche Nachricht----- >>>> Von: Erick Erickson [mailto:erickerick...@gmail.com] >>>> Gesendet: Mittwoch, 22. April 2015 17:11 >>>> An: solr-user@lucene.apache.org >>>> Betreff: Re: Odp.: solr issue with pdf forms >>>> >>>> Are they not _indexed_ correctly or not being displayed correctly? >>>> Take a look at admin UI>>schema browser>> your field and press the "load >>>> terms" button. That'll show you what is _in_ the index as opposed to what >>>> the raw data looked like. >>>> >>>> When you return the field in a Solr search, you get a verbatim, >>>> un-analyzed copy of your original input. My guess is that your browser >>>> isn't using the compatible character encoding for display. >>>> >>>> Best, >>>> Erick >>>> >>>> On Wed, Apr 22, 2015 at 7:08 AM, <steve.sch...@t-systems.com> wrote: >>>>> Thanks for your answer. Maybe my English is not good enough, what are you >>>>> trying to say? Sorry I didn't get the point. >>>>> :-( >>>>> >>>>> >>>>> -----Ursprüngliche Nachricht----- >>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com] >>>>> Gesendet: Mittwoch, 22. April 2015 14:01 >>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org >>>>> Betreff: Odp.: solr issue with pdf forms >>>>> >>>>> Out of my head I'd follow how are writable PDFs created and encoded. >>>>> >>>>> @LAFK_PL >>>>> Oryginalna wiadomość >>>>> Od: steve.sch...@t-systems.com >>>>> Wysłano: środa, 22 kwietnia 2015 12:41 >>>>> Do: solr-user@lucene.apache.org >>>>> Odpowiedz: solr-user@lucene.apache.org >>>>> Temat: solr issue with pdf forms >>>>> >>>>> Hi guys, >>>>> >>>>> hopefully you can help me with my issue. We are using a solr setup and >>>>> have the following issue: >>>>> - usual pdf files are indexed just fine >>>>> - pdf files with writable form-fields look like this: >>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt >>>>> und v ollständig sind >>>>> >>>>> Somehow the blank space character is not indexed correctly. >>>>> >>>>> Is this a know issue? Does anybody have an idea? >>>>> >>>>> Thanks a lot >>>>> Best >>>>> Steve