AW: Odp.: solr issue with pdf forms

Steve.Scholl Sun, 26 Apr 2015 23:20:13 -0700

Erick,

thanks a lot for helping me here. In my case it ist he "content" field which is 
displayed not correctly. So I went tot he schema browser like you pointed out. 
Here ist he information I found:
Field: content
Field Type: text
Properties:  Indexed, Tokenized, Stored, TermVector Stored
Schema:  Indexed, Tokenized, Stored, TermVector Stored
Index:  Indexed, Tokenized, Stored, TermVector Stored
Copied Into: spell teaser 
Position Increment Gap:  100
Index Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:  
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 
splitOnCaseChange: 0 generateNumberParts: 1 catenateWords: 1 
luceneMatchVersion: LUCENE_36 generateWordParts: 1 catenateAll: 0 
catenateNumbers: 1 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: 
LUCENE_36 }
org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: 
german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion: LUCENE_36 
}
org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory 
args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4 minWordSize: 
5 dictionary: german/german-common-nouns.txt luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.StopFilterFactory args:{words: german/stopwords.txt 
ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.GermanNormalizationFilterFactory 
args:{luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: 
german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
args:{luceneMatchVersion: LUCENE_36 }
Query Analyzer: org.apache.solr.analysis.TokenizerChain Details
Tokenizer Class:  org.apache.solr.analysis.WhitespaceTokenizerFactory
Filters:  
org.apache.solr.analysis.WordDelimiterFilterFactory args:{preserveOriginal: 1 
splitOnCaseChange: 0 generateNumberParts: 1 catenateWords: 0 
luceneMatchVersion: LUCENE_36 generateWordParts: 1 catenateAll: 0 
catenateNumbers: 0 }
org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: 
LUCENE_36 }
org.apache.solr.analysis.StopFilterFactory args:{words: german/stopwords.txt 
ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.GermanNormalizationFilterFactory 
args:{luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected: 
german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory 
args:{luceneMatchVersion: LUCENE_36 }
Distinct:  160403


Does this somehow help to figure out the issue?
Thanks
Best
Steve


-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Freitag, 24. April 2015 20:15
An: solr-user@lucene.apache.org
Betreff: Re: Odp.: solr issue with pdf forms

Steve:

Right, it's not exactly obvious. Bring up the admin UI, something like 
http://localhost:8983/solr. From there you have to select a core in the 'core 
selector' drop-down on the left side. If you're using SolrCloud, this will have 
a rather strange name, but it should be easy to identify what collection it 
belongs to.

At that point you'll see a bunch of new options, among them "schema browser". 
From there, select your field from the drop-down that will appear, then a 
button should pop up "load term info".

NOTE: you can get the same information from the TermsComponent, see:
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
This is a little more flexible because you can, among other things, specify the 
place to start. In your case you might specify terms.prefix=mein which will 
show you the terms that are actually being _searched_ as opposed to being 
stored. This latter is what you see in the browser when you search for docs and 
is sometimes misleading as you're (probably) seeing.

Best,
Erick

On Fri, Apr 24, 2015 at 1:58 AM,  <steve.sch...@t-systems.com> wrote:
> Hey Erick,
>
> thanks a lot for your answer. I went to the admin schema browser, but 
> what should I see there? Sorry I'm not firm with the admin schema 
> browser. :-(
>
> Best
> Steve
>
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Donnerstag, 23. April 2015 18:00
> An: solr-user@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> When you say "they're not indexed correctly", what's your evidence?
> You cannot rely
> on the display in the browser, that's the raw input just as it was sent to 
> Solr, _not_ the actual tokens in the index. What do you see when you go to 
> the admin schema browser pate and load the actual tokens.
>
> Or use the TermsComponent
> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component)
> to see the actual terms in the index as opposed to the stored data you see in 
> the browser when you look at search results.
>
> If the actual terms don't seem right _in the index_ we need to see your 
> analysis chain, i.e. your fieldType definition.
>
> I'm, 90% sure you're seeing the stored data and your terms are indexed just 
> fine, but I've certainly been wrong before, more times than I want to 
> remember.....
>
> Best,
> Erick
>
> On Thu, Apr 23, 2015 at 1:18 AM,  <steve.sch...@t-systems.com> wrote:
>> Hey Erick,
>>
>> thanks for your answer. They are not indexed correctly. Also throught the 
>> solr admin interface I see these typical questionmarks within a rhombus 
>> where a blank space should be.
>> I now figured out the following (not sure if it is relevant at all):
>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are 
>> indexed correctly, no issues
>> - PDF documents (with editable form fields) created with "Adobe 
>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>
>> Best
>> Steve
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> Gesendet: Mittwoch, 22. April 2015 17:11
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Are they not _indexed_ correctly or not being displayed correctly?
>> Take a look at admin UI>>schema browser>> your field and press the "load 
>> terms" button. That'll show you what is _in_ the index as opposed to what 
>> the raw data looked like.
>>
>> When you return the field in a Solr search, you get a verbatim, un-analyzed 
>> copy of your original input. My guess is that your browser isn't using the 
>> compatible character encoding for display.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 22, 2015 at 7:08 AM,  <steve.sch...@t-systems.com> wrote:
>>> Thanks for your answer. Maybe my English is not good enough, what are you 
>>> trying to say? Sorry I didn't get the point.
>>> :-(
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: LAFK [mailto:tomasz.bo...@gmail.com]
>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>>> Betreff: Odp.: solr issue with pdf forms
>>>
>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>
>>> @LAFK_PL
>>>   Oryginalna wiadomość
>>> Od: steve.sch...@t-systems.com
>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>> Do: solr-user@lucene.apache.org
>>> Odpowiedz: solr-user@lucene.apache.org
>>> Temat: solr issue with pdf forms
>>>
>>> Hi guys,
>>>
>>> hopefully you can help me with my issue. We are using a solr setup and have 
>>> the following issue:
>>> - usual pdf files are indexed just fine
>>> - pdf files with writable form-fields look like this:
>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt und 
>>> v ollständig sind
>>>
>>> Somehow the blank space character is not indexed correctly.
>>>
>>> Is this a know issue? Does anybody have an idea?
>>>
>>> Thanks a lot
>>> Best
>>> Steve

AW: Odp.: solr issue with pdf forms

Reply via email to