AW: Odp.: solr issue with pdf forms

Steve.Scholl Wed, 29 Apr 2015 03:51:58 -0700

Sorry, but there really isn't... :-/

I never used the terms component. So I first looked if it is configured, and it 
really is.
Then I tried to get an idea how it works and tried the examples described in 
the doku.
After that I tried to figure out how to get the output from the "misscoded" pdf 
content.
My first step was to find the fields I need:


http://IP:8080/solr/core_de/terms?terms.fl=content&terms.fl=fileReferenceDocumentId&terms.fl=fileName

This gives me a top 10 list of the indexed documents and shows the fields 
content, fileReferenceDocumentId and fileName if I understand the documentation 
correctly.
Now I tried to limit the output to the specified file which has the coding 
issues:

http://IP:8080/solr/core_de/terms?terms.fl=content&terms.fl=fileReferenceDocumentId&terms.fl=fileName&terms.prefix=CODING-ISSUE.pdf

But this is then not showing the content of the content field anymore. :-(
The result looks like this:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="terms">
<lst name="content"/>
<lst name="fileReferenceDocumentId"/>
<lst name="fileName">
<int name=" CODING-ISSUE.pdf ">3</int>
</lst>
</lst>
</response>

Any help would be appreciated 
Thanks a lot
Best
Steve


-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerick...@gmail.com] 
Gesendet: Mittwoch, 29. April 2015 03:07
An: solr-user@lucene.apache.org
Betreff: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  <steve.sch...@t-systems.com> wrote:
> Thanks a lot for being patient with me. Unfortunately there is no 
> button "load term info". :-( Can you may be help me using the TermsComponent 
> instead? I read it is per default configured.
>
> Thanks a lot
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Montag, 27. April 2015 17:23
> An: solr-user@lucene.apache.org
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on 
> that page. Clicking that button will show you the terms in your index (as 
> opposed to the raw stored input which is what you get when you look at 
> results in the browser). My bet is that you'll see perfectly normal tokens in 
> the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly 
> fine. On the other hand, if the individual terms are weird, then you have 
> something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, 
> and allows you a bit more flexibility than the admin page in terms of what 
> tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,  <steve.sch...@t-systems.com> wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which 
>> is displayed not correctly. So I went tot he schema browser like you pointed 
>> out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } 
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } 
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 
>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } 
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } 
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 
>> } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> Distinct:  160403
>>
>> Does this somehow help to figure out the issue?
>> Thanks
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>> Gesendet: Freitag, 24. April 2015 20:15
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Steve:
>>
>> Right, it's not exactly obvious. Bring up the admin UI, something like 
>> http://localhost:8983/solr. From there you have to select a core in the 
>> 'core selector' drop-down on the left side. If you're using SolrCloud, this 
>> will have a rather strange name, but it should be easy to identify what 
>> collection it belongs to.
>>
>> At that point you'll see a bunch of new options, among them "schema 
>> browser". From there, select your field from the drop-down that will appear, 
>> then a button should pop up "load term info".
>>
>> NOTE: you can get the same information from the TermsComponent, see:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> This is a little more flexible because you can, among other things, specify 
>> the place to start. In your case you might specify terms.prefix=mein which 
>> will show you the terms that are actually being _searched_ as opposed to 
>> being stored. This latter is what you see in the browser when you search for 
>> docs and is sometimes misleading as you're (probably) seeing.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 24, 2015 at 1:58 AM,  <steve.sch...@t-systems.com> wrote:
>>> Hey Erick,
>>>
>>> thanks a lot for your answer. I went to the admin schema browser, 
>>> but what should I see there? Sorry I'm not firm with the admin 
>>> schema browser. :-(
>>>
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>> An: solr-user@lucene.apache.org
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> When you say "they're not indexed correctly", what's your evidence?
>>> You cannot rely
>>> on the display in the browser, that's the raw input just as it was sent to 
>>> Solr, _not_ the actual tokens in the index. What do you see when you go to 
>>> the admin schema browser pate and load the actual tokens.
>>>
>>> Or use the TermsComponent
>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Componen
>>> t
>>> ) to see the actual terms in the index as opposed to the stored data 
>>> you see in the browser when you look at search results.
>>>
>>> If the actual terms don't seem right _in the index_ we need to see your 
>>> analysis chain, i.e. your fieldType definition.
>>>
>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just 
>>> fine, but I've certainly been wrong before, more times than I want to 
>>> remember.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 23, 2015 at 1:18 AM,  <steve.sch...@t-systems.com> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks for your answer. They are not indexed correctly. Also throught the 
>>>> solr admin interface I see these typical questionmarks within a rhombus 
>>>> where a blank space should be.
>>>> I now figured out the following (not sure if it is relevant at all):
>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are 
>>>> indexed correctly, no issues
>>>> - PDF documents (with editable form fields) created with "Adobe 
>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>
>>>> Best
>>>> Steve
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:erickerick...@gmail.com]
>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>> An: solr-user@lucene.apache.org
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>> Take a look at admin UI>>schema browser>> your field and press the "load 
>>>> terms" button. That'll show you what is _in_ the index as opposed to what 
>>>> the raw data looked like.
>>>>
>>>> When you return the field in a Solr search, you get a verbatim, 
>>>> un-analyzed copy of your original input. My guess is that your browser 
>>>> isn't using the compatible character encoding for display.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <steve.sch...@t-systems.com> wrote:
>>>>> Thanks for your answer. Maybe my English is not good enough, what are you 
>>>>> trying to say? Sorry I didn't get the point.
>>>>> :-(
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: LAFK [mailto:tomasz.bo...@gmail.com]
>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>> An: solr-user@lucene.apache.org; solr-user@lucene.apache.org
>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>
>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>
>>>>> @LAFK_PL
>>>>>   Oryginalna wiadomość
>>>>> Od: steve.sch...@t-systems.com
>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>> Do: solr-user@lucene.apache.org
>>>>> Odpowiedz: solr-user@lucene.apache.org
>>>>> Temat: solr issue with pdf forms
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> hopefully you can help me with my issue. We are using a solr setup and 
>>>>> have the following issue:
>>>>> - usual pdf files are indexed just fine
>>>>> - pdf files with writable form-fields look like this:
>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt 
>>>>> und v ollständig sind
>>>>>
>>>>> Somehow the blank space character is not indexed correctly.
>>>>>
>>>>> Is this a know issue? Does anybody have an idea?
>>>>>
>>>>> Thanks a lot
>>>>> Best
>>>>> Steve

AW: Odp.: solr issue with pdf forms

Reply via email to