Re: PDF extraction leads to reversed words

Abdelhamid ABID Tue, 09 Mar 2010 07:22:27 -0800

I doen't know about pdftotext, is it pluggable with Solr, or do we need
hard-code the step of extraction before Solr turn.


On 3/9/10, Dominique Bejean <dominique.bej...@eolya.fr> wrote:
>
> Hi,
>
> The problem comes form PDFBox (
> http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However
> Tika doesn't yet use this version of PDFBox.
> So for PDF text extraction, I doesn't use Tika but pdftotext.
>
> Dominique
>
>
> Le 09/03/10 06:00, Robert Muir a écrit :
>
>  it is an optional dependency of PDFBox. If ICU is available, then it
>> is capable of processing Arabic PDF files.
>>
>> The problem is that Arabic "text" in PDF files is really glyphs
>> (encoded in visual order) and needs to be 'unshaped' with some stuff
>> that isn't in the JDK.
>>
>> If the size of the default ICU jar file is the issue here, we can
>> consider an alternative: The default ICU jar is very large as it
>> includes everything, yet it can be customized to only include what is
>> needed: http://apps.icu-project.org/datacustom/
>>
>> We did this in lucene for the collation contrib, to shrink the jar
>> about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867
>>
>> For this use-case, it could be even smaller, as most of the huge size
>> of ICU comes from large CJK collation tables (needed for collation,
>> but not for this Arabic PDF extraction).
>>
>> In reality I don't really like doing this as it might confuse users
>> (e.g. people that want collation, too), and ICU is useful for other
>> things, but if thats what we have to do, we should do it so that
>> Arabic PDF files will work.
>>
>> On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog<goks...@gmail.com>  wrote:
>>
>>
>>> Is this a mistake in the Tika library collection in the Solr trunk?
>>>
>>> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir<rcm...@gmail.com>  wrote:
>>>
>>>
>>>> I think the problem is that Solr does not include the ICU4J jar, so it
>>>> won't work with Arabic PDF files.
>>>>
>>>> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
>>>> classpath.
>>>>
>>>> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID<aeh.a...@gmail.com>
>>>>  wrote:
>>>>
>>>>
>>>>> Hi,
>>>>> Posting arabic pdf files to Solr using a web form (to
>>>>> solr/update/extract)
>>>>> get extracted texts and each words displayed in reverse
>>>>> direction(instead of
>>>>> right to left).
>>>>> When perform search against these texts with -always- reversed
>>>>> key-words I
>>>>> get results but reversed.
>>>>> This problem doesn't occur when posting MsWord document.
>>>>> I think the problem come from Tika !
>>>>>
>>>>> Any clue ?
>>>>>
>>>>> --
>>>>> elsadek
>>>>> Software Engineer- J2EE / WEB / ESB MULE
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcm...@gmail.com
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>>
>>
>>
>>
>>
>


-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE

Re: PDF extraction leads to reversed words

Reply via email to