Re: PDF extraction leads to reversed words

Robert Muir Tue, 09 Mar 2010 07:21:52 -0800

this depends on what version of solr you are using, the trunk version
has a version of tika that supports this. See SOLR-1813


On Tue, Mar 9, 2010 at 3:59 AM, Dominique Bejean
<dominique.bej...@eolya.fr> wrote:
> Hi,
>
> The problem comes form PDFBox
> (http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However
> Tika doesn't yet use this version of PDFBox.
> So for PDF text extraction, I doesn't use Tika but pdftotext.
>
> Dominique
>
>
> Le 09/03/10 06:00, Robert Muir a écrit :
>>
>> it is an optional dependency of PDFBox. If ICU is available, then it
>> is capable of processing Arabic PDF files.
>>
>> The problem is that Arabic "text" in PDF files is really glyphs
>> (encoded in visual order) and needs to be 'unshaped' with some stuff
>> that isn't in the JDK.
>>
>> If the size of the default ICU jar file is the issue here, we can
>> consider an alternative: The default ICU jar is very large as it
>> includes everything, yet it can be customized to only include what is
>> needed: http://apps.icu-project.org/datacustom/
>>
>> We did this in lucene for the collation contrib, to shrink the jar
>> about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867
>>
>> For this use-case, it could be even smaller, as most of the huge size
>> of ICU comes from large CJK collation tables (needed for collation,
>> but not for this Arabic PDF extraction).
>>
>> In reality I don't really like doing this as it might confuse users
>> (e.g. people that want collation, too), and ICU is useful for other
>> things, but if thats what we have to do, we should do it so that
>> Arabic PDF files will work.
>>
>> On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog<goks...@gmail.com>  wrote:
>>
>>>
>>> Is this a mistake in the Tika library collection in the Solr trunk?
>>>
>>> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir<rcm...@gmail.com>  wrote:
>>>
>>>>
>>>> I think the problem is that Solr does not include the ICU4J jar, so it
>>>> won't work with Arabic PDF files.
>>>>
>>>> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
>>>> classpath.
>>>>
>>>> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID<aeh.a...@gmail.com>
>>>>  wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>> Posting arabic pdf files to Solr using a web form (to
>>>>> solr/update/extract)
>>>>> get extracted texts and each words displayed in reverse
>>>>> direction(instead of
>>>>> right to left).
>>>>> When perform search against these texts with -always- reversed
>>>>> key-words I
>>>>> get results but reversed.
>>>>> This problem doesn't occur when posting MsWord document.
>>>>> I think the problem come from Tika !
>>>>>
>>>>> Any clue ?
>>>>>
>>>>> --
>>>>> elsadek
>>>>> Software Engineer- J2EE / WEB / ESB MULE
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> rcm...@gmail.com
>>>>
>>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>
>>
>>
>



-- 
Robert Muir
rcm...@gmail.com

Re: PDF extraction leads to reversed words

Reply via email to