I tried couples of times to get this patch, but downloads fail, filesize missmach or someting like error poped up is there another link
On 3/9/10, Dominique Bejean <dominique.bej...@eolya.fr> wrote: > > Hi, > > The problem comes form PDFBox ( > http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However > Tika doesn't yet use this version of PDFBox. > So for PDF text extraction, I doesn't use Tika but pdftotext. > > Dominique > > > Le 09/03/10 06:00, Robert Muir a écrit : > > it is an optional dependency of PDFBox. If ICU is available, then it >> is capable of processing Arabic PDF files. >> >> The problem is that Arabic "text" in PDF files is really glyphs >> (encoded in visual order) and needs to be 'unshaped' with some stuff >> that isn't in the JDK. >> >> If the size of the default ICU jar file is the issue here, we can >> consider an alternative: The default ICU jar is very large as it >> includes everything, yet it can be customized to only include what is >> needed: http://apps.icu-project.org/datacustom/ >> >> We did this in lucene for the collation contrib, to shrink the jar >> about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867 >> >> For this use-case, it could be even smaller, as most of the huge size >> of ICU comes from large CJK collation tables (needed for collation, >> but not for this Arabic PDF extraction). >> >> In reality I don't really like doing this as it might confuse users >> (e.g. people that want collation, too), and ICU is useful for other >> things, but if thats what we have to do, we should do it so that >> Arabic PDF files will work. >> >> On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog<goks...@gmail.com> wrote: >> >> >>> Is this a mistake in the Tika library collection in the Solr trunk? >>> >>> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir<rcm...@gmail.com> wrote: >>> >>> >>>> I think the problem is that Solr does not include the ICU4J jar, so it >>>> won't work with Arabic PDF files. >>>> >>>> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your >>>> classpath. >>>> >>>> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid ABID<aeh.a...@gmail.com> >>>> wrote: >>>> >>>> >>>>> Hi, >>>>> Posting arabic pdf files to Solr using a web form (to >>>>> solr/update/extract) >>>>> get extracted texts and each words displayed in reverse >>>>> direction(instead of >>>>> right to left). >>>>> When perform search against these texts with -always- reversed >>>>> key-words I >>>>> get results but reversed. >>>>> This problem doesn't occur when posting MsWord document. >>>>> I think the problem come from Tika ! >>>>> >>>>> Any clue ? >>>>> >>>>> -- >>>>> elsadek >>>>> Software Engineer- J2EE / WEB / ESB MULE >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Robert Muir >>>> rcm...@gmail.com >>>> >>>> >>>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >>> >>> >> >> >> >> > -- Abdelhamid ABID Software Engineer- J2EE / WEB / ESB MULE