Hello Daniel,
The text ("2007") we need to extract is written in CLRDingbats font. Can you
please give us any pointer so that we don't get garbage value while extracting
it from the pdf (attached for your reference)?
Thanks!
From: Daniel Manzke [mailto:[email protected]]
Sent: Monday, December 29, 2008 5:36 PM
To: Duseja, Sushil; [email protected]
Cc: Rally, Menka
Subject: Re: Garbage Output
Sorry this would be a job for one of the pdfbox developers. Until now I'm just
doing some support for the list and didn't have too much know-how about it.
So I can just have a look in the evening and maybe I will find a solution. ;)
Daniel
2008/12/29 Duseja, Sushil <[email protected]>
If possible, can you please let us know your contact number to discuss this
issue?
Thanks!
From: Daniel Manzke [mailto:[email protected]]
Sent: Monday, December 29, 2008 5:12 PM
To: Duseja, Sushil; [email protected]
Cc: Rally, Menka
Subject: Re: Garbage Output
Hi,
I've just added this line:
//after stripper.extractRegions();
stripper.getText(document));
After doing this I got some text for the regions. But it seems that this text
is related to page 1. Did you have found an example how to use the Stripper?
Maybe another guy could help you, due the fact that I don't have any knowledge
about the Stripper.
If I have some time in the evening I will give it another test.
Bye,
Daniel
2008/12/29 Duseja, Sushil <[email protected]>
Hello Daniel,
I tried using the compiled version sent across by you with no luck.
I tried running a java program (for text extraction) with PDFBox 0.7.3 and 0.8
versions in the classpath separately. With 0.8, I am not being able to fetch
anything. However with 0.7.3, I could extract all values apart from "Year of
Form" whose value is garbage - À¾´» , which is why you recommended using 0.8.
Note - Java program and my PDF are attached for your kind reference. The names
of the java files are self explanatory and indicative of which version they are
using. The contents of these java files are exactly the same.
Please advise.
Thanks!
From: Daniel Manzke [mailto:[email protected]]
Sent: Monday, December 29, 2008 2:45 PM
To: Duseja, Sushil
Cc: [email protected]; Rally, Menka
Subject: Re: Garbage Output
Just check out the latest source code and run Maven.
I will send you a compiled version.
Bye
2008/12/29 Duseja, Sushil <[email protected]>
Thanks Daniel.
Do you mean that - I need to fetch the latest source code from the trunk in the
Subversion repository? If no, how can I get the source code for 0.8?
I would really appreciate if you can build me a compiled version. I hope I am
not bothering you.
Thanking you in anticipation.
From: Daniel Manzke [mailto:[email protected]]
Sent: Monday, December 29, 2008 1:41 PM
To: Duseja, Sushil
Cc: [email protected]; Rally, Menka
Subject: Re: Garbage Output
PDFBox is still under incubation and there is not 0.8 distribution. What you
could do, is downloading the source code and build it by your own. So you could
have a look at the code and debug it, where the garbage is produced. Or ask me
and I will build you a compiled version.
Daniel
2008/12/29 Duseja, Sushil <[email protected]>
Thanks again for responding.
Can you please point me to the URL/location from which 0.8 version can be
downloaded?
I referred to - http://sourceforge.net/project/showfiles.php?group_id=78314;
however it shows the latest version is 0.7.3.
Thanks for your time.
From: Daniel Manzke [mailto:[email protected]]
Sent: Monday, December 29, 2008 1:29 PM
To: Duseja, Sushil
Cc: [email protected]; Rally, Menka
Subject: Re: Garbage Output
Try to check out the latest Development Build. Due the fact thaht 0.7.3 is
outdated. (year: 2006) In 0.8 there are a lot of issues fixed.
Bye,
daniel
2008/12/29 Duseja, Sushil <[email protected]>
Hello Daniel,
Thanks for the response.
I am using version 0.7.3.
Thanks!
-----Original Message-----
From: Daniel Manzke [mailto:[email protected]]
Sent: Friday, December 26, 2008 9:11 PM
To: [email protected]
Subject: Re: Garbage Output
Hi,
standard question. ;) Which version are you using?
Daniel
2008/12/26 Duseja, Sushil <[email protected]>
> Hello,
>
>
>
> While extracting text from a pdf file (attached for your kind reference)
> using PDFBox, I get garbage output (*À¾´»*) for a special text value"*2007
> *" (please see page 2); I can fetch other values correctly though.
>
> Is this an *encoding issue*; if yes, can anyone please let me know how to
> fix it? If possible, please point me to some working examples.
>
>
>
> Thanks in advance.
>
--
Mit freundlichen Grüßen
Daniel Manzke
--
Mit freundlichen Grüßen
Daniel Manzke
--
Mit freundlichen Grüßen
Daniel Manzke
--
Mit freundlichen Grüßen
Daniel Manzke
--
Mit freundlichen Grüßen
Daniel Manzke
--
Mit freundlichen Grüßen
Daniel Manzke