[Tutor] Package which can extract data from pdf

2019-08-14 Thread Nupur Jha
Hi All,

I have many pdf invoices with different formats. I want to extract the line
items from these pdf files using python coding.

I would request you all to guide me how can i achieve this.

-- 

*Thanks & Regards,Nupur Jha*
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Package which can extract data from pdf

2019-08-14 Thread Mats Wichmann
On 8/14/19 10:10 AM, Nupur Jha wrote:
> Hi All,
> 
> I have many pdf invoices with different formats. I want to extract the line
> items from these pdf files using python coding.
> 
> I would request you all to guide me how can i achieve this.
> 

There are many packages that attempt to extract text from pdf.  They
have varying degrees of success on various different documents: you need
to be aware that PDF wasn't intended to be used that way, it was written
to *display* consistently.  Sometimes the pdf is full of instructions
for rendering that are hard for a reader to figure out, and need to be
pieced together in possibly unexpected ways.  My experience is that if
you can select the interesting text in a pdf reader, and paste it into
an editor, and it doesn't come out looking particularly mangled, then
reading it programmatically has a pretty good chance of working. If not,
you may be in trouble. That said...

pypdf2, textract, and tika all have their supporters. You can search for
all of these on pypi, which will give you links to the projects' home pages.

(if it matters, tika is an interface to a bunch of Java code, so you're
not using Python to read it, but you are using Python to control the
process)

There's a product called pdftables which specifically tries to be good
at spreadsheet-like data, which your invoices *might* be.  That is not a
free product, however. For that one there's a Python interface that
sends your data off to a web service and you get answers back.

There are probably dozens more... this seems to be an area with a lot of
reinvention going on.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] cgi module help (original poster)

2019-08-14 Thread rmlibre



On 2019-08-13 15:49, tutor-requ...@python.org wrote:
> Send Tutor mailing list submissions to
>   tutor@python.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   https://mail.python.org/mailman/listinfo/tutor
> or, via email, send a message with subject or body 'help' to
>   tutor-requ...@python.org
> 
> You can reach the person managing the list at
>   tutor-ow...@python.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Tutor digest..."
> 
> Today's Topics:
> 
>1. Re: Fwd: Re: HELP PLEASE (Alan Gauld)
>2. Re: HELP PLEASE (David L Neil)
>3. Re: HELP PLEASE (Cameron Simpson)
>4. Re: HELP PLEASE (Sithembewena L. Dube)
>5. Re: HELP PLEASE (Alan Gauld)
>6. Re: cgi module help (Peter Otten)
> 
> ___
> Tutor maillist  -  Tutor@python.org
> https://mail.python.org/mailman/listinfo/tutor


I went looking through the cgi module code and found the limitation is
in the way urllib's parse_qsl method parses data. It simply does not
handle nested dictionaries correctly. This lead to a simple
solution/hack by just json'ing the nested values:

"""Client Code"""
import json
from requests import sessions

inner_metadata = json.dumps({"date": "2019-08", "id": ""})
metadata = {"metadata": inner_metadata}
session = sessions.Session()
session.post(, data=metadata)


This is received on the server side and can be parsed with:

"""Server Code"""
import cgi
import json

form = cgi.FieldStorage()
metadata_json = form.getvalue("metadata", None)
metadata = json.loads(metadata)
print(metadata)
> {"date": "2019-08", "id": ""}

print(type(metadata))
> dict
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Package which can extract data from pdf

2019-08-14 Thread William Ray Wing via Tutor

> On Aug 14, 2019, at 2:16 PM, Mats Wichmann  wrote:
> 
>> On 8/14/19 10:10 AM, Nupur Jha wrote:
>> Hi All,
>> 
>> I have many pdf invoices with different formats. I want to extract the line
>> items from these pdf files using python coding.
>> 

Treat this as a two part problem: part one is extracting the text; part two is 
parsing that text for your desired information. Unless you have a specific need 
for extracting the text with python, I’d recommend solving part one with an 
image-to-text reader. These have gotten really quite good recently (AI no 
doubt). Then parsing the text with python’s string handling routines should be 
pretty straightforward. 

Bill

>> I would request you all to guide me how can i achieve this.
>> 
> 
> There are many packages that attempt to extract text from pdf.  They
> have varying degrees of success on various different documents: you need
> to be aware that PDF wasn't intended to be used that way, it was written
> to *display* consistently.  Sometimes the pdf is full of instructions
> for rendering that are hard for a reader to figure out, and need to be
> pieced together in possibly unexpected ways.  My experience is that if
> you can select the interesting text in a pdf reader, and paste it into
> an editor, and it doesn't come out looking particularly mangled, then
> reading it programmatically has a pretty good chance of working. If not,
> you may be in trouble. That said...
> 
> pypdf2, textract, and tika all have their supporters. You can search for
> all of these on pypi, which will give you links to the projects' home pages.
> 
> (if it matters, tika is an interface to a bunch of Java code, so you're
> not using Python to read it, but you are using Python to control the
> process)
> 
> There's a product called pdftables which specifically tries to be good
> at spreadsheet-like data, which your invoices *might* be.  That is not a
> free product, however. For that one there's a Python interface that
> sends your data off to a web service and you get answers back.
> 
> There are probably dozens more... this seems to be an area with a lot of
> reinvention going on.
> 
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor