[Tutor] Package which can extract data from pdf
Hi All, I have many pdf invoices with different formats. I want to extract the line items from these pdf files using python coding. I would request you all to guide me how can i achieve this. -- *Thanks & Regards,Nupur Jha* ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Package which can extract data from pdf
On 8/14/19 10:10 AM, Nupur Jha wrote: > Hi All, > > I have many pdf invoices with different formats. I want to extract the line > items from these pdf files using python coding. > > I would request you all to guide me how can i achieve this. > There are many packages that attempt to extract text from pdf. They have varying degrees of success on various different documents: you need to be aware that PDF wasn't intended to be used that way, it was written to *display* consistently. Sometimes the pdf is full of instructions for rendering that are hard for a reader to figure out, and need to be pieced together in possibly unexpected ways. My experience is that if you can select the interesting text in a pdf reader, and paste it into an editor, and it doesn't come out looking particularly mangled, then reading it programmatically has a pretty good chance of working. If not, you may be in trouble. That said... pypdf2, textract, and tika all have their supporters. You can search for all of these on pypi, which will give you links to the projects' home pages. (if it matters, tika is an interface to a bunch of Java code, so you're not using Python to read it, but you are using Python to control the process) There's a product called pdftables which specifically tries to be good at spreadsheet-like data, which your invoices *might* be. That is not a free product, however. For that one there's a Python interface that sends your data off to a web service and you get answers back. There are probably dozens more... this seems to be an area with a lot of reinvention going on. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] cgi module help (original poster)
On 2019-08-13 15:49, tutor-requ...@python.org wrote: > Send Tutor mailing list submissions to > tutor@python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/tutor > or, via email, send a message with subject or body 'help' to > tutor-requ...@python.org > > You can reach the person managing the list at > tutor-ow...@python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Tutor digest..." > > Today's Topics: > >1. Re: Fwd: Re: HELP PLEASE (Alan Gauld) >2. Re: HELP PLEASE (David L Neil) >3. Re: HELP PLEASE (Cameron Simpson) >4. Re: HELP PLEASE (Sithembewena L. Dube) >5. Re: HELP PLEASE (Alan Gauld) >6. Re: cgi module help (Peter Otten) > > ___ > Tutor maillist - Tutor@python.org > https://mail.python.org/mailman/listinfo/tutor I went looking through the cgi module code and found the limitation is in the way urllib's parse_qsl method parses data. It simply does not handle nested dictionaries correctly. This lead to a simple solution/hack by just json'ing the nested values: """Client Code""" import json from requests import sessions inner_metadata = json.dumps({"date": "2019-08", "id": ""}) metadata = {"metadata": inner_metadata} session = sessions.Session() session.post(, data=metadata) This is received on the server side and can be parsed with: """Server Code""" import cgi import json form = cgi.FieldStorage() metadata_json = form.getvalue("metadata", None) metadata = json.loads(metadata) print(metadata) > {"date": "2019-08", "id": ""} print(type(metadata)) > dict ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Package which can extract data from pdf
> On Aug 14, 2019, at 2:16 PM, Mats Wichmann wrote: > >> On 8/14/19 10:10 AM, Nupur Jha wrote: >> Hi All, >> >> I have many pdf invoices with different formats. I want to extract the line >> items from these pdf files using python coding. >> Treat this as a two part problem: part one is extracting the text; part two is parsing that text for your desired information. Unless you have a specific need for extracting the text with python, I’d recommend solving part one with an image-to-text reader. These have gotten really quite good recently (AI no doubt). Then parsing the text with python’s string handling routines should be pretty straightforward. Bill >> I would request you all to guide me how can i achieve this. >> > > There are many packages that attempt to extract text from pdf. They > have varying degrees of success on various different documents: you need > to be aware that PDF wasn't intended to be used that way, it was written > to *display* consistently. Sometimes the pdf is full of instructions > for rendering that are hard for a reader to figure out, and need to be > pieced together in possibly unexpected ways. My experience is that if > you can select the interesting text in a pdf reader, and paste it into > an editor, and it doesn't come out looking particularly mangled, then > reading it programmatically has a pretty good chance of working. If not, > you may be in trouble. That said... > > pypdf2, textract, and tika all have their supporters. You can search for > all of these on pypi, which will give you links to the projects' home pages. > > (if it matters, tika is an interface to a bunch of Java code, so you're > not using Python to read it, but you are using Python to control the > process) > > There's a product called pdftables which specifically tries to be good > at spreadsheet-like data, which your invoices *might* be. That is not a > free product, however. For that one there's a Python interface that > sends your data off to a web service and you get answers back. > > There are probably dozens more... this seems to be an area with a lot of > reinvention going on. > > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > https://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor