Re: [Tutor] Handling missing fields in a csv file

Dave Angel Tue, 29 Sep 2009 14:06:20 -0700

Eduardo Vieira wrote:

Hello, I have a csv file,

a broken csv file

 using the ";" as a delimiter. This file
contains addresses. My problem is that some fields are missing in some
rows and I would like to normalize the rows for a smoother import into
Excel, for example.
Here is an example. This is the header:
Company;Telephone;Address;Prov;PCode
While most of them have this header, some data would be like this:
Abc blaba;403-403-4545;MB ---> missing address, city, and postal code
Acme;123-403-4545;Winnipeg;MB;
I think a good solution would be to add delimiter to represent empty fields:
Abc blaba;403-403-4545;;;MB; -->missing address and postal code
Acme;123-403-4545;;Winnipeg;MB;


Fortunately the source has province names abbreviated (2 letters). I
could also take into account a postal code, maybe:
Given I have 2 simple functions:
isProvince()
isPostalCode():
How I would write to the proper fields once that was returned true?
Province has to go to row[3], and PCode to row[4] right?


Eduardo

On any problem of this type, the toughest part is coming up with thespec. And sometimes you don't really know it till you've run allpossible data through a program that analyzes it. If the raw data isavailable to you, I'd suggest you start there. And if it was convertedto this file, and you no longer have the raw data, then at least analyzethe program that did the faulty conversion. And if that's not possible,at least plan for your conversion program to do enough error analysis todetect when the data does not meet the assumptions.



Let me make a guess about the data, and then the program will write itself.

(Guessing) You have a file consisting of text lines. Each line hasbetween two and five fields, separated by semicolon, with no semicolonappearing inside any of the fields. None of the fields is "quoted" soparsing is simply a matter of splitting by the semicolons.

Each field may be blank. Multiple semicolons indicates a blank fieldbetween them. The exhaustive list of fields and missing fields is below.


Company  Telephone     Address   Prov   PCode  (nothing missing)


Company  Telephone     Address   Prov


Company  Telephone     Address

Company  Telephone
Company  Telephone     Prov  PCode


Company  Telephone     PCode

Company Telephone Prov

Company  Telephone     Address   PCode

You have a finite list of valid Prov, so isProvince() is reliable, andyou have a reliable algorithm for PCode, so isPostalCode() is reliable.In other words, no Address will pass isPostalCode(), no PCode will passisProvince(), and so on.

So, your algorithm: Read the file, one line at a time, and split theline into two to five fields, in a list.If the length of the list is less than 2, report error and quit. If thelength is 2, append three blank fields.If item2 is a province, insert a blank field just before it. if item3is a postalcode, insert a blank field just before itIf the (new) length of the list is 5, output the list and proceed to thenext line. Otherwise report an error and quit.


DaveA

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Handling missing fields in a csv file

Reply via email to