Eduardo Vieira wrote:
Hello, I have a csv file,
a broken csv file
 using the ";" as a delimiter. This file
contains addresses. My problem is that some fields are missing in some
rows and I would like to normalize the rows for a smoother import into
Excel, for example.
Here is an example. This is the header:
Company;Telephone;Address;Prov;PCode
While most of them have this header, some data would be like this:
Abc blaba;403-403-4545;MB ---> missing address, city, and postal code
Acme;123-403-4545;Winnipeg;MB;
I think a good solution would be to add delimiter to represent empty fields:
Abc blaba;403-403-4545;;;MB; -->missing address and postal code
Acme;123-403-4545;;Winnipeg;MB;

Fortunately the source has province names abbreviated (2 letters). I
could also take into account a postal code, maybe:
Given I have 2 simple functions:
isProvince()
isPostalCode():
How I would write to the proper fields once that was returned true?
Province has to go to row[3], and PCode to row[4] right?


Eduardo

On any problem of this type, the toughest part is coming up with the spec. And sometimes you don't really know it till you've run all possible data through a program that analyzes it. If the raw data is available to you, I'd suggest you start there. And if it was converted to this file, and you no longer have the raw data, then at least analyze the program that did the faulty conversion. And if that's not possible, at least plan for your conversion program to do enough error analysis to detect when the data does not meet the assumptions.


Let me make a guess about the data, and then the program will write itself.

(Guessing) You have a file consisting of text lines. Each line has between two and five fields, separated by semicolon, with no semicolon appearing inside any of the fields. None of the fields is "quoted" so parsing is simply a matter of splitting by the semicolons.

Each field may be blank. Multiple semicolons indicates a blank field between them. The exhaustive list of fields and missing fields is below.

Company  Telephone     Address   Prov   PCode  (nothing missing)


Company  Telephone     Address   Prov


Company  Telephone     Address

Company  Telephone
Company  Telephone     Prov  PCode


Company  Telephone     PCode

Company Telephone Prov
Company  Telephone     Address   PCode


You have a finite list of valid Prov, so isProvince() is reliable, and you have a reliable algorithm for PCode, so isPostalCode() is reliable. In other words, no Address will pass isPostalCode(), no PCode will pass isProvince(), and so on.

So, your algorithm: Read the file, one line at a time, and split the line into two to five fields, in a list. If the length of the list is less than 2, report error and quit. If the length is 2, append three blank fields. If item2 is a province, insert a blank field just before it. if item3 is a postalcode, insert a blank field just before it If the (new) length of the list is 5, output the list and proceed to the next line. Otherwise report an error and quit.

DaveA

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to