Greetings Albert-Jan,

I have a suggestion and a comment in this matter.

Yes, a justified fear of making mistakes. That is, a mistake has already occurred and I don't want it to happen again.

Suggestion: Choose a single in-memory representation of your data and make all of your input and output functions perform conversion to appropriate data types.

See below for a bit more explanation.

I made a comparison with data from two sources: a csv file (reference file) an sqlite database (test data). The csv module will always return str, unless one converts it. The test data were written to sqlite with pandas.to_sql, which (it seems) tries to be helpful by making INTs of everything that looks like ints. I chose sqlite because the real data will be in SQL server, and I hope this would mimic the behavior wrt None, NULL, nan, "", etc.

Comment and explanation: I have been following this thread and I will tell you how I would look at this problem (instead of trying to compare different data types).

  * It sounds as though you will have several different types of
    backing stores for your data.  You mentioned 1) csv, 2) sqlite,
    3) SQL server.  Each of these is a different serialization tool.

  * You also mention comparisons.  It seems as though you are
    comparing data acquired (read into memory) from backing store 1)
    to data retrieved from 2).

If you are reading data into memory, then you are probably planning to compute, process, transmit or display the data. In each case, I'd imagine you are operating on the data (numerically, if they are numbers).

I would write a function (or class or module) that can read the data from any of the backing stores you want to use (csv, sqlite, SQL server, punch cards or even pigeon feathers). Each piece of code that reads data from a particular serialization (e.g. sqlite) would be responsible for converting to the in-memory form.

Thus, it would not matter where you store the data...once it's in memory, the form or representation you have chosen will be identical.

There is the benefit, then, of your code being agnostic (or extensible) to the serialization tool.

By the way, did you know that pandas.to_csv() [0] also exists?

-Martin

 [0] 
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

--
Martin A. Brown
http://linux-ip.net/
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to