Re: [Rd] Design for classes with database connection

Paul Gilbert Thu, 19 Sep 2013 15:29:58 -0700

Simon

Your idea to use SQLite and the nature of some of the sorting andextracting you are suggesting makes me wonder why you are thinking of Rdata structures as the home for the data storage. I would be inclined toput the data in an SQL database as the prime repository, then extractparts you want with SQL queries and bring them into R for analysis andgraphics. If the full data set is large, and the parts you want toanalyze in R at any one time are relatively small, then this will bemuch faster. After all, SQL is primarily for databases, whereas R'sstrength is more in statistics and graphics.

In the project http://tsdbi.r-forge.r-project.org/ I have code that doessome of the things you probably want. There the focus is on a singleidentifier for a series, and various observation frequencies aresupported. Tick data is supported (as time stamped data) but notextensively tested as I do not work with tick data much. There is afunction TSquery, currently in TSdbi on CRAN but very shortly beingsplit with the SQL specific parts of the interface into a package TSsql.It is very much like the queries you seem to have in mind, but I havenot used it with tick data. It is used to generate a time series byformulating a query to a database with several possible sorting fields,very much like you describe, and then order the data according to thetime index.

If your data set is large, then you need to think carefully about whichfields you index. You certainly do not want to be building the indexeson the fly, as you would need to do if you dump all the data out of Rinto an SQL db just to do a sort. If the data set is small then indexingdoes not matter too much. Also, for a small data set there will be muchless advantage of keeping the data in an SQL db rather than in R. You doneed to be a bit more specific about what "huge" means. (Tick data for 5days or 20 years? A 100 IDs or 10 million?) Large for an R structure isnot necessarily large for an SQL db. With more specifics I might be ableto give more suggestions.


(R-SIG-DB may be a better forum for this discussion.)

HTH,
Paul

On 13-09-18 01:06 PM, Simon Zehnder wrote:

Dear R-Devels,

I am designing right now a package intended to simplify the handling
of market microstructure data (tick data, order data, etc). As these
data is most times pretty huge and needs to be reordered quite often
(e.g. if several security data is batched together or if only a
certain time range should be considered) - the package needs to
handle this.

Before I start, I would like to mention some facts which made me
decide to construct an own package instead of using e.g. the packages
bigmemory, highfrequency, zoo or xts: AFAIK big memory does not
provide the opportunity to handle data with different types
(timestamp, string and numerics) and their appropriate sorting, for
this task databases offer better tools. Package highfrequency is
designed to work specifically with a certain data structure and the
data in market microstructure has much greater versatility. Packages
zoo and xts offer a lot of versatility but do not offer the data
sorting ability needed for such big data.

I would like to get some feedback in regard to my decision and in
regard to the short design overview following.

My design idea is now:

1. Base the package on S4 classes, with one class that handles
data-reading from external sources, structuring and reordering.
Structuring is done in regard to specific data variables, i.e.
security ID, company ID, timestamp, price, volume (not all have to be
provided, but some surely exist on market microstructure data). The
less important variables are considered as a slot @other and are only
ordered in regard to the other variables. Something like this:

.mmstruct <- setClass('mmstruct', representation( name       =
"character", index    = "array", N          = "integer", K                = 
"integer", compiD
= "array", secID      = "array", tradetime  = "POSIXlt", flag             =
"array", price        = "array", vol                = "array", other      = 
"data.frame"))

2. To enable a lightweight ordering function, the class should
basically create an SQLite database on construction and delete it if
'rm()' is called. Throughout its life an object holds the database
path and can execute queries on the database tables. By this, I can
use the table sorting of SQLite (e.g. by constructing an index for
each important variable). I assume this is faster and more efficient
than programming something on my own - why reinventing the wheel? For
this I would use VIRTUAL classes like:

.mmstructBASE   <- setClass('mmstructBASE', representation( dbName           =
"character", dbTable          = "character"))

.mmstructDB             <- setClass('mmstructDB', representation( conn          
     =
"SQLiteConnection"), contains         = c("mmstructBASE"))

.mmstruct <- setClass('mmstruct', representation( name       =
"character", index    = "array", N          = "integer", K                = 
"integer", compiD
= "array", secID      = "array", tradetime  = "POSIXlt", price    =
"array", vol          = "array", other      = "data.frame"), contains =
c("mmstructDB"))

The slots in the mistrust class hold then a view (e.g. only the
head()) of the data or can be used to hold retrieved data from the
underlying database.

3. The workflow would than be something like:   a) User reads in the
data from an external source and gets a data.frame from it. b) This
data.frame then can be used to construct an mmstruct object from it
by formatting the variables and read them into the SQLite database
constructed. c) Given the data structure in the database, the user
can sort the data by secID, timestamp etc. and can use several
algorithms for cleaning the data (package-specific not in the
database) d) Example: The user makes a query to get only price from
entries compID = "AA" with tradetime < "2012-03-09" or with trade
time only first trading day in a month. This can then be converted
e.g. to a 'ts' object in R by coercing e) In addition the user can
perform several estimations of market microstructure models by
calling package-specific functions.


Is there a big fault in my design, something I haven't considered? I
am very sure on this list are researchers and developers with much
more experience. I would like to hear your opinion and ideas. I learn
from it and can maybe get to a design which I can then implement for
the research on such data and models.


Best

Simon




______________________________________________ R-devel@r-project.org
mailing list https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Design for classes with database connection

Reply via email to