RFC, data manipulation program

Kevin Ilchmann Jørgensen Fri, 02 Dec 2011 12:10:34 -0800

Hi All,

We are working on a data manipulation application. The idea is that a user
should be able to set up a pipeline with one or more input files (eg. a
database, csv, spreadsheet, json/XML, etc), one or more transformations of
data (eg. merging, renaming, aggregating, search/replace, etc), and finally
output to one or more files (eg. database, csv, spreadsheet, json/XML, etc)
- basically a program that can do most of the stuff that Talend Open Studio
can do, but less bloated. A quick, simple use case: A user has a
spreadsheet with name and addresses. He wants to separate addresses
different entities (ZIP, Street name, country) and output this into
different tables of a MySQL database.


We are attempting to build this in clojure (the backend that is - then at
some point in the future build a nice GUI for it) We have a basic program
and data structure with a few examples github:
https://github.com/kij/grotesql

We are writing to this list hoping to get some constructive criticism and
comments on stuff like:

* Does our data representation make sense?
* Does our partial-function-pipeline scheme make sense?
* Have we overlooked something obvious that means our current approach is
terrible and useless?
* Comments, ideas, recommendations, thoughts, etc.

I believe our code is fairly straight forward, when looked at along with
the examples in the doc/ folder.. but just a few words about the structure
of the program:

* Data is represented as a list of maps, eg: '( { :name "Peter", :age 30 }
{ :name "Jones", age: 50 } ...)
* Data manipulation is done by using small concise functions (simple stuff
like: join column X Y values, rename column X, search/replace column X) -
these functions are then used as 'building blocks' to achieve more complex
transformations. All functions take one or more parameters with the last
parameter being input data. Functions returns the manipulated data as
output.
* A program is built by creating a list (pipeline) with the following
structure: first entry is an input node (a function that fetches data from
external source and returns the data), the last entry is an output node (a
function that takes data as only input, and has no return value), and
everything in between are data manipulation nodes (functions that takes a
single input - the data, and spits out the manipulated data).
* The list is built by adding curryed functions - so for example, if we
have a data manipulating function: rename-column [ oldname newname data ]
(...), the function specific parameters are filled out before adding it to
the list: (partial rename-column :someoldname :somenewname) - which leaves
a function that takes data as a single input, and gives manipulated data as
it's output - the requirements for a data manipulation functions in the
list.
* When one has the pipeline, say '(input manip-a manip-b manip-c output)
ready it will be run like: (output (manip-c (manip-b (manip-a (input))))).
As desired, this will pass the data from input, through each of the
manipulatiing functions, and finally to the output function.

We have only just started the development, so what's in the repository is
mostly proof of concept stuff, but it's working and should be enough to
give an idea. We are both new to both functional programming and clojure,
so any thoughts about the questions above or any pointers in general would
be highly appreciated.

Thanks,
Kasper and Kevin

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

RFC, data manipulation program

Reply via email to