Hi All, We are working on a data manipulation application. The idea is that a user should be able to set up a pipeline with one or more input files (eg. a database, csv, spreadsheet, json/XML, etc), one or more transformations of data (eg. merging, renaming, aggregating, search/replace, etc), and finally output to one or more files (eg. database, csv, spreadsheet, json/XML, etc) - basically a program that can do most of the stuff that Talend Open Studio can do, but less bloated. A quick, simple use case: A user has a spreadsheet with name and addresses. He wants to separate addresses different entities (ZIP, Street name, country) and output this into different tables of a MySQL database.
We are attempting to build this in clojure (the backend that is - then at some point in the future build a nice GUI for it) We have a basic program and data structure with a few examples github: https://github.com/kij/grotesql We are writing to this list hoping to get some constructive criticism and comments on stuff like: * Does our data representation make sense? * Does our partial-function-pipeline scheme make sense? * Have we overlooked something obvious that means our current approach is terrible and useless? * Comments, ideas, recommendations, thoughts, etc. I believe our code is fairly straight forward, when looked at along with the examples in the doc/ folder.. but just a few words about the structure of the program: * Data is represented as a list of maps, eg: '( { :name "Peter", :age 30 } { :name "Jones", age: 50 } ...) * Data manipulation is done by using small concise functions (simple stuff like: join column X Y values, rename column X, search/replace column X) - these functions are then used as 'building blocks' to achieve more complex transformations. All functions take one or more parameters with the last parameter being input data. Functions returns the manipulated data as output. * A program is built by creating a list (pipeline) with the following structure: first entry is an input node (a function that fetches data from external source and returns the data), the last entry is an output node (a function that takes data as only input, and has no return value), and everything in between are data manipulation nodes (functions that takes a single input - the data, and spits out the manipulated data). * The list is built by adding curryed functions - so for example, if we have a data manipulating function: rename-column [ oldname newname data ] (...), the function specific parameters are filled out before adding it to the list: (partial rename-column :someoldname :somenewname) - which leaves a function that takes data as a single input, and gives manipulated data as it's output - the requirements for a data manipulation functions in the list. * When one has the pipeline, say '(input manip-a manip-b manip-c output) ready it will be run like: (output (manip-c (manip-b (manip-a (input))))). As desired, this will pass the data from input, through each of the manipulatiing functions, and finally to the output function. We have only just started the development, so what's in the repository is mostly proof of concept stuff, but it's working and should be enough to give an idea. We are both new to both functional programming and clojure, so any thoughts about the questions above or any pointers in general would be highly appreciated. Thanks, Kasper and Kevin -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to [email protected] Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/clojure?hl=en
