Hi everyBody, In my application the treatment of the whole dataset (that we called CycleWorkflow) may have a duration of several weeks and we want (mandatory) to split the CycleWorkflow into multiple DayWorkflow. The actual system use a traditionnal RDBMS approach and use SQL OFFSET LIMIT to split the dataset (potentially 80 billions rows in one table) into smaller datasets. We are actually (re)designing the application using Hadoop/Cascading/HDFS and we try to found a feature to split MapFile in input (with 80 billions key) according to the number of record. CycleDataset = 10000 <key,value> record Split1 = 0 <key,value> to 1000<key,value> = Dataset for DayWorkflow1 Split2 = 2000 <key,value> to 3000<key,value> = Dataset for DayWorkflow2 Split3 = 3000 <key,value> to 4000<key,value> = Dataset for DayWorkflow3 etc We expect to use MapFile (is it a good choice? or is there another existing file format? (HFile?) more suitable for this usage...). The progress monitoring (bargraph in a GUI) of DayWorkflowX are currently done with the number of record already processed by this workflow.
In MapFile class i don't find API to count record efficiently.. am I missing something ? (probably yes..) Which solution do you suggest to split Input Dataset according to the number of <key,value> record ? Thank you for your response.. Regards, Alain (we are using Hadoop 0.20.xx version) [@@THALES GROUP RESTRICTED@@]
