Re: Storing large files for later processing through hadoop

2015-01-03 Thread Wilm Schumacher
Am 03.01.2015 um 07:07 schrieb Srinivasa T N: > Hi Wilm, >The reason is that for some auditing purpose, I want to store the > original files also. well, then I would use a hdfs cluster for storing, as it seems to be exactly what you need. If you collocate hdfs DataNodes and yarns ResourceManage

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Jacob Rhoden
If it's for auditing, if recommend pushing the files out somewhere reasonably external, Amazon S3 works well for this type of thing, and you don't have to worry too much about backups and the like. __ Sent from iPhone > On 3 Jan 2015, at 5:07 pm, Srinivasa T N wrote

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi Wilm, The reason is that for some auditing purpose, I want to store the original files also. Regards, Seenu. On Fri, Jan 2, 2015 at 11:09 PM, Wilm Schumacher wrote: > Hi, > > perhaps I totally misunderstood your problem, but why "bother" with > cassandra for storing in the first place? >

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Wilm Schumacher
Hi, perhaps I totally misunderstood your problem, but why "bother" with cassandra for storing in the first place? If your MR for hadoop is only run once for each file (as you wrote above), why not copy the data directly to hdfs, run your MR job and use cassandra as sink? As hdfs and yarn are mor

Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
> Since the hadoop MR streaming job requires the file to be processed to be > present in HDFS, > I was thinking whether can it get directly from mongodb instead of me > manually fetching it > and placing it in a directory before submitting the hadoop job? Hadoop M/R can get data directly from

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
I agree that cassandra is a columnar store. The storing of the raw xml file, parsing the file using hadoop and then storing the extracted value is only once. The extracted data on which further operations will be done suits well with the timeseries storage of the data provided by cassandra and th

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Eric Stevens
> Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? There are client libraries which offer recipes for this, but in general, no. You're trying to do something with Cassandra that it's not designed to do. You

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
On Fri, Jan 2, 2015 at 5:54 PM, mck wrote: > > You could manually chunk them down to 64Mb pieces. > > Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? > > > 2) Can I replace HDFS with Cassandra so that I

Re: Storing large files for later processing through hadoop

2015-01-02 Thread mck
> 1) The FAQ … informs that I can have only files of around 64 MB … See http://wiki.apache.org/cassandra/CassandraLimitations "A single column value may not be larger than 2GB; in practice, "single digits of MB" is a more reasonable limit, since there is no streaming or random access of blob va

Storing large files for later processing through hadoop

2015-01-02 Thread Srinivasa T N
Hi All, The problem I am trying to address is: Store the raw files (files are in xml format and of the size arnd 700MB) in cassandra, later fetch it and process it in hadoop cluster and populate back the processed data in cassandra. Regarding this, I wanted few clarifications: 1) The FAQ ( ht