Looks like HBase MOB should be mentioned, since the feature was definitely introduced with photo files/objects in mind.
Regards, Kai From: Grant Overby [mailto:[email protected]] Sent: Thursday, September 07, 2017 3:05 AM To: Ralph Soika <[email protected]> Cc: [email protected] Subject: Re: Is Hadoop basically not suitable for a photo archive? I'm late to the party, and this isn't a hadoop solution, but apparently Cassandra is pretty good at this. https://medium.com/walmartlabs/building-object-store-storing-images-in-cassandra-walmart-scale-a6b9c02af593 On Wed, Sep 6, 2017 at 2:48 PM, Ralph Soika <[email protected]<mailto:[email protected]>> wrote: Hi I want to thank you all for your answers and your good ideas how to solve the hadoop "small-file-problem". Now I would like to briefly summarize your answers and suggested solutions. First of all I describe once again my general use case: * An external enterprise application need to store small photo files in unsteady intervals into a clustered big data storage. * Users need to read the files through the web interface of the enterprise application also in unsteady intervals. * The solution need to guarantee the data integrity of all files over a long period of time. * To write and read the files a Rest API is preferred. 1) Multiple small-files in one sequence file: Packing multiple small files in one sequence file is a possible solution even if it is hard to implement. If the files are attached by the enterprise application in unsteady intervals (as in my case), the enterprise application need to be aware of the latest filesize of the sequence file and need to compute the correct offset. The offset and the size of the photo file is needed to access a single photo file later. E.g. via WebHDFS: http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN&offset=<LONG>]&length=<LONG<http://%3cHOST%3e:%3cPORT%3e/webhdfs/v1/%3cPATH%3e?op=OPEN&offset=%3cLONG%3e%5d&length=%3cLONG>>] If multiple threads try to append data in parallel, the problem becomes more and more complex. But yes - this could be a possible solution. 2) Multiple small-files in a Hadoop Archive (HAR): Another solution is to pack the small files in a Hadoop Archive (HAR) file. But this solution is even more difficult to implement in my case. As I explained, the enterprise application write data in unsteady intervals. This means that the archive job need to be decoupled from the enterprise application. For example a scheduler could archive and delete files older than one day on a daily basis. This would reduce the number of small files significant. The problem here is, that the enterprise application need to be aware of the new location of a single photo file. To access a 'packed' photo file, the Offset and Size in the HAR file need to be transferred back to the enterprise application. As a result of this solution, the complexity of the overall system increases unreasonably. To decouple things, the scheduler could create a kind of index file for each new created HAR file. The index could be used by the enterprise application to lookup the file-path, offset and size. But as a single photo file can now be either stored still as a small-file or already as a part of a HAR file, the access method need to be implemented very tricky. OK, the solution is possible, but the sequence file solution seems to be much easier. 3) The object store "openstack swift": It seems that the object store "openstack swift" is solving the small-file problem much better. It is certainly worth following this approach. However, since I am basically convinced of Hadoop, I will not make a fundamental change in architecture for now. 4) Intel-bigdata/SSM: The "Transparent Small Files Support" form the Intel-Bigdata project is an interesting approach and I believe this would solve my problems at all. But I fear it is to early to start here. 5) HDFS-7240 or Ozone: HDFS-7240 or Ozone looks very very promising. It looks to me that Ozone is the missing piece in the Hadoop project. Although it is not yet ready for use in production, I will follow this project. 6) mapR-fs: mapR-fs could be an alternative, but I do not consider it here. My conclusion: So for my use case in short-term it seems to be the best solution to start with the sequence file approach. In the intermediate-term I will see if I can adapt my solution to the Hadoop Ozone project. In the long term I will probably support both approaches. Since my solution is part of the open source project Imixs-Workflow, I will publish my solution as well on GitHub. So once again - thanks a lot for your help. Ralph On 04.09.2017 19:03, Ralph Soika wrote: Hi, I know that the issue around the small-file problem was asked frequently, not only in this mailing list. I also have read already some books about Haddoop and I also started to work with Hadoop. But still I did not really understand if Hadoop is the right choice for my goals. To simplify my problem domain I would like to use the use case of a photo archive: - An external application produces about 10 million photos in one year. The files contain important business critical data. - A single photo file has a size between 1 and 10 MB. - The photos need to be stored over several years (10-30 years). - The data store should support replication over several servers. - A checksum-concept is needed to guarantee the data integrity of all files over a long period of time. - To write and read the files a Rest API is preferred. So far Hadoop seems to be absolutely the perfect solution. But my last requirement seems to throw Hadoop out of the race. - The photos need to be readable with very short latency from an external enterprise application With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems that most of the Hadoop experts advise against this usage if the size of my data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB. I think I understood the concepts of HAR or sequential files. But if I pack, for example, my files together in a large file of many Gigabytes it is impossible to access one single photo from the Hadoop repository in a reasonable time. It makes no sense in my eyes to pack thousands of files into a large file just so that Hadoop jobs can handle it better. To simply access a single file from a web interface - as in my case - it seems to be all counterproductive. So my question is: Is Hadoop only feasible to archive large Web-server log files and not designed to handle big archives of small files with also business critical data? Thanks for your advice in advance. Ralph -- -- Imixs...extends the way people work together We are an open source company, read more at: www.imixs.org<http://www.imixs.org> ________________________________ Imixs Software Solutions GmbH Agnes-Pockels-Bogen 1, 80992 München Web: www.imixs.com<http://www.imixs.com> Office: +49 (0)89-452136 16<tel:+49%2089%2045213616> Mobil: +49-177-4128245<tel:+49%20177%204128245> Registergericht: Amtsgericht Muenchen, HRB 136045 Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika
