Re: Is Hadoop basically not suitable for a photo archive?

Ralph Soika Wed, 06 Sep 2017 11:48:55 -0700

Hi

I want to thank you all for your answers and your good ideas how tosolve the hadoop "small-file-problem".

Now I would like to briefly summarize your answers and suggestedsolutions. First of all I describe once again my general use case:


 * An external enterprise application need to store small photo files
   in unsteady intervals into a clustered big data storage.
 * Users need to read the files through the web interface of the
   enterprise application also in unsteady intervals.
 * The solution need to guarantee the data integrity of all files over
   a long period of time.
 * To write and read the files a Rest API is preferred.



_1) Multiple small-files in one sequence file:_

Packing multiple small files in one sequence file is a possible solutioneven if it is hard to implement. If the files are attached by theenterprise application in unsteady intervals (as in my case), theenterprise application need to be aware of the latest filesize of thesequence file and need to compute the correct offset. The offset and thesize of the photo file is needed to access a single photo file later.E.g. via WebHDFS:


http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN&offset=<LONG>]&length=<LONG>]

If multiple threads try to append data in parallel, the problem becomesmore and more complex. But yes - this could be a possible solution.



_2) __Multiple small-files in a Hadoop Archive (HAR):_

Another solution is to pack the small files in a Hadoop Archive (HAR)file. But this solution is even more difficult to implement in my case.As I explained, the enterprise application write data in unsteadyintervals. This means that the archive job need to be decoupled from theenterprise application. For example a scheduler could archive and deletefiles older than one day on a daily basis. This would reduce the numberof small files significant. The problem here is, that the enterpriseapplication need to be aware of the new location of a single photo file.To access a 'packed' photo file, the Offset and Size in the HAR fileneed to be transferred back to the enterprise application. As a resultof this solution, the complexity of the overall system increasesunreasonably. To decouple things, the scheduler could create a kind ofindex file for each new created HAR file. The index could be used by theenterprise application to lookup the file-path, offset and size. But asa single photo file can now be either stored still as a small-file oralready as a part of a HAR file, the access method need to beimplemented very tricky.OK, the solution is possible, but the sequence file solution seems to bemuch easier.




_3) The object store "openstack swift":_

It seems that the object store "openstack swift" is solving thesmall-file problem much better. It is certainly worth following thisapproach. However, since I am basically convinced of Hadoop, I will notmake a fundamental change in architecture for now.



_4) Intel-bigdata/SSM:_

The "Transparent Small Files Support" form the Intel-Bigdata project isan interesting approach and I believe this would solve my problems atall. But I fear it is to early to start here.




_5) HDFS-7240 or Ozone:_

HDFS-7240 or Ozone looks very very promising. It looks to me that Ozoneis the missing piece in the Hadoop project. Although it is not yet readyfor use in production, I will follow this project.



_6) mapR-fs:_
mapR-fs could be an alternative, but I do not consider it here.




_My conclusion: _

So for my use case in short-term it seems to be the best solution tostart with the sequence file approach. In the intermediate-term I willsee if I can adapt my solution to the Hadoop Ozone project. In the longterm I will probably support both approaches.Since my solution is part of the open source project Imixs-Workflow, Iwill publish my solution as well on GitHub.



So once again - thanks a lot for your help.

Ralph



On 04.09.2017 19:03, Ralph Soika wrote:

Hi,
I know that the issue around the small-file problem was askedfrequently, not only in this mailing list.I also have read already some books about Haddoop and I also startedto work with Hadoop. But still I did not really understand if Hadoopis the right choice for my goals.
To simplify my problem domain I would like to use the use case of aphoto archive:
- An external application produces about 10 million photos in oneyear. The files contain important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of allfiles over a long period of time.
- To write and read the files a Rest API is preferred.
So far Hadoop seems to be absolutely the perfect solution. But my lastrequirement seems to throw Hadoop out of the race.
- The photos need to be readable with very short latency from anexternal enterprise application
With Hadoop HDFS and the Web Proxy everything seems perfect. But itseems that most of the Hadoop experts advise against this usage if thesize of my data files (1-10 MB) are well below the Hadoop block sizeof 64 or 128 MB.
I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of manyGigabytes it is impossible to access one single photo from the Hadooprepository in a reasonable time. It makes no sense in my eyes to packthousands of files into a large file just so that Hadoop jobs canhandle it better. To simply access a single file from a web interface- as in my case - it seems to be all counterproductive.
So my question is: Is Hadoop only feasible to archive large Web-serverlog files and not designed to handle big archives of small files withalso business critical data?
Thanks for your advice in advance.

Ralph

--


--
*Imixs*...extends the way people work together

We are an open source company, read more at: www.imixs.org<http://www.imixs.org>

------------------------------------------------------------------------
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com <http://www.imixs.com>
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika

Re: Is Hadoop basically not suitable for a photo archive?

Reply via email to