RE: Is Hadoop basically not suitable for a photo archive?

Zheng, Kai Wed, 06 Sep 2017 16:10:04 -0700

Looks like HBase MOB should be mentioned, since the feature was definitely 
introduced with photo files/objects in mind.

Regards,
Kai

From: Grant Overby [mailto:[email protected]]
Sent: Thursday, September 07, 2017 3:05 AM
To: Ralph Soika <[email protected]>
Cc: [email protected]
Subject: Re: Is Hadoop basically not suitable for a photo archive?

I'm late to the party, and this isn't a hadoop solution, but apparently 
Cassandra is pretty good at this.

https://medium.com/walmartlabs/building-object-store-storing-images-in-cassandra-walmart-scale-a6b9c02af593

On Wed, Sep 6, 2017 at 2:48 PM, Ralph Soika 
<[email protected]<mailto:[email protected]>> wrote:
Hi

I want to thank you all for your answers and your good ideas how to solve the 
hadoop "small-file-problem".

Now I would like to briefly summarize your answers and suggested solutions. 
First of all I describe once again my general use case:

  *   An external enterprise application need to store small photo files in 
unsteady intervals into a clustered big data storage.
  *   Users need to read the files through the web interface of the enterprise 
application also in unsteady intervals.
  *   The solution need to guarantee the data integrity of all files over a 
long period of time.
  *   To write and read the files a Rest API is preferred.

1) Multiple small-files in one sequence file:

Packing multiple small files in one sequence file is a possible solution even 
if it is hard to implement. If the files are attached by the enterprise 
application in unsteady intervals (as in my case), the enterprise application 
need to be aware of the latest filesize of the sequence file and need to 
compute the correct offset. The offset and the size of the photo file is needed 
to access a single photo file later. E.g. via WebHDFS:

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN&offset=<LONG>]&length=<LONG<http://%3cHOST%3e:%3cPORT%3e/webhdfs/v1/%3cPATH%3e?op=OPEN&offset=%3cLONG%3e%5d&length=%3cLONG>>]

If multiple threads try to append data in parallel, the problem becomes more 
and more complex. But yes - this could be a possible solution.

2) Multiple small-files in a Hadoop Archive (HAR):

Another solution is to pack the small files in a Hadoop Archive (HAR) file. But 
this solution is even more difficult to implement in my case. As I explained, 
the enterprise application write data in unsteady intervals. This means that 
the archive job need to be decoupled from the enterprise application. For 
example a scheduler could archive and delete files older than one day on a 
daily basis. This would reduce the number of small files significant. The 
problem here is, that the enterprise application need to be aware of the new 
location of a single photo file. To access a 'packed' photo file, the Offset 
and Size in the HAR file need to be transferred back to the enterprise 
application. As a result of this solution, the complexity of the overall system 
increases unreasonably. To decouple things, the scheduler could create a kind 
of index file for each new created HAR file. The index could be used by the 
enterprise application to lookup the file-path, offset and size. But as a 
single photo file can now be either stored still as a small-file or already as 
a part of a HAR file, the access method need to be implemented very tricky.
OK, the solution is possible, but the sequence file solution seems to be much 
easier.

3) The object store "openstack swift":

It seems that the object store "openstack swift" is solving the small-file 
problem much better. It is certainly worth following this approach. However, 
since I am basically convinced of Hadoop, I will not make a fundamental change 
in architecture for now.

4) Intel-bigdata/SSM:

The "Transparent Small Files Support" form the Intel-Bigdata project is an 
interesting approach and I believe this would solve my problems at all. But I 
fear it is to early to start here.

5) HDFS-7240 or Ozone:

HDFS-7240 or Ozone looks very very promising. It looks to me that Ozone is the 
missing piece in the Hadoop project. Although it is not yet ready for use in 
production, I will follow this project.

6) mapR-fs:
mapR-fs could be an alternative, but I do not consider it here.

My conclusion:

So for my use case in short-term it seems to be the best solution to start with 
the sequence file approach. In the intermediate-term I will see if I can adapt 
my solution to the Hadoop Ozone project. In the long term I will probably 
support both approaches.
Since my solution is part of the open source project Imixs-Workflow, I will 
publish my solution as well on GitHub.

So once again - thanks a lot for your help.

Ralph

On 04.09.2017 19:03, Ralph Soika wrote:

Hi,

I know that the issue around the small-file problem was asked frequently, not 
only in this mailing list.
I also have read already some books about Haddoop and I also started to work 
with Hadoop. But still I did not really understand if Hadoop is the right 
choice for my goals.

To simplify my problem domain I would like to use the use case of a photo 
archive:

- An external application produces about 10 million photos in one year. The 
files contain important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of all files 
over a long period of time.
- To write and read the files a Rest API is preferred.

So far Hadoop seems to be absolutely the perfect solution. But my last 
requirement seems to throw Hadoop out of the race.

- The photos need to be readable with very short latency from an external 
enterprise application

With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems that 
most of the Hadoop experts advise against this usage if the size of my data 
files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.

I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of many Gigabytes 
it is impossible to access one single photo from the Hadoop repository in a 
reasonable time. It makes no sense in my eyes to pack thousands of files into a 
large file just so that Hadoop jobs can handle it better. To simply access a 
single file from a web interface - as in my case - it seems to be all 
counterproductive.

So my question is: Is Hadoop only feasible to archive large Web-server log 
files and not designed to handle big archives of small files with also business 
critical data?

Thanks for your advice in advance.

Ralph
--

--
Imixs...extends the way people work together
We are an open source company, read more at: www.imixs.org<http://www.imixs.org>
________________________________
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
Web: www.imixs.com<http://www.imixs.com>
Office: +49 (0)89-452136 16<tel:+49%2089%2045213616> Mobil: 
+49-177-4128245<tel:+49%20177%204128245>
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika

RE: Is Hadoop basically not suitable for a photo archive?

Reply via email to