A nice discussion about support of small files in Hadoop.

Not sure if this really helps, but I’d like to mention in Intel we actually has 
spent some time on this interesting problem domain before and again recently. 
We planned to develop a small files compaction optimization in the Smart 
Storage Management project (derived from 
https://issues.apache.org/jira/browse/HDFS-7343) that can support 
writing-a-small-file, reading-a-small-file, reading-batch-of-small-files, and 
compacting-small-files-together-in-background. These supports are transparent 
to applications but users need to use an HDFS compatible client. If you’re 
interested, please ref. the following links. We have rough design and plans, 
one important target is to support Deep Learning use cases that want to train 
lots of small samples stored into HDFS as files. We will implement it but your 
feedback would be very welcome.

https://github.com/Intel-bigdata/SSM
https://github.com/Intel-bigdata/SSM/blob/trunk/docs/small-file-solution.md

Regards,
Kai

From: Hayati Gonultas [mailto:[email protected]]
Sent: Tuesday, September 05, 2017 6:06 AM
To: Alexey Eremihin <[email protected]>; Uwe Geercken 
<[email protected]>
Cc: Ralph Soika <[email protected]>; [email protected]
Subject: Re: Re: Is Hadoop basically not suitable for a photo archive?

I would recommend an object store such as openstack swift as another option.

On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken 
<[email protected]<mailto:[email protected]>> wrote:
just my two cents:

Maybe you can use hadoop for storing and to pack multiple files to use hdfs in 
a smarter way and at the same time store a limited amount of data/photos - 
based on time - in parallel in a different solution. I assume you won't need 
high performant access to the whole time span.

Yes it would be a duplication, but maybe - without knowing all the details - 
that would be acceptable and and easy way to go for.

Cheers,

Uwe

Gesendet: Montag, 04. September 2017 um 21:32 Uhr
Von: "Alexey Eremihin" 
<[email protected]<mailto:[email protected]>>
An: "Ralph Soika" <[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Betreff: Re: Is Hadoop basically not suitable for a photo archive?
Hi Ralph,
In general Hadoop is able to store such data. And even Har archives can be used 
with conjunction with WebHDFS (by passing offset and limit attributes). What 
are your reading requirements? FS meta data are not distributed and reading the 
data is limited by the HDFS NameNode server performance. So if you would like 
to download files with high RPS that would not work well.

On Monday, September 4, 2017, Ralph Soika 
<[email protected]<mailto:[email protected]>> wrote:

Hi,

I know that the issue around the small-file problem was asked frequently, not 
only in this mailing list.
I also have read already some books about Haddoop and I also started to work 
with Hadoop. But still I did not really understand if Hadoop is the right 
choice for my goals.

To simplify my problem domain I would like to use the use case of a photo 
archive:

- An external application produces about 10 million photos in one year. The 
files contain important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of all files 
over a long period of time.
- To write and read the files a Rest API is preferred.

So far Hadoop seems to be absolutely the perfect solution. But my last 
requirement seems to throw Hadoop out of the race.

- The photos need to be readable with very short latency from an external 
enterprise application

With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems that 
most of the Hadoop experts advise against this usage if the size of my data 
files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB.

I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of many Gigabytes 
it is impossible to access one single photo from the Hadoop repository in a 
reasonable time. It makes no sense in my eyes to pack thousands of files into a 
large file just so that Hadoop jobs can handle it better. To simply access a 
single file from a web interface - as in my case - it seems to be all 
counterproductive.

So my question is: Is Hadoop only feasible to archive large Web-server log 
files and not designed to handle big archives of small files with also business 
critical data?


Thanks for your advice in advance.

Ralph
--

--------------------------------------------------------------------- To 
unsubscribe, e-mail: 
[email protected]<mailto:[email protected]> 
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>
--
Hayati Gonultas

Reply via email to