Kai, this is great. It is well down the path to solving the small/object-as-file problem. Good show!
*Daemeon C.M. ReiydelleSan Francisco 1.415.501.0198London 44 020 8144 9872* On Mon, Sep 4, 2017 at 8:56 PM, Zheng, Kai <[email protected]> wrote: > A nice discussion about support of small files in Hadoop. > > > > Not sure if this really helps, but I’d like to mention in Intel we > actually has spent some time on this interesting problem domain before and > again recently. We planned to develop a small files compaction optimization > in the Smart Storage Management project (derived from > https://issues.apache.org/jira/browse/HDFS-7343) that can support > writing-a-small-file, reading-a-small-file, reading-batch-of-small-files, > and compacting-small-files-together-in-background. These supports are > transparent to applications but users need to use an HDFS compatible > client. If you’re interested, please ref. the following links. We have > rough design and plans, one important target is to support Deep Learning > use cases that want to train lots of small samples stored into HDFS as > files. We will implement it but your feedback would be very welcome. > > > > https://github.com/Intel-bigdata/SSM > > https://github.com/Intel-bigdata/SSM/blob/trunk/docs/ > small-file-solution.md > > > > Regards, > > Kai > > > > *From:* Hayati Gonultas [mailto:[email protected]] > *Sent:* Tuesday, September 05, 2017 6:06 AM > *To:* Alexey Eremihin <[email protected]>; Uwe Geercken < > [email protected]> > *Cc:* Ralph Soika <[email protected]>; [email protected] > *Subject:* Re: Re: Is Hadoop basically not suitable for a photo archive? > > > > I would recommend an object store such as openstack swift as another > option. > > > > On Mon, Sep 4, 2017 at 1:09 PM Uwe Geercken <[email protected]> wrote: > > just my two cents: > > > > Maybe you can use hadoop for storing and to pack multiple files to use > hdfs in a smarter way and at the same time store a limited amount of > data/photos - based on time - in parallel in a different solution. I assume > you won't need high performant access to the whole time span. > > > > Yes it would be a duplication, but maybe - without knowing all the details > - that would be acceptable and and easy way to go for. > > > > Cheers, > > > > Uwe > > > > *Gesendet:* Montag, 04. September 2017 um 21:32 Uhr > *Von:* "Alexey Eremihin" <[email protected]> > *An:* "Ralph Soika" <[email protected]> > *Cc:* "[email protected]" <[email protected]> > *Betreff:* Re: Is Hadoop basically not suitable for a photo archive? > > Hi Ralph, > > In general Hadoop is able to store such data. And even Har archives can be > used with conjunction with WebHDFS (by passing offset and limit > attributes). What are your reading requirements? FS meta data are not > distributed and reading the data is limited by the HDFS NameNode server > performance. So if you would like to download files with high RPS that > would not work well. > > On Monday, September 4, 2017, Ralph Soika <[email protected]> wrote: > > Hi, > > I know that the issue around the small-file problem was asked frequently, > not only in this mailing list. > I also have read already some books about Haddoop and I also started to > work with Hadoop. But still I did not really understand if Hadoop is the > right choice for my goals. > > To simplify my problem domain I would like to use the use case of a photo > archive: > > - An external application produces about 10 million photos in one year. > The files contain important business critical data. > - A single photo file has a size between 1 and 10 MB. > - The photos need to be stored over several years (10-30 years). > - The data store should support replication over several servers. > - A checksum-concept is needed to guarantee the data integrity of all > files over a long period of time. > - To write and read the files a Rest API is preferred. > > So far Hadoop seems to be absolutely the perfect solution. But my last > requirement seems to throw Hadoop out of the race. > > - The photos need to be readable with very short latency from an external > enterprise application > > With Hadoop HDFS and the Web Proxy everything seems perfect. But it seems > that most of the Hadoop experts advise against this usage if the size of my > data files (1-10 MB) are well below the Hadoop block size of 64 or 128 MB. > > I think I understood the concepts of HAR or sequential files. > But if I pack, for example, my files together in a large file of many > Gigabytes it is impossible to access one single photo from the Hadoop > repository in a reasonable time. It makes no sense in my eyes to pack > thousands of files into a large file just so that Hadoop jobs can handle it > better. To simply access a single file from a web interface - as in my case > - it seems to be all counterproductive. > > So my question is: Is Hadoop only feasible to archive large Web-server log > files and not designed to handle big archives of small files with also > business critical data? > > > Thanks for your advice in advance. > > Ralph > > -- > > > --------------------------------------------------------------------- To > unsubscribe, e-mail: [email protected] For additional > commands, e-mail: [email protected] > > -- > > Hayati Gonultas >
