Hi
I want to thank you all for your answers and your good ideas how to
solve the hadoop "small-file-problem".
Now I would like to briefly summarize your answers and suggested
solutions. First of all I describe once again my general use case:
* An external enterprise application need to store small photo files
in unsteady intervals into a clustered big data storage.
* Users need to read the files through the web interface of the
enterprise application also in unsteady intervals.
* The solution need to guarantee the data integrity of all files over
a long period of time.
* To write and read the files a Rest API is preferred.
_1) Multiple small-files in one sequence file:_
Packing multiple small files in one sequence file is a possible solution
even if it is hard to implement. If the files are attached by the
enterprise application in unsteady intervals (as in my case), the
enterprise application need to be aware of the latest filesize of the
sequence file and need to compute the correct offset. The offset and the
size of the photo file is needed to access a single photo file later.
E.g. via WebHDFS:
http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN&offset=<LONG>]&length=<LONG>]
If multiple threads try to append data in parallel, the problem becomes
more and more complex. But yes - this could be a possible solution.
_2) __Multiple small-files in a Hadoop Archive (HAR):_
Another solution is to pack the small files in a Hadoop Archive (HAR)
file. But this solution is even more difficult to implement in my case.
As I explained, the enterprise application write data in unsteady
intervals. This means that the archive job need to be decoupled from the
enterprise application. For example a scheduler could archive and delete
files older than one day on a daily basis. This would reduce the number
of small files significant. The problem here is, that the enterprise
application need to be aware of the new location of a single photo file.
To access a 'packed' photo file, the Offset and Size in the HAR file
need to be transferred back to the enterprise application. As a result
of this solution, the complexity of the overall system increases
unreasonably. To decouple things, the scheduler could create a kind of
index file for each new created HAR file. The index could be used by the
enterprise application to lookup the file-path, offset and size. But as
a single photo file can now be either stored still as a small-file or
already as a part of a HAR file, the access method need to be
implemented very tricky.
OK, the solution is possible, but the sequence file solution seems to be
much easier.
_3) The object store "openstack swift":_
It seems that the object store "openstack swift" is solving the
small-file problem much better. It is certainly worth following this
approach. However, since I am basically convinced of Hadoop, I will not
make a fundamental change in architecture for now.
_4) Intel-bigdata/SSM:_
The "Transparent Small Files Support" form the Intel-Bigdata project is
an interesting approach and I believe this would solve my problems at
all. But I fear it is to early to start here.
_5) HDFS-7240 or Ozone:_
HDFS-7240 or Ozone looks very very promising. It looks to me that Ozone
is the missing piece in the Hadoop project. Although it is not yet ready
for use in production, I will follow this project.
_6) mapR-fs:_
mapR-fs could be an alternative, but I do not consider it here.
_My conclusion: _
So for my use case in short-term it seems to be the best solution to
start with the sequence file approach. In the intermediate-term I will
see if I can adapt my solution to the Hadoop Ozone project. In the long
term I will probably support both approaches.
Since my solution is part of the open source project Imixs-Workflow, I
will publish my solution as well on GitHub.
So once again - thanks a lot for your help.
Ralph
On 04.09.2017 19:03, Ralph Soika wrote:
Hi,
I know that the issue around the small-file problem was asked
frequently, not only in this mailing list.
I also have read already some books about Haddoop and I also started
to work with Hadoop. But still I did not really understand if Hadoop
is the right choice for my goals.
To simplify my problem domain I would like to use the use case of a
photo archive:
- An external application produces about 10 million photos in one
year. The files contain important business critical data.
- A single photo file has a size between 1 and 10 MB.
- The photos need to be stored over several years (10-30 years).
- The data store should support replication over several servers.
- A checksum-concept is needed to guarantee the data integrity of all
files over a long period of time.
- To write and read the files a Rest API is preferred.
So far Hadoop seems to be absolutely the perfect solution. But my last
requirement seems to throw Hadoop out of the race.
- The photos need to be readable with very short latency from an
external enterprise application
With Hadoop HDFS and the Web Proxy everything seems perfect. But it
seems that most of the Hadoop experts advise against this usage if the
size of my data files (1-10 MB) are well below the Hadoop block size
of 64 or 128 MB.
I think I understood the concepts of HAR or sequential files.
But if I pack, for example, my files together in a large file of many
Gigabytes it is impossible to access one single photo from the Hadoop
repository in a reasonable time. It makes no sense in my eyes to pack
thousands of files into a large file just so that Hadoop jobs can
handle it better. To simply access a single file from a web interface
- as in my case - it seems to be all counterproductive.
So my question is: Is Hadoop only feasible to archive large Web-server
log files and not designed to handle big archives of small files with
also business critical data?
Thanks for your advice in advance.
Ralph
--
--
*Imixs*...extends the way people work together
We are an open source company, read more at: www.imixs.org
<http://www.imixs.org>
------------------------------------------------------------------------
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com <http://www.imixs.com>
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika