Jens Scheidtmann 2013-03-30, 19:15
Bertrand Dechoux 2013-03-31, 09:38
Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.
In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.
In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.
Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.
On 31 March 2013 10:38, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:
> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> [EMAIL PROTECTED]> wrote:
>> Dear Robert,
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> Best regards,
> Bertrand Dechoux