Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Using Hadoop for codec functionality


Copy link to this message
-
Re: Using Hadoop for codec functionality
Thanks for both your responses. I was indeed talking about developing a
codec utility as the hadoop application itself.

In particular, thanks to Bertrand for the lengthy response. I'm actually
learning Hadoop at the moment, so I've been trying to find a suitable (very
modestly sized) application for a student project (1-2 weeks max).
I had previously written a codec utility in Perl that uses a combination of
dictionary (LZW) and arithmetic coding techniques. Compression rates aren't
that bad, but it's very slow.

In any case, I just thought that it might be interesting to Hadoop-ify the
program since compression/decompression is compute intensive and could
probably benefit from parallelization.
I'm thinking now that it might not be such a good fit after all.

Also, if anyone reading this has any novel ideas for demonstrating Hadoops
capabilities inside of a short developmental window, I'd love to hear about
it.
At the moment, I'm leaning towards a distributed grep, most likely with
some kind of agrep-like functionality. Not really a searingly inventive
idea, but if anyone can suggest some way I could make it more exciting, I'd
love to hear about that too.

-Rob
On 31 March 2013 10:38, Bertrand Dechoux <[EMAIL PROTECTED]> wrote:

> Your question could be interpreted in another way : should I use Hadoop in
> order to perform massive compression/decompression using my own
> (eventually, proprietary) utility?
>
> So yes, Hadoop can be used to parallelize the work. But the real answer
> will depend on your context, like always.
> How many files need to be processed? What is the average size? Is your
> utility parallelizable? How the data will be used after
> compression/decompression?
>
> The number of files and their size is important because Hadoop is designed
> to deal with a relatively low number of files but relatively big : a few
> millions of gigabyte-sized files instead of 'milliards' of megabyte-sized
> files. Many small files could become an issue for the performance. But a
> huge files is not necessarily better because if your utility is not
> parallelizable then, regardless of Hadoop, uncompressing a 2GB file require
> a single process to read the whole file and then the uncompressed version
> need to be stored somewhere.
>
> So the final question is : for what purpose? If it is for massive
> decompression, keeping the compressed version inside Hadoop seems a sane
> strategy. So it might be better to rely on a standard compression utility
> and uncompress only before processing inside Hadoop itself. If it is for
> compression, well, it might not be that massive because you might not
> receive that many files at the same time.
>
> The common strategy in Hadoop is not to compress a whole file but instead
> compress the parts (blocks) of the file. This way the size of the
> compression work is limited/bounded and the work can be parallelized even
> with a non parallelizable compression utility. The drawback is that the
> "list of compressed blocks" is not a standard compressed file. And so the
> interoperability with other parts of your system is not granted without
> extra work.
>
> Bertrand
>
>
> On Sat, Mar 30, 2013 at 8:15 PM, Jens Scheidtmann <
> [EMAIL PROTECTED]> wrote:
>
>> Dear Robert,
>>
>> SequenceFiles do have either record, block or no compression. You can
>> configure, which codec (gzip, bzip2, etc.) is used. Have a look at
>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/CompressionCodec.html
>>
>> Best regards,
>>
>> Jens
>>
>>
>>
>
>
> --
> Bertrand Dechoux
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB