Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Should splittable Gzip be a "core" hadoop feature?


Copy link to this message
-
Re: Should splittable Gzip be a "core" hadoop feature?
Edward Capriolo 2012-02-29, 15:52
Mike,

Snappy is cool and all, but I was not overly impressed with it.

GZ zipps much better then Snappy. Last time I checked for our log file
gzip took them down from 100MB-> 40MB, while snappy compressed them
from 100MB->55MB. That was only with sequence files. But still that is
pretty significant if you are considering long term storage. Also
being that the delta in the file size was large I could not actually
make the agree that using sequence+snappy was faster then sequence+gz.
Sure the MB/s rate was probably faster but since I had more MB I was
not able to prove snappy a win. I use it for intermediate compression
only.

Actually the raw formats (gz vs sequence gz) are significantly smaller
and faster then their sequence file counterparts.

Believe it or not, I commonly use mapred.compress.output without
sequence files. As long as I have a larger number of reducers I do not
have to worry about files being splittable because N mappers process N
files. Generally I am happpy with say N mappers because the input
formats tend to create more mappers then I want which makes more
overhead and more shuffle.

But being able to generate split info for them and processing them
would be good as well. I remember that was a hot thing to do with lzo
back in the day. The pain of once overing the gz files to generate the
split info is detracting but it is nice to know it is there if you
want it.

Edward
On Wed, Feb 29, 2012 at 7:10 AM, Michel Segel <[EMAIL PROTECTED]> wrote:
> Let's play devil's advocate for a second?
>
> Why? Snappy exists.
> The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively.
>
> Next question is how large are the gzip files in the first place?
>
> I don't disagree, I just want to have a solid argument in favor of it...
>
>
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Feb 28, 2012, at 9:50 AM, Niels Basjes <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> Some time ago I had an idea and implemented it.
>>
>> Normally you can only run a single gzipped input file through a single
>> mapper and thus only on a single CPU core.
>> What I created makes it possible to process a Gzipped file in such a way
>> that it can run on several mappers in parallel.
>>
>> I've put the javadoc I created on my homepage so you can read more about
>> the details.
>> http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec
>>
>> Now the question that was raised by one of the people reviewing this code
>> was: Should this implementation be part of the core Hadoop feature set?
>> The main reason that was given is that this needs a bit more understanding
>> on what is happening and as such cannot be enabled by default.
>>
>> I would like to hear from the Hadoop Core/Map reduce users what you think.
>>
>> Should this be
>> - a part of the default Hadoop feature set so that anyone can simply enable
>> it by setting the right configuration?
>> - a separate library?
>> - a nice idea I had fun building but that no one needs?
>> - ... ?
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes