Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: Should splittable Gzip be a "core" hadoop feature?


+
Michel Segel 2012-02-29, 12:10
+
Edward Capriolo 2012-02-29, 15:52
+
Niels Basjes 2012-02-29, 16:00
+
Robert Evans 2012-02-29, 16:31
+
Edward Capriolo 2012-02-29, 17:06
+
Robert Evans 2012-02-29, 18:13
+
Niels Basjes 2012-02-29, 21:17
+
Michel Segel 2012-03-01, 12:34
Copy link to this message
-
Re: Should splittable Gzip be a "core" hadoop feature?
Hi,

On Wed, Feb 29, 2012 at 13:10, Michel Segel <[EMAIL PROTECTED]>wrote:

> Let's play devil's advocate for a second?
>

I always like that :)
> Why?
Because then datafiles from other systems (like the Apache HTTP webserver)
can be processed without preprocessing more efficiently.

Snappy exists.
>

Compared to gzip: Snappy is faster, compresses a bit less and is
unfortunately not splittable.

The only advantage is that you don't have to convert from gzip to snappy
> and can process gzip files natively.
>

Yes, that and the fact that the files are smaller.
Note that I've described some of these considerations in the javadoc.

Next question is how large are the gzip files in the first place?
>

I work for the biggest webshop in the Netherlands and I'm facing a set of
logfiles that are very often > 1 GB each.... and are gzipped.
The first thing we do with then is parse and disect each line in the very
first mapper. Then we store the result in (snappy compressed) avro files.

I don't disagree, I just want to have a solid argument in favor of it...
>

:)

--
Best regards / Met vriendelijke groeten,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB