-Re: Making Gzip splittable
Niels Basjes 2012-02-22, 22:37
On Wed, Feb 22, 2012 at 19:14, Tim Broberg <[EMAIL PROTECTED]> wrote:
> There are three options here:
> 1 - Add your codec, and alternative to the default gzip codec.
> 2 - Modify the gzip codec to incorporate your feature so that it is
> pseudo-splittable by default (skippable?)
> 3 - Do nothing
> The code uses the normal splittability interface and doesn't invent some
> new solution. It seems perfectly well explained.
The choice was made to implement it as a separate 'Codec' that reuses all
decompression functionality from the existing GzipCodec without making any
changes to the original. This way there is no duplicate code and there is
no risk the existing GzipCodec is affected by the new functionality.
This was actually one of the first review comments I got on one of the
first versions (which did have a few minor changes in the GzipCodec).
So that is why option '1' was chosen instead of '2'.
There is a lot of explanation in there on how to switch over from one codec
> to the other.
Enabling the codec is only one setting.
There is however a quite a bit of documentation on the matter how to use it.
Does it all get simpler if skippability is implemented by default but the
> option is not enabled?
There are two answers to this:
1) No, it won't get simpler.
2) This feature cannot be disabled per codec. The reason is that the
framework creates splits if the applicable codec implements the
SplittableCompressionCodec. This check is done purely by doing an
"instanceof" check. After that the FileInputFormat creates the splits
without consulting the codec class at all. So either a codec is splittable
or not. And the splits are defined independent of the codec.
So there is (unfortunately) currently no way to create a codec that can be
splittable/non-splittable by using a config setting.
Does this make things any less potentially confusing?
I don't think this would make it less confusing.
> - Tim.
> From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of Niels Basjes [
> [EMAIL PROTECTED]]
> Sent: Sunday, February 19, 2012 4:23 PM
> To: common-dev
> Subject: Making Gzip splittable
> As some of you know I've created a patch that effectively makes Gzip
> What this does is for a split somewhere in the middle of the file it will
> read from the start of the file up until the point where the split starts.
> This is a useful waste of resources because it creates room to run a heavy
> lifting mapper in parallel.
> Due to this balance between the waste being useful and the waste being
> wasteful I've included extensive documentation in the patch on how it works
> and how to use it.
> I've seen that there are quite a few real life situations where I expect my
> solution can be useful.
> What I created is as far as I can tell the only way you can split a gzipped
> file without prior knowledge about the actual file.
> If you do have prior information then other directions with a similar goal
> are possible:
> - Analyzing the file beforehand:
> - Create a specially crafted gzipped file:
> Over the last year I've had review comments from Chris Douglas (until he
> stopped being involved in Hadoop) and later from Luke Lu.
> Now the last feedback I got from Luke is this:
> > Niels, I'm ambivalent about this patch. It has clean code and
> > documentation, OTOH, it has really confusing usage/semantics and
> > dubious general utility that the community might not want to maintain
> > as part of an official release. After having to explain many finer
> > points of Hadoop to new users/developers these days, I think the
> > downside of this patch might out weight its benefits. I'm -0 on it.
> > i.e., you need somebody else to +1 on this.
> So after consulting Eli I'm asking this group.
Best regards / Met vriendelijke groeten,