Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # dev - Making Gzip splittable


Copy link to this message
-
Making Gzip splittable
Niels Basjes 2012-02-20, 00:23
Hi,

As some of you know I've created a patch that effectively makes Gzip
splittable.

https://issues.apache.org/jira/browse/HADOOP-7076

What this does is for a split somewhere in the middle of the file it will
read from the start of the file up until the point where the split starts.
This is a useful waste of resources because it creates room to run a heavy
lifting mapper in parallel.
Due to this balance between the waste being useful and the waste being
wasteful I've included extensive documentation in the patch on how it works
and how to use it.

I've seen that there are quite a few real life situations where I expect my
solution can be useful.

What I created is as far as I can tell the only way you can split a gzipped
file without prior knowledge about the actual file.
If you do have prior information then other directions with a similar goal
are possible:
- Analyzing the file beforehand:
HADOOP-6153<https://issues.apache.org/jira/browse/HADOOP-6153>
- Create a specially crafted gzipped file:
HADOOP-7909<https://issues.apache.org/jira/browse/HADOOP-7909>

Over the last year I've had review comments from Chris Douglas (until he
stopped being involved in Hadoop) and later from Luke Lu.

Now the last feedback I got from Luke is this:

> Niels, I'm ambivalent about this patch. It has clean code and
> documentation, OTOH, it has really confusing usage/semantics and
> dubious general utility that the community might not want to maintain
> as part of an official release. After having to explain many finer
> points of Hadoop to new users/developers these days, I think the
> downside of this patch might out weight its benefits. I'm -0 on it.
> i.e., you need somebody else to +1 on this.

So after consulting Eli I'm asking this group.

My views on this feature:
- I think this feature should go in because I think others can benefit from
it.
- I also think that it should remain disabled by default. It can then be
used by those that read the documentation.
- The implementation does not contain any decompression code at all. It
only does the splitting smartness. (It could even be refactored to make any
codec splittable). It has been tested with both the java and the native
decompressors.

What do you think?

Is this a feature that should go in the official release or not?

--
Best regards

Niels Basjes
+
Tim Broberg 2012-02-22, 18:14
+
Niels Basjes 2012-02-22, 22:37