-Re: Should splittable Gzip be a "core" hadoop feature?
Robert Evans 2012-02-29, 18:13
If many people are going to use it then by all means put it in. If there is only one person, or a very small handful of people that are going to use it then I personally would prefer to see it a separate project. However, Edward, you have convinced me that I am trying to make a logical judgment based only on a gut feeling and the response rate to an email chain. Thanks for that. What I really want to know is how well does this new CompressionCodec perform in comparison to the regular gzip codec in various different conditions and what type of impact does it have on network traffic and datanode load. My gut feeling is that the speedup is going to be relatively small except when there is a lot of computation happening in the mapper and the added load and network traffic outweighs the speedup in most cases, but like all performance on a complex system gut feelings are almost worthless and hard numbers are what is needed to make a judgment call. Niels, I assume you have tested this on your cluster(s). Can you share with us some of the numbers?
On 2/29/12 11:06 AM, "Edward Capriolo" <[EMAIL PROTECTED]> wrote:
Too bad we can not up the replication on the first few blocks of the
file or distributed cache it.
The crontrib statement is arguable. I could make a case that the
majority of stuff should not be in hadoop-core. NLineInputFormat for
example, nice to have. Took a long time to get ported to the new map
reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but
does not need to be part of core. I could see hadoop as just coming
with TextInputFormat and SequenceInputFormat and everything else is
after market from github,
On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans <[EMAIL PROTECTED]> wrote:
> I can see a use for it, but I have two concerns about it. My biggest concern is maintainability. We have had lots of things get thrown into contrib in the past, very few people use them, and inevitably they start to suffer from bit rot. I am not saying that it will happen with this, but if you have to ask if people will use it and there has been no overwhelming yes, it makes me nervous about it. My second concern is with knowing when to use this. Anything that adds this in would have to come with plenty of documentation about how it works, how it is different from the normal gzip format, explanations about what type of a load it might put on data nodes that hold the start of the file, etc.
> From both of these I would prefer to see this as a github project for a while first, and one it shows that it has a significant following, or a community with it, then we can pull it in. But if others disagree I am not going to block it. I am a -0 on pulling this in now.
> On 2/29/12 10:00 AM, "Niels Basjes" <[EMAIL PROTECTED]> wrote:
> On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <[EMAIL PROTECTED]>wrote:
>> But being able to generate split info for them and processing them
>> would be good as well. I remember that was a hot thing to do with lzo
>> back in the day. The pain of once overing the gz files to generate the
>> split info is detracting but it is nice to know it is there if you
>> want it.
> Note that the solution I created (HADOOP-7076) does not require any
> It can split ANY gzipped file as-is.
> The downside is that this effectively costs some additional performance
> because the task has to decompress the first part of the file that is to be
> The other two ways of splitting gzipped files either require
> - creating come kind of "compression index" before actually using the file
> - creating a file in a format that is gerenated in such a way that it is
> really a set of concatenated gzipped files. (HADOOP-7909)
> Best regards / Met vriendelijke groeten,
> Niels Basjes