-Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
Roberto Congiu 2009-09-30, 07:07
that's what I thought. Right now unfortunately we can't migrate to 0.20.
I realize we lose data locality but as you said, it would still be
considerably better than now.
I had a look at the shim code, shouldn't be difficult since it would
be basically mimicking CombineFileInputFormat.
Once I add the appropriate logic to the shim, I have to set
org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive
actually use it, right ?
2009/9/29 Namit Jain <[EMAIL PROTECTED]>:
> Hi Roberto,
> Talked with Raghu and Dhruba – it is possible to do so using
> But the performance will not be very good since MutliFileInputFormat does
> provide any locality. However, it will still be much better than the problem
> you are
> running into right now.
> Can you move to hadoop-0.20 ? That might be simpler.
> If not, you can definitely implement the shim using MultiFileInputFormat for
> (which should work even with 0.17). Do you need some help in understanding
> current shim code ?
> On 9/29/09 10:53 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote:
> Just checked – CombineFileInputFormat and a lot of other related stuff went
> to hadoop 0.20
> So, it would be very difficult to add this for 0.19
> From: Namit Jain [mailto:[EMAIL PROTECTED]]
> Sent: Monday, September 28, 2009 10:30 PM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop
> I am not sure whether CombineFileInputFormat (in hadoop) is available in
> 0.19 -
> If it is, we can add it, otherwise it will be very difficult.
> On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote:
> Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
> On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote:
>> Hi guys,
>> I've been working on integrating hive with a legacy file format we use
>> here. I wrote the appropriate InputFormat and SerDe and everything
>> works, but it's painfully slow.
>> The reason is that the files I am reading are many and hive uses one
>> mapper for every file.
>> I saw the HIVE-74 patches but those use CombineFileInputFormat which
>> is available on hadoop 0.20...but we use 0.19. Is there any reason the
>> same goal could not be achieved using the deprecated (but present <
>> 0.20) MultiFileInputFormat ?