|
|
-
Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopRoberto Congiu 2009-09-30, 07:07
Hi Namit,
that's what I thought. Right now unfortunately we can't migrate to 0.20. I realize we lose data locality but as you said, it would still be considerably better than now. I had a look at the shim code, shouldn't be difficult since it would be basically mimicking CombineFileInputFormat. Once I add the appropriate logic to the shim, I have to set hive.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive actually use it, right ? Roberto 2009/9/29 Namit Jain <[EMAIL PROTECTED]>: > Hi Roberto, > > Talked with Raghu and Dhruba – it is possible to do so using > MutliFileInputFormat, > But the performance will not be very good since MutliFileInputFormat does > not > provide any locality. However, it will still be much better than the problem > you are > running into right now. > > Can you move to hadoop-0.20 ? That might be simpler. > > If not, you can definitely implement the shim using MultiFileInputFormat for > 0.19 > (which should work even with 0.17). Do you need some help in understanding > the > current shim code ? > > Thanks, > -namit > > > > > > On 9/29/09 10:53 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > > Just checked – CombineFileInputFormat and a lot of other related stuff went > to hadoop 0.20 > So, it would be very difficult to add this for 0.19 > > > > From: Namit Jain [mailto:[EMAIL PROTECTED]] > Sent: Monday, September 28, 2009 10:30 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop > > I am not sure whether CombineFileInputFormat (in hadoop) is available in > 0.19 - > If it is, we can add it, otherwise it will be very difficult. > > > > On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote: > Can we add MultiFileInputFormat as the CombineFileInputFormatShim for > hadoop-0.19? > > On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > >> Hi guys, >> I've been working on integrating hive with a legacy file format we use >> here. I wrote the appropriate InputFormat and SerDe and everything >> works, but it's painfully slow. >> The reason is that the files I am reading are many and hive uses one >> mapper for every file. >> I saw the HIVE-74 patches but those use CombineFileInputFormat which >> is available on hadoop 0.20...but we use 0.19. Is there any reason the >> same goal could not be achieved using the deprecated (but present < >> 0.20) MultiFileInputFormat ? >> >> Thanks, >> Roberto > > > |