|
Roberto Congiu
2009-09-29, 01:57
Raghu Murthy
2009-09-29, 02:06
Namit Jain
2009-09-29, 05:30
Namit Jain
2009-09-29, 17:53
Namit Jain
2009-09-30, 05:35
Roberto Congiu
2009-09-30, 07:07
Namit Jain
2009-09-30, 12:34
Namit Jain
2010-02-01, 22:31
|
-
HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopRoberto Congiu 2009-09-29, 01:57
Hi guys,
I've been working on integrating hive with a legacy file format we use here. I wrote the appropriate InputFormat and SerDe and everything works, but it's painfully slow. The reason is that the files I am reading are many and hive uses one mapper for every file. I saw the HIVE-74 patches but those use CombineFileInputFormat which is available on hadoop 0.20...but we use 0.19. Is there any reason the same goal could not be achieved using the deprecated (but present < 0.20) MultiFileInputFormat ? Thanks, Roberto
-
Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopRaghu Murthy 2009-09-29, 02:06
Can we add MultiFileInputFormat as the CombineFileInputFormatShim for
hadoop-0.19? On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > Hi guys, > I've been working on integrating hive with a legacy file format we use > here. I wrote the appropriate InputFormat and SerDe and everything > works, but it's painfully slow. > The reason is that the files I am reading are many and hive uses one > mapper for every file. > I saw the HIVE-74 patches but those use CombineFileInputFormat which > is available on hadoop 0.20...but we use 0.19. Is there any reason the > same goal could not be achieved using the deprecated (but present < > 0.20) MultiFileInputFormat ? > > Thanks, > Roberto
-
Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopNamit Jain 2009-09-29, 05:30
I am not sure whether CombineFileInputFormat (in hadoop) is available in 0.19 -
If it is, we can add it, otherwise it will be very difficult. On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote: Can we add MultiFileInputFormat as the CombineFileInputFormatShim for hadoop-0.19? On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > Hi guys, > I've been working on integrating hive with a legacy file format we use > here. I wrote the appropriate InputFormat and SerDe and everything > works, but it's painfully slow. > The reason is that the files I am reading are many and hive uses one > mapper for every file. > I saw the HIVE-74 patches but those use CombineFileInputFormat which > is available on hadoop 0.20...but we use 0.19. Is there any reason the > same goal could not be achieved using the deprecated (but present < > 0.20) MultiFileInputFormat ? > > Thanks, > Roberto
-
RE: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopNamit Jain 2009-09-29, 17:53
Just checked - CombineFileInputFormat and a lot of other related stuff went to hadoop 0.20
So, it would be very difficult to add this for 0.19 From: Namit Jain [mailto:[EMAIL PROTECTED]] Sent: Monday, September 28, 2009 10:30 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop I am not sure whether CombineFileInputFormat (in hadoop) is available in 0.19 - If it is, we can add it, otherwise it will be very difficult. On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote: Can we add MultiFileInputFormat as the CombineFileInputFormatShim for hadoop-0.19? On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > Hi guys, > I've been working on integrating hive with a legacy file format we use > here. I wrote the appropriate InputFormat and SerDe and everything > works, but it's painfully slow. > The reason is that the files I am reading are many and hive uses one > mapper for every file. > I saw the HIVE-74 patches but those use CombineFileInputFormat which > is available on hadoop 0.20...but we use 0.19. Is there any reason the > same goal could not be achieved using the deprecated (but present < > 0.20) MultiFileInputFormat ? > > Thanks, > Roberto
-
Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopNamit Jain 2009-09-30, 05:35
Hi Roberto,
Talked with Raghu and Dhruba - it is possible to do so using MutliFileInputFormat, But the performance will not be very good since MutliFileInputFormat does not provide any locality. However, it will still be much better than the problem you are running into right now. Can you move to hadoop-0.20 ? That might be simpler. If not, you can definitely implement the shim using MultiFileInputFormat for 0.19 (which should work even with 0.17). Do you need some help in understanding the current shim code ? Thanks, -namit On 9/29/09 10:53 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote: Just checked - CombineFileInputFormat and a lot of other related stuff went to hadoop 0.20 So, it would be very difficult to add this for 0.19 From: Namit Jain [mailto:[EMAIL PROTECTED]] Sent: Monday, September 28, 2009 10:30 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop I am not sure whether CombineFileInputFormat (in hadoop) is available in 0.19 - If it is, we can add it, otherwise it will be very difficult. On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote: Can we add MultiFileInputFormat as the CombineFileInputFormatShim for hadoop-0.19? On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > Hi guys, > I've been working on integrating hive with a legacy file format we use > here. I wrote the appropriate InputFormat and SerDe and everything > works, but it's painfully slow. > The reason is that the files I am reading are many and hive uses one > mapper for every file. > I saw the HIVE-74 patches but those use CombineFileInputFormat which > is available on hadoop 0.20...but we use 0.19. Is there any reason the > same goal could not be achieved using the deprecated (but present < > 0.20) MultiFileInputFormat ? > > Thanks, > Roberto
-
Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopRoberto Congiu 2009-09-30, 07:07
Hi Namit,
that's what I thought. Right now unfortunately we can't migrate to 0.20. I realize we lose data locality but as you said, it would still be considerably better than now. I had a look at the shim code, shouldn't be difficult since it would be basically mimicking CombineFileInputFormat. Once I add the appropriate logic to the shim, I have to set hive.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive actually use it, right ? Roberto 2009/9/29 Namit Jain <[EMAIL PROTECTED]>: > Hi Roberto, > > Talked with Raghu and Dhruba – it is possible to do so using > MutliFileInputFormat, > But the performance will not be very good since MutliFileInputFormat does > not > provide any locality. However, it will still be much better than the problem > you are > running into right now. > > Can you move to hadoop-0.20 ? That might be simpler. > > If not, you can definitely implement the shim using MultiFileInputFormat for > 0.19 > (which should work even with 0.17). Do you need some help in understanding > the > current shim code ? > > Thanks, > -namit > > > > > > On 9/29/09 10:53 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > > Just checked – CombineFileInputFormat and a lot of other related stuff went > to hadoop 0.20 > So, it would be very difficult to add this for 0.19 > > > > From: Namit Jain [mailto:[EMAIL PROTECTED]] > Sent: Monday, September 28, 2009 10:30 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop > > I am not sure whether CombineFileInputFormat (in hadoop) is available in > 0.19 - > If it is, we can add it, otherwise it will be very difficult. > > > > On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote: > Can we add MultiFileInputFormat as the CombineFileInputFormatShim for > hadoop-0.19? > > On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > >> Hi guys, >> I've been working on integrating hive with a legacy file format we use >> here. I wrote the appropriate InputFormat and SerDe and everything >> works, but it's painfully slow. >> The reason is that the files I am reading are many and hive uses one >> mapper for every file. >> I saw the HIVE-74 patches but those use CombineFileInputFormat which >> is available on hadoop 0.20...but we use 0.19. Is there any reason the >> same goal could not be achieved using the deprecated (but present < >> 0.20) MultiFileInputFormat ? >> >> Thanks, >> Roberto > > >
-
Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopNamit Jain 2009-09-30, 12:34
That's right
On 9/30/09 12:07 AM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: Hi Namit, that's what I thought. Right now unfortunately we can't migrate to 0.20. I realize we lose data locality but as you said, it would still be considerably better than now. I had a look at the shim code, shouldn't be difficult since it would be basically mimicking CombineFileInputFormat. Once I add the appropriate logic to the shim, I have to set hive.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive actually use it, right ? Roberto 2009/9/29 Namit Jain <[EMAIL PROTECTED]>: > Hi Roberto, > > Talked with Raghu and Dhruba - it is possible to do so using > MutliFileInputFormat, > But the performance will not be very good since MutliFileInputFormat does > not > provide any locality. However, it will still be much better than the problem > you are > running into right now. > > Can you move to hadoop-0.20 ? That might be simpler. > > If not, you can definitely implement the shim using MultiFileInputFormat for > 0.19 > (which should work even with 0.17). Do you need some help in understanding > the > current shim code ? > > Thanks, > -namit > > > > > > On 9/29/09 10:53 AM, "Namit Jain" <[EMAIL PROTECTED]> wrote: > > Just checked - CombineFileInputFormat and a lot of other related stuff went > to hadoop 0.20 > So, it would be very difficult to add this for 0.19 > > > > From: Namit Jain [mailto:[EMAIL PROTECTED]] > Sent: Monday, September 28, 2009 10:30 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop > > I am not sure whether CombineFileInputFormat (in hadoop) is available in > 0.19 - > If it is, we can add it, otherwise it will be very difficult. > > > > On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]> wrote: > Can we add MultiFileInputFormat as the CombineFileInputFormatShim for > hadoop-0.19? > > On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]> wrote: > >> Hi guys, >> I've been working on integrating hive with a legacy file format we use >> here. I wrote the appropriate InputFormat and SerDe and everything >> works, but it's painfully slow. >> The reason is that the files I am reading are many and hive uses one >> mapper for every file. >> I saw the HIVE-74 patches but those use CombineFileInputFormat which >> is available on hadoop 0.20...but we use 0.19. Is there any reason the >> same goal could not be achieved using the deprecated (but present < >> 0.20) MultiFileInputFormat ? >> >> Thanks, >> Roberto > > >
-
RE: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoopNamit Jain 2010-02-01, 22:31
I will take a look -
It will be great if you can file a jira and add a patch for that From: Roberto Congiu [mailto:[EMAIL PROTECTED]] Sent: Monday, February 01, 2010 11:02 AM To: Namit Jain Cc: [EMAIL PROTECTED] Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop Reviving this old thread...just found the time to work on this... I have a patch for using MultiFIleInputFormat in hadoop 0.19 as CombineHiveInputFormat - setting set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; (or the equivalent setting on hive-site.xml) will have hive use MultiFIleInputFormat, packing many small files in mapred.multifileinputformat.splits splits (if set), or guessing the size by dividing the total input size by the DFS block size. Patch attached...I checked that it passes all unit tests according to http://wiki.apache.org/hadoop/Hive/HowToContribute#Setting_up_Eclipse_Development_Environment_.28Optional.29 On Wed, Sep 30, 2009 at 4:34 AM, Namit Jain <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: That's right On 9/30/09 12:07 AM, "Roberto Congiu" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi Namit, that's what I thought. Right now unfortunately we can't migrate to 0.20. I realize we lose data locality but as you said, it would still be considerably better than now. I had a look at the shim code, shouldn't be difficult since it would be basically mimicking CombineFileInputFormat. Once I add the appropriate logic to the shim, I have to set hive.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat to have hive actually use it, right ? Roberto 2009/9/29 Namit Jain <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>: > Hi Roberto, > > Talked with Raghu and Dhruba - it is possible to do so using > MutliFileInputFormat, > But the performance will not be very good since MutliFileInputFormat does > not > provide any locality. However, it will still be much better than the problem > you are > running into right now. > > Can you move to hadoop-0.20 ? That might be simpler. > > If not, you can definitely implement the shim using MultiFileInputFormat for > 0.19 > (which should work even with 0.17). Do you need some help in understanding > the > current shim code ? > > Thanks, > -namit > > > > > > On 9/29/09 10:53 AM, "Namit Jain" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > > Just checked - CombineFileInputFormat and a lot of other related stuff went > to hadoop 0.20 > So, it would be very difficult to add this for 0.19 > > > > From: Namit Jain [mailto:[EMAIL PROTECTED]] > Sent: Monday, September 28, 2009 10:30 PM > To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>; [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > Subject: Re: HIVE-74 and CombineFileInputFormat on pre-0.20 hadoop > > I am not sure whether CombineFileInputFormat (in hadoop) is available in > 0.19 - > If it is, we can add it, otherwise it will be very difficult. > > > > On 9/28/09 7:06 PM, "Raghu Murthy" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Can we add MultiFileInputFormat as the CombineFileInputFormatShim for > hadoop-0.19? > > On 9/28/09 6:57 PM, "Roberto Congiu" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > >> Hi guys, >> I've been working on integrating hive with a legacy file format we use >> here. I wrote the appropriate InputFormat and SerDe and everything >> works, but it's painfully slow. >> The reason is that the files I am reading are many and hive uses one >> mapper for every file. >> I saw the HIVE-74 patches but those use CombineFileInputFormat which >> is available on hadoop 0.20...but we use 0.19. Is there any reason the >> same goal could not be achieved using the deprecated (but present < >> 0.20) MultiFileInputFormat ? >> >> Thanks, >> Roberto > > > |