|
Sandeep Reddy P
2012-09-07, 14:18
Abhishek
2012-09-07, 14:31
Connell, Chuck
2012-09-07, 14:39
Sandeep Reddy P
2012-09-07, 14:41
Mohammad Tariq
2012-09-07, 14:48
Connell, Chuck
2012-09-07, 14:57
Mohammad Tariq
2012-09-07, 15:02
Sandeep Reddy P
2012-09-07, 15:07
praveenesh kumar
2012-09-08, 11:54
Connell, Chuck
2012-09-08, 12:18
Bejoy KS
2012-09-08, 12:33
praveenesh kumar
2012-09-08, 14:35
|
-
How to load csv data into HIVESandeep Reddy P 2012-09-07, 14:18
Hi,
Here is the sample data "174969274","14-mar-2006"," 3522876","","14-mar-2006","500000308","65","1"| "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| How to load this kind of data into HIVE? I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this? -- Thanks, sandeep
-
Re: How to load csv data into HIVEAbhishek 2012-09-07, 14:31
So are you trying get rid of double quotes and pipe symbol??
Regards Abhi Sent from my iPhone On Sep 7, 2012, at 10:18 AM, Sandeep Reddy P <[EMAIL PROTECTED]> wrote: > Hi, > Here is the sample data > "174969274","14-mar-2006"," > 3522876","","14-mar-2006","500000308","65","1"| > "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| > "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| > "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| > > How to load this kind of data into HIVE? > I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this? > > > > -- > Thanks, > sandeep >
-
RE: How to load csv data into HIVEConnell, Chuck 2012-09-07, 14:39
How about a Python script that changes it into plain tab-separated text? So it would look like this...
174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> etc... Tab-separated with newlines is easy to read and works perfectly on import. Chuck Connell Nuance R&D Data Team Burlington, MA 781-565-4611 From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]] Subject: How to load csv data into HIVE Hi, Here is the sample data "174969274","14-mar-2006"," 3522876","","14-mar-2006","500000308","65","1"| "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| How to load this kind of data into HIVE? I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this?
-
Re: How to load csv data into HIVESandeep Reddy P 2012-09-07, 14:41
Hi,
I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster? On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]>wrote: > How about a Python script that changes it into plain tab-separated text? > So it would look like this…**** > > ** ** > > 174969274<tab>14-mar-2006<tab>3522876<tab> > <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> > etc…**** > > ** ** > > Tab-separated with newlines is easy to read and works perfectly on import. > **** > > ** ** > > Chuck Connell**** > > Nuance R&D Data Team**** > > Burlington, MA**** > > 781-565-4611**** > > ** ** > > *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] > *Subject:* How to load csv data into HIVE**** > > ** ** > > Hi, > Here is the sample data > "174969274","14-mar-2006","**** > > 3522876","","14-mar-2006","500000308","65","1"| > "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| > "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| > "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| > > How to load this kind of data into HIVE? > I'm using shell script to get rid of double quotes and '|' but its taking > very long time to work on each csv which are 12GB each. What is the best > way to do this?**** > > ** ** > -- Thanks, sandeep
-
Re: How to load csv data into HIVEMohammad Tariq 2012-09-07, 14:48
Hello Sandeep,
I would suggest you to write a MapReduce job instead of usual sequential program to transform your files. It would be much faster. Then use Hive to load the data. Regards, Mohammad Tariq On Fri, Sep 7, 2012 at 8:11 PM, Sandeep Reddy P <[EMAIL PROTECTED] > wrote: > Hi, > I wrote a shell script to get csv data but when i run that script on a > 12GB csv its taking more time. If i run a python script will that be faster? > > > On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]>wrote: > >> How about a Python script that changes it into plain tab-separated >> text? So it would look like this…**** >> >> ** ** >> >> 174969274<tab>14-mar-2006<tab>3522876<tab> >> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> >> etc…**** >> >> ** ** >> >> Tab-separated with newlines is easy to read and works perfectly on import. >> **** >> >> ** ** >> >> Chuck Connell**** >> >> Nuance R&D Data Team**** >> >> Burlington, MA**** >> >> 781-565-4611**** >> >> ** ** >> >> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >> *Subject:* How to load csv data into HIVE**** >> >> ** ** >> >> Hi, >> Here is the sample data >> "174969274","14-mar-2006","**** >> >> 3522876","","14-mar-2006","500000308","65","1"| >> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| >> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| >> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| >> >> How to load this kind of data into HIVE? >> I'm using shell script to get rid of double quotes and '|' but its taking >> very long time to work on each csv which are 12GB each. What is the best >> way to do this?**** >> >> ** ** >> > > > > -- > Thanks, > sandeep > >
-
RE: How to load csv data into HIVEConnell, Chuck 2012-09-07, 14:57
I cannot promise which is faster. A lot depends on how clever your scripts are.
From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]] Sent: Friday, September 07, 2012 10:42 AM To: [EMAIL PROTECTED] Subject: Re: How to load csv data into HIVE Hi, I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster? On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: How about a Python script that changes it into plain tab-separated text? So it would look like this... 174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> etc... Tab-separated with newlines is easy to read and works perfectly on import. Chuck Connell Nuance R&D Data Team Burlington, MA 781-565-4611<tel:781-565-4611> From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Subject: How to load csv data into HIVE Hi, Here is the sample data "174969274","14-mar-2006"," 3522876","","14-mar-2006","500000308","65","1"| "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| How to load this kind of data into HIVE? I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this? -- Thanks, sandeep
-
Re: How to load csv data into HIVEMohammad Tariq 2012-09-07, 15:02
I said this assuming that a Hadoop cluster is available since Sandeep is
planning to use Hive. If that is the case then MapReduce would be faster for such large files. Regards, Mohammad Tariq On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <[EMAIL PROTECTED]>wrote: > I cannot promise which is faster. A lot depends on how clever your > scripts are.**** > > ** ** > > ** ** > > ** ** > > *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] > *Sent:* Friday, September 07, 2012 10:42 AM > *To:* [EMAIL PROTECTED] > *Subject:* Re: How to load csv data into HIVE**** > > ** ** > > Hi, > I wrote a shell script to get csv data but when i run that script on a > 12GB csv its taking more time. If i run a python script will that be faster? > **** > > On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]> > wrote:**** > > How about a Python script that changes it into plain tab-separated text? > So it would look like this…**** > > **** > > 174969274<tab>14-mar-2006<tab>3522876<tab> > <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> > etc…**** > > **** > > Tab-separated with newlines is easy to read and works perfectly on import. > **** > > **** > > Chuck Connell**** > > Nuance R&D Data Team**** > > Burlington, MA**** > > 781-565-4611**** > > **** > > *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] > *Subject:* How to load csv data into HIVE**** > > **** > > Hi, > Here is the sample data > "174969274","14-mar-2006","**** > > 3522876","","14-mar-2006","500000308","65","1"| > "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| > "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| > "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| > > How to load this kind of data into HIVE? > I'm using shell script to get rid of double quotes and '|' but its taking > very long time to work on each csv which are 12GB each. What is the best > way to do this?**** > > **** > > > > > -- > Thanks, > sandeep**** >
-
Re: How to load csv data into HIVESandeep Reddy P 2012-09-07, 15:07
Hi,
Thank you all for your help. I'll try both ways and i'll get back to you. On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > I said this assuming that a Hadoop cluster is available since Sandeep is > planning to use Hive. If that is the case then MapReduce would be faster > for such large files. > > Regards, > Mohammad Tariq > > > > On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <[EMAIL PROTECTED]>wrote: > >> I cannot promise which is faster. A lot depends on how clever your >> scripts are.**** >> >> ** ** >> >> ** ** >> >> ** ** >> >> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >> *Sent:* Friday, September 07, 2012 10:42 AM >> *To:* [EMAIL PROTECTED] >> *Subject:* Re: How to load csv data into HIVE**** >> >> ** ** >> >> Hi, >> I wrote a shell script to get csv data but when i run that script on a >> 12GB csv its taking more time. If i run a python script will that be faster? >> **** >> >> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]> >> wrote:**** >> >> How about a Python script that changes it into plain tab-separated text? >> So it would look like this…**** >> >> **** >> >> 174969274<tab>14-mar-2006<tab>3522876<tab> >> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> >> etc…**** >> >> **** >> >> Tab-separated with newlines is easy to read and works perfectly on import. >> **** >> >> **** >> >> Chuck Connell**** >> >> Nuance R&D Data Team**** >> >> Burlington, MA**** >> >> 781-565-4611**** >> >> **** >> >> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >> *Subject:* How to load csv data into HIVE**** >> >> **** >> >> Hi, >> Here is the sample data >> "174969274","14-mar-2006","**** >> >> 3522876","","14-mar-2006","500000308","65","1"| >> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| >> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| >> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| >> >> How to load this kind of data into HIVE? >> I'm using shell script to get rid of double quotes and '|' but its taking >> very long time to work on each csv which are 12GB each. What is the best >> way to do this?**** >> >> **** >> >> >> >> >> -- >> Thanks, >> sandeep**** >> > > -- Thanks, sandeep
-
Re: How to load csv data into HIVEpraveenesh kumar 2012-09-08, 11:54
You can use hadoop streaming that would be much faster... Just run your
cleaning shell script logic in map phase and it will be done in just few minutes. That will keep the data in HDFS. Regards, Praveenesh On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <[EMAIL PROTECTED] > wrote: > Hi, > Thank you all for your help. I'll try both ways and i'll get back to you. > > > On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> I said this assuming that a Hadoop cluster is available since Sandeep is >> planning to use Hive. If that is the case then MapReduce would be faster >> for such large files. >> >> Regards, >> Mohammad Tariq >> >> >> >> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <[EMAIL PROTECTED]>wrote: >> >>> I cannot promise which is faster. A lot depends on how clever your >>> scripts are.**** >>> >>> ** ** >>> >>> ** ** >>> >>> ** ** >>> >>> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >>> *Sent:* Friday, September 07, 2012 10:42 AM >>> *To:* [EMAIL PROTECTED] >>> *Subject:* Re: How to load csv data into HIVE**** >>> >>> ** ** >>> >>> Hi, >>> I wrote a shell script to get csv data but when i run that script on a >>> 12GB csv its taking more time. If i run a python script will that be faster? >>> **** >>> >>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck < >>> [EMAIL PROTECTED]> wrote:**** >>> >>> How about a Python script that changes it into plain tab-separated text? >>> So it would look like this…**** >>> >>> **** >>> >>> 174969274<tab>14-mar-2006<tab>3522876<tab> >>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> >>> etc…**** >>> >>> **** >>> >>> Tab-separated with newlines is easy to read and works perfectly on >>> import.**** >>> >>> **** >>> >>> Chuck Connell**** >>> >>> Nuance R&D Data Team**** >>> >>> Burlington, MA**** >>> >>> 781-565-4611**** >>> >>> **** >>> >>> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >>> *Subject:* How to load csv data into HIVE**** >>> >>> **** >>> >>> Hi, >>> Here is the sample data >>> "174969274","14-mar-2006","**** >>> >>> 3522876","","14-mar-2006","500000308","65","1"| >>> >>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| >>> >>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| >>> >>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| >>> >>> How to load this kind of data into HIVE? >>> I'm using shell script to get rid of double quotes and '|' but its >>> taking very long time to work on each csv which are 12GB each. What is the >>> best way to do this?**** >>> >>> **** >>> >>> >>> >>> >>> -- >>> Thanks, >>> sandeep**** >>> >> >> > > > -- > Thanks, > sandeep > >
-
RE: How to load csv data into HIVEConnell, Chuck 2012-09-08, 12:18
I would like to hear more about this "hadoop streaming to Hive" idea. I have used streaming jobs as mappers, with a python script as map.py. Are you saying that such a streaming mapper can load its output into Hive? Can you send some example code? Hive wants to load "files" not individual lines/records. How would you do this?
Thanks very much, Chuck ________________________________ From: praveenesh kumar [[EMAIL PROTECTED]] Sent: Saturday, September 08, 2012 7:54 AM To: [EMAIL PROTECTED] Subject: Re: How to load csv data into HIVE You can use hadoop streaming that would be much faster... Just run your cleaning shell script logic in map phase and it will be done in just few minutes. That will keep the data in HDFS. Regards, Praveenesh On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi, Thank you all for your help. I'll try both ways and i'll get back to you. On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I said this assuming that a Hadoop cluster is available since Sandeep is planning to use Hive. If that is the case then MapReduce would be faster for such large files. Regards, Mohammad Tariq On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I cannot promise which is faster. A lot depends on how clever your scripts are. From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Friday, September 07, 2012 10:42 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: How to load csv data into HIVE Hi, I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster? On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: How about a Python script that changes it into plain tab-separated text? So it would look like this… 174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> etc… Tab-separated with newlines is easy to read and works perfectly on import. Chuck Connell Nuance R&D Data Team Burlington, MA 781-565-4611<tel:781-565-4611> From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Subject: How to load csv data into HIVE Hi, Here is the sample data "174969274","14-mar-2006"," 3522876","","14-mar-2006","500000308","65","1"| "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| How to load this kind of data into HIVE? I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this? -- Thanks, sandeep -- Thanks, sandeep
-
Re: How to load csv data into HIVEBejoy KS 2012-09-08, 12:33
Hi Chuck
I believe Praveenesh was adding his thought to the discussion on preprocessing the data using mapreduce itself. If you go with hadoop streaming you can use the python script in the mapper and that will do the preprocessing parallely on large volume data. Then this preprocessed data can be loaded into hive table. Regards Bejoy KS Sent from handheld, please excuse typos. -----Original Message----- From: "Connell, Chuck" <[EMAIL PROTECTED]> Date: Sat, 8 Sep 2012 12:18:33 To: [EMAIL PROTECTED]<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: RE: How to load csv data into HIVE I would like to hear more about this "hadoop streaming to Hive" idea. I have used streaming jobs as mappers, with a python script as map.py. Are you saying that such a streaming mapper can load its output into Hive? Can you send some example code? Hive wants to load "files" not individual lines/records. How would you do this? Thanks very much, Chuck ________________________________ From: praveenesh kumar [[EMAIL PROTECTED]] Sent: Saturday, September 08, 2012 7:54 AM To: [EMAIL PROTECTED] Subject: Re: How to load csv data into HIVE You can use hadoop streaming that would be much faster... Just run your cleaning shell script logic in map phase and it will be done in just few minutes. That will keep the data in HDFS. Regards, Praveenesh On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi, Thank you all for your help. I'll try both ways and i'll get back to you. On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I said this assuming that a Hadoop cluster is available since Sandeep is planning to use Hive. If that is the case then MapReduce would be faster for such large files. Regards, Mohammad Tariq On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I cannot promise which is faster. A lot depends on how clever your scripts are. From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Friday, September 07, 2012 10:42 AM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: How to load csv data into HIVE Hi, I wrote a shell script to get csv data but when i run that script on a 12GB csv its taking more time. If i run a python script will that be faster? On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: How about a Python script that changes it into plain tab-separated text? So it would look like this… 174969274<tab>14-mar-2006<tab>3522876<tab> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> etc… Tab-separated with newlines is easy to read and works perfectly on import. Chuck Connell Nuance R&D Data Team Burlington, MA 781-565-4611<tel:781-565-4611> From: Sandeep Reddy P [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Subject: How to load csv data into HIVE Hi, Here is the sample data "174969274","14-mar-2006"," 3522876","","14-mar-2006","500000308","65","1"| "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| How to load this kind of data into HIVE? I'm using shell script to get rid of double quotes and '|' but its taking very long time to work on each csv which are 12GB each. What is the best way to do this? -- Thanks, sandeep -- Thanks, sandeep
-
Re: How to load csv data into HIVEpraveenesh kumar 2012-09-08, 14:35
Yup, Bejoy is correct :-) Just use hadoop streaming, for what it can do
best --->>> Cleaning, Transformations and Validations, in just simple steps. Regards, Praveenesh On Sat, Sep 8, 2012 at 6:03 PM, Bejoy KS <[EMAIL PROTECTED]> wrote: > Hi Chuck > > I believe Praveenesh was adding his thought to the discussion on > preprocessing the data using mapreduce itself. If you go with hadoop > streaming you can use the python script in the mapper and that will do the > preprocessing parallely on large volume data. Then this preprocessed data > can be loaded into hive table. > > > Regards > Bejoy KS > > Sent from handheld, please excuse typos. > ------------------------------ > *From: * "Connell, Chuck" <[EMAIL PROTECTED]> > *Date: *Sat, 8 Sep 2012 12:18:33 +0000 > *To: *[EMAIL PROTECTED]<[EMAIL PROTECTED]> > *ReplyTo: * [EMAIL PROTECTED] > *Subject: *RE: How to load csv data into HIVE > > I would like to hear more about this "hadoop streaming to Hive" idea. I > have used streaming jobs as mappers, with a python script as map.py. Are > you saying that such a streaming mapper can load its output into Hive? Can > you send some example code? Hive wants to load "files" not individual > lines/records. How would you do this? > > Thanks very much, > Chuck > > > ------------------------------ > *From:* praveenesh kumar [[EMAIL PROTECTED]] > *Sent:* Saturday, September 08, 2012 7:54 AM > *To:* [EMAIL PROTECTED] > *Subject:* Re: How to load csv data into HIVE > > You can use hadoop streaming that would be much faster... Just run your > cleaning shell script logic in map phase and it will be done in just few > minutes. That will keep the data in HDFS. > > Regards, > Praveenesh > > On Fri, Sep 7, 2012 at 8:37 PM, Sandeep Reddy P < > [EMAIL PROTECTED]> wrote: > >> Hi, >> Thank you all for your help. I'll try both ways and i'll get back to you. >> >> >> On Fri, Sep 7, 2012 at 11:02 AM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: >> >>> I said this assuming that a Hadoop cluster is available since Sandeep is >>> planning to use Hive. If that is the case then MapReduce would be faster >>> for such large files. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Fri, Sep 7, 2012 at 8:27 PM, Connell, Chuck <[EMAIL PROTECTED] >>> > wrote: >>> >>>> I cannot promise which is faster. A lot depends on how clever your >>>> scripts are.**** >>>> >>>> ** ** >>>> >>>> ** ** >>>> >>>> ** ** >>>> >>>> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >>>> *Sent:* Friday, September 07, 2012 10:42 AM >>>> *To:* [EMAIL PROTECTED] >>>> *Subject:* Re: How to load csv data into HIVE**** >>>> >>>> ** ** >>>> >>>> Hi, >>>> I wrote a shell script to get csv data but when i run that script on a >>>> 12GB csv its taking more time. If i run a python script will that be faster? >>>> **** >>>> >>>> On Fri, Sep 7, 2012 at 10:39 AM, Connell, Chuck < >>>> [EMAIL PROTECTED]> wrote:**** >>>> >>>> How about a Python script that changes it into plain tab-separated >>>> text? So it would look like this…**** >>>> >>>> **** >>>> >>>> 174969274<tab>14-mar-2006<tab>3522876<tab> >>>> <tab>14-mar-2006<tab>500000308<tab>65<tab>1<newline> >>>> etc…**** >>>> >>>> **** >>>> >>>> Tab-separated with newlines is easy to read and works perfectly on >>>> import.**** >>>> >>>> **** >>>> >>>> Chuck Connell**** >>>> >>>> Nuance R&D Data Team**** >>>> >>>> Burlington, MA**** >>>> >>>> 781-565-4611**** >>>> >>>> **** >>>> >>>> *From:* Sandeep Reddy P [mailto:[EMAIL PROTECTED]] >>>> *Subject:* How to load csv data into HIVE**** >>>> >>>> **** >>>> >>>> Hi, >>>> Here is the sample data >>>> "174969274","14-mar-2006","**** >>>> >>>> 3522876","","14-mar-2006","500000308","65","1"| >>>> >>>> "174969275","19-jul-2006","3523154","","19-jul-2006","500000308","65","1"| >>>> >>>> "174969276","31-dec-2005","3530333","","31-dec-2005","500000308","65","1"| >>>> >>>> "174969277","14-apr-2005","3531470","","14-apr-2005","500000308","65","1"| |