|
Anze
2010-11-03, 07:22
Alejandro Abdelnur
2010-11-03, 08:40
Anze
2010-11-03, 08:56
Sonal Goyal
2010-11-03, 09:27
Anze
2010-11-03, 11:32
Ankur C. Goel
2010-11-03, 14:18
Jai Krishna
2010-11-04, 11:38
Anze
2010-11-04, 12:30
arvind@...)
2010-11-03, 17:08
Anze
2010-11-03, 19:08
arvind@...)
2010-11-03, 19:17
Anze
2010-11-04, 09:26
arvind@...)
2010-11-04, 16:25
arvind@...)
2010-11-04, 16:29
Anze
2010-11-04, 20:02
Dmitriy Ryaboy
2010-11-04, 22:22
Aaron Kimball
2010-11-04, 23:16
Anze
2010-11-05, 13:42
Aaron Kimball
2010-11-05, 17:24
|
-
MySQL / JDBC / SQL DB Loader?Anze 2010-11-03, 07:22
Hi!
Part of data I have resides in MySQL. Is there a loader that would allow loading directly from it? I can't find anything on the net, but it seems to me this must be a quite common problem. I checked piggybank but there is only DBStorage (and no DBLoader). Is some DBLoader out there too? Thanks, Anze +
Anze 2010-11-03, 07:22
-
Re: MySQL / JDBC / SQL DB Loader?Alejandro Abdelnur 2010-11-03, 08:40
Not a 100% Pig solution, but you could use Sqoop to get the data in as a
pre-processing step. And if you want to handle all as single job, you could use Oozie to create a workflow that does Sqoop and then your Pig processing. Alejandro On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > Hi! > > Part of data I have resides in MySQL. Is there a loader that would allow > loading directly from it? > > I can't find anything on the net, but it seems to me this must be a quite > common problem. > I checked piggybank but there is only DBStorage (and no DBLoader). > > Is some DBLoader out there too? > > Thanks, > > Anze > +
Alejandro Abdelnur 2010-11-03, 08:40
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-03, 08:56
Alejandro, thanks for answering! I was hoping it could be done directly from Pig, but... :) I'll take a look at Sqoop then, and if that doesn't help, I'll just write a simple batch to export data to TXT/CSV. Thanks for the pointer! Anze On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > Not a 100% Pig solution, but you could use Sqoop to get the data in as a > pre-processing step. And if you want to handle all as single job, you could > use Oozie to create a workflow that does Sqoop and then your Pig > processing. > > Alejandro > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > Hi! > > > > Part of data I have resides in MySQL. Is there a loader that would allow > > loading directly from it? > > > > I can't find anything on the net, but it seems to me this must be a quite > > common problem. > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > Is some DBLoader out there too? > > > > Thanks, > > > > Anze +
Anze 2010-11-03, 08:56
-
Re: MySQL / JDBC / SQL DB Loader?Sonal Goyal 2010-11-03, 09:27
Anze,
You can check hiho as well: http://code.google.com/p/hiho/wiki/DatabaseImportFAQ Let me know if you need any help. Thanks and Regards, Sonal Sonal Goyal | Founder and CEO | Nube Technologies LLP http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal 2010/11/3 Anze <[EMAIL PROTECTED]> > > Alejandro, thanks for answering! > > I was hoping it could be done directly from Pig, but... :) > > I'll take a look at Sqoop then, and if that doesn't help, I'll just write a > simple batch to export data to TXT/CSV. Thanks for the pointer! > > Anze > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > Not a 100% Pig solution, but you could use Sqoop to get the data in as a > > pre-processing step. And if you want to handle all as single job, you > could > > use Oozie to create a workflow that does Sqoop and then your Pig > > processing. > > > > Alejandro > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > Hi! > > > > > > Part of data I have resides in MySQL. Is there a loader that would > allow > > > loading directly from it? > > > > > > I can't find anything on the net, but it seems to me this must be a > quite > > > common problem. > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > Is some DBLoader out there too? > > > > > > Thanks, > > > > > > Anze > > +
Sonal Goyal 2010-11-03, 09:27
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-03, 11:32
Sonal,
Thanks for answering! Hiho sounds nice, but from what I gathered, it is more a low-level interface for efficient loading from and storing to SQL DBs? (in other words, there is no loader and storage for Pig yet) I wrote a batch to export DB to local files and then copy them to HDFS, so there is no gain for me in using another type of export (unless it can be used directly from Pig and/or keeps the schema intact), but it's nice to know it exists. It just seems weird that there is no DB loader for Pig yet. I tried writing it but it would take more time than I have at the moment... I have a problem to solve ASAP. :) Thanks, Anze On Wednesday 03 November 2010, Sonal Goyal wrote: > Anze, > > You can check hiho as well: > > http://code.google.com/p/hiho/wiki/DatabaseImportFAQ > > Let me know if you need any help. > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > Alejandro, thanks for answering! > > > > I was hoping it could be done directly from Pig, but... :) > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just write > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > Anze > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > Not a 100% Pig solution, but you could use Sqoop to get the data in as > > > a pre-processing step. And if you want to handle all as single job, > > > you > > > > could > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > processing. > > > > > > Alejandro > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > Hi! > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > allow > > > > > > loading directly from it? > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > quite > > > > > > common problem. > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > Is some DBLoader out there too? > > > > > > > > Thanks, > > > > > > > > Anze +
Anze 2010-11-03, 11:32
-
Re: MySQL / JDBC / SQL DB Loader?Ankur C. Goel 2010-11-03, 14:18
Hitting the database from multiple mappers is not such a great idea IF there are hundreds/thousands of mappers involved processing hundreds of GBs. of data. This could easily saturate the I/O bandwidth of the database server creating a bottleneck in the processing. Export and dump to HDFS is a better option
-@nkur On 11/3/10 5:02 PM, "Anze" <[EMAIL PROTECTED]> wrote: Sonal, Thanks for answering! Hiho sounds nice, but from what I gathered, it is more a low-level interface for efficient loading from and storing to SQL DBs? (in other words, there is no loader and storage for Pig yet) I wrote a batch to export DB to local files and then copy them to HDFS, so there is no gain for me in using another type of export (unless it can be used directly from Pig and/or keeps the schema intact), but it's nice to know it exists. It just seems weird that there is no DB loader for Pig yet. I tried writing it but it would take more time than I have at the moment... I have a problem to solve ASAP. :) Thanks, Anze On Wednesday 03 November 2010, Sonal Goyal wrote: > Anze, > > You can check hiho as well: > > http://code.google.com/p/hiho/wiki/DatabaseImportFAQ > > Let me know if you need any help. > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > Alejandro, thanks for answering! > > > > I was hoping it could be done directly from Pig, but... :) > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just write > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > Anze > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > Not a 100% Pig solution, but you could use Sqoop to get the data in as > > > a pre-processing step. And if you want to handle all as single job, > > > you > > > > could > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > processing. > > > > > > Alejandro > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > Hi! > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > allow > > > > > > loading directly from it? > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > quite > > > > > > common problem. > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > Is some DBLoader out there too? > > > > > > > > Thanks, > > > > > > > > Anze +
Ankur C. Goel 2010-11-03, 14:18
-
Re: MySQL / JDBC / SQL DB Loader?Jai Krishna 2010-11-04, 11:38
Ankur,
In this case, there is no data on the grid a priori; the data has to come into the grid from a DB. So what would the C/M mappers run on? Is there a way to run say 5 mappers without having 5 blocks of data on HDFS? Just trying to wrap my head around this; pl. excuse me if Im missing something obvious. Thanks Jai On 11/3/10 7:48 PM, "Ankur C. Goel" <[EMAIL PROTECTED]> wrote: Hitting the database from multiple mappers is not such a great idea IF there are hundreds/thousands of mappers involved processing hundreds of GBs. of data. This could easily saturate the I/O bandwidth of the database server creating a bottleneck in the processing. Export and dump to HDFS is a better option -@nkur On 11/3/10 5:02 PM, "Anze" <[EMAIL PROTECTED]> wrote: Sonal, Thanks for answering! Hiho sounds nice, but from what I gathered, it is more a low-level interface for efficient loading from and storing to SQL DBs? (in other words, there is no loader and storage for Pig yet) I wrote a batch to export DB to local files and then copy them to HDFS, so there is no gain for me in using another type of export (unless it can be used directly from Pig and/or keeps the schema intact), but it's nice to know it exists. It just seems weird that there is no DB loader for Pig yet. I tried writing it but it would take more time than I have at the moment... I have a problem to solve ASAP. :) Thanks, Anze On Wednesday 03 November 2010, Sonal Goyal wrote: > Anze, > > You can check hiho as well: > > http://code.google.com/p/hiho/wiki/DatabaseImportFAQ > > Let me know if you need any help. > > Thanks and Regards, > Sonal > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > Alejandro, thanks for answering! > > > > I was hoping it could be done directly from Pig, but... :) > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just write > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > Anze > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > Not a 100% Pig solution, but you could use Sqoop to get the data in as > > > a pre-processing step. And if you want to handle all as single job, > > > you > > > > could > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > processing. > > > > > > Alejandro > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > Hi! > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > allow > > > > > > loading directly from it? > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > quite > > > > > > common problem. > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > Is some DBLoader out there too? > > > > > > > > Thanks, > > > > > > > > Anze +
Jai Krishna 2010-11-04, 11:38
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-04, 12:30
> In this case, there is no data on the grid a priori; the data has to come > into the grid from a DB. So what would the C/M mappers run on? Is there a > way to run say 5 mappers without having 5 blocks of data on HDFS? No, but the data doesn't have to come from DB. The first copy yes, but other 4 can be replicated within the cluster. Which is exactly what export + dump does - except that it is cumbersome to use, so it would be better if it could be done automatically (within Pig loader). That's how I see it at least... :) Anze On Thursday 04 November 2010, Jai Krishna wrote: > Ankur, > > In this case, there is no data on the grid a priori; the data has to come > into the grid from a DB. So what would the C/M mappers run on? Is there a > way to run say 5 mappers without having 5 blocks of data on HDFS? > > Just trying to wrap my head around this; pl. excuse me if Im missing > something obvious. > > Thanks > Jai > > On 11/3/10 7:48 PM, "Ankur C. Goel" <[EMAIL PROTECTED]> wrote: > > Hitting the database from multiple mappers is not such a great idea IF > there are hundreds/thousands of mappers involved processing hundreds of > GBs. of data. This could easily saturate the I/O bandwidth of the database > server creating a bottleneck in the processing. Export and dump to HDFS > is a better option > > -@nkur > > On 11/3/10 5:02 PM, "Anze" <[EMAIL PROTECTED]> wrote: > > Sonal, > > Thanks for answering! > > Hiho sounds nice, but from what I gathered, it is more a low-level > interface for efficient loading from and storing to SQL DBs? > (in other words, there is no loader and storage for Pig yet) > > I wrote a batch to export DB to local files and then copy them to HDFS, so > there is no gain for me in using another type of export (unless it can be > used directly from Pig and/or keeps the schema intact), but it's nice to > know it exists. > > It just seems weird that there is no DB loader for Pig yet. I tried writing > it but it would take more time than I have at the moment... I have a > problem to solve ASAP. :) > > Thanks, > > Anze > > On Wednesday 03 November 2010, Sonal Goyal wrote: > > Anze, > > > > You can check hiho as well: > > > > http://code.google.com/p/hiho/wiki/DatabaseImportFAQ > > > > Let me know if you need any help. > > > > Thanks and Regards, > > Sonal > > > > Sonal Goyal | Founder and CEO | Nube Technologies LLP > > http://www.nubetech.co | http://in.linkedin.com/in/sonalgoyal > > > > > > > > > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > > > Alejandro, thanks for answering! > > > > > > I was hoping it could be done directly from Pig, but... :) > > > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just > > > write a simple batch to export data to TXT/CSV. Thanks for the > > > pointer! > > > > > > Anze > > > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > > Not a 100% Pig solution, but you could use Sqoop to get the data in > > > > as a pre-processing step. And if you want to handle all as single > > > > job, you > > > > > > could > > > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > > processing. > > > > > > > > Alejandro > > > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > > Hi! > > > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > > > allow > > > > > > > > loading directly from it? > > > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > > > quite > > > > > > > > common problem. > > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > > > Is some DBLoader out there too? > > > > > > > > > > Thanks, > > > > > > > > > > Anze +
Anze 2010-11-04, 12:30
-
Re: MySQL / JDBC / SQL DB Loader?arvind@...) 2010-11-03, 17:08
Anze,
Did you get a chance to try out Sqoop? If not, I would encourage you to do so. Here is a link to the user guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> . Sqoop allows you to easily move data across from relational databases and other enterprise systems to HDFS and back. Arvind 2010/11/3 Anze <[EMAIL PROTECTED]> > > Alejandro, thanks for answering! > > I was hoping it could be done directly from Pig, but... :) > > I'll take a look at Sqoop then, and if that doesn't help, I'll just write a > simple batch to export data to TXT/CSV. Thanks for the pointer! > > Anze > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > Not a 100% Pig solution, but you could use Sqoop to get the data in as a > > pre-processing step. And if you want to handle all as single job, you > could > > use Oozie to create a workflow that does Sqoop and then your Pig > > processing. > > > > Alejandro > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > Hi! > > > > > > Part of data I have resides in MySQL. Is there a loader that would > allow > > > loading directly from it? > > > > > > I can't find anything on the net, but it seems to me this must be a > quite > > > common problem. > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > Is some DBLoader out there too? > > > > > > Thanks, > > > > > > Anze > > +
arvind@...) 2010-11-03, 17:08
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-03, 19:08
I tried to run it, got NullPointerException, searched the net, found Sqoop requires JDK (instead of JRE) and gave up. I am working on a production cluster - so I'd rather not upgrade to JDK if not necessary. :) But I was able export MySQL with a simple bash script: ********** #!/bin/bash MYSQL_TABLES=( table1 table2 table3 ) WHERE=/home/hadoop/pig for i in ${MYSQL_TABLES[@]} do mysql -BAN -h <mysql_host> -u <username> --password=<pass> <database> \ -e "select * from $i;" --skip-column-names > $WHERE/$i.csv hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ rm $WHERE/$i.csv done ********** Of course, in my case the tables were small enough so I could do it. And of course I lost schema in process. Hope it helps someone else too... Anze On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > Anze, > > Did you get a chance to try out Sqoop? If not, I would encourage you to do > so. Here is a link to the user > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> > . > > Sqoop allows you to easily move data across from relational databases and > other enterprise systems to HDFS and back. > > Arvind > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > Alejandro, thanks for answering! > > > > I was hoping it could be done directly from Pig, but... :) > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just write > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > Anze > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > Not a 100% Pig solution, but you could use Sqoop to get the data in as > > > a pre-processing step. And if you want to handle all as single job, > > > you > > > > could > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > processing. > > > > > > Alejandro > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > Hi! > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > allow > > > > > > loading directly from it? > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > quite > > > > > > common problem. > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > Is some DBLoader out there too? > > > > > > > > Thanks, > > > > > > > > Anze +
Anze 2010-11-03, 19:08
-
Re: MySQL / JDBC / SQL DB Loader?arvind@...) 2010-11-03, 19:17
Sorry that you ran into a problem. Typically, it is usually something like
missing a required option etc that could cause this and if you were to send a mail to [EMAIL PROTECTED], you would get prompt assistance. Regardless, if you still have any use cases like this, I will be glad to help you out in using Sqoop for that purpose. Arvind 2010/11/3 Anze <[EMAIL PROTECTED]> > > I tried to run it, got NullPointerException, searched the net, found Sqoop > requires JDK (instead of JRE) and gave up. I am working on a production > cluster - so I'd rather not upgrade to JDK if not necessary. :) > > But I was able export MySQL with a simple bash script: > ********** > #!/bin/bash > > MYSQL_TABLES=( table1 table2 table3 ) > WHERE=/home/hadoop/pig > > for i in ${MYSQL_TABLES[@]} > do > mysql -BAN -h <mysql_host> -u <username> --password=<pass> <database> \ > -e "select * from $i;" --skip-column-names > $WHERE/$i.csv > hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ > rm $WHERE/$i.csv > done > ********** > > Of course, in my case the tables were small enough so I could do it. And of > course I lost schema in process. > > Hope it helps someone else too... > > Anze > > > On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > > Anze, > > > > Did you get a chance to try out Sqoop? If not, I would encourage you to > do > > so. Here is a link to the user > > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> > > . > > > > Sqoop allows you to easily move data across from relational databases and > > other enterprise systems to HDFS and back. > > > > Arvind > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > > > Alejandro, thanks for answering! > > > > > > I was hoping it could be done directly from Pig, but... :) > > > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just > write > > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > > > Anze > > > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > > Not a 100% Pig solution, but you could use Sqoop to get the data in > as > > > > a pre-processing step. And if you want to handle all as single job, > > > > you > > > > > > could > > > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > > processing. > > > > > > > > Alejandro > > > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > > Hi! > > > > > > > > > > Part of data I have resides in MySQL. Is there a loader that would > > > > > > allow > > > > > > > > loading directly from it? > > > > > > > > > > I can't find anything on the net, but it seems to me this must be a > > > > > > quite > > > > > > > > common problem. > > > > > I checked piggybank but there is only DBStorage (and no DBLoader). > > > > > > > > > > Is some DBLoader out there too? > > > > > > > > > > Thanks, > > > > > > > > > > Anze > > +
arvind@...) 2010-11-03, 19:17
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-04, 09:26
So Sqoop doesn't require JDK? It seemed weird to me too. Also, if it would require it, then JDK would probably have to be among dependencies of the package Sqoop is in. I started working on DBLoader, but the learning curve seems quite steep and I don't have enough time for it right now. Also, as Ankur said, it might not be a good idea to hit MySQL from the cluster. The ideal solution IMHO would be loading data from MySQL to HDFS from a single machine (but within LoadFunc, of course) and work with the data from there (with schema automatically converted from MySQL). But I don't know enough about Pig to do that kind of thing... yet. :) Anze On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > Sorry that you ran into a problem. Typically, it is usually something like > missing a required option etc that could cause this and if you were to send > a mail to [EMAIL PROTECTED], you would get prompt assistance. > > Regardless, if you still have any use cases like this, I will be glad to > help you out in using Sqoop for that purpose. > > Arvind > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > I tried to run it, got NullPointerException, searched the net, found > > Sqoop requires JDK (instead of JRE) and gave up. I am working on a > > production cluster - so I'd rather not upgrade to JDK if not necessary. > > :) > > > > But I was able export MySQL with a simple bash script: > > ********** > > #!/bin/bash > > > > MYSQL_TABLES=( table1 table2 table3 ) > > WHERE=/home/hadoop/pig > > > > for i in ${MYSQL_TABLES[@]} > > do > > > > mysql -BAN -h <mysql_host> -u <username> --password=<pass> <database> \ > > > > -e "select * from $i;" --skip-column-names > $WHERE/$i.csv > > > > hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ > > rm $WHERE/$i.csv > > > > done > > ********** > > > > Of course, in my case the tables were small enough so I could do it. And > > of course I lost schema in process. > > > > Hope it helps someone else too... > > > > Anze > > > > On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > > > Anze, > > > > > > Did you get a chance to try out Sqoop? If not, I would encourage you to > > > > do > > > > > so. Here is a link to the user > > > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> > > > . > > > > > > Sqoop allows you to easily move data across from relational databases > > > and other enterprise systems to HDFS and back. > > > > > > Arvind > > > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > > > > > Alejandro, thanks for answering! > > > > > > > > I was hoping it could be done directly from Pig, but... :) > > > > > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just > > > > write > > > > > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > > > > > Anze > > > > > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > > > Not a 100% Pig solution, but you could use Sqoop to get the data in > > > > as > > > > > > > a pre-processing step. And if you want to handle all as single job, > > > > > you > > > > > > > > could > > > > > > > > > use Oozie to create a workflow that does Sqoop and then your Pig > > > > > processing. > > > > > > > > > > Alejandro > > > > > > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[EMAIL PROTECTED]> wrote: > > > > > > Hi! > > > > > > > > > > > > Part of data I have resides in MySQL. Is there a loader that > > > > > > would > > > > > > > > allow > > > > > > > > > > loading directly from it? > > > > > > > > > > > > I can't find anything on the net, but it seems to me this must be > > > > > > a > > > > > > > > quite > > > > > > > > > > common problem. > > > > > > I checked piggybank but there is only DBStorage (and no > > > > > > DBLoader). > > > > > > > > > > > > Is some DBLoader out there too? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Anze +
Anze 2010-11-04, 09:26
-
Re: MySQL / JDBC / SQL DB Loader?arvind@...) 2010-11-04, 16:25
Sqoop is Java based and you should have JDK 1.6 or higher available on your
system. We will add this as a dependency for the package. Regarding accessing MySQL from a cluster - it should not be a problem if you control the number of tasks that do that. Sqoop allows you to explicitly specify the number of mappers, where each mapper holds a a connection to the database and effectively parallelizes the loading process. Apart from just the speed, Sqoop offers many other advantages too such as incremental loads, exporting data from HDFS back to the database, automatic creation of Hive tables or populating hbase etc. Arvind 2010/11/4 Anze <[EMAIL PROTECTED]> > > So Sqoop doesn't require JDK? > It seemed weird to me too. Also, if it would require it, then JDK would > probably have to be among dependencies of the package Sqoop is in. > > I started working on DBLoader, but the learning curve seems quite steep and > I > don't have enough time for it right now. Also, as Ankur said, it might not > be > a good idea to hit MySQL from the cluster. > > The ideal solution IMHO would be loading data from MySQL to HDFS from a > single > machine (but within LoadFunc, of course) and work with the data from there > (with schema automatically converted from MySQL). But I don't know enough > about Pig to do that kind of thing... yet. :) > > Anze > > > On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > > Sorry that you ran into a problem. Typically, it is usually something > like > > missing a required option etc that could cause this and if you were to > send > > a mail to [EMAIL PROTECTED], you would get prompt assistance. > > > > Regardless, if you still have any use cases like this, I will be glad to > > help you out in using Sqoop for that purpose. > > > > Arvind > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > > > I tried to run it, got NullPointerException, searched the net, found > > > Sqoop requires JDK (instead of JRE) and gave up. I am working on a > > > production cluster - so I'd rather not upgrade to JDK if not necessary. > > > :) > > > > > > But I was able export MySQL with a simple bash script: > > > ********** > > > #!/bin/bash > > > > > > MYSQL_TABLES=( table1 table2 table3 ) > > > WHERE=/home/hadoop/pig > > > > > > for i in ${MYSQL_TABLES[@]} > > > do > > > > > > mysql -BAN -h <mysql_host> -u <username> --password=<pass> <database> > \ > > > > > > -e "select * from $i;" --skip-column-names > $WHERE/$i.csv > > > > > > hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ > > > rm $WHERE/$i.csv > > > > > > done > > > ********** > > > > > > Of course, in my case the tables were small enough so I could do it. > And > > > of course I lost schema in process. > > > > > > Hope it helps someone else too... > > > > > > Anze > > > > > > On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > > > > Anze, > > > > > > > > Did you get a chance to try out Sqoop? If not, I would encourage you > to > > > > > > do > > > > > > > so. Here is a link to the user > > > > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> > > > > . > > > > > > > > Sqoop allows you to easily move data across from relational databases > > > > and other enterprise systems to HDFS and back. > > > > > > > > Arvind > > > > > > > > 2010/11/3 Anze <[EMAIL PROTECTED]> > > > > > > > > > Alejandro, thanks for answering! > > > > > > > > > > I was hoping it could be done directly from Pig, but... :) > > > > > > > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll just > > > > > > write > > > > > > > > a simple batch to export data to TXT/CSV. Thanks for the pointer! > > > > > > > > > > Anze > > > > > > > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > > > > Not a 100% Pig solution, but you could use Sqoop to get the data > in > > > > > > as > > > > > > > > > a pre-processing step. And if you want to handle all as single > job, > > > > > > you > > > > > > > > > > could > > > > > > > > > > > use Oozie to create a workflow that does Sqoop and then your Pig +
arvind@...) 2010-11-04, 16:25
-
Re: MySQL / JDBC / SQL DB Loader?arvind@...) 2010-11-04, 16:29
Anze - I just checked that our Sqoop packages do declare the JDK dependency.
Which package did you see as not having this dependency? Arvind On Thu, Nov 4, 2010 at 9:25 AM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote: > Sqoop is Java based and you should have JDK 1.6 or higher available on your > system. We will add this as a dependency for the package. > > Regarding accessing MySQL from a cluster - it should not be a problem if > you control the number of tasks that do that. Sqoop allows you to explicitly > specify the number of mappers, where each mapper holds a a connection to the > database and effectively parallelizes the loading process. Apart from just > the speed, Sqoop offers many other advantages too such as incremental loads, > exporting data from HDFS back to the database, automatic creation of Hive > tables or populating hbase etc. > > Arvind > > 2010/11/4 Anze <[EMAIL PROTECTED]> > > >> So Sqoop doesn't require JDK? >> It seemed weird to me too. Also, if it would require it, then JDK would >> probably have to be among dependencies of the package Sqoop is in. >> >> I started working on DBLoader, but the learning curve seems quite steep >> and I >> don't have enough time for it right now. Also, as Ankur said, it might not >> be >> a good idea to hit MySQL from the cluster. >> >> The ideal solution IMHO would be loading data from MySQL to HDFS from a >> single >> machine (but within LoadFunc, of course) and work with the data from there >> (with schema automatically converted from MySQL). But I don't know enough >> about Pig to do that kind of thing... yet. :) >> >> Anze >> >> >> On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: >> > Sorry that you ran into a problem. Typically, it is usually something >> like >> > missing a required option etc that could cause this and if you were to >> send >> > a mail to [EMAIL PROTECTED], you would get prompt assistance. >> > >> > Regardless, if you still have any use cases like this, I will be glad to >> > help you out in using Sqoop for that purpose. >> > >> > Arvind >> > >> > 2010/11/3 Anze <[EMAIL PROTECTED]> >> > >> > > I tried to run it, got NullPointerException, searched the net, found >> > > Sqoop requires JDK (instead of JRE) and gave up. I am working on a >> > > production cluster - so I'd rather not upgrade to JDK if not >> necessary. >> > > :) >> > > >> > > But I was able export MySQL with a simple bash script: >> > > ********** >> > > #!/bin/bash >> > > >> > > MYSQL_TABLES=( table1 table2 table3 ) >> > > WHERE=/home/hadoop/pig >> > > >> > > for i in ${MYSQL_TABLES[@]} >> > > do >> > > >> > > mysql -BAN -h <mysql_host> -u <username> --password=<pass> <database> >> \ >> > > >> > > -e "select * from $i;" --skip-column-names > $WHERE/$i.csv >> > > >> > > hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ >> > > rm $WHERE/$i.csv >> > > >> > > done >> > > ********** >> > > >> > > Of course, in my case the tables were small enough so I could do it. >> And >> > > of course I lost schema in process. >> > > >> > > Hope it helps someone else too... >> > > >> > > Anze >> > > >> > > On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: >> > > > Anze, >> > > > >> > > > Did you get a chance to try out Sqoop? If not, I would encourage you >> to >> > > >> > > do >> > > >> > > > so. Here is a link to the user >> > > > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> >> > > > . >> > > > >> > > > Sqoop allows you to easily move data across from relational >> databases >> > > > and other enterprise systems to HDFS and back. >> > > > >> > > > Arvind >> > > > >> > > > 2010/11/3 Anze <[EMAIL PROTECTED]> >> > > > >> > > > > Alejandro, thanks for answering! >> > > > > >> > > > > I was hoping it could be done directly from Pig, but... :) >> > > > > >> > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll >> just >> > > >> > > write >> > > >> > > > > a simple batch to export data to TXT/CSV. Thanks for the pointer! >> > > > > +
arvind@...) 2010-11-04, 16:29
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-04, 20:02
Hi Arvind!
Should we take this discussion off the list? It is not really Pig-related anymore... Not sure what the custom is around here. :) > > process. Apart from just the speed, Sqoop offers many other advantages > > too such as incremental loads, exporting data from HDFS back to the > > database, automatic creation of Hive tables or populating hbase etc. Only Pig is missing then... >:-D Sorry, couldn't hold that back... ;) I would love to use Sqoop for another task (periodically importing MySQL tables to HBase) if schema gets more or less preserved, however I don't dare upgrade JRE to JDK at the moment in fear of breaking things. > Anze - I just checked that our Sqoop packages do declare the JDK > dependency. Which package did you see as not having this dependency? We are using: ----- deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib ----- But there is no sqoop package per se, I guess it is part of hadoop package: ----- $ aptitude show hadoop-0.20 | grep Depends Depends: adduser, sun-java6-jre, sun-java6-bin ----- $ aptitude search sun-java6 | grep "jdk\|jre" p sun-java6-jdk - Sun Java(TM) Development Kit (JDK) 6 i A sun-java6-jre - Sun Java(TM) Runtime Environment (JRE) 6 ----- This is where aaron@cloudera advises that JDK is needed (instead of JRE) for successful running of sqoop: http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception- j7ziz As I said, I am interested in Sqoop (and alternatives) as we will be facing the problem in near future, so I appreciate your involvement in this thread! Anze On Thursday 04 November 2010, [EMAIL PROTECTED] wrote: > Anze - I just checked that our Sqoop packages do declare the JDK > dependency. Which package did you see as not having this dependency? > > Arvind > > On Thu, Nov 4, 2010 at 9:25 AM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote: > > Sqoop is Java based and you should have JDK 1.6 or higher available on > > your system. We will add this as a dependency for the package. > > > > Regarding accessing MySQL from a cluster - it should not be a problem if > > you control the number of tasks that do that. Sqoop allows you to > > explicitly specify the number of mappers, where each mapper holds a a > > connection to the database and effectively parallelizes the loading > > process. Apart from just the speed, Sqoop offers many other advantages > > too such as incremental loads, exporting data from HDFS back to the > > database, automatic creation of Hive tables or populating hbase etc. > > > > Arvind > > > > 2010/11/4 Anze <[EMAIL PROTECTED]> > > > >> So Sqoop doesn't require JDK? > >> It seemed weird to me too. Also, if it would require it, then JDK would > >> probably have to be among dependencies of the package Sqoop is in. > >> > >> I started working on DBLoader, but the learning curve seems quite steep > >> and I > >> don't have enough time for it right now. Also, as Ankur said, it might > >> not be > >> a good idea to hit MySQL from the cluster. > >> > >> The ideal solution IMHO would be loading data from MySQL to HDFS from a > >> single > >> machine (but within LoadFunc, of course) and work with the data from > >> there (with schema automatically converted from MySQL). But I don't > >> know enough about Pig to do that kind of thing... yet. :) > >> > >> Anze > >> > >> On Wednesday 03 November 2010, [EMAIL PROTECTED] wrote: > >> > Sorry that you ran into a problem. Typically, it is usually something > >> > >> like > >> > >> > missing a required option etc that could cause this and if you were to > >> > >> send > >> > >> > a mail to [EMAIL PROTECTED], you would get prompt assistance. > >> > > >> > Regardless, if you still have any use cases like this, I will be glad > >> > to help you out in using Sqoop for that purpose. > >> > > >> > Arvind > >> > > >> > 2010/11/3 Anze <[EMAIL PROTECTED]> +
Anze 2010-11-04, 20:02
-
Re: MySQL / JDBC / SQL DB Loader?Dmitriy Ryaboy 2010-11-04, 22:22
I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty
easy, since iirc Sqoop does generate an input format for you. Good project for someone looking to get started in contributing to Pig ... :) -D 2010/11/4 Anze <[EMAIL PROTECTED]> > Hi Arvind! > > Should we take this discussion off the list? It is not really Pig-related > anymore... Not sure what the custom is around here. :) > > > > process. Apart from just the speed, Sqoop offers many other advantages > > > too such as incremental loads, exporting data from HDFS back to the > > > database, automatic creation of Hive tables or populating hbase etc. > > Only Pig is missing then... >:-D > Sorry, couldn't hold that back... ;) > > I would love to use Sqoop for another task (periodically importing MySQL > tables to HBase) if schema gets more or less preserved, however I don't > dare > upgrade JRE to JDK at the moment in fear of breaking things. > > > Anze - I just checked that our Sqoop packages do declare the JDK > > dependency. Which package did you see as not having this dependency? > > We are using: > ----- > deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib > ----- > But there is no sqoop package per se, I guess it is part of hadoop package: > ----- > $ aptitude show hadoop-0.20 | grep Depends > Depends: adduser, sun-java6-jre, sun-java6-bin > ----- > $ aptitude search sun-java6 | grep "jdk\|jre" > p sun-java6-jdk - Sun Java(TM) Development Kit (JDK) 6 > i A sun-java6-jre - Sun Java(TM) Runtime Environment > (JRE) 6 > ----- > > > This is where aaron@cloudera advises that JDK is needed (instead of JRE) > for > successful running of sqoop: > > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception- > j7ziz<http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception-%0Aj7ziz> > > As I said, I am interested in Sqoop (and alternatives) as we will be facing > the problem in near future, so I appreciate your involvement in this > thread! > > Anze > > > On Thursday 04 November 2010, [EMAIL PROTECTED] wrote: > > Anze - I just checked that our Sqoop packages do declare the JDK > > dependency. Which package did you see as not having this dependency? > > > > Arvind > > > > On Thu, Nov 4, 2010 at 9:25 AM, [EMAIL PROTECTED] > <[EMAIL PROTECTED]>wrote: > > > Sqoop is Java based and you should have JDK 1.6 or higher available on > > > your system. We will add this as a dependency for the package. > > > > > > Regarding accessing MySQL from a cluster - it should not be a problem > if > > > you control the number of tasks that do that. Sqoop allows you to > > > explicitly specify the number of mappers, where each mapper holds a a > > > connection to the database and effectively parallelizes the loading > > > process. Apart from just the speed, Sqoop offers many other advantages > > > too such as incremental loads, exporting data from HDFS back to the > > > database, automatic creation of Hive tables or populating hbase etc. > > > > > > Arvind > > > > > > 2010/11/4 Anze <[EMAIL PROTECTED]> > > > > > >> So Sqoop doesn't require JDK? > > >> It seemed weird to me too. Also, if it would require it, then JDK > would > > >> probably have to be among dependencies of the package Sqoop is in. > > >> > > >> I started working on DBLoader, but the learning curve seems quite > steep > > >> and I > > >> don't have enough time for it right now. Also, as Ankur said, it might > > >> not be > > >> a good idea to hit MySQL from the cluster. > > >> > > >> The ideal solution IMHO would be loading data from MySQL to HDFS from > a > > >> single > > >> machine (but within LoadFunc, of course) and work with the data from > > >> there (with schema automatically converted from MySQL). But I don't > > >> know enough about Pig to do that kind of thing... yet. :) +
Dmitriy Ryaboy 2010-11-04, 22:22
-
Re: MySQL / JDBC / SQL DB Loader?Aaron Kimball 2010-11-04, 23:16
By default, Sqoop should load files into HDFS as delimited text. The
existing PigStorage should be able to work with the data loaded here. Similarly, if you use PigStorage to write data back to HDFS in delimited text form, Sqoop can export those files to your RDBMS. Also, Sqoop only needs to be installed on the client machine; it doesn't require modifying your Hadoop deployment on your servers anywhere. If you're writing any Java MapReduce programs, or Java UDFs for Pig, it's likely you've already got the JDK on this machine already. - Aaron On Thu, Nov 4, 2010 at 3:22 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > easy, since iirc Sqoop does generate an input format for you. > > Good project for someone looking to get started in contributing to Pig ... > :) > > -D > > 2010/11/4 Anze <[EMAIL PROTECTED]> > > > Hi Arvind! > > > > Should we take this discussion off the list? It is not really Pig-related > > anymore... Not sure what the custom is around here. :) > > > > > > process. Apart from just the speed, Sqoop offers many other > advantages > > > > too such as incremental loads, exporting data from HDFS back to the > > > > database, automatic creation of Hive tables or populating hbase etc. > > > > Only Pig is missing then... >:-D > > Sorry, couldn't hold that back... ;) > > > > I would love to use Sqoop for another task (periodically importing MySQL > > tables to HBase) if schema gets more or less preserved, however I don't > > dare > > upgrade JRE to JDK at the moment in fear of breaking things. > > > > > Anze - I just checked that our Sqoop packages do declare the JDK > > > dependency. Which package did you see as not having this dependency? > > > > We are using: > > ----- > > deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib > > ----- > > But there is no sqoop package per se, I guess it is part of hadoop > package: > > ----- > > $ aptitude show hadoop-0.20 | grep Depends > > Depends: adduser, sun-java6-jre, sun-java6-bin > > ----- > > $ aptitude search sun-java6 | grep "jdk\|jre" > > p sun-java6-jdk - Sun Java(TM) Development Kit (JDK) > 6 > > i A sun-java6-jre - Sun Java(TM) Runtime Environment > > (JRE) 6 > > ----- > > > > > > This is where aaron@cloudera advises that JDK is needed (instead of JRE) > > for > > successful running of sqoop: > > > > > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception- > > j7ziz< > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception-%0Aj7ziz > > > > > > As I said, I am interested in Sqoop (and alternatives) as we will be > facing > > the problem in near future, so I appreciate your involvement in this > > thread! > > > > Anze > > > > > > On Thursday 04 November 2010, [EMAIL PROTECTED] wrote: > > > Anze - I just checked that our Sqoop packages do declare the JDK > > > dependency. Which package did you see as not having this dependency? > > > > > > Arvind > > > > > > On Thu, Nov 4, 2010 at 9:25 AM, [EMAIL PROTECTED] > > <[EMAIL PROTECTED]>wrote: > > > > Sqoop is Java based and you should have JDK 1.6 or higher available > on > > > > your system. We will add this as a dependency for the package. > > > > > > > > Regarding accessing MySQL from a cluster - it should not be a problem > > if > > > > you control the number of tasks that do that. Sqoop allows you to > > > > explicitly specify the number of mappers, where each mapper holds a a > > > > connection to the database and effectively parallelizes the loading > > > > process. Apart from just the speed, Sqoop offers many other > advantages > > > > too such as incremental loads, exporting data from HDFS back to the > > > > database, automatic creation of Hive tables or populating hbase etc. > > > > > > > > Arvind > > > > +
Aaron Kimball 2010-11-04, 23:16
-
Re: MySQL / JDBC / SQL DB Loader?Anze 2010-11-05, 13:42
> > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > > easy, since iirc Sqoop does generate an input format for you. Yes, but if I remember correctly (I have looked at Sqoop quite some time ago) Sqoop generates classes based on SQL the user provides. Unless you suggest using input format classes only as a starting point? That would probably work... > > Good project for someone looking to get started in contributing to Pig It is tempting. :) > Also, Sqoop only needs to be installed on the client machine; it doesn't > require modifying your Hadoop deployment on your servers anywhere. If > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's > likely you've already got the JDK on this machine already. I am running Pig remotely, not from a local machine. But this will be changed soon so this will not be a problem for me anymore. Thanks, Anze On Friday 05 November 2010, Aaron Kimball wrote: > By default, Sqoop should load files into HDFS as delimited text. The > existing PigStorage should be able to work with the data loaded here. > Similarly, if you use PigStorage to write data back to HDFS in delimited > text form, Sqoop can export those files to your RDBMS. > > Also, Sqoop only needs to be installed on the client machine; it doesn't > require modifying your Hadoop deployment on your servers anywhere. If > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's > likely you've already got the JDK on this machine already. > > - Aaron > > On Thu, Nov 4, 2010 at 3:22 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > > easy, since iirc Sqoop does generate an input format for you. > > > > Good project for someone looking to get started in contributing to Pig > > ... > > > > :) > > > > -D > > > > 2010/11/4 Anze <[EMAIL PROTECTED]> > > > > > Hi Arvind! > > > > > > Should we take this discussion off the list? It is not really > > > Pig-related anymore... Not sure what the custom is around here. :) > > > > > > > > process. Apart from just the speed, Sqoop offers many other > > > > advantages > > > > > > > too such as incremental loads, exporting data from HDFS back to the > > > > > database, automatic creation of Hive tables or populating hbase > > > > > etc. > > > > > > Only Pig is missing then... >:-D > > > Sorry, couldn't hold that back... ;) > > > > > > I would love to use Sqoop for another task (periodically importing > > > MySQL tables to HBase) if schema gets more or less preserved, however > > > I don't dare > > > upgrade JRE to JDK at the moment in fear of breaking things. > > > > > > > Anze - I just checked that our Sqoop packages do declare the JDK > > > > dependency. Which package did you see as not having this dependency? > > > > > > We are using: > > > ----- > > > deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib > > > ----- > > > But there is no sqoop package per se, I guess it is part of hadoop > > > > package: > > > ----- > > > $ aptitude show hadoop-0.20 | grep Depends > > > Depends: adduser, sun-java6-jre, sun-java6-bin > > > ----- > > > $ aptitude search sun-java6 | grep "jdk\|jre" > > > p sun-java6-jdk - Sun Java(TM) Development Kit > > > (JDK) > > > > 6 > > > > > i A sun-java6-jre - Sun Java(TM) Runtime Environment > > > (JRE) 6 > > > ----- > > > > > > > > > This is where aaron@cloudera advises that JDK is needed (instead of > > > JRE) for > > > > > successful running of sqoop: > > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio > > n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep > > tion- > > > > > j7ziz< > > > > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio > > n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep > > tion-%0Aj7ziz > > > > > As I said, I am interested in Sqoop (and alternatives) as we will be +
Anze 2010-11-05, 13:42
-
Re: MySQL / JDBC / SQL DB Loader?Aaron Kimball 2010-11-05, 17:24
Sqoop does generate a class on the client machine which is then shipped to
the cluster during the processing phase. See http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_basic_usage for some more details about this process. Instances of this class may be marshaled into SequenceFiles if you'd like to keep your data in binary form. If you're storing your data as text (the default), the generated class is discarded after the import. Then you can use the regular text-based loader in Pig, or TextInputFormat in MapReduce, etc. If you want to store your data in a binary encoding (SequenceFiles) and still use it in Pig, you'd need to write your own loader. This should be relatively straightforward; you'd just need to read the records out of SequenceFiles into instances of the generated class (which could be specified as a parameter to the loader). Generated classes in Sqoop fulfill the interface FieldMappable ( https://github.com/cloudera/sqoop/blob/master/src/java/com/cloudera/sqoop/lib/FieldMappable.java) which allows you to iterate over the fields in the record. I'm not a Pig expert, but I doubt this would be too hard to convert to a map-based type used more broadly in Pig. Good luck - Aaron 2010/11/5 Anze <[EMAIL PROTECTED]> > > > > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > > > easy, since iirc Sqoop does generate an input format for you. > > Yes, but if I remember correctly (I have looked at Sqoop quite some time > ago) > Sqoop generates classes based on SQL the user provides. Unless you suggest > using input format classes only as a starting point? That would probably > work... > > > > Good project for someone looking to get started in contributing to Pig > > It is tempting. :) > > > Also, Sqoop only needs to be installed on the client machine; it doesn't > > require modifying your Hadoop deployment on your servers anywhere. If > > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's > > likely you've already got the JDK on this machine already. > > I am running Pig remotely, not from a local machine. But this will be > changed > soon so this will not be a problem for me anymore. > > Thanks, > > Anze > > > On Friday 05 November 2010, Aaron Kimball wrote: > > By default, Sqoop should load files into HDFS as delimited text. The > > existing PigStorage should be able to work with the data loaded here. > > Similarly, if you use PigStorage to write data back to HDFS in delimited > > text form, Sqoop can export those files to your RDBMS. > > > > Also, Sqoop only needs to be installed on the client machine; it doesn't > > require modifying your Hadoop deployment on your servers anywhere. If > > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's > > likely you've already got the JDK on this machine already. > > > > - Aaron > > > > On Thu, Nov 4, 2010 at 3:22 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > > > easy, since iirc Sqoop does generate an input format for you. > > > > > > Good project for someone looking to get started in contributing to Pig > > > ... > > > > > > :) > > > > > > -D > > > > > > 2010/11/4 Anze <[EMAIL PROTECTED]> > > > > > > > Hi Arvind! > > > > > > > > Should we take this discussion off the list? It is not really > > > > Pig-related anymore... Not sure what the custom is around here. :) > > > > > > > > > > process. Apart from just the speed, Sqoop offers many other > > > > > > advantages > > > > > > > > > too such as incremental loads, exporting data from HDFS back to > the > > > > > > database, automatic creation of Hive tables or populating hbase > > > > > > etc. > > > > > > > > Only Pig is missing then... >:-D > > > > Sorry, couldn't hold that back... ;) > > > > > > > > I would love to use Sqoop for another task (periodically importing > > > > MySQL tables to HBase) if schema gets more or less preserved, however > > > > I don't dare > > > > upgrade JRE to JDK at the moment in fear of breaking things. +
Aaron Kimball 2010-11-05, 17:24
|