Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Import data from mysql


Copy link to this message
-
Re: Import data from mysql
Hey Brian,

One final point about Sqoop: it's a part of Cloudera's Distribution for
Hadoop, so it's Apache 2.0 licensed and tightly integrated with the other
platform components. This means, for example, that we have added a Sqoop
action to Oozie, which makes integrating data import and export into your
workflows trivial; see
http://archive.cloudera.com/cdh/3/oozie-2.2.1+82/WorkflowActionExtensionsSpec.html#AE.2_Sqoop_Actionfor
more details.

For further discussion of Sqoop, I'd recommend using the Sqoop user list at
https://groups.google.com/a/cloudera.org/group/sqoop-user. For questions
about CDH in general, see
https://groups.google.com/a/cloudera.org/group/cdh-user.

Regards,
Jeff

On Sun, Jan 9, 2011 at 1:37 PM, [EMAIL PROTECTED] <[EMAIL PROTECTED]>wrote:

> Hi Brian,
>
> Sqoop supports incremental imports that can be run against a live database
> system on a daily basis for importing the new data. Unless your data is
> large and cannot be split into comparable slices for parallel imports, I do
> not see any concerns regarding performance.
>
> Regarding the database library you have pointed out, it is fundamentally
> very close to what Sqoop does. However, Sqoop goes way beyond these
> libraries to ensure that you can address your use-case out of the box
> without having to modify anything. If on the other hand, you are more
> inclined to coding your own solution, then perhaps the other tools or these
> low leve APIs may come in handy.
>
> Arvind
>
> On Sun, Jan 9, 2011 at 5:21 AM, Brian McSweeney
> <[EMAIL PROTECTED]>wrote:
>
> > Thanks Konstantin,
> >
> > I had seen sqoop. I wonder is it normally used as a once off process or
> can
> > it also be effectively used on a live database system on a daily basis to
> > batch export. Are there performance issues with this approach? Or how
> would
> > it compare to some of the other classes that I have seen such as those in
> > the database library
> http://hadoop.apache.org/mapreduce/docs/current/api/
> >
> > I have also seen a few alternatives out there such as cascading and
> > cascading-dbmigrate
> >
> > http://architects.dzone.com/articles/tools-moving-sql-database
> >
> > But from the hadoop api above it also seems that some of this
> functionality
> > is perhaps now in the main api. I suppose any experience people have is
> > welcome. I would want to run a batch job to export every day, perform my
> > map
> > reduce, and then import the results back into mysql afterwards.
> >
> > cheers,
> > Brian
> >
> > On Sun, Jan 9, 2011 at 3:18 AM, Konstantin Boudnik <[EMAIL PROTECTED]>
> wrote:
> >
> > > There's a supported tool with all bells and whistles:
> > >  http://www.cloudera.com/downloads/sqoop/
> > >
> > > --
> > >   Take care,
> > > Konstantin (Cos) Boudnik
> > >
> > > On Sat, Jan 8, 2011 at 18:57, Sonal Goyal <[EMAIL PROTECTED]>
> wrote:
> > > > Hi Brian,
> > > >
> > > > You can check HIHO at https://github.com/sonalgoyal/hiho which can
> > help
> > > you
> > > > load data from any JDBC database to the Hadoop file system. If your
> > table
> > > > has a date or id field, or any indicator for modified/newly added
> rows,
> > > you
> > > > can import only the altered rows every day. Please let me know if you
> > > need
> > > > help.
> > > >
> > > > Thanks and Regards,
> > > > Sonal
> > > > <https://github.com/sonalgoyal/hiho>Connect Hadoop with databases,
> > > > Salesforce, FTP servers and others <
> https://github.com/sonalgoyal/hiho
> > >
> > > > Nube Technologies <http://www.nubetech.co>
> > > >
> > > > <http://in.linkedin.com/in/sonalgoyal>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Jan 9, 2011 at 5:03 AM, Brian McSweeney
> > > > <[EMAIL PROTECTED]>wrote:
> > > >
> > > >> Hi folks,
> > > >>
> > > >> I'm a TOTAL newbie on hadoop. I have an existing webapp that has a
> > > growing
> > > >> number of rows in a mysql database that I have to compare against
> one
> > > >> another once a day from a batch job. This is an exponential problem