Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop, mail # user - Sqoop export - incremental extracts

Sadananda Hegde 2012-10-26, 00:38
Jarek Jarcec Cecho 2012-10-26, 01:04
Copy link to this message
Re: Sqoop export - incremental extracts
Sadananda Hegde 2012-10-26, 19:01
Thanks Jarek.
Thanks Jarek.

Here is the use case.

1.      My Hive table contains detailed transaction level data and
continuously getting updated throughout the day (say every 15 minutes)

2.      I have to send summary data from Hive/HDFS to other systems like
EDW say twice a day.

This need to be automated and scheduled in production. I need to implement
incremental logic so that I can export only the changes every time. I was
reading about incremental options in Sqoop Import. It has kind of features
I am looking for; but I need them on Sqoop Export. Since export does not
provide that feature, I may have to track it myself.  Some how I need to
keep track of when was the last time export ran successfully and what data
has been added to Hive since then. Then I can do something like:

1.      Execute Hive Query to extract the data I need to send (summary and
only changes):

    Select fld1, fld,2, sum(fld3), …

    From tableA

    Where <HDFS_File_create_timestamp>   > <last_extract_timestamp>

    Group by fld, fld2, …

2.      Use SQOOP Export to export the result file to EDW

I am not sure where / how to get HDFS_File_create_timestamp and
last_extract_timestamp  values so it can be used dynamically inside Hive

Any ideas??? Are there any other options?
Thanks for your help.


On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:

> Hi Sadu,
> unfortunately Sqoop export is taking entire input directory (--export-dir)
> and simply exporting it's content to the external database/warehouse
> system. I'm afraid that there isn't more sophisticated way of doing
> "incremental" exports then using different hdfs directories for each
> "incremental" part.
> If you could describe your use case, there might be other ways how to
> achieve similar results.
> Jarcec
> On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > Hello,
> >
> > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> send
> > the same data again and again. I need to identify the changes in HDFS and
> > send only the data that has changed since my previous export. What is the
> > best way to implement such incremental export logic?  I see that sqoop
> > import has incremental logic option; but can't see it in export.
> >
> > Any recomendations / suggestions would greatly be appreciated.
> >
> > Thanks,
> > Sadu
Jarek Jarcec Cecho 2012-10-26, 21:22