Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Sqoop export - incremental extracts


+
Sadananda Hegde 2012-10-26, 00:38
+
Jarek Jarcec Cecho 2012-10-26, 01:04
Copy link to this message
-
Re: Sqoop export - incremental extracts
Thanks Jarek.
Thanks Jarek.

Here is the use case.

1.      My Hive table contains detailed transaction level data and
continuously getting updated throughout the day (say every 15 minutes)

2.      I have to send summary data from Hive/HDFS to other systems like
EDW say twice a day.

This need to be automated and scheduled in production. I need to implement
incremental logic so that I can export only the changes every time. I was
reading about incremental options in Sqoop Import. It has kind of features
I am looking for; but I need them on Sqoop Export. Since export does not
provide that feature, I may have to track it myself.  Some how I need to
keep track of when was the last time export ran successfully and what data
has been added to Hive since then. Then I can do something like:

1.      Execute Hive Query to extract the data I need to send (summary and
only changes):

    Select fld1, fld,2, sum(fld3), …

    From tableA

    Where <HDFS_File_create_timestamp>   > <last_extract_timestamp>

    Group by fld, fld2, …

2.      Use SQOOP Export to export the result file to EDW

I am not sure where / how to get HDFS_File_create_timestamp and
last_extract_timestamp  values so it can be used dynamically inside Hive
query.

Any ideas??? Are there any other options?
Thanks for your help.

Sadu

On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:

> Hi Sadu,
> unfortunately Sqoop export is taking entire input directory (--export-dir)
> and simply exporting it's content to the external database/warehouse
> system. I'm afraid that there isn't more sophisticated way of doing
> "incremental" exports then using different hdfs directories for each
> "incremental" part.
>
> If you could describe your use case, there might be other ways how to
> achieve similar results.
>
> Jarcec
>
> On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > Hello,
> >
> > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> send
> > the same data again and again. I need to identify the changes in HDFS and
> > send only the data that has changed since my previous export. What is the
> > best way to implement such incremental export logic?  I see that sqoop
> > import has incremental logic option; but can't see it in export.
> >
> > Any recomendations / suggestions would greatly be appreciated.
> >
> > Thanks,
> > Sadu
>
+
Jarek Jarcec Cecho 2012-10-26, 21:22
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB