Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Sqoop export - incremental extracts


Copy link to this message
-
Re: Sqoop export - incremental extracts
Hi Sadu,
thank you very much for your detailed used case. I'm afraid that Sqoop do not support your use case out of the box at the moment and you will most likely need some workaround similar to the one you've suggested.

I do not know how you're going to automatize your workflow so I can't much comment where/how to store the time stamp values. In case of a bash scripting you might save last values in same working directory or in case of an Oozie execution you might create support java actions that will get and save values from some external database system.

Jarcec

On Fri, Oct 26, 2012 at 02:01:24PM -0500, Sadananda Hegde wrote:
> Thanks Jarek.
>
>
> Thanks Jarek.
>
>
>
> Here is the use case.
>
>
>
> 1.      My Hive table contains detailed transaction level data and
> continuously getting updated throughout the day (say every 15 minutes)
>
> 2.      I have to send summary data from Hive/HDFS to other systems like
> EDW say twice a day.
>
>
>
> This need to be automated and scheduled in production. I need to implement
> incremental logic so that I can export only the changes every time. I was
> reading about incremental options in Sqoop Import. It has kind of features
> I am looking for; but I need them on Sqoop Export. Since export does not
> provide that feature, I may have to track it myself.  Some how I need to
> keep track of when was the last time export ran successfully and what data
> has been added to Hive since then. Then I can do something like:
>
>
>
> 1.      Execute Hive Query to extract the data I need to send (summary and
> only changes):
>
>     Select fld1, fld,2, sum(fld3), …
>
>     From tableA
>
>     Where <HDFS_File_create_timestamp>   > <last_extract_timestamp>
>
>     Group by fld, fld2, …
>
>
>
> 2.      Use SQOOP Export to export the result file to EDW
>
>
>
> I am not sure where / how to get HDFS_File_create_timestamp and
> last_extract_timestamp  values so it can be used dynamically inside Hive
> query.
>
>
>
> Any ideas??? Are there any other options?
>
>
> Thanks for your help.
>
>
>
> Sadu
>
>
>
>
>
> On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:
>
> > Hi Sadu,
> > unfortunately Sqoop export is taking entire input directory (--export-dir)
> > and simply exporting it's content to the external database/warehouse
> > system. I'm afraid that there isn't more sophisticated way of doing
> > "incremental" exports then using different hdfs directories for each
> > "incremental" part.
> >
> > If you could describe your use case, there might be other ways how to
> > achieve similar results.
> >
> > Jarcec
> >
> > On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > > Hello,
> > >
> > > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> > send
> > > the same data again and again. I need to identify the changes in HDFS and
> > > send only the data that has changed since my previous export. What is the
> > > best way to implement such incremental export logic?  I see that sqoop
> > > import has incremental logic option; but can't see it in export.
> > >
> > > Any recomendations / suggestions would greatly be appreciated.
> > >
> > > Thanks,
> > > Sadu
> >