Sadananda Hegde 2012-10-26, 00:38
Jarek Jarcec Cecho 2012-10-26, 01:04
-Re: Sqoop export - incremental extracts
Sadananda Hegde 2012-10-26, 19:01
Here is the use case.
1. My Hive table contains detailed transaction level data and
continuously getting updated throughout the day (say every 15 minutes)
2. I have to send summary data from Hive/HDFS to other systems like
EDW say twice a day.
This need to be automated and scheduled in production. I need to implement
incremental logic so that I can export only the changes every time. I was
reading about incremental options in Sqoop Import. It has kind of features
I am looking for; but I need them on Sqoop Export. Since export does not
provide that feature, I may have to track it myself. Some how I need to
keep track of when was the last time export ran successfully and what data
has been added to Hive since then. Then I can do something like:
1. Execute Hive Query to extract the data I need to send (summary and
Select fld1, fld,2, sum(fld3), …
Where <HDFS_File_create_timestamp> > <last_extract_timestamp>
Group by fld, fld2, …
2. Use SQOOP Export to export the result file to EDW
I am not sure where / how to get HDFS_File_create_timestamp and
last_extract_timestamp values so it can be used dynamically inside Hive
Any ideas??? Are there any other options?
Thanks for your help.
On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:
> Hi Sadu,
> unfortunately Sqoop export is taking entire input directory (--export-dir)
> and simply exporting it's content to the external database/warehouse
> system. I'm afraid that there isn't more sophisticated way of doing
> "incremental" exports then using different hdfs directories for each
> "incremental" part.
> If you could describe your use case, there might be other ways how to
> achieve similar results.
> On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > Hello,
> > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> > the same data again and again. I need to identify the changes in HDFS and
> > send only the data that has changed since my previous export. What is the
> > best way to implement such incremental export logic? I see that sqoop
> > import has incremental logic option; but can't see it in export.
> > Any recomendations / suggestions would greatly be appreciated.
> > Thanks,
> > Sadu
Jarek Jarcec Cecho 2012-10-26, 21:22