you can achieve similar functionality with sqoop using several ways, based on the connector that you will use:
1) You can always manually (or by script) remove previously imported data if you know how to easily identify them prior executing sqoop. E.g. you might create script that will remove previously imported data (if present) and then execute sqoop.
2) You can benefit from staging table using parameters --staging-table and --clear-staging-table. This way, sqoop will firstly import your data in parallel to staging table and promote them to destination table only if all parallel execution threads will succeed. Please note that staging option is not available in all connectors (typically direct connectors are not supporting it).
3) Lastly, you might use "upsert" functionality. Some connectors (MySQL, Oracle) are supporting --update-mode allowinsert which will either insert new row or update the previous one if it's present in the table already. Please note that this solution have the worst performance from all others.
On Sun, Sep 09, 2012 at 12:42:45PM +0530, Adarsh Sharma wrote:
> I am using Sqoop-1.4.2 from the past few days in a hadoop cluster of 10
> As per the documentation of sqoop 9.4 Export & Transactions , the export
> operation is not atomic in database becuase it creates separate
> transactions to insert records.
> Fore.g if a map task failed to export transaction while others succeeded ,
> it would lead to partial & incomplete results in database tables.
> I created a script in bash to load data from a CSV ( daily csvs ) of 500
> thousand records into db in which i delete the records of the day csvs
> before loading the csv into db so that if there is issue while loading a
> day CSV , we get correct results by again running the job.
> Can we achieve the same functionality in Sqoop , so that if a job in sqoop
> fails some map tasks, we achive correct & complete ( no duplicates )
> records in db.