Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)


Copy link to this message
-
Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)
Hello Jarcec,

I got the issue hope this is the cause..  I got data loss by doing
incremental pull

I have crossed checked it and found that

sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \
now I have imported newly inserted data into RDBMS to HDFS

but when I do

select count(*) , unique_value from EMP group by unique_value (both in
RDBMS and in HIVE)

I can find huge data loss.

1) in RDBMS

  Count(*)    Unique_value
  1000          201401201
   5000         201401202
  10000         201401203
2) in HIVE

  Count(*)    Unique_value
  189          201401201
   421         201401202
   50           201401203
If I do

select Unique value from emp ;

Result :
201401201
201401201
201401201
201401201
201401201
.
.
201401202
.
.
and so on...
Pls help and suggest why is it so
Many thanks in advance

Yogesh kumar

On Sun, Jan 12, 2014 at 11:08 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:

> Hi Yogesh,
> I would start by verifying imported data. If there are duplicates than
> it's suggesting some miss configuration of Sqoop, otherwise you might have
> some inconsistencies down the pipeline.
>
> Jarcec
>
> On Sat, Jan 11, 2014 at 11:01:22PM +0530, yogesh kumar wrote:
> > Hello All,
> >
> > I am working on a use case where I have to run a process on daily basis
> > which will do these.
> >
> > 1)  Pull every day new data inserted into RDBMS tables to HDFS
> > 2)  Having external table in hive (pointing to the location of HDFS
> > directry where data is pulled by sqoop)
> > 3) Perform some hive queries (joins) and create a final internal table
> into
> > Hive (say.. Hive_Table_Final).
> >
> >
> > What I am doing..
> >
> > I am migrating a process from RDBMS to HADOOP ( same process is being
> > executed in RDBMS procedure and stored in final table . {say..
> >  Rdbms_Table_Final} )
> >
> > Issue I am facing is.
> >
> > Every time I do Incremental import and after processing I find the final
> > table in hive having the value multiplied by every time I do incremental
> > import (If I do incremental import to bring new data into HDFS , the data
> > in final table of hive after processing  i.e "Hive_Table_Final"  showing
> > the values of all columns multiplied by the times of I done incremental
> > pull), if I do perform incremental import for 4 days ( every day once
> > incremental import in a day and did it for  4 days) i got  data
> multiplied
> > 4 in the final table of hive (Hive_Table_Final)  with respect to final
> > table in RDBMS (Rdbms_final_table).
> >
> >
> > Like..
> >
> > 1) 1st time I have pulled the data from RDBMS based on the months (like
> > from 2013-12-01 to 2013-01-01) and processed it, got perfect results
> > matching the data in final Hive's  table(Hive_Table_Final) and RDBMS
> > processed data into (Rdbms_Table_Final)
> >
> > 2) I have done incremental import to bring new data from RDBMS to HDFS by
> > using this command..
> >
> >
> >  sqoop import -libjars
> >  --driver com.sybase.jdbc3.jdbc.SybDriver \
> >  --query "select * from
> >  from EMP where \$CONDITIONS and SAL > 50000 and SAL <= 80000" \
> > --check-column Unique_value \
> >  --incremental append \
> >  --last-value 201401200 \
> >  --split-by DEPT \
> >  --fields-terminated-by ',' \
> >  --target-dir ${TARGET_DIR}/${INC} \
> >  --username ${SYBASE_USERNAME} \
> >  --password ${SYBASE_PASSWORD} \
> >
> > "Note -- The field Unique_value is very unique for every time, its
> > like primary key "
> >
> >
> >
> > As now I have just pulled the new records to my HDFS which were into
>  RDBMS
> > tables..
> >
> > Now I got major data mis-match issue,  after the
> > processing..(Hive_Table_final)
> >
> > My Major issue is with sqoop incremental import, as many times I do