Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)


Copy link to this message
-
Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)
here was the code I have used ...

sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \

--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \
now I have imported newly inserted data into RDBMS to HDFS

but when I do

select count(*) , unique_value from EMP group by unique_value (both in
RDBMS and in HIVE)

I can find huge data loss.

1) in RDBMS

  Count(*)    Unique_value
  1000          201401201
   5000         201401202
  10000         201401203
2) in HIVE

  Count(*)    Unique_value
  189          201401201
   421         201401202
   50           201401203
If I do

select Unique value from emp ;

Result :
201401201
201401201
201401201
201401201
201401201
.
.
201401202
.
.
and so on...
On Tue, Jan 14, 2014 at 1:29 AM, yogesh kumar <[EMAIL PROTECTED]> wrote:

> Hello Abe,
>
> Thanks a zillion for your response,
>
> Yes the unique_vale is SAL  over here ..
>
> I have an option to test with different directory as its having small
> amount of data (only 1 month data)
>
> Yes the interesting fact is I did sqoop pull on the basis on month, into
> which I am have not done incremental import, and the data matches very
> well, but for daily pull I have to do incremental import every day..
>
> and then i find some data loss...
>
> one of the sqoop guy has suggested me to use  " \$CONDITIONS " with in
> this query as  ......  from EMP where \$CONDITIONS and SAL > 201401200
> and SAL <= 201401204 \
>
> like
>
> EMP where SAL > 201401200 and SAL <= 201401204 and   \$CONDITIONS
>
>
>
> Plz do suggest me pls help me out as I have to output to my client..
>
>
>
>
>
>
>
>
> On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>
>> Yogesh,
>>
>> Is unique_value in this case SAL? I'm a bit confused about your query.
>>
>> Do you have the option of running this query on a separate database
>> somewhere to find the issue? I think it would be interesting to see the
>> initial state and then the state after running an incremental import. That
>> would tell us how many results are being imported after sqoop has ran and
>> we can validate each step. Also, please use the --verbose flag to get the
>> most out of the logs.
>>
>> -Abe
>>
>>
>> On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy <[EMAIL PROTECTED]>wrote:
>>
>>> Yogesh,
>>>
>>> Please try to put $CONDITIONS after your where clause.
>>>
>>> Checkout the examples in the below blog.
>>>
>>>
>>> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1
>>>  On Jan 13, 2014 7:04 AM, "yogesh kumar" <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hello Jarcec,
>>>>
>>>> I got the issue hope this is the cause..  I got data loss by doing
>>>> incremental pull
>>>>
>>>> I have crossed checked it and found that
>>>>
>>>> sqoop import -libjars
>>>>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>>  --query "select * from
>>>>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
>>>> --check-column Unique_value \
>>>>  --incremental append \
>>>>  --last-value 201401200 \
>>>>  --split-by DEPT \
>>>>  --fields-terminated-by ',' \
>>>>  --target-dir ${TARGET_DIR}/${INC} \
>>>>  --username ${SYBASE_USERNAME} \
>>>>  --password ${SYBASE_PASSWORD} \
>>>>
>>>>
>>>> now I have imported newly inserted data into RDBMS to HDFS
>>>>
>>>> but when I do
>>>>
>>>> select count(*) , unique_value from EMP group by unique_value (both in
>>>> RDBMS and in HIVE)
>>>>
>>>> I can find huge data loss.
>>>>
>>>> 1) in RDBMS
>>>>
>>>>   Count(*)    Unique_value
>>>>   1000          201401201
>>>>    5000         201401202
>>>>   10000         201401203
>>>>
>>>>
>>>> 2) in HIVE
>>>>
>>>>   Count(*)    Unique_value
>>>>   189          201401201
>>>>    421         201401202