Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)


+
yogesh kumar 2014-01-11, 17:31
+
Jarek Jarcec Cecho 2014-01-12, 17:38
+
yogesh kumar 2014-01-13, 13:01
+
Sharath Punreddy 2014-01-13, 13:15
+
Abraham Elmahrek 2014-01-13, 17:43
+
yogesh kumar 2014-01-13, 19:59
Copy link to this message
-
Re: SQOOP INCREMENTAL PULL ISSUE (PLEASE SUGGEST.)
here was the code I have used ...

sqoop import -libjars
 --driver com.sybase.jdbc3.jdbc.SybDriver \
 --query "select * from
 from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \

--check-column Unique_value \
 --incremental append \
 --last-value 201401200 \
 --split-by DEPT \
 --fields-terminated-by ',' \
 --target-dir ${TARGET_DIR}/${INC} \
 --username ${SYBASE_USERNAME} \
 --password ${SYBASE_PASSWORD} \
now I have imported newly inserted data into RDBMS to HDFS

but when I do

select count(*) , unique_value from EMP group by unique_value (both in
RDBMS and in HIVE)

I can find huge data loss.

1) in RDBMS

  Count(*)    Unique_value
  1000          201401201
   5000         201401202
  10000         201401203
2) in HIVE

  Count(*)    Unique_value
  189          201401201
   421         201401202
   50           201401203
If I do

select Unique value from emp ;

Result :
201401201
201401201
201401201
201401201
201401201
.
.
201401202
.
.
and so on...
On Tue, Jan 14, 2014 at 1:29 AM, yogesh kumar <[EMAIL PROTECTED]> wrote:

> Hello Abe,
>
> Thanks a zillion for your response,
>
> Yes the unique_vale is SAL  over here ..
>
> I have an option to test with different directory as its having small
> amount of data (only 1 month data)
>
> Yes the interesting fact is I did sqoop pull on the basis on month, into
> which I am have not done incremental import, and the data matches very
> well, but for daily pull I have to do incremental import every day..
>
> and then i find some data loss...
>
> one of the sqoop guy has suggested me to use  " \$CONDITIONS " with in
> this query as  ......  from EMP where \$CONDITIONS and SAL > 201401200
> and SAL <= 201401204 \
>
> like
>
> EMP where SAL > 201401200 and SAL <= 201401204 and   \$CONDITIONS
>
>
>
> Plz do suggest me pls help me out as I have to output to my client..
>
>
>
>
>
>
>
>
> On Mon, Jan 13, 2014 at 11:13 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>
>> Yogesh,
>>
>> Is unique_value in this case SAL? I'm a bit confused about your query.
>>
>> Do you have the option of running this query on a separate database
>> somewhere to find the issue? I think it would be interesting to see the
>> initial state and then the state after running an incremental import. That
>> would tell us how many results are being imported after sqoop has ran and
>> we can validate each step. Also, please use the --verbose flag to get the
>> most out of the logs.
>>
>> -Abe
>>
>>
>> On Mon, Jan 13, 2014 at 5:15 AM, Sharath Punreddy <[EMAIL PROTECTED]>wrote:
>>
>>> Yogesh,
>>>
>>> Please try to put $CONDITIONS after your where clause.
>>>
>>> Checkout the examples in the below blog.
>>>
>>>
>>> http://jugnu-life.blogspot.com/2012/03/sqoop-free-form-query-example.html?m=1
>>>  On Jan 13, 2014 7:04 AM, "yogesh kumar" <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hello Jarcec,
>>>>
>>>> I got the issue hope this is the cause..  I got data loss by doing
>>>> incremental pull
>>>>
>>>> I have crossed checked it and found that
>>>>
>>>> sqoop import -libjars
>>>>  --driver com.sybase.jdbc3.jdbc.SybDriver \
>>>>  --query "select * from
>>>>  from EMP where \$CONDITIONS and SAL > 201401200 and SAL <= 201401204 \
>>>> --check-column Unique_value \
>>>>  --incremental append \
>>>>  --last-value 201401200 \
>>>>  --split-by DEPT \
>>>>  --fields-terminated-by ',' \
>>>>  --target-dir ${TARGET_DIR}/${INC} \
>>>>  --username ${SYBASE_USERNAME} \
>>>>  --password ${SYBASE_PASSWORD} \
>>>>
>>>>
>>>> now I have imported newly inserted data into RDBMS to HDFS
>>>>
>>>> but when I do
>>>>
>>>> select count(*) , unique_value from EMP group by unique_value (both in
>>>> RDBMS and in HIVE)
>>>>
>>>> I can find huge data loss.
>>>>
>>>> 1) in RDBMS
>>>>
>>>>   Count(*)    Unique_value
>>>>   1000          201401201
>>>>    5000         201401202
>>>>   10000         201401203
>>>>
>>>>
>>>> 2) in HIVE
>>>>
>>>>   Count(*)    Unique_value
>>>>   189          201401201
>>>>    421         201401202
+
Abraham Elmahrek 2014-01-16, 19:13
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB