Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Re: Sqoop export .lzo to mysql duplicates


Copy link to this message
-
Re: Sqoop export .lzo to mysql duplicates
Hi Bhargav,
I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was able to replicate the behaviour in my testing environment today and my intention is to continue debugging tomorrow.

As a workaround you can decompress the files manually prior Sqoop export for now.

Jarcec

Links:
1: https://issues.apache.org/jira/browse/SQOOP-721

On Nov 22, 2012, at 10:00 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]> wrote:

> Hi Bhargav,
> I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was able to replicate the behaviour in my testing environment today and my intention is to continue debugging tomorrow.
>
> As a workaround you can decompress the files manually prior Sqoop export for now.
>
> Jarcec
>
> Links:
> 1: https://issues.apache.org/jira/browse/SQOOP-721
>
> On Nov 22, 2012, at 9:07 PM, Bhargav Nallapu <[EMAIL PROTECTED]> wrote:
>
>>
>> Hi,
>>
>> Finding this strange issue.
>>
>> Context:
>>
>> Hive writes an output to an external table, with LZO  compression in place. So, my hdfs folder has large_file.lzo
>>
>> Using Sqoop, when I try to export this file to the mysql table, the num of rows is doubled.
>>
>> Then I do,
>> lzop -d large_file.lzo
>>
>> This doesn't happen if I load the same file uncompressing it, "large_file" Rows are as expected.
>>
>> Where as both small_file and small_file.lzo are loaded with correct rows.
>>
>> Sqoop : v 1.30
>> Num of mappers : 1
>>
>> Observation : Any compressed file (gzipped or lzo) of size greater than 60 MB (might be 64 MB), while exported to DB puts the double the row count, probably exact duplicates.
>> Can anyone please help?
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB