Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Getting bogus rows from sqoop import...?


+
Felix GV 2013-03-21, 03:27
Copy link to this message
-
Re: Getting bogus rows from sqoop import...?
Hi Felix,
we've seen similar behaviour in the past when the data itself contains Hive special characters like new line characters. Would you mind trying your import with --hive-drop-import-delims to see if it helps?

Jarcec

On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
> Hello,
>
> I'm trying to import a full table from MySQL to Hadoop/Hive. It works with
> certain parameters, but when I try to do an ETL that's somewhat more
> complex, I start getting bogus rows in my resulting table.
>
> This works:
>
> sqoop import \
>         --connect
> 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
> \
>         --username xxxxx \
>         --password xxxxx \
>         --hive-import \
>         --hive-overwrite \
>         -m 23 \
>         --direct \
>         --hive-table profile_felix_test17 \
>         --split-by id \
>         --table Profile
>
> But if I use a --query instead of a --table, then I start getting bogus
> records (and by that, I mean rows that have a non-sensically high primary
> key that doesn't exist in my source database and null for the rest of the
> cells).
>
> The output I get with the above query is not exactly the way I want it.
> Using --query, I can get the data in the format I want (by transforming
> some stuff inside MySQL), but then I also get the bogus rows, which pretty
> much makes the Hive table unusable.
>
> I tried various combinations of parameters and it's hard to pin-point
> exactly what causes the problem, so it could be more intricate than my
> above simplistic description. That being said, removing --table and adding
> the following params definitely breaks it:
>
>         --target-dir /tests/sqoop/general/profile_felix_test \
>         --query "select * from Profile WHERE \$CONDITIONS"
>
> (Ultimately, I want to use a query that's more complex than this, but even
> a simple query like this breaks...)
>
> Any ideas why this would happen and how to solve it?
>
> Is this the kind of problem that Sqoop2's cleaner architecture intends to
> solve?
>
> I use CDH 4.2, BTW.
>
> Thanks :) !
>
> --
> Felix
+
Felix GV 2013-03-21, 04:47
+
Felix GV 2013-03-21, 20:46
+
Felix GV 2013-03-22, 00:32
+
Venkat Ranganathan 2013-03-22, 00:54
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB