Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Getting bogus rows from sqoop import...?


+
Felix GV 2013-03-21, 03:27
+
Jarek Jarcec Cecho 2013-03-21, 04:42
+
Felix GV 2013-03-21, 04:47
Copy link to this message
-
Re: Getting bogus rows from sqoop import...?
I seem to be getting a proper output with the above parameters BTW.

I'll try to re-integrate the rest of my more complex ETL query in that
sqoop job...

Thanks :) !

--
Felix
On Thu, Mar 21, 2013 at 12:47 AM, Felix GV <[EMAIL PROTECTED]> wrote:

> Thanks for your response Jarek :)
>
> I've started a new import run with --hive-drop-import-delims added and
> --direct removed (since the two are mutually exclusive), we'll see how it
> goes.
>
> Going to sleep now. I'll report back tomorrow :)
>
> --
> Felix
>
>
> On Thu, Mar 21, 2013 at 12:42 AM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:
>
>> Hi Felix,
>> we've seen similar behaviour in the past when the data itself contains
>> Hive special characters like new line characters. Would you mind trying
>> your import with --hive-drop-import-delims to see if it helps?
>>
>> Jarcec
>>
>> On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
>> > Hello,
>> >
>> > I'm trying to import a full table from MySQL to Hadoop/Hive. It works
>> with
>> > certain parameters, but when I try to do an ETL that's somewhat more
>> > complex, I start getting bogus rows in my resulting table.
>> >
>> > This works:
>> >
>> > sqoop import \
>> >         --connect
>> >
>> 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
>> > \
>> >         --username xxxxx \
>> >         --password xxxxx \
>> >         --hive-import \
>> >         --hive-overwrite \
>> >         -m 23 \
>> >         --direct \
>> >         --hive-table profile_felix_test17 \
>> >         --split-by id \
>> >         --table Profile
>> >
>> > But if I use a --query instead of a --table, then I start getting bogus
>> > records (and by that, I mean rows that have a non-sensically high
>> primary
>> > key that doesn't exist in my source database and null for the rest of
>> the
>> > cells).
>> >
>> > The output I get with the above query is not exactly the way I want it.
>> > Using --query, I can get the data in the format I want (by transforming
>> > some stuff inside MySQL), but then I also get the bogus rows, which
>> pretty
>> > much makes the Hive table unusable.
>> >
>> > I tried various combinations of parameters and it's hard to pin-point
>> > exactly what causes the problem, so it could be more intricate than my
>> > above simplistic description. That being said, removing --table and
>> adding
>> > the following params definitely breaks it:
>> >
>> >         --target-dir /tests/sqoop/general/profile_felix_test \
>> >         --query "select * from Profile WHERE \$CONDITIONS"
>> >
>> > (Ultimately, I want to use a query that's more complex than this, but
>> even
>> > a simple query like this breaks...)
>> >
>> > Any ideas why this would happen and how to solve it?
>> >
>> > Is this the kind of problem that Sqoop2's cleaner architecture intends
>> to
>> > solve?
>> >
>> > I use CDH 4.2, BTW.
>> >
>> > Thanks :) !
>> >
>> > --
>> > Felix
>>
>
>
+
Felix GV 2013-03-22, 00:32
+
Venkat Ranganathan 2013-03-22, 00:54