Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Getting bogus rows from sqoop import...?


+
Felix GV 2013-03-21, 03:27
+
Jarek Jarcec Cecho 2013-03-21, 04:42
+
Felix GV 2013-03-21, 04:47
+
Felix GV 2013-03-21, 20:46
+
Felix GV 2013-03-22, 00:32
Copy link to this message
-
Re: Getting bogus rows from sqoop import...?
The --direct option most of the time is incompatible with many of the
parsing options in Sqoop (unless some of them are specifically
adapted).

Some of the connectors do not allow --direct and --query for example
(for example Postgres, Netezza etc).   May be other connectors should
check also.   But I am not sure what is the intersection of --query
and --hive-drop-delims?   I mean, if it is table or query should not
affect the output processing.    Do you mean --direct and
--hive-drop-delims?

Thanks

Venkat

On Thu, Mar 21, 2013 at 5:32 PM, Felix GV <[EMAIL PROTECTED]> wrote:
> So... the verdict appears to be the following:
>
> Using --direct with --table produces correct results.
> Using --query instead of --table absolutely requires using
> --hive-drop-import-delims.
> --hive-drop-import-delims is not compatible with --direct.
> Sqoop allows you to use --query and --direct together, but produces bogus
> results.
>
> Considering the nature of --hive-drop-import-delims, I would say it's quite
> likely that the above observations only hold true if the data in the
> imported table actually contains the conflicting delimiting characters.
>
> In any case, don't take my word for it: test on your own dataset if you're
> having any problem. Hopefully the above observations can provide a good
> starting point though.
>
> Maybe Sqoop should spit out an error or a warning if --query is used without
> --hive-drop-import-delims ...?
>
> --
> Felix
>
>
> On Thu, Mar 21, 2013 at 4:46 PM, Felix GV <[EMAIL PROTECTED]> wrote:
>>
>> I seem to be getting a proper output with the above parameters BTW.
>>
>> I'll try to re-integrate the rest of my more complex ETL query in that
>> sqoop job...
>>
>> Thanks :) !
>>
>> --
>> Felix
>>
>>
>> On Thu, Mar 21, 2013 at 12:47 AM, Felix GV <[EMAIL PROTECTED]> wrote:
>>>
>>> Thanks for your response Jarek :)
>>>
>>> I've started a new import run with --hive-drop-import-delims added and
>>> --direct removed (since the two are mutually exclusive), we'll see how it
>>> goes.
>>>
>>> Going to sleep now. I'll report back tomorrow :)
>>>
>>> --
>>> Felix
>>>
>>>
>>> On Thu, Mar 21, 2013 at 12:42 AM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>> Hi Felix,
>>>> we've seen similar behaviour in the past when the data itself contains
>>>> Hive special characters like new line characters. Would you mind trying your
>>>> import with --hive-drop-import-delims to see if it helps?
>>>>
>>>> Jarcec
>>>>
>>>> On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
>>>> > Hello,
>>>> >
>>>> > I'm trying to import a full table from MySQL to Hadoop/Hive. It works
>>>> > with
>>>> > certain parameters, but when I try to do an ETL that's somewhat more
>>>> > complex, I start getting bogus rows in my resulting table.
>>>> >
>>>> > This works:
>>>> >
>>>> > sqoop import \
>>>> >         --connect
>>>> >
>>>> > 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
>>>> > \
>>>> >         --username xxxxx \
>>>> >         --password xxxxx \
>>>> >         --hive-import \
>>>> >         --hive-overwrite \
>>>> >         -m 23 \
>>>> >         --direct \
>>>> >         --hive-table profile_felix_test17 \
>>>> >         --split-by id \
>>>> >         --table Profile
>>>> >
>>>> > But if I use a --query instead of a --table, then I start getting
>>>> > bogus
>>>> > records (and by that, I mean rows that have a non-sensically high
>>>> > primary
>>>> > key that doesn't exist in my source database and null for the rest of
>>>> > the
>>>> > cells).
>>>> >
>>>> > The output I get with the above query is not exactly the way I want
>>>> > it.
>>>> > Using --query, I can get the data in the format I want (by
>>>> > transforming
>>>> > some stuff inside MySQL), but then I also get the bogus rows, which
>>>> > pretty
>>>> > much makes the Hive table unusable.
>>>> >
>>>> > I tried various combinations of parameters and it's hard to pin-point
>>>> > exactly what causes the problem, so it could be more intricate than my