Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop, mail # user - Strange distribution of keys among mappers


Copy link to this message
-
Re: Strange distribution of keys among mappers
Abraham Elmahrek 2013-06-20, 00:30
David,

It's really just a hint. So the splitters will try to hit whatever is
defined, but an extra may be created. For instance, BigDecimalSplitter will
create 4 splits for certain ranges with 3 MR tasks specified.

-Abe
On Wed, Jun 19, 2013 at 5:03 PM, David Kincaid <[EMAIL PROTECTED]>wrote:

> We don't have that set on our cluster and aren't specifying it in our job.
> When I look at the different sqoop jobs I see both 3 for some and 4 for
> others on the jobs.
>
>
> On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>
>> David,
>>
>> Well I think sqoop is looking at "mapred.map.tasks". Do you have that set
>> in mapred-site.xml? I thought that defaults to 2.
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>
>>> David,
>>>
>>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>>> the documentation issue. Thanks for bringing this to the community's
>>> attention!
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hey David,
>>>>
>>>> With oracle, the BigDecimalSplitter will be used by default for all
>>>> number types.
>>>>
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Abe, the database is Oracle.
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> What database are you importing from? The description I gave was for
>>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>>>> referring to the IntegerSplitter which will add the remainder to the last
>>>>>> value.
>>>>>>
>>>>>> -Abe
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us
>>>>>>> 4. I understand your explanation, but it seems to conflict with the Sqoop
>>>>>>> user guide (
>>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>>> ):
>>>>>>>
>>>>>>> "When performing parallel imports, Sqoop needs a criterion by which
>>>>>>> it can split the workload. Sqoop uses a *splitting column* to split
>>>>>>> the workload. By default, Sqoop will identify the primary key column (if
>>>>>>> present) in a table and use it as the splitting column. The low and high
>>>>>>> values for the splitting column are retrieved from the database, and the
>>>>>>> map tasks operate on evenly-sized components of the total range. For
>>>>>>> example, if you had a table with a primary key column of id whose
>>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND
>>>>>>> id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and
>>>>>>> (750, 1001) in the different tasks."
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>>>>>
>>>>>>>> Hey David,
>>>>>>>>
>>>>>>>> Here's the algorithm:
>>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever
>>>>>>>> is left is tacked on at the end. So in this case, (288272191-2110)/3
>>>>>>>> = 96090027.33... So I'm assuming the .33... is rounded down and split
>>>>>>>> lengths will be of length 96090027. Sqoop will then create splits
>>>>>>>> with the following points: (min) + (range length)*(n). We can see
>>>>>>>> that 2110 + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110
>>>>>>>> + 96090027*2 = 192182164, and 2110 + 96090027*3 = 288272191 will
>>>>>>>> be generated based off of this algorithm. The last point to be added will
>>>>>>>> be 288272192 because the max value is not part of the generated
>>>>>>>> split points. Then sqoop will distributed accordingly based off of these
>>>>>>>> points as you've pointed out above.