Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop, mail # user - Strange distribution of keys among mappers


+
David Kincaid 2013-06-19, 15:33
+
Abraham Elmahrek 2013-06-19, 20:14
+
David Kincaid 2013-06-19, 20:23
Copy link to this message
-
Re: Strange distribution of keys among mappers
Abraham Elmahrek 2013-06-19, 22:48
David,

What database are you importing from? The description I gave was for
datatypes that map to the BigDecimal Splitter. The userguide might be
referring to the IntegerSplitter which will add the remainder to the last
value.

-Abe
On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <[EMAIL PROTECTED]>wrote:

> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
> understand your explanation, but it seems to conflict with the Sqoop user
> guide (
> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
> ):
>
> "When performing parallel imports, Sqoop needs a criterion by which it
> can split the workload. Sqoop uses a *splitting column* to split the
> workload. By default, Sqoop will identify the primary key column (if
> present) in a table and use it as the splitting column. The low and high
> values for the splitting column are retrieved from the database, and the
> map tasks operate on evenly-sized components of the total range. For
> example, if you had a table with a primary key column of id whose minimum
> value was 0 and maximum value was 1000, and Sqoop was directed to use 4
> tasks, Sqoop would run four processes which each execute SQL statements of
> the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set
> to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different
> tasks."
>
>
> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>
>> Hey David,
>>
>> Here's the algorithm:
>> Split lengths are defined by (max - min)/(# mappers) and whatever is left
>> is tacked on at the end. So in this case, (288272191-2110)/3 >> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>> will be of length 96090027. Sqoop will then create splits with the
>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164, and 2110
>> + 96090027*3 = 288272191 will be generated based off of this algorithm.
>> The last point to be added will be 288272192 because the max value is
>> not part of the generated split points. Then sqoop will distributed
>> accordingly based off of these points as you've pointed out above.
>>
>> Just to be sure, did you configure sqoop to use 3 mappers?
>>
>> Hope this helps,
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <[EMAIL PROTECTED]>wrote:
>>
>>> We're seeing a strange thing happen with a sqoop import job with the way
>>> the key range is getting distributed among the 4 mappers that are running.
>>> The minimum key value is 2110 and the maximum value is 288272191. We are
>>> getting one mapper that is only getting one record to import. Here is the
>>> distribution among the mappers:
>>>
>>> [2110, 96092137)
>>> [96092137, 192182164)
>>> [192182164, 288272191)
>>> [288272191, 288272192)
>>>
>>> you can see that the fourth mapper is given a range with only one value
>>> in it. Could someone help me understand what is going on?
>>>
>>> Thanks,
>>>
>>> Dave
>>>
>>
>>
>
+
David Kincaid 2013-06-19, 23:05
+
Abraham Elmahrek 2013-06-19, 23:21
+
Abraham Elmahrek 2013-06-19, 23:31
+
Abraham Elmahrek 2013-06-19, 23:50
+
David Kincaid 2013-06-20, 00:03
+
Abraham Elmahrek 2013-06-20, 00:30
+
David Kincaid 2013-06-20, 00:33
+
David Kincaid 2013-06-19, 23:25