Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Strange distribution of keys among mappers


Copy link to this message
-
Re: Strange distribution of keys among mappers
David,

It's really just a hint. So the splitters will try to hit whatever is
defined, but an extra may be created. For instance, BigDecimalSplitter will
create 4 splits for certain ranges with 3 MR tasks specified.

-Abe
On Wed, Jun 19, 2013 at 5:03 PM, David Kincaid <[EMAIL PROTECTED]>wrote:

> We don't have that set on our cluster and aren't specifying it in our job.
> When I look at the different sqoop jobs I see both 3 for some and 4 for
> others on the jobs.
>
>
> On Wed, Jun 19, 2013 at 6:50 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>
>> David,
>>
>> Well I think sqoop is looking at "mapred.map.tasks". Do you have that set
>> in mapred-site.xml? I thought that defaults to 2.
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 4:31 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>
>>> David,
>>>
>>> I've created https://issues.apache.org/jira/browse/SQOOP-1093 to track
>>> the documentation issue. Thanks for bringing this to the community's
>>> attention!
>>>
>>> -Abe
>>>
>>>
>>> On Wed, Jun 19, 2013 at 4:21 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hey David,
>>>>
>>>> With oracle, the BigDecimalSplitter will be used by default for all
>>>> number types.
>>>>
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Abe, the database is Oracle.
>>>>>
>>>>>
>>>>> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> David,
>>>>>>
>>>>>> What database are you importing from? The description I gave was for
>>>>>> datatypes that map to the BigDecimal Splitter. The userguide might be
>>>>>> referring to the IntegerSplitter which will add the remainder to the last
>>>>>> value.
>>>>>>
>>>>>> -Abe
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <
>>>>>> [EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>> Thanks. We didn't specify the number of mappers, so it's giving us
>>>>>>> 4. I understand your explanation, but it seems to conflict with the Sqoop
>>>>>>> user guide (
>>>>>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>>>>>> ):
>>>>>>>
>>>>>>> "When performing parallel imports, Sqoop needs a criterion by which
>>>>>>> it can split the workload. Sqoop uses a *splitting column* to split
>>>>>>> the workload. By default, Sqoop will identify the primary key column (if
>>>>>>> present) in a table and use it as the splitting column. The low and high
>>>>>>> values for the splitting column are retrieved from the database, and the
>>>>>>> map tasks operate on evenly-sized components of the total range. For
>>>>>>> example, if you had a table with a primary key column of id whose
>>>>>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>>>>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>>>>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND
>>>>>>> id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and
>>>>>>> (750, 1001) in the different tasks."
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>>>>>
>>>>>>>> Hey David,
>>>>>>>>
>>>>>>>> Here's the algorithm:
>>>>>>>> Split lengths are defined by (max - min)/(# mappers) and whatever
>>>>>>>> is left is tacked on at the end. So in this case, (288272191-2110)/3
>>>>>>>> = 96090027.33... So I'm assuming the .33... is rounded down and split
>>>>>>>> lengths will be of length 96090027. Sqoop will then create splits
>>>>>>>> with the following points: (min) + (range length)*(n). We can see
>>>>>>>> that 2110 + 96090027*0 = 2110, 2110 + 96090027*1 = 96092137, 2110
>>>>>>>> + 96090027*2 = 192182164, and 2110 + 96090027*3 = 288272191 will
>>>>>>>> be generated based off of this algorithm. The last point to be added will
>>>>>>>> be 288272192 because the max value is not part of the generated
>>>>>>>> split points. Then sqoop will distributed accordingly based off of these
>>>>>>>> points as you've pointed out above.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB