Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Strange distribution of keys among mappers


+
David Kincaid 2013-06-19, 15:33
+
Abraham Elmahrek 2013-06-19, 20:14
+
David Kincaid 2013-06-19, 20:23
+
Abraham Elmahrek 2013-06-19, 22:48
+
David Kincaid 2013-06-19, 23:05
Copy link to this message
-
Re: Strange distribution of keys among mappers
Hey David,

With oracle, the BigDecimalSplitter will be used by default for all number
types.

-Abe
On Wed, Jun 19, 2013 at 4:05 PM, David Kincaid <[EMAIL PROTECTED]>wrote:

> Abe, the database is Oracle.
>
>
> On Wed, Jun 19, 2013 at 5:48 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>
>> David,
>>
>> What database are you importing from? The description I gave was for
>> datatypes that map to the BigDecimal Splitter. The userguide might be
>> referring to the IntegerSplitter which will add the remainder to the last
>> value.
>>
>> -Abe
>>
>>
>> On Wed, Jun 19, 2013 at 1:23 PM, David Kincaid <[EMAIL PROTECTED]>wrote:
>>
>>> Thanks. We didn't specify the number of mappers, so it's giving us 4. I
>>> understand your explanation, but it seems to conflict with the Sqoop user
>>> guide (
>>> http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_parallelism
>>> ):
>>>
>>> "When performing parallel imports, Sqoop needs a criterion by which it
>>> can split the workload. Sqoop uses a *splitting column* to split the
>>> workload. By default, Sqoop will identify the primary key column (if
>>> present) in a table and use it as the splitting column. The low and high
>>> values for the splitting column are retrieved from the database, and the
>>> map tasks operate on evenly-sized components of the total range. For
>>> example, if you had a table with a primary key column of id whose
>>> minimum value was 0 and maximum value was 1000, and Sqoop was directed to
>>> use 4 tasks, Sqoop would run four processes which each execute SQL
>>> statements of the form SELECT * FROM sometable WHERE id >= lo AND id <
>>> hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750,
>>> 1001) in the different tasks."
>>>
>>>
>>> On Wed, Jun 19, 2013 at 3:14 PM, Abraham Elmahrek <[EMAIL PROTECTED]>wrote:
>>>
>>>> Hey David,
>>>>
>>>> Here's the algorithm:
>>>> Split lengths are defined by (max - min)/(# mappers) and whatever is
>>>> left is tacked on at the end. So in this case, (288272191-2110)/3 >>>> 96090027.33... So I'm assuming the .33... is rounded down and split lengths
>>>> will be of length 96090027. Sqoop will then create splits with the
>>>> following points: (min) + (range length)*(n). We can see that 2110 + 96090027*0
>>>> = 2110, 2110 + 96090027*1 = 96092137, 2110 + 96090027*2 = 192182164,
>>>> and 2110 + 96090027*3 = 288272191 will be generated based off of this
>>>> algorithm. The last point to be added will be 288272192 because the
>>>> max value is not part of the generated split points. Then sqoop will
>>>> distributed accordingly based off of these points as you've pointed out
>>>> above.
>>>>
>>>> Just to be sure, did you configure sqoop to use 3 mappers?
>>>>
>>>> Hope this helps,
>>>> -Abe
>>>>
>>>>
>>>> On Wed, Jun 19, 2013 at 8:33 AM, David Kincaid <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> We're seeing a strange thing happen with a sqoop import job with the
>>>>> way the key range is getting distributed among the 4 mappers that are
>>>>> running. The minimum key value is 2110 and the maximum value is 288272191.
>>>>> We are getting one mapper that is only getting one record to import. Here
>>>>> is the distribution among the mappers:
>>>>>
>>>>> [2110, 96092137)
>>>>> [96092137, 192182164)
>>>>> [192182164, 288272191)
>>>>> [288272191, 288272192)
>>>>>
>>>>> you can see that the fourth mapper is given a range with only one
>>>>> value in it. Could someone help me understand what is going on?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Dave
>>>>>
>>>>
>>>>
>>>
>>
>
+
Abraham Elmahrek 2013-06-19, 23:31
+
Abraham Elmahrek 2013-06-19, 23:50
+
David Kincaid 2013-06-20, 00:03
+
Abraham Elmahrek 2013-06-20, 00:30
+
David Kincaid 2013-06-20, 00:33
+
David Kincaid 2013-06-19, 23:25
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB