Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Sqoop split-by column limiting map tasks


+
Erik Knoll 2012-08-30, 16:13
Copy link to this message
-
Re: Sqoop split-by column limiting map tasks
This is just a tweak for your scenario:
add this option to your sqoop command:
--boundary-query 'select min(mapid), max(mapid) + 1 from <table_name>'

Let me know if that doesn't work.

Thanks,
Abhijeet
On 30 Aug 2012 21:43, "Erik Knoll" <[EMAIL PROTECTED]> wrote:

> I'm using Sqoop 1.4.1 to import a table from MySQL to HDFS. The table
> contains log entries by users who are identified by an integer user ID
> but does not have a primary key. Because of the way user ID's were
> assigned, lower value ID's have more records in the table than larger
> ID's making parallel imports extremely unbalanced (I'm only running 7
> map tasks).
>
> In order balance the parallel import, I created an additional column
> which I set to be 'mapid = UserID mod 7' producing values 0 - 6 which
> do have a uniform distribution of records. When I run the Sqoop import
> with '--split-by mapid -m 7' the job seems to be limited to 6 map
> tasks. This same behavior is exhibited even if I add 1 to my 'mapid'
> column so I'm thinking Sqoop is limiting the map tasks to the
> difference between the minimum and maximum values of the split-by
> column without adding 1 to the range.
>
> I know that I can create a different 'mapid' column or create a
> primary key, but is there something I can do with Sqoop to correct for
> this?
>
> Thank you,
> Erik
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB