Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Sqoop import big MySql table in HBase

Copy link to this message
Re: Sqoop import big MySql table in HBase
Hi Alberto,
taking into account that you have 910 millions of records and you're job was able to get to 75% in matter of 8 minutes and then it slow down significantly, I do have a feeling that your splits were not equally divided. Based on your command line it seems that you're diving data by some date field. Is this date field uniformly distributed? E.g. is there roughly same number of rows for each date or do you have more rows in more recent days?

Because Sqoop have no idea how exactly the data are distributed in your database, it assumes uniform distribution. Let me explain why it matters on following example. Let's consider table where there is one row on 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we will use three mappers (--num-mappers 3). In this case, sqoop will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one row to move, they will finish almost instantly and get job to 66% done (2 out of 3 mappers are done), however the last mapper will be running for some time as it need to move 1M of rows. For external observer it would appear that the sqoop has stopped, but what really happened is just having not uniformly distributed data across all mappers.


On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> Hi all,
> I am using Sqoop to import a big MySql table (around 910 milions of
> records) in Hbase.
> The command line that I'm using is something like:
> sqoop import --connect
> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
> -P --query <query>' --split-by <date-field> --hbase-table
> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> The strange thing is that it takes a lot to complete the last part of
> the map. This is part of the log:
> [...]
> 12/09/04 17:16:45 INFO mapred.JobClient: Running job: job_201209031227_0007
> 12/09/04 17:16:46 INFO mapred.JobClient:  map 0% reduce 0%
> 12/09/04 17:24:20 INFO mapred.JobClient:  map 25% reduce 0%
> 12/09/04 17:24:21 INFO mapred.JobClient:  map 50% reduce 0%
> 12/09/04 17:24:23 INFO mapred.JobClient:  map 75% reduce 0%
> As you can see it does not take much time to from start to 75%, but
> the last part hasn't been finished (although it is working by a day
> continuously).
> Is there something wrong? I've tried to take a look to the logs but it
> seems to be ok.
> Thanks,
> Alberto
> --
> Alberto Cordioli