I've never used text column for data splitting, however it seems that sqoop is supporting that (I found it's splitter in the code). However I'm still not sure if it's wise as string operations tends to be much slower on databases and you might end up with performance issues. Unfortunately Sqoop currently do not support any direct way how to affect split creation.
I tried to think about your problem and came with two ideas how to help in your use case:
1) Would it be acceptable in your use case to change the zero date policy from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In case that "split" column contains nulls, sqoop will create X+1 splits where the +1 will cover all NULL values. It probably won't be the best, but it might help to distribute your load more properly.
2) What about splitting entire job into two parts - firstly export all zero dates and separately in next job the rest of the values. By doing so you might be able to get decent distribution across the "normal" dates part. Importing all the zero dates might be challenging if you have a lot of them as there will be only one value available (and thus just one split) and therefore you might need to use the text column for split creation in this case anyway.
On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
> Thanks Jarcec,
> probably you've identified immediately the problem. In fact, I checked
> the date field, and I think problem is that in my data I have some
> "limit" values like '0000-00-00' (damn who have inserted these).
> The other data are equally distributed in 2 months (from 2012-04-01 to
> 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
> basically no data while the other will do the "true" job, right?
> So, now my question becomes: the other field that I could use to split
> the job is an hash (string). How sqoop divide this type of field?
> Lexicography order?
> On 5 September 2012 09:57, Jarek Jarcec Cecho <[EMAIL PROTECTED]> wrote:
> > Hi Alberto,
> > taking into account that you have 910 millions of records and you're job was able to get to 75% in matter of 8 minutes and then it slow down significantly, I do have a feeling that your splits were not equally divided. Based on your command line it seems that you're diving data by some date field. Is this date field uniformly distributed? E.g. is there roughly same number of rows for each date or do you have more rows in more recent days?
> > Because Sqoop have no idea how exactly the data are distributed in your database, it assumes uniform distribution. Let me explain why it matters on following example. Let's consider table where there is one row on 2012-01-01, second row on 2012-02-01 and 1M of rows on 2012-03-01. Let's assume that we will use three mappers (--num-mappers 3). In this case, sqoop will create three splits 2012-01-01 up to 2012-01-31, 2012-02-01 up to 2012-02-28 and 2012-03-01 up to 2012-03-31. Because the first two mappers do have just one row to move, they will finish almost instantly and get job to 66% done (2 out of 3 mappers are done), however the last mapper will be running for some time as it need to move 1M of rows. For external observer it would appear that the sqoop has stopped, but what really happened is just having not uniformly distributed data across all mappers.
> > Jarcec
> > On Wed, Sep 05, 2012 at 09:37:49AM +0200, Alberto Cordioli wrote:
> >> Hi all,
> >> I am using Sqoop to import a big MySql table (around 910 milions of
> >> records) in Hbase.
> >> The command line that I'm using is something like:
> >> sqoop import --connect
> >> jdbc:mysql://<server>/<db>?zeroDateTimeBehavior=round --username <usr>
> >> -P --query <query>' --split-by <date-field> --hbase-table
> >> "<hbase_table>" --column-family "<fam>" --hbase-row-key "hash"
> >> The strange thing is that it takes a lot to complete the last part of
> >> the map. This is part of the log: