Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop >> mail # user >> Sqoop import big MySql table in HBase


+
Alberto Cordioli 2012-09-05, 07:37
+
Jarek Jarcec Cecho 2012-09-05, 07:57
+
Alberto Cordioli 2012-09-05, 08:16
+
Jarek Jarcec Cecho 2012-09-05, 11:43
Copy link to this message
-
Re: Sqoop import big MySql table in HBase
Hi Alberto,
Sqoop calculates split points by converting min. and max. string values
returned by DBMS for the column into respective decimal values using an
algo. and then use Decimal splitter. There is a complex algorithm used for
converting(more of mapping) string to decimal. This would help you
understand better (taken from the java docs for the split method of text
splitter):

/**
   * This method needs to determine the splits between two user-provided
   * strings.  In the case where the user's strings are 'A' and 'Z', this is
   * not hard; we could create two splits from ['A', 'M') and ['M', 'Z'], 26
   * splits for strings beginning with each letter, etc.
   *
   * If a user has provided us with the strings "Ham" and "Haze", however,
we
   * need to create splits that differ in the third letter.
   *
   * The algorithm used is as follows:
   * Since there are 2**16 unicode characters, we interpret characters as
   * digits in base 65536. Given a string 's' containing characters s_0, s_1
   * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n in
   * base 65536. Having mapped the low and high strings into floating-point
   * values, we then use the BigDecimalSplitter to establish the even split
   * points, then map the resulting floating point values back into strings.
   */
  public List<InputSplit> split(Configuration conf, ResultSet results,
      String colName) throws SQLException {

After the splits are calculated, where clauses are used in SELECT
queries(i.e. - result is bounded by split points)) fired by each mapper to
retrieve the data.

>From user perspective, you can use string for splitting except for
following scenario:
Char split-by column is not recommended when the DBMS sorts in case
insensitive manner. The current algorithm used to calculate splits has some
flaws. This is known and Sqoop displays a warning before executing the job.

Let me know if you need more details.

Thanks,
Abhijeet

On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>wrote:

> Hi Alberto,
> I've never used text column for data splitting, however it seems that
> sqoop is supporting that (I found it's splitter in the code). However I'm
> still not sure if it's wise as string operations tends to be much slower on
> databases and you might end up with performance issues. Unfortunately Sqoop
> currently do not support any direct way how to affect split creation.
>
> I tried to think about your problem and came with two ideas how to help in
> your use case:
>
> 1) Would it be acceptable in your use case to change the zero date policy
> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull? In
> case that "split" column contains nulls, sqoop will create X+1 splits where
> the +1 will cover all NULL values. It probably won't be the best, but it
> might help to distribute your load more properly.
>
> 2) What about splitting entire job into two parts - firstly export all
> zero dates and separately in next job the rest of the values. By doing so
> you might be able to get decent distribution across the "normal" dates
> part. Importing all the zero dates might be challenging if you have a lot
> of them as there will be only one value available (and thus just one split)
> and therefore you might need to use the text column for split creation in
> this case anyway.
>
> Jarcec
>
> On Wed, Sep 05, 2012 at 10:16:17AM +0200, Alberto Cordioli wrote:
> > Thanks Jarcec,
> > probably you've identified immediately the problem. In fact, I checked
> > the date field, and I think problem is that in my data I have some
> > "limit" values like '0000-00-00' (damn who have inserted these).
> > The other data are equally distributed in 2 months (from 2012-04-01 to
> > 2012-06-01): as you said with a parallelism of 3, 2 mappers will take
> > basically no data while the other will do the "true" job, right?
> >
> > So, now my question becomes: the other field that I could use to split
> > the job is an hash (string). How sqoop divide this type of field?
+
Alberto Cordioli 2012-09-05, 14:29
+
abhijeet gaikwad 2012-09-05, 14:50
+
Alberto Cordioli 2012-09-07, 07:20