Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Sqoop, mail # user - Sqoop import big MySql table in HBase


+
Alberto Cordioli 2012-09-05, 07:37
+
Jarek Jarcec Cecho 2012-09-05, 07:57
+
Alberto Cordioli 2012-09-05, 08:16
+
Jarek Jarcec Cecho 2012-09-05, 11:43
+
abhijeet gaikwad 2012-09-05, 13:21
+
Alberto Cordioli 2012-09-05, 14:29
Copy link to this message
-
Re: Sqoop import big MySql table in HBase
abhijeet gaikwad 2012-09-05, 14:50
I forgot mentioning but calculations are done for maximum first 8 chars of
the string. So computation wise you are safe - but that may generate
in-accurate splits for some scenarios.
Anyways, I feel Jarcec's solutions are a better workaround :)

Thanks,
Abhijeet
On 5 Sep 2012 20:00, "Alberto Cordioli" <[EMAIL PROTECTED]> wrote:

> Mmh, I see. Hence I should avoid split by string fields, since my hash
> field is 72 char long and it requires a lot of computations (if I
> understood correctly).
> I think one of the solutions proposed by Jarcec could be ok.
> I also think I'll divide my big table in more little chunks, since the
> problem is the query that determine the split points. What do you
> think?
>
> Cheers,
> Alberto
>
>
> On 5 September 2012 15:21, abhijeet gaikwad <[EMAIL PROTECTED]>
> wrote:
> > Hi Alberto,
> > Sqoop calculates split points by converting min. and max. string values
> > returned by DBMS for the column into respective decimal values using an
> > algo. and then use Decimal splitter. There is a complex algorithm used
> for
> > converting(more of mapping) string to decimal. This would help you
> > understand better (taken from the java docs for the split method of text
> > splitter):
> >
> > /**
> >    * This method needs to determine the splits between two user-provided
> >    * strings.  In the case where the user's strings are 'A' and 'Z',
> this is
> >    * not hard; we could create two splits from ['A', 'M') and ['M',
> 'Z'], 26
> >    * splits for strings beginning with each letter, etc.
> >    *
> >    * If a user has provided us with the strings "Ham" and "Haze",
> however,
> > we
> >    * need to create splits that differ in the third letter.
> >    *
> >    * The algorithm used is as follows:
> >    * Since there are 2**16 unicode characters, we interpret characters as
> >    * digits in base 65536. Given a string 's' containing characters s_0,
> s_1
> >    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n
> in
> >    * base 65536. Having mapped the low and high strings into
> floating-point
> >    * values, we then use the BigDecimalSplitter to establish the even
> split
> >    * points, then map the resulting floating point values back into
> strings.
> >    */
> >   public List<InputSplit> split(Configuration conf, ResultSet results,
> >       String colName) throws SQLException {
> >
> > After the splits are calculated, where clauses are used in SELECT
> > queries(i.e. - result is bounded by split points)) fired by each mapper
> to
> > retrieve the data.
> >
> > From user perspective, you can use string for splitting except for
> following
> > scenario:
> > Char split-by column is not recommended when the DBMS sorts in case
> > insensitive manner. The current algorithm used to calculate splits has
> some
> > flaws. This is known and Sqoop displays a warning before executing the
> job.
> >
> > Let me know if you need more details.
> >
> > Thanks,
> > Abhijeet
> >
> >
> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>
> > wrote:
> >>
> >> Hi Alberto,
> >> I've never used text column for data splitting, however it seems that
> >> sqoop is supporting that (I found it's splitter in the code). However
> I'm
> >> still not sure if it's wise as string operations tends to be much
> slower on
> >> databases and you might end up with performance issues. Unfortunately
> Sqoop
> >> currently do not support any direct way how to affect split creation.
> >>
> >> I tried to think about your problem and came with two ideas how to help
> in
> >> your use case:
> >>
> >> 1) Would it be acceptable in your use case to change the zero date
> policy
> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull?
> In
> >> case that "split" column contains nulls, sqoop will create X+1 splits
> where
> >> the +1 will cover all NULL values. It probably won't be the best, but it
> >> might help to distribute your load more properly.
> >>
> >> 2) What about splitting entire job into two parts - firstly export all
+
Alberto Cordioli 2012-09-07, 07:20