Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # user >> Sqoop import big MySql table in HBase


Copy link to this message
-
Re: Sqoop import big MySql table in HBase
I forgot mentioning but calculations are done for maximum first 8 chars of
the string. So computation wise you are safe - but that may generate
in-accurate splits for some scenarios.
Anyways, I feel Jarcec's solutions are a better workaround :)

Thanks,
Abhijeet
On 5 Sep 2012 20:00, "Alberto Cordioli" <[EMAIL PROTECTED]> wrote:

> Mmh, I see. Hence I should avoid split by string fields, since my hash
> field is 72 char long and it requires a lot of computations (if I
> understood correctly).
> I think one of the solutions proposed by Jarcec could be ok.
> I also think I'll divide my big table in more little chunks, since the
> problem is the query that determine the split points. What do you
> think?
>
> Cheers,
> Alberto
>
>
> On 5 September 2012 15:21, abhijeet gaikwad <[EMAIL PROTECTED]>
> wrote:
> > Hi Alberto,
> > Sqoop calculates split points by converting min. and max. string values
> > returned by DBMS for the column into respective decimal values using an
> > algo. and then use Decimal splitter. There is a complex algorithm used
> for
> > converting(more of mapping) string to decimal. This would help you
> > understand better (taken from the java docs for the split method of text
> > splitter):
> >
> > /**
> >    * This method needs to determine the splits between two user-provided
> >    * strings.  In the case where the user's strings are 'A' and 'Z',
> this is
> >    * not hard; we could create two splits from ['A', 'M') and ['M',
> 'Z'], 26
> >    * splits for strings beginning with each letter, etc.
> >    *
> >    * If a user has provided us with the strings "Ham" and "Haze",
> however,
> > we
> >    * need to create splits that differ in the third letter.
> >    *
> >    * The algorithm used is as follows:
> >    * Since there are 2**16 unicode characters, we interpret characters as
> >    * digits in base 65536. Given a string 's' containing characters s_0,
> s_1
> >    * .. s_n, we interpret the string as the number: 0.s_0 s_1 s_2.. s_n
> in
> >    * base 65536. Having mapped the low and high strings into
> floating-point
> >    * values, we then use the BigDecimalSplitter to establish the even
> split
> >    * points, then map the resulting floating point values back into
> strings.
> >    */
> >   public List<InputSplit> split(Configuration conf, ResultSet results,
> >       String colName) throws SQLException {
> >
> > After the splits are calculated, where clauses are used in SELECT
> > queries(i.e. - result is bounded by split points)) fired by each mapper
> to
> > retrieve the data.
> >
> > From user perspective, you can use string for splitting except for
> following
> > scenario:
> > Char split-by column is not recommended when the DBMS sorts in case
> > insensitive manner. The current algorithm used to calculate splits has
> some
> > flaws. This is known and Sqoop displays a warning before executing the
> job.
> >
> > Let me know if you need more details.
> >
> > Thanks,
> > Abhijeet
> >
> >
> > On Wed, Sep 5, 2012 at 5:13 PM, Jarek Jarcec Cecho <[EMAIL PROTECTED]>
> > wrote:
> >>
> >> Hi Alberto,
> >> I've never used text column for data splitting, however it seems that
> >> sqoop is supporting that (I found it's splitter in the code). However
> I'm
> >> still not sure if it's wise as string operations tends to be much
> slower on
> >> databases and you might end up with performance issues. Unfortunately
> Sqoop
> >> currently do not support any direct way how to affect split creation.
> >>
> >> I tried to think about your problem and came with two ideas how to help
> in
> >> your use case:
> >>
> >> 1) Would it be acceptable in your use case to change the zero date
> policy
> >> from zeroDateTimeBehavior=round to zeroDateTimeBehavior=convertToNull?
> In
> >> case that "split" column contains nulls, sqoop will create X+1 splits
> where
> >> the +1 will cover all NULL values. It probably won't be the best, but it
> >> might help to distribute your load more properly.
> >>
> >> 2) What about splitting entire job into two parts - firstly export all
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB