Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Sqoop >> mail # dev >> Review Request: Fix for SQOOP-1029


Copy link to this message
-
Re: Review Request: Fix for SQOOP-1013


> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > Seems good except for a few comments and nit picks.
> >
> > The datetime split logic is when certain intervals are "hotter" than others. IE: 1000000 rows out of 20000000 exist between the date range of december 1st to december 31st, but a user is importing the entire years data, with 12 node. Basically, 1 machine will extract 1000000 rows, while the others will extract 1000000/11 rows. In the future we could probably add some upfront analysis of the data to improve distribution.
>
> Abraham Elmahrek wrote:
>     Sorry... my wording above wasn't very concise... The datetime split logic doesn't seem to handle "hot" partitions is what I meant.

Good point Abraham - even though this skew can happen to all datatypes that are used for splitting, with dates, there is more chance of such skew.  Typically the solution for this as also to efficiently manage the data life-cycle is dependent on the database facilities, so we may have to do database specific functionality like partitioning etc.   A generic solution can be attempted, but support for grouping and ordering based on expressions varies between different databases.   But something to keep in our mind
> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java, lines 164-180
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line164>
> >
> >     Nit: This seems like it can be merged above.

I am assuming you are talking about the end condition.   Please see above
> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java, lines 139-163
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line139>
> >
> >     Nit: This loop can probably handle all partitions by using constructDateConditions(objLB, objUB, false) to constructDateConditions(objLB, objUB, upperBound == maxDateValue). Also, upperBound += (i < remainder) ? 1 : 0;

That is true, but we have the pattern to add the last partition separately for all partitioning types.   It is better to be consistent, I thought.   Do you agree?
> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java, line 190
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line190>
> >
> >     There doesn't seem to be any thing stopping partitionMaxValue from being null. So, it makes sense to add some null checking here as well.

min and max being null is handled for all datatypes.   NULL in the presence of all values will be treated as having a lower value and so it is impossible to get MAX to null while MIN is not null.

In fact, when one row has a non-null value for a column with all the other rows in the table having the split by column as NULL will return the single non-null value for both min and max.

So, the whole null handling thing has to be cleaned up.   I will file a follow on JIRA for that
> On May 31, 2013, 8:30 p.m., Abraham Elmahrek wrote:
> > connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcImportPartitioner.java, line 232
> > <https://reviews.apache.org/r/11537/diff/2/?file=299088#file299088line232>
> >
> >     There doesn't seem to be any thing stopping partitionMaxValue from being null. So, it makes sense to add some null checking here as well.

Please see above
- Venkat
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11537/#review21237
-----------------------------------------------------------
On May 30, 2013, 6:25 p.m., Venkat Ranganathan wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit: