Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - Review Request: Patch for an HBase Table Scanner version of a Drill Storage Engine, implemented against the reference implementation.


Copy link to this message
-
Re: Review Request: Patch for an HBase Table Scanner version of a Drill Storage Engine, implemented against the reference implementation.
David Alves 2013-03-24, 22:54


> On March 24, 2013, 9:26 p.m., Ted Dunning wrote:
> > Overall, this looks like a nice contribution.
> >
> > My review was fairly superficial.  Many of the comments regard esthetics rather than content.  A few regard basic readability.
> >
> > That said, it would really help to have some more high level comments that explain what connects to what.  This is key for Drill since we need to make it easy for new contributors to jump in.

Thank you for reviewing. I'll address some of your remarks before proceeding to implement the changes you suggest.
> On March 24, 2013, 9:26 p.m., Ted Dunning wrote:
> > sandbox/prototype/contrib/storage-hbase/src/main/java/org/apache/drill/hbase/HBaseStorageEngine.java, line 42
> > <https://reviews.apache.org/r/10099/diff/1/?file=274063#file274063line42>
> >
> >     This comment says nothing.  It seems like there is something important to say here.  The text from the JIRA might make a good start.

fair enough, will add more high level comments throughout.
> On March 24, 2013, 9:26 p.m., Ted Dunning wrote:
> > sandbox/prototype/contrib/storage-hbase/src/main/java/org/apache/drill/hbase/HBaseStorageEngine.java, line 59
> > <https://reviews.apache.org/r/10099/diff/1/?file=274063#file274063line59>
> >
> >     This is hard to read (looks automatically generated).
> >    
> >     It should be easy to simply this automatically.  It shouldn't be any different in terms of speed.

This is peculiar because it's an equals contract based on a method and not on a field.
Assuming a plan associates a SE with a name it makes sense (IMO) to have the SE implement the equals contract based on that name.
The unnecessary clutter comes mainly from allowing getName() to be null which I'm not sure makes sense, If we remove that this will be much easier to read.
> On March 24, 2013, 9:26 p.m., Ted Dunning wrote:
> > sandbox/prototype/contrib/storage-hbase/src/main/java/org/apache/drill/hbase/HBaseStorageEngine.java, line 102
> > <https://reviews.apache.org/r/10099/diff/1/?file=274063#file274063line102>
> >
> >     Monster long line.  Can this be broken?

definitely, missed this one
> On March 24, 2013, 9:26 p.m., Ted Dunning wrote:
> > sandbox/prototype/contrib/storage-hbase/src/main/java/org/apache/drill/hbase/region/HBaseRegionRecordReader.java, lines 24-34
> > <https://reviews.apache.org/r/10099/diff/1/?file=274065#file274065line24>
> >
> >     Surely there is more to say here?!
> >    
> >     Why is collocation required?  Is local join really supported?

This is a skeleton for a WIP that I'm doing with coprocessors/endpoints etc that will work at a region scanner level. Maybe it makes sense to remove this altogether from this patch.
Colocation is not mandatory but is relevant in some cases like joins.
When doing a dist join a common approach in some cases is to perform a semi-join first.
In particular my focus in on self-joins of key col_name, but this is also valid for inter-table joins
A possible sequence of steps could be the following.
- each drill daemon (co-located with the hbase region server) will perform a filtered/projected region scan locally (node-local) collecting the unique entries (alternatively I'm trying to do this with hbase's bloom filters for key based joins)
- unique entries from the previous operation are sent/streamed to the other hbase nodes that hold other regions of that particular table which they use to perform a local semi-join. Results of the semi-join are streamed to other nodes based on assigned regions (this is why colocation is important, if the daemons are not colocated semi-join input and output data is routed through a 3rd party node)
- each node performs the portion of the join that is local, there are two cases here:
   - the schema is something like key,col_name,col_value with non unique keys, joining on keys, in which case the join is already done and corresponds to a row with multiple columns
   - the join is between the key an a col_name or col_value which means that there might exist local col_names or col_values that can need joined locally.
- each node performs the remaining portion of the join with the results of the semi join.

I hope this sheds some light into your questions.
If this skeleton is to stay I can move these notes into the comments.

this is a thin wrapper for a table scanner. Will add this to the docs.

duly noted

just so that I dont have to repeat:

  public BooleanValue getAsBooleanValue(){
    throw new DrillRuntimeException(String.format("A %s value is not a BooleanValue.", this.getClass().getCanonicalName()));
  }

  public BytesValue getAsBytesValue(){
    throw new DrillRuntimeException(String.format("A %s value is not a BytesValue.", this.getClass().getCanonicalName()));
  }

etc...

in any case just noticed there are some methods that are stubs.

strangely my IDE reported this as perfectly aligned
- David
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10099/#review18330
On March 24, 2013, 8:45 p.m., David Alves wrote: