Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # dev >> resync: todo list

Copy link to this message
resync: todo list
Hi drillers,

I'm back online.
I'm going to continue on my goal: execute query on one drillbit first.

I've pulled from current github/execwork. It seems to me that the following
work remain to be done, correct me if anything wrong:

* test cases for sql query.
this would include join, projection, selection, grouping.

* nextBatch() for PhysicalOperator
which does the iteration over records.

* encoded ValueVector types
dictionary encoding/bit vector encoding/RLE, for strings especially, to
reduce memory usage.

* POP implementation for Join/Projection/Selection etc.
most importantly, with the nextBatch() method. And, how would these POP
cooperate with different ValueVector types, especially encoded types?
anyway, I could start with simple cases first.

* Foreman.convert()
We have Optimizer interface to do this. Optimizer should generate physical
plan with ExchangeOps, which are the boundary of fragments. What's the rule
of generating Exchange nodes? How will clustering/schema information affect
this? anyway, I could start with simple case too: no exchange at all.

And further todo:

* performance test suites
I think we need some bigger data set, best in json file and in HTable.
shall I include test data file in source repository, or shall I generate
(predictable) data set each time at test setup? Which approach do you
Currently, I'm willing to contribute to any of above. If anything is wrong
or anything is already done, please let me know. From tomorrow on, I could
lay out these issues on jira and start working.