Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill >> mail # user >> meeting notes 10/22/13

Copy link to this message
Re: meeting notes 10/22/13
Thanks Jason! And thanks for everyone's time!

* Push from leaves
Thanks for Jacques' suggest. Indeed, in our current implementation, we need
to take special care for RecordBatches with multiple inputs, and need more
memory when executing, Join specifically. Your suggestion about prefetch
reminds me 2 improvements on this:

1. I can add a finite queue at each edge in the dag serving as data buffer.
When the queue is full, it would stall the operator who is putting data
into it. This should solve the huge memory problem.
2. with 1, I think the execution could be more storm-like. Each RecordBatch
driven by its input, via a thread pool. This way we can better parallelize
cpu as well as IO.
3. the downside of 1 & 2 is more execution overhead on cpu. However, we
will modify our push implementation and see what we can get. Thanks!

* for stream processing
One of our implementation goal is to use the same set of RecordBatch
implementations for both standard pull exec and push exec, and possibly
stream processing. For resources, it could also be memory consuming if we
want to join (relatively) static dimension table and streaming fact fable,
because dimension should entirely fit into memory.
This is still far off radar now; just thinking: we can dynamically
add/remove arbitrary query into the graph; data stream-in and graph is
updated; whenever we need result, just access the corresponding node in the
graph; the graph conext resets periodically at time window boundary, the
topology remains.

* for approximation
As Jason said, our current work on sampling makes many assumptions on our
data distribution. I can hardly imagine how this part of code can be useful
for others, except the CountDistinct. However I would try to sort out
something to share if I can get something general.

On Wed, Oct 23, 2013 at 12:49 AM, Jason Altekruse

> Hello All,
> Here are the notes from todays hangout. Michael, can you copy them into the
> google doc?
> participants: Jacques, Micheal hausenblas, Lisen Mu, Yash Sharma, Jinfeng,
> Jason Altekruse, Harri, Steven Phillips, Timothy Chen, Julien Hyde
> New employee at MapR: Jinfeng
>     - couple more in the next month
> Jacques:
>     - merged limit
>     - clarify VVs
>         - never access internal state of VV when it is invalid
>     - release notes
> Steven:
>     - ordered partitioner
>         - abstract out distributed cache interface
>     - continue to work on spooling to disk
> Jason:
>     -semi-blocking
>         - look at sort and ordered hash partitioner
> Yash
>     - name of functions
>         - separate class for operators and functions for more clarity
>             - different operators have their own class files
> Lisen
>     - fork of Drill
>         - data pushed form leaves rather than pulled from root
>         - we have been thinking about this same problem
>             - don't want to wait for IO all the time
>             - pre-fetch rather than push
>             - in a join you might get pushed a huge amount of data when you
> aren't ready for it
>             - stream processing
>                 - alternative concept around foreman
>                 - not quite right for streams
>                 - resource allocation
>                     - not as much for resource requirements
>         -HyperLogLog
>             - space saving
>             - acceptable - not precise
>         - data assembly - business logic
>             - approximations will be important to drill
>             - no serious thinking about sampling
>             - certain types of scanners should support sampling
>                 - hard with some without reading all data anyway
>                 - Hbase might be easier to do a scan
>             - doing it with their own business logic and statistics
>                 - hard to generalize
> Hari
>     - not much for updates
>     - pick up with amazon ec2 docs
>         - had problem where we need 8 gigs