I did have one open question in the notes, when you said that you were
getting out of memory errors right away on the smaller ec2 instances, was
that with or without the direct memory size changed in the POM?
On Tue, Oct 22, 2013 at 11:12 PM, Eriksson Magnus <
[EMAIL PROTECTED]> wrote:
> Sent to the wrong email.
> Best regards
> Magnus in Sweden
> For every hard problem there is at least one solution that is simple, easy
> to understand and wrong...
> Lisen Mu wrote:
> Thanks Jason! And thanks for everyone's time!
> * Push from leaves
> Thanks for Jacques' suggest. Indeed, in our current implementation, we need
> to take special care for RecordBatches with multiple inputs, and need more
> memory when executing, Join specifically. Your suggestion about prefetch
> reminds me 2 improvements on this:
> 1. I can add a finite queue at each edge in the dag serving as data buffer.
> When the queue is full, it would stall the operator who is putting data
> into it. This should solve the huge memory problem.
> 2. with 1, I think the execution could be more storm-like. Each RecordBatch
> driven by its input, via a thread pool. This way we can better parallelize
> cpu as well as IO.
> 3. the downside of 1 & 2 is more execution overhead on cpu. However, we
> will modify our push implementation and see what we can get. Thanks!
> * for stream processing
> One of our implementation goal is to use the same set of RecordBatch
> implementations for both standard pull exec and push exec, and possibly
> stream processing. For resources, it could also be memory consuming if we
> want to join (relatively) static dimension table and streaming fact fable,
> because dimension should entirely fit into memory.
> This is still far off radar now; just thinking: we can dynamically
> add/remove arbitrary query into the graph; data stream-in and graph is
> updated; whenever we need result, just access the corresponding node in the
> graph; the graph conext resets periodically at time window boundary, the
> topology remains.
> * for approximation
> As Jason said, our current work on sampling makes many assumptions on our
> data distribution. I can hardly imagine how this part of code can be useful
> for others, except the CountDistinct. However I would try to sort out
> something to share if I can get something general.
> On Wed, Oct 23, 2013 at 12:49 AM, Jason Altekruse
> <[EMAIL PROTECTED]>wrote:
> > Hello All,
> > Here are the notes from todays hangout. Michael, can you copy them into
> > google doc?
> > participants: Jacques, Micheal hausenblas, Lisen Mu, Yash Sharma,
> > Jason Altekruse, Harri, Steven Phillips, Timothy Chen, Julien Hyde
> > New employee at MapR: Jinfeng
> > - couple more in the next month
> > Jacques:
> > - merged limit
> > - clarify VVs
> > - never access internal state of VV when it is invalid
> > - release notes
> > Steven:
> > - ordered partitioner
> > - abstract out distributed cache interface
> > - continue to work on spooling to disk
> > Jason:
> > -semi-blocking
> > - look at sort and ordered hash partitioner
> > Yash
> > - name of functions
> > - separate class for operators and functions for more clarity
> > - different operators have their own class files
> > Lisen
> > - fork of Drill
> > - data pushed form leaves rather than pulled from root
> > - we have been thinking about this same problem
> > - don't want to wait for IO all the time
> > - pre-fetch rather than push
> > - in a join you might get pushed a huge amount of data when
> > aren't ready for it
> > - stream processing
> > - alternative concept around foreman
> > - not quite right for streams
> > - resource allocation
> > - not as much for resource requirements