Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Drill, mail # dev - In-place processing and performance.


Copy link to this message
-
Re: In-place processing and performance.
Azuryy Yu 2012-09-18, 02:51
Thanks!

Generally agree, but Cache and Data manipulation should be separated. every
query reach cache firstly, if not hit, then call the read data interface,
which cannot be included in the cache module.

so everybody can replace cache policy and read/write data. then can
configure drill.cache.policy.class and drill.read.class drill.write.class
in the configure file.
On Tue, Sep 18, 2012 at 10:23 AM, moon soo Lee <[EMAIL PROTECTED]> wrote:

> Here's my quick drill's common caching framework proposal.
>
> 0. Why
>
>    - While In-place processing, data format is not guaranteed the best
>    efficient format to process (ie. columnar).
>    - Non-columnar format can make huge performance impact. (order of
>    magnitude)
>
>
> 1. Goal.
>
>    - Increase performance without painful ETL
>    - Performance includes not only overall throughput but also how
>    interactive it is.
>    - Provide easy implementation interface to datasource point of view
>
>
> 2. How it looks?
>
>    - Drill provide common caching policy. Which is responsible for
>
>    - construct columnar format
>    - read columnar format
>    - caching algorithm
>
>
>    - Each datasource optionally implements some method to support caching,
>    they could be
>
>    interface CachingSupport {
>
>    // to write columnar format data to cache media
>    OutputStream getOutputStream(path);
>
>    // to clear cached data
>    void remove(path);
>
>    // to read cached data
>    InputStream getInputStream(path);
>
>    // to get location information of data (in DFS)
>    Location getLocation(path);
>
>    }
>
>    - The datasource implementation does not care about columnar format,
>    cache replacement policy, things. only care about basic IO. So people
> who
>    implement datasource does not need to understand columnar things.
>
>
> 3. How it works?
>
>    - Drill construct columnar format cache using datasource provided
> method.
>    - Datasource can skip the implementation for the caching. This time,
>    drill work passthru mode.
>    - Cache policy class can be replaced. So if there's more efficient data
>    format, efficient algorithm it can be applied, without changing all
>    datasource implementation.
>    - Cache construction does not block data read. So performance impact
>    from cache construction is minimized.
>    - Drill performs it's query through cache. There could be some query for
>    cache management (like purge).
>
>
>
> Is it worth? or just adding a complexity?
>
> for me, worth +1.
>
> and i'm fully ready to do this job. :-)
>
>
> Thanks.
>
> ----
>
> Leemoonsoo
> [EMAIL PROTECTED]
>
>
> On Tue, Sep 18, 2012 at 1:59 AM, Tomer Shiran <[EMAIL PROTECTED]>
> wrote:
>
> > The plan was to have the scan operator do that kind of caching, but I
> agree
> > it could make sense to have some common caching framework in case other
> > scan operators want to cache as well.
> >
> > On Sun, Sep 16, 2012 at 5:29 PM, moon soo Lee <[EMAIL PROTECTED]> wrote:
> >
> > > Drill want In-place processing ([1], page 12). yes, ETL is painful.
> > > In my understanding, In-place processing means the data is not always
> > > columnar.
> > >
> > > [2], Figure 10, shows performance difference between columnar and
> > > record-oriented (MR)
> > > if Dremel work with record-oriented data, I can guess that'll be order
> of
> > > magnitude slower.
> > >
> > > If it's true, will this still interactive?
> > >
> > > And can anyone give an more detail about "Adaptively convert storage
> > layout
> > > into more efficient forms", [1], page 12 ?
> > > Is it kind of transparent columnar format caching?
> > >
> > > And if non-columnar data expected in many cases,
> > > then how about drill have common cache for storage interface instead of
> > > each scanner implements their own caching policies?
> > >
> > > Thanks.
> > >
> > > [1] Apache Drill, Architecture outlines.
> > > http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
> > > [2] Dremel: Interactive Analysis of Web-Scale Datasets