Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Tez branch and tez based patches


Copy link to this message
-
Re: Tez branch and tez based patches
I have finally gotten access to wiki and added the design doc:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez

I've also added links to it from the jira and in general overhauled the
design. Please let me know if you feel there's still stuff missing from the
document.

>> Possibly we should be thinking on how to build hive in such a way
>> that many different frameworks could plug in.

I believe that the proposed design and refactoring puts you on that path.
I'm not introducing layer upon layer of abstraction without a specific use
case in mind, but high level you would go through similar steps:

Exec layer:
- Define your own Task classes
- If you can reuse the operator pipeline define your own replacement for
ExecMapper/ExecReducer (glue code to drive records through the pipeline)
- Operators: You might have to add specific operators for your framework

Planning:
- Define your own work classes (or reuse existing ones). These abstractly
encapsulate all input/meta info necessary to execute.
- Define your own *Compiler to translate either the logical plan or
physical plan to a graph of Tasks. This might include specific additional
optimizations.

Devil's in the details no doubt.

Thanks,
Gunther.
On Sat, Jul 20, 2013 at 8:10 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> I agree we are getting into grey area with the term disruptive. For
> reference ( I have not been doing this all the time bad on me) we are
> supposed to +1 and wait a day.
>
> >> I am not familiar with these other engines, but the short answer is that
> >> Tez is built to work on YARN, which works well for Hive since it is tied
> >> to Hadoop
>
> I understand what you are saying here yarn support is a plus. However the
> rest of the answer is something relevant to the discussion.
>
> There are already frameworks like spark that are semi popular.
>
> http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data
> .
> There are also other framworks like s4 http://incubator.apache.org/s4/, or
> storm.
>
> A big part of making a design decision is doing a competitive analysis.
> Usually asking yourself "What else for this is already out there?" or "Can
> this be done other ways?"
> I do want to be convinced we do not lock into tez too early with tunnel
> vision. Possibly we should be thinking on how to build hive in such a way
> that many different frameworks could plug in. In other words convincing
> that tez is the best choice, since many people are claiming an mrr type
> solution.
>
> I will watch the video you posted and study the material myself as well.
>
>
> On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan <[EMAIL PROTECTED]
> >wrote:
>
> > On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo <[EMAIL PROTECTED]
> > >wrote:
> >
> > >
> > > "In my opinion we should limit the amount of tez related optimizations
> to
> > > and trunk" Refactoring that cleans up code is good, but as you have
> > pointed
> > > out there wont be a tez release until sometime this fall, and this
> branch
> > > will be open for an extended period of time. Thus code cleanups and
> other
> > > tez related refactoring does not need to be disruptive to trunk.
> >
> >
> > I agree Tez specific changes need not to go in trunk. But general
> > refactoring and code cleanup needs to happen on trunk as and when someone
> > is willing to work on those. We have to continually improve our code
> > quality. Code maintainability and readability is a priority. Without that
> > code quality suffers and discourages new contributors to contribute
> because
> > code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
> > need to simplify it. Patch like HIVE-4811 is a welcome change which
> tackled
> > it. Exec package is all convoluted which mixes up runtime operators and
> > drivers for runtime. Thats a welcome patch because it makes it much more
> > easy to read and reason about that piece of code. HIVE-4825 is another
> > example which improves modularity of code. For contributors who are
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB