Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # dev >> Tez branch and tez based patches

Edward Capriolo 2013-07-13, 16:48
Alan Gates 2013-07-16, 00:37
Edward Capriolo 2013-07-16, 01:51
Alan Gates 2013-07-16, 18:24
Edward Capriolo 2013-07-16, 20:08
Edward Capriolo 2013-07-17, 05:20
Alan Gates 2013-07-17, 19:35
Edward Capriolo 2013-07-17, 20:41
Ashutosh Chauhan 2013-07-18, 00:43
Edward Capriolo 2013-07-20, 15:10
Copy link to this message
Re: Tez branch and tez based patches
I have finally gotten access to wiki and added the design doc:

I've also added links to it from the jira and in general overhauled the
design. Please let me know if you feel there's still stuff missing from the

>> Possibly we should be thinking on how to build hive in such a way
>> that many different frameworks could plug in.

I believe that the proposed design and refactoring puts you on that path.
I'm not introducing layer upon layer of abstraction without a specific use
case in mind, but high level you would go through similar steps:

Exec layer:
- Define your own Task classes
- If you can reuse the operator pipeline define your own replacement for
ExecMapper/ExecReducer (glue code to drive records through the pipeline)
- Operators: You might have to add specific operators for your framework

- Define your own work classes (or reuse existing ones). These abstractly
encapsulate all input/meta info necessary to execute.
- Define your own *Compiler to translate either the logical plan or
physical plan to a graph of Tasks. This might include specific additional

Devil's in the details no doubt.

On Sat, Jul 20, 2013 at 8:10 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> I agree we are getting into grey area with the term disruptive. For
> reference ( I have not been doing this all the time bad on me) we are
> supposed to +1 and wait a day.
> >> I am not familiar with these other engines, but the short answer is that
> >> Tez is built to work on YARN, which works well for Hive since it is tied
> >> to Hadoop
> I understand what you are saying here yarn support is a plus. However the
> rest of the answer is something relevant to the discussion.
> There are already frameworks like spark that are semi popular.
> http://www.slideshare.net/jetlore/spark-and-shark-lightningfast-analytics-over-hadoop-and-hive-data
> .
> There are also other framworks like s4 http://incubator.apache.org/s4/, or
> storm.
> A big part of making a design decision is doing a competitive analysis.
> Usually asking yourself "What else for this is already out there?" or "Can
> this be done other ways?"
> I do want to be convinced we do not lock into tez too early with tunnel
> vision. Possibly we should be thinking on how to build hive in such a way
> that many different frameworks could plug in. In other words convincing
> that tez is the best choice, since many people are claiming an mrr type
> solution.
> I will watch the video you posted and study the material myself as well.
> On Wed, Jul 17, 2013 at 8:43 PM, Ashutosh Chauhan <[EMAIL PROTECTED]
> >wrote:
> > On Wed, Jul 17, 2013 at 1:41 PM, Edward Capriolo <[EMAIL PROTECTED]
> > >wrote:
> >
> > >
> > > "In my opinion we should limit the amount of tez related optimizations
> to
> > > and trunk" Refactoring that cleans up code is good, but as you have
> > pointed
> > > out there wont be a tez release until sometime this fall, and this
> branch
> > > will be open for an extended period of time. Thus code cleanups and
> other
> > > tez related refactoring does not need to be disruptive to trunk.
> >
> >
> > I agree Tez specific changes need not to go in trunk. But general
> > refactoring and code cleanup needs to happen on trunk as and when someone
> > is willing to work on those. We have to continually improve our code
> > quality. Code maintainability and readability is a priority. Without that
> > code quality suffers and discourages new contributors to contribute
> because
> > code is unnecessarily complicated. SemanticAnalyzer is 11K line class. We
> > need to simplify it. Patch like HIVE-4811 is a welcome change which
> tackled
> > it. Exec package is all convoluted which mixes up runtime operators and
> > drivers for runtime. Thats a welcome patch because it makes it much more
> > easy to read and reason about that piece of code. HIVE-4825 is another
> > example which improves modularity of code. For contributors who are
Alan Gates 2013-07-17, 21:41
Edward Capriolo 2013-07-30, 04:02
Edward Capriolo 2013-07-30, 04:53
Alan Gates 2013-08-05, 17:54
Edward Capriolo 2013-08-16, 13:13
Edward Capriolo 2013-08-16, 14:54
Alan Gates 2013-08-05, 17:40
Brock Noland 2013-07-16, 15:56