Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # dev - non map-reduce for simple queries


Copy link to this message
-
Re: non map-reduce for simple queries
Namit Jain 2012-07-29, 14:45
This can be a follow-up to HIVE-2925.
Navis, if you want, I can work on it.
On 7/29/12 7:58 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote:

>I like Navis's idea. The timeout can be configurable.
>
>
>On 7/29/12 6:47 AM, "Navis류승우" <[EMAIL PROTECTED]> wrote:
>
>>I was thinking of timeout for fetching, 2000msec for example. How about
>>that?
>>
>>2012년 7월 29일 일요일에 Edward Capriolo<[EMAIL PROTECTED]>님이 작성:
>>> If where condition is too complex , selecting specific columns seems
>>simple
>>> enough and useful.
>>>
>>> On Saturday, July 28, 2012, Namit Jain <[EMAIL PROTECTED]> wrote:
>>>> Currently, hive does not launch map-reduce jobs for the following
>>queries:
>>>>
>>>> select * from <T> where <condition on partition columns> (limit <n>)?
>>>>
>>>> This behavior is not configurable, and cannot be altered.
>>>>
>>>> HIVE-2925 wants to extend this behavior. The goal is not to spawn
>>> map-reduce jobs for the following queries:
>>>>
>>>> Select <expr> from <T> where <any condition> (limit <n>)?
>>>>
>>>> It is currently controlled by one parameter:
>>> hive.aggressive.fetch.task.conversion, based on which it is decided,
>>> whether to spawn
>>>> map-reduce jobs or not for the queries of the above type. Note that
>>>>this
>>> can be beneficial for certain types of queries, since it is
>>>> avoiding the expensive step of spawning map-reduce. However, it can be
>>> pretty expensive for certain types of queries: selecting
>>>> a very large number of rows, the query having a very selective filter
>>> (which is satisfied by a very number of rows, and therefore involves
>>>> scanning a very large table) etc. The user does not have any control
>>>>on
>>> this. Note that it cannot be done by hooks, since the pre-semantic
>>>> hooks does not have enough information: type of the query, inputs etc.
>>> and it is too late to do anything in the post-semantic hook (the
>>>> query plan has already been altered).
>>>>
>>>> I would like to propose the following configuration parameters to
>>>>control
>>> this behavior.
>>>> hive.fetch.task.conversion: true, false, auto
>>>>
>>>> If the value is true, then all queries with only selects and filters
>>>>will
>>> be converted
>>>> If the value is false, then no query will be converted
>>>> If the value is auto (which should be the default behavior), there
>>>>should
>>> be additional parameters to control the semantics.
>>>>
>>>> hive.fetch.task.auto.limit.threshold               ---> integer value
>>>>X1
>>>> hive.fetch.task.auto.inputsize.threshold      ---> integer value X2
>>>>
>>>> If either the query has a limit lower than X1, or the input size is
>>> smaller than X2, the queries containing only filters and selects will
>>>be
>>> converted to not use
>>>> map-reudce jobs.
>>>>
>>>>
>>>> Comments…
>>>>
>>>> -namit
>>>>
>>>>
>>>>
>>>
>