Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> non map-reduce for simple queries


Copy link to this message
-
Re: non map-reduce for simple queries
This can be a follow-up to HIVE-2925.
Navis, if you want, I can work on it.
On 7/29/12 7:58 PM, "Namit Jain" <[EMAIL PROTECTED]> wrote:

>I like Navis's idea. The timeout can be configurable.
>
>
>On 7/29/12 6:47 AM, "Navis류승우" <[EMAIL PROTECTED]> wrote:
>
>>I was thinking of timeout for fetching, 2000msec for example. How about
>>that?
>>
>>2012년 7월 29일 일요일에 Edward Capriolo<[EMAIL PROTECTED]>님이 작성:
>>> If where condition is too complex , selecting specific columns seems
>>simple
>>> enough and useful.
>>>
>>> On Saturday, July 28, 2012, Namit Jain <[EMAIL PROTECTED]> wrote:
>>>> Currently, hive does not launch map-reduce jobs for the following
>>queries:
>>>>
>>>> select * from <T> where <condition on partition columns> (limit <n>)?
>>>>
>>>> This behavior is not configurable, and cannot be altered.
>>>>
>>>> HIVE-2925 wants to extend this behavior. The goal is not to spawn
>>> map-reduce jobs for the following queries:
>>>>
>>>> Select <expr> from <T> where <any condition> (limit <n>)?
>>>>
>>>> It is currently controlled by one parameter:
>>> hive.aggressive.fetch.task.conversion, based on which it is decided,
>>> whether to spawn
>>>> map-reduce jobs or not for the queries of the above type. Note that
>>>>this
>>> can be beneficial for certain types of queries, since it is
>>>> avoiding the expensive step of spawning map-reduce. However, it can be
>>> pretty expensive for certain types of queries: selecting
>>>> a very large number of rows, the query having a very selective filter
>>> (which is satisfied by a very number of rows, and therefore involves
>>>> scanning a very large table) etc. The user does not have any control
>>>>on
>>> this. Note that it cannot be done by hooks, since the pre-semantic
>>>> hooks does not have enough information: type of the query, inputs etc.
>>> and it is too late to do anything in the post-semantic hook (the
>>>> query plan has already been altered).
>>>>
>>>> I would like to propose the following configuration parameters to
>>>>control
>>> this behavior.
>>>> hive.fetch.task.conversion: true, false, auto
>>>>
>>>> If the value is true, then all queries with only selects and filters
>>>>will
>>> be converted
>>>> If the value is false, then no query will be converted
>>>> If the value is auto (which should be the default behavior), there
>>>>should
>>> be additional parameters to control the semantics.
>>>>
>>>> hive.fetch.task.auto.limit.threshold               ---> integer value
>>>>X1
>>>> hive.fetch.task.auto.inputsize.threshold      ---> integer value X2
>>>>
>>>> If either the query has a limit lower than X1, or the input size is
>>> smaller than X2, the queries containing only filters and selects will
>>>be
>>> converted to not use
>>>> map-reudce jobs.
>>>>
>>>>
>>>> Comments…
>>>>
>>>> -namit
>>>>
>>>>
>>>>
>>>
>

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB