Sivaramakrishnan Narayana... 2012-11-19, 09:47
Sivaramakrishnan Narayana... 2012-11-19, 09:40
-Re: Top-K optimization
Namit Jain 2012-11-19, 09:46
Take a look at https://issues.apache.org/jira/browse/HIVE-3562.
It is in my todo list, but I have not been able to review this.
I think, this addresses a very similar problem. If yes, can you also
above patch ?
On 11/19/12 3:10 PM, "Sivaramakrishnan Narayanan" <[EMAIL PROTECTED]>
>I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and
>Hive. In my past life, I was on the optimizer team of Greenplum Parallel
>Database. I'm a newbie to the Hive mailing list, so apologies for any
>missteps. I've done some searching in the Hive mailing list and JIRA and
>have not found any discussions around this topic - please feel free to
>redirect me to any old discussions I might've missed.
>A class of queries we're interested in optimizing are top-k queries i.e.
>queries of the form:
>(1) SELECT x, y from T order by z limit 10
>You can imagine similar query with aggregates:
>(2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit
>I'll continue my discussion with example (1) for simplicity. The way such
>a query is executed, every mapper sorts all rows from T and writes it to
>local files. Reducers (in this example, singular) read these files and
>merge them. These rows are fed to the limit operator which stops after 10
>The change I'm proposing is a combination of Hive and Hadoop changes
>which will greatly improve the performance of such queries:
> - New parameter map.sort.limitrecords which determines how many records
>each mapper in a job will send to every reducer
> - When writing out local files after sorting, map-task stops after
>map.sort.limitrecords records for each reducer
> - Effectively, each mapper sends out its top-K records
> - Determining when the Top-K optimization is applicable and setting K in
> - Passing the K value along to MapredWork
> - ExecDriver sets map.sort.limitrecords before executing the job
>corresponding to the MapredWork
>This change will reduce the amount of I/O that happens on the map-side
>(writing only 10 rows per reducer as opposed to entire table) and can
>have a big effect on performance. Furthermore, it is possible to make the
>sort on the mapper side a top-k sort which can further improve
>performance - but the deep pocket is really the I/O savings. In my
>experiments, I see a 5x performance improvement for such queries.
>Please let me know if this is of general interest - I'll be happy to
>contribute this back to the community. I'll also be mailing the Hadoop
>mailing list about this.