Sivaramakrishnan Narayana... 2012-11-19, 09:40
I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and Hive. In my past life, I was on the optimizer team of Greenplum Parallel Database. I'm a newbie to the Hive mailing list, so apologies for any missteps. I've done some searching in the Hive mailing list and JIRA and have not found any discussions around this topic - please feel free to redirect me to any old discussions I might've missed.
A class of queries we're interested in optimizing are top-k queries i.e. queries of the form:
(1) SELECT x, y from T order by z limit 10
You can imagine similar query with aggregates:
(2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit 10
I'll continue my discussion with example (1) for simplicity. The way such a query is executed, every mapper sorts all rows from T and writes it to local files. Reducers (in this example, singular) read these files and merge them. These rows are fed to the limit operator which stops after 10 rows.
The change I'm proposing is a combination of Hive and Hadoop changes which will greatly improve the performance of such queries:
- New parameter map.sort.limitrecords which determines how many records each mapper in a job will send to every reducer
- When writing out local files after sorting, map-task stops after map.sort.limitrecords records for each reducer
- Effectively, each mapper sends out its top-K records
- Determining when the Top-K optimization is applicable and setting K in ReduceSinkDesc
- Passing the K value along to MapredWork
- ExecDriver sets map.sort.limitrecords before executing the job corresponding to the MapredWork
This change will reduce the amount of I/O that happens on the map-side (writing only 10 rows per reducer as opposed to entire table) and can have a big effect on performance. Furthermore, it is possible to make the sort on the mapper side a top-k sort which can further improve performance - but the deep pocket is really the I/O savings. In my experiments, I see a 5x performance improvement for such queries.
Please let me know if this is of general interest - I'll be happy to contribute this back to the community. I'll also be mailing the Hadoop mailing list about this.