Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Top-K optimization


Copy link to this message
-
Re: Top-K optimization
Hi Siva,
Take a look at https://issues.apache.org/jira/browse/HIVE-3562.

It is in my todo list, but I have not been able to review this.

I think, this addresses a very similar problem. If yes, can you also
review the
above patch ?
Thanks,
-namit
On 11/19/12 3:10 PM, "Sivaramakrishnan Narayanan" <[EMAIL PROTECTED]>
wrote:

>Hi All,
>
>I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and
>Hive. In my past life, I was on the optimizer team of Greenplum Parallel
>Database. I'm a newbie to the Hive mailing list, so apologies for any
>missteps. I've done some searching in the Hive mailing list and JIRA and
>have not found any discussions around this topic - please feel free to
>redirect me to any old discussions I might've missed.
>
>A class of queries we're interested in optimizing are top-k queries i.e.
>queries of the form:
>
>(1) SELECT x, y from T order by z limit 10
>
>You can imagine similar query with aggregates:
>
>(2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit
>10
>
>I'll continue my discussion with example (1) for simplicity. The way such
>a query is executed, every mapper sorts all rows from T and writes it to
>local files. Reducers (in this example, singular) read these files and
>merge them. These rows are fed to the limit operator which stops after 10
>rows.
>
>The change I'm proposing is a combination of Hive and Hadoop changes
>which will greatly improve the performance of such queries:
>
>Hadoop change:
> - New parameter map.sort.limitrecords which determines how many records
>each mapper in a job will send to every reducer
> - When writing out local files after sorting, map-task stops after
>map.sort.limitrecords records for each reducer
> - Effectively, each mapper sends out its top-K records
>
>Hive change:
> - Determining when the Top-K optimization is applicable and setting K in
>ReduceSinkDesc
> - Passing the K value along to MapredWork
> - ExecDriver sets map.sort.limitrecords before executing the job
>corresponding to the MapredWork
>
>This change will reduce the amount of I/O that happens on the map-side
>(writing only 10 rows per reducer as opposed to entire table) and can
>have a big effect on performance. Furthermore, it is possible to make the
>sort on the mapper side a top-k sort which can further improve
>performance - but the deep pocket is really the I/O savings. In my
>experiments, I see a 5x performance improvement for such queries.
>
>Please let me know if this is of general interest - I'll be happy to
>contribute this back to the community. I'll also be mailing the Hadoop
>mailing list about this.
>
>Thanks
>Siva
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB