Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Tuning Triangle Joins on Hive


Copy link to this message
-
Re: Tuning Triangle Joins on Hive
On 7/31/14, 12:28 PM, Firas Abuzaid wrote:

This is an interesting use-case for a JOIN clause.

The main overhead is that the JOIN criteria is materializing JOINs onto
HDFS.

Neither hive indexes, nor ORC indexes will help you here, unfortunately.

With the latest hive-13, I tried rewriting this to get a performance
boost with a CTE (which will be streamed through in tez‚ but MR will
still write it as SequenceFiles to HDFS).

Roughly, you need to remove the implicit filter of r1.src < r2.src &&
r2.src < r3.src into an independent sub query.

explain
with r2_tmp as (
select r2.src as r2_src, r3.dst as r3_dst from r2 join r3 on (r2.dst =
r3.src) where r2.src < r3.src
)
select count(1) from r1 JOIN r2_tmp on (r1.dst = r2_src and r1.src = r3_dst)
where (r1.src < r2_src)

Vectorization+ORC will massively reduce the GC overhead for this as well
- but for all these you need a modern hive version.

Data organization can give a bigger performance boost as well, because
of how ORC will do run-length packing of the same valued columns. I'd
just sort/bucket on src for all of them.

You can see a similar organization for easier big table JOINs here -
https://github.com/t3rmin4t0r/nyc-taxi-bigdata/blob/master/ddl/orc.sql#L26

HTH.

Cheers,
Gopal

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB