Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - How to improve the performs of PIG Join

byambajargal 2011-04-17, 13:03
Thejas M Nair 2011-04-18, 23:33
Copy link to this message
Re: How to improve the performs of PIG Join
Thejas M Nair 2011-04-19, 15:41
Here is the (theoretical) rule of thumb for replicated join :
for replicated join to perform significantly better than default join, the size of the replicated input should be  smaller than the block size ( or pig.maxCombinedSplitSize if property pig.splitCombination=true and larger than block size).

This is because for the number of map tasks started are equal to the number of blocks (or size/pig.maxCombinedSplitSize) in the left side input of replicated join. Each of these blocks will read   the replicated input. If the replicated input read size is few times larger than block size, using replicated join will not save on IO/(de)serialization costs.


On 4/18/11 4:33 PM, "Thejas M Nair" <[EMAIL PROTECTED]> wrote:

For default join (hash join) -
- Increasing the parallelism of the default join should speed it up.
- Put the table which has large number of tuples per key as the last table
in join . (Yes, this happens to be the opposite of the recommendation for
replicated join !) See -
- http://pig.apache.org/docs/r0.8.0/cookbook.html#Project+Early+and+Often

For replicated join -
- I believe the reason why replicated join is performing worse that default
join is because of the large number of maps and the large size of the
replicated file. Each map task ends up reading and deserializing the
replicated file( obs_relation.txt), and usually that takes bulk of the
runtime. In this case (691MB x 266 (maps) =~) 183GB of replicated input data
will be read and deserialized by all the map tasks. This is actually very
small compared to size of the larger input (17GB).
To reduce the number of maps, you can use the feature introduced in
https://issues.apache.org/jira/browse/PIG-1518 , ensure that you have the
property pig.splitCombination=true, and pig.maxCombinedSplitSize=X, where X
= size_of_obr_pm_annotation.txt/number-of-map-slots . This will ensure that
all cluster slots are used and you don't have too many map tasks.


On 4/17/11 6:03 AM, "byambajargal" <[EMAIL PROTECTED]> wrote:

> Hello ...
> I have a cluster with 11 nodes  each of them have 16 GB RAM, 6 core CPU,
> ! TB HDD and i use cloudera distribution CHD4b with Pig. I have two Pig
> Join queries  which are a Parallel and a Replicated version of pig Join.
> Theoretically Replicated Join could be faster than Parallel join but in
> my case Parallel is faster.
> I am wondering why the replicated join is so slowly. i wont to improve
> the performance of both query. Could you check the detail of the queries.
> thanks
> Byambajargal
> ANNO = load '/datastorm/task3/obr_pm_annotation.txt' using
> PigStorage(',') AS (element_id:long,concept_id:long); ;REL = load
> '/datastorm/task3/obs_relation.txt' using PigStorage(',') AS
> (id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO by
> concept_id,REL by concept_id*PARALLEL 10*;ISA_ANNO_T = GROUP ISA_ANNO
> ALL;ISA_ANNO_C = foreach ISA_ANNO_T generate COUNT($1); dump ISA_ANNO_C
> HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt
> Features
> 0.20.2-CDH3B4   0.8.0-CDH3B4    haisen  2011-04-15 10:31:36
> 2011-04-15 10:43:22
> HASH_JOIN,GROU                                                       P_BY
> Success!
> Job Stats (time in seconds):
> JobId                               Maps    Reduces
> MaxMapTime      MinMapTIme      AvgMapTime      MaxReduceTime
> MinReduceTime    AvgReduceTime    Alias              Feature Outputs
> job_201103122121_0084   277     10                          15
>         5                           11                        417
>         351                           379     ANNO,ISA_ANNO,
> job_201103122121_0085   631     1                            10
>           5                            7                        242
>             242                          242     ISA_ANNO_C,ISA_ANNO_T
> hdfs://haisen11:54310/tmp/temp281466632/tmp-171526868,