Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Loading data into partition taking seven times total of (map+reduce) on highly skewed data


Copy link to this message
-
Re: Loading data into partition taking seven times total of (map+reduce) on highly skewed data
Another detail:   ~400 mappers  64 reducers
2013/9/20 Stephen Boesch <[EMAIL PROTECTED]>

>
> We have a small (3GB /280M rows) table with 435 partitions that is highly
> skewed:  one partition has nearly 200M, two others have nearly 40M apiece,
> then the remaining 432 have all together less than 1% of total table size.
>
> So .. the skew is something to be addressed.  However - even give that -
> why would the following occur?
>
>
> Table Structure:
>
>      # Partition Information
> # col_name             data_type           comment
>  derived_create_dt   string               None
>
> # Detailed Table Information
>  ..
> Protect Mode:       None
> Retention:           0
>  ..
> Table Type:         MANAGED_TABLE
> Table Parameters:
>  SORTBUCKETCOLSPREFIX TRUE
> transient_lastDdlTime 1379678551
>
> # Storage Information
> SerDe Library:       org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
>  InputFormat:         org.apache.hadoop.hive.ql.io.RCFileInputFormat
> OutputFormat:       org.apache.hadoop.hive.ql.io.RCFileOutputFormat
>  Compressed:         No
> Num Buckets:         64
>  Bucket Columns:     [station_id]
> Sort Columns:       [Order(col:station_id, order:1)]
>  Storage Desc Params:
> serialization.format 1
>
> HIGHLY SKEWED data:  although
> This particular load:
>     300M rows
>      4GB
>     435 partitions
>        Over 99% of data in just 3 out of the 435 partitons
>         2013-09-18 26733990
>       2013-09-19 191634067
>       2013-09-20 63790065
>
>
>
> Map takes 10 min
> Reduce 13 mins
> Loading into partitions takes 3 hours 27 minutes
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB