Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Loading data into partition taking seven times total of (map+reduce) on highly skewed data


Copy link to this message
-
Loading data into partition taking seven times total of (map+reduce) on highly skewed data
We have a small (3GB /280M rows) table with 435 partitions that is highly
skewed:  one partition has nearly 200M, two others have nearly 40M apiece,
then the remaining 432 have all together less than 1% of total table size.

So .. the skew is something to be addressed.  However - even give that -
why would the following occur?
Table Structure:

     # Partition Information
# col_name             data_type           comment
 derived_create_dt   string               None

# Detailed Table Information
 ..
Protect Mode:       None
Retention:           0
 ..
Table Type:         MANAGED_TABLE
Table Parameters:
 SORTBUCKETCOLSPREFIX TRUE
transient_lastDdlTime 1379678551

# Storage Information
SerDe Library:       org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe
 InputFormat:         org.apache.hadoop.hive.ql.io.RCFileInputFormat
OutputFormat:       org.apache.hadoop.hive.ql.io.RCFileOutputFormat
 Compressed:         No
Num Buckets:         64
 Bucket Columns:     [station_id]
Sort Columns:       [Order(col:station_id, order:1)]
 Storage Desc Params:
serialization.format 1

HIGHLY SKEWED data:  although
This particular load:
    300M rows
     4GB
    435 partitions
       Over 99% of data in just 3 out of the 435 partitons
        2013-09-18 26733990
      2013-09-19 191634067
      2013-09-20 63790065

Map takes 10 min
Reduce 13 mins
Loading into partitions takes 3 hours 27 minutes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB