Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> why need to copy when run a sql with a single map


+
Daniel,Wu 2011-08-10, 12:07
Copy link to this message
-
Re: why need to copy when run a sql with a single map
Hi
  Hive queries are parsed into hadoop map reduce jobs. In map reduce jobs, between map and reduce tasks there are two phases, copy-phase and sort-phase together known as sort and shuffle phase. So the copy task indicated in hive job  here should be the copy phase of map reduce. It does the copying of map output from map task nodes to corresponding reduce task nodes.

Regards
Bejoy K S

-----Original Message-----
From: "Daniel,Wu" <[EMAIL PROTECTED]>
Date: Wed, 10 Aug 2011 20:07:48
To: hive<[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: why need to copy when run a sql with a single map

I run a single query like

select retailer_key,count(*) from records group by retailer_key;

it uses a single map as shown below, since the file is already on HDFS, so I think hadoop/hive doesn't need to copy anything.
Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map100.00%
100100 / 0
reduce100.00%
100100 / 0

but the final chart in the job  report shows "copy" takes about 33% of the total time, and the rest are "sort", and "reduce".  So why it should copy here, or copy means something elso?
 oracle@oracle-MS-7623:~/test$ hadoop fs -lsr /

drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:46 /user
drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:46 /user/hive
drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:59 /user/hive/warehouse
drwxr-xr-x   - oracle supergroup          0 2011-08-10 19:59 /user/hive/warehouse/records
-rw-r--r--   1 oracle supergroup   41600256 2011-08-10 19:59 /user/hive/warehouse/records/test.txt

+
Kai Ju Liu 2011-08-10, 19:02