Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: Why my tests shows Yarn is worse than MRv1 for terasort?


Copy link to this message
-
Re: Why my tests shows Yarn is worse than MRv1 for terasort?
Unfortunately yes.  I agree that it should be documented.  The only way to
compare fairly is to use the same terasort jar against both.  When I've
done comparisons, I've used the MR2 examples jar against both MR1 and MR2.
 This works on CDH because it has a version of MR1 that's compatible with
MR2, but may require some recompilation if using the apache releases.
On Wed, Oct 23, 2013 at 5:01 PM, Jian Fang <[EMAIL PROTECTED]>wrote:

> Really? Does that mean I cannot compare terasort in MR2 with MR1
> directly?  If yes, this really should be documented.
>
> Thanks.
>
>
> On Wed, Oct 23, 2013 at 4:54 PM, Sandy Ryza <[EMAIL PROTECTED]>wrote:
>
>> I should have brought this up earlier - the Terasort benchmark
>> requirements changed recently to make data less compressible.  The MR2
>> version of Terasort has this change, but the MR1 version does not.    So
>> Snappy should be working fine, but the data that MR2 is using is less
>> compressible.
>>
>>
>> On Wed, Oct 23, 2013 at 4:48 PM, Jian Fang <[EMAIL PROTECTED]
>> > wrote:
>>
>>> Ok, I think I know where the problem is now. I used snappy to compress
>>> map output.
>>>
>>> Calculate the compression ratio = "Map output materialized bytes"/"Map
>>> output bytes"
>>>
>>> MR2
>>> 440519309909/1020000000000 = 0.431881676
>>>
>>> MR1
>>> 240948272514/1000000000000 = 0.240948273
>>>
>>>
>>> Seems something is wrong with the snappy in my Hadoop 2 and as a result,
>>> MR2 needs to fetch much more data during shuffle phase. Any way to check
>>> snappy in MR2?
>>>
>>>
>>>
>>>
>>> On Wed, Oct 23, 2013 at 4:26 PM, Jian Fang <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>>> Looked at the results for MR1 and MR2. Reduce input groups for MR1 is
>>>> 4,294,967,296, but 10,000,000,000 for MR2.
>>>> That is to say the number of unique keys fed into the reducers is
>>>> 10000000000 in MR2, which is really wired. Any problem in the terasort code?
>>>>
>>>>
>>>> On Wed, Oct 23, 2013 at 2:16 PM, Jian Fang <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Reducing the number of map containers to 8 slightly improve the total
>>>>> time to 84 minutes. Here is output. Also, from the log, there is no clear
>>>>> reason why the containers are killed other than the message such as
>>>>> "Container killed by the
>>>>> ApplicationMaster../container_1382237301855_0001_01_000001/syslog:Container
>>>>> killed on request. Exit code is 143"
>>>>>
>>>>>
>>>>> 2013-10-23 21:09:34,325 INFO org.apache.hadoop.mapreduce.Job (main):
>>>>> Counters: 46
>>>>>         File System Counters
>>>>>                 FILE: Number of bytes read=455484809066
>>>>>                 FILE: Number of bytes written=896642344343
>>>>>
>>>>>                 FILE: Number of read operations=0
>>>>>                 FILE: Number of large read operations=0
>>>>>                 FILE: Number of write operations=0
>>>>>                 HDFS: Number of bytes read=1000000841624
>>>>>
>>>>>                 HDFS: Number of bytes written=1000000000000
>>>>>                 HDFS: Number of read operations=25531
>>>>>
>>>>>                 HDFS: Number of large read operations=0
>>>>>                 HDFS: Number of write operations=150
>>>>>
>>>>>         Job Counters
>>>>>                 Killed map tasks=1
>>>>>                 Killed reduce tasks=11
>>>>>                 Launched map tasks=7449
>>>>>                 Launched reduce tasks=86
>>>>>                 Data-local map tasks=7434
>>>>>                 Rack-local map tasks=15
>>>>>                 Total time spent by all maps in occupied slots
>>>>> (ms)=1030941232
>>>>>                 Total time spent by all reduces in occupied slots
>>>>> (ms)=1574732272
>>>>>
>>>>>         Map-Reduce Framework
>>>>>                 Map input records=10000000000
>>>>>                 Map output records=10000000000
>>>>>                 Map output bytes=1020000000000
>>>>>                 Map output materialized bytes=440519309909
>>>>>                 Input split bytes=841624