Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> TeraSort question.


Copy link to this message
-
Re: TeraSort question.
Raj,

Have a look at the graph shown here:
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines

It should make clear that the number of tasks varies greatly over the
lifetime of a job.
Depending on the nodes available this may leave node idle.

Niels

2011/1/11 Raj V <[EMAIL PROTECTED]>:
> Ted
>
>
> Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format.,
>
> For legal reasons, I really don't want to send the complete job histiory files.
>
> My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different.
>
> Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better.
>
> Raj
> From: Ted Dunning <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; Raj V <[EMAIL PROTECTED]>
> Cc:
> Sent: Tuesday, January 11, 2011 8:22:17 AM
> Subject: Re: TeraSort question.
>
> Raj,
>
> Do you have the job history files?  That would be very useful.  I would be
> happy to create some swimlane and related graphs for you if you can send me
> the history files.
>
> On Mon, Jan 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote:
>
>> All,
>>
>> I have been running terasort on a 480 node hadoop cluster. I have also
>> collected cpu,memory,disk, network statistics during this run. The system
>> stats are quite intersting. I can post it when I have put them together in
>> some presentable format ( if there is interest.). However while looking at
>> the data, I noticed something interesting.
>>
>>  I thought, intutively, that the all the systems in the cluster would have
>> more or less similar behaviour ( time translation was possible) but the
>> overall graph would look the same.,
>>
>> Just to confirm it I took 5 random nodes and looked at the CPU, disk
>> ,network etc. activity when the sort was running. Strangeley enough, it was
>> not so., Two of the 5 systems were seriously busy, big IO with lots of disk
>> and network activity. The other three systems, CPU was more or less 100%
>> idle, slight network and I/O.
>>
>> Is that normal and/or expected? SHouldn't all the nodes be utilized in more
>> or less manner over the length of the run?
>>
>> I generated the data forf the sort using teragen. ( 128MB bloick size,
>> replication =3).
>>
>> I would also be interested in other people timings of sort. Is there some
>> place where people can post sort numbers ( not just the record.)
>>
>> I will post the actual graphs of the 5 nodes, if there is interest,
>> tomorrow. ( Some logistical issues abt. posting them tonight)
>>
>> I am using CDH3B3, even though I think this is not specific to CDH3B3.
>>
>> Sorry for the cross post.
>>
>> Raj

--
Met vriendelijke groeten,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB