Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - TeraSort question.


Copy link to this message
-
Re: TeraSort question.
bharath vissapragada 2011-01-11, 05:46
Ravi,

Please post the figures and graphs .. Figures for  large clusters (>
200 nodes) are certainly interesting ..

Thanks

On Tue, Jan 11, 2011 at 10:36 AM, Raj V <[EMAIL PROTECTED]> wrote:
> All,
>
> I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.
>
>  I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,
>
> Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.
>
> Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?
>
> I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).
>
> I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)
>
> I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)
>
> I am using CDH3B3, even though I think this is not specific to CDH3B3.
>
> Sorry for the cross post.
>
> Raj