Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Spilled Records


Copy link to this message
-
Re: Spilled Records
Jie Li 2012-02-28, 17:31
Hi Dan,

You might want to post your Pig script to the Pig user mailing list.
Previously I did some experiments on Pig and Hive and I'll also be
interested in looking into your script.

Yeah Starfish now only supports Hadoop job-level tuning, and supporting
workflow like Pig and Hive is our top priority. We'll let you know once
we're ready.

Thanks,
Jie

On Tue, Feb 28, 2012 at 11:57 AM, Daniel Baptista <
[EMAIL PROTECTED]> wrote:

> Hi Jie,
>
> To be honest I don't think I understand enough of what our job is doing to
> be able to explain it.
>
> Thanks for the response though, I had figured that I was grasping at
> straws.
>
> I have looped at Starfish however all our jobs are submitted via Apache
> Pig so I don't know if it would be much good.
>
> Thanks again, Dan.
>
> -----Original Message-----
> From: Jie Li [mailto:[EMAIL PROTECTED]]
> Sent: 28 February 2012 16:35
> To: [EMAIL PROTECTED]
> Subject: Re: Spilled Records
>
> Hello Dan,
>
> The fact that the spilled records are double as the output records means
> the map task produces more than one spill file, and these spill files are
> read, merged and written to a single file, thus each record is spilled
> twice.
>
> I can't infer anything from the numbers of the two tasks. Could you provide
> more info, such as what the application is doing?
>
> If you like, you can also try our tool Starfish to see what's going on
> behind.
>
> Thanks,
> Jie
> ------------------
> Starfish is an intelligent performance tuning tool for Hadoop.
> Homepage: www.cs.duke.edu/starfish/
> Mailing list: http://groups.google.com/group/hadoop-starfish
>
>
> On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista <
> [EMAIL PROTECTED]> wrote:
>
> > Hi All,
> >
> > I am trying to improve the performance of my hadoop cluster and would
> like
> > to get some feedback on a couple of numbers that I am seeing.
> >
> > Below is the output from a single task (1 of 16) that took 3 mins 40
> > Seconds
> >
> > FileSystemCounters
> > FILE_BYTES_READ 214,653,748
> > HDFS_BYTES_READ 67,108,864
> > FILE_BYTES_WRITTEN 429,278,388
> >
> > Map-Reduce Framework
> > Combine output records 0
> > Map input records 2,221,478
> > Spilled Records 4,442,956
> > Map output bytes 210,196,148
> > Combine input records 0
> > Map output records 2,221,478
> >
> > And another task in the same job (16 of 16) that took 7 minutes and 19
> > seconds
> >
> > FileSystemCounters
> > FILE_BYTES_READ 199,003,192
> > HDFS_BYTES_READ 58,434,476
> > FILE_BYTES_WRITTEN 397,975,310
> >
> > Map-Reduce Framework
> > Combine output records 0
> > Map input records 2,086,789
> > Spilled Records 4,173,578 Map output bytes
> > 194,813,958
> > Combine input records 0 Map output records 2,086,789
> >
> > Can anybody determine anything from these figures?
> >
> > The first task is twice as quick as the second yet the input and output
> > are comparable (certainly not double). In all of the tasks (in this and
> > other jobs) the spilled records are always double the output records,
> this
> > can't be 'normal'?
> >
> > Am I clutching at straws (it feels like I am).
> >
> > Thanks in advance, Dan.
> >
> >
>
>