Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Spilled Records


Copy link to this message
-
Re: Spilled Records
Hi Dan,

You might want to post your Pig script to the Pig user mailing list.
Previously I did some experiments on Pig and Hive and I'll also be
interested in looking into your script.

Yeah Starfish now only supports Hadoop job-level tuning, and supporting
workflow like Pig and Hive is our top priority. We'll let you know once
we're ready.

Thanks,
Jie

On Tue, Feb 28, 2012 at 11:57 AM, Daniel Baptista <
[EMAIL PROTECTED]> wrote:

> Hi Jie,
>
> To be honest I don't think I understand enough of what our job is doing to
> be able to explain it.
>
> Thanks for the response though, I had figured that I was grasping at
> straws.
>
> I have looped at Starfish however all our jobs are submitted via Apache
> Pig so I don't know if it would be much good.
>
> Thanks again, Dan.
>
> -----Original Message-----
> From: Jie Li [mailto:[EMAIL PROTECTED]]
> Sent: 28 February 2012 16:35
> To: [EMAIL PROTECTED]
> Subject: Re: Spilled Records
>
> Hello Dan,
>
> The fact that the spilled records are double as the output records means
> the map task produces more than one spill file, and these spill files are
> read, merged and written to a single file, thus each record is spilled
> twice.
>
> I can't infer anything from the numbers of the two tasks. Could you provide
> more info, such as what the application is doing?
>
> If you like, you can also try our tool Starfish to see what's going on
> behind.
>
> Thanks,
> Jie
> ------------------
> Starfish is an intelligent performance tuning tool for Hadoop.
> Homepage: www.cs.duke.edu/starfish/
> Mailing list: http://groups.google.com/group/hadoop-starfish
>
>
> On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista <
> [EMAIL PROTECTED]> wrote:
>
> > Hi All,
> >
> > I am trying to improve the performance of my hadoop cluster and would
> like
> > to get some feedback on a couple of numbers that I am seeing.
> >
> > Below is the output from a single task (1 of 16) that took 3 mins 40
> > Seconds
> >
> > FileSystemCounters
> > FILE_BYTES_READ 214,653,748
> > HDFS_BYTES_READ 67,108,864
> > FILE_BYTES_WRITTEN 429,278,388
> >
> > Map-Reduce Framework
> > Combine output records 0
> > Map input records 2,221,478
> > Spilled Records 4,442,956
> > Map output bytes 210,196,148
> > Combine input records 0
> > Map output records 2,221,478
> >
> > And another task in the same job (16 of 16) that took 7 minutes and 19
> > seconds
> >
> > FileSystemCounters
> > FILE_BYTES_READ 199,003,192
> > HDFS_BYTES_READ 58,434,476
> > FILE_BYTES_WRITTEN 397,975,310
> >
> > Map-Reduce Framework
> > Combine output records 0
> > Map input records 2,086,789
> > Spilled Records 4,173,578 Map output bytes
> > 194,813,958
> > Combine input records 0 Map output records 2,086,789
> >
> > Can anybody determine anything from these figures?
> >
> > The first task is twice as quick as the second yet the input and output
> > are comparable (certainly not double). In all of the tasks (in this and
> > other jobs) the spilled records are always double the output records,
> this
> > can't be 'normal'?
> >
> > Am I clutching at straws (it feels like I am).
> >
> > Thanks in advance, Dan.
> >
> >
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB