Thanks a lot for the good advice.
I'll see if I can get lzo setup. Currently I'm using emr which uses pig 0.6.
I'll looking into whirr to start the hadoop cluster on ec2.
There is one place in my job where I can use replicated join, I'm sure that
will cut down some time.
What I find interesting is without doing any optimization on configuration
or code side, I get 2x to 4x speed up by just using the "*Cluster Compute
Quadruple Extra Large Instance*" (cc1.4xlarge) as oppose to the regular
"Large instance" (m1.large) on the $$. They do claim cc1.4xlarge's IO is
"very high". Since I suspect most of my job was spending time
reading/writing disk, this speedup makes sense.
On Wed, Jun 15, 2011 at 6:46 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> you need to add this to your pig.properties:
> Make sure that you are running hadoop 20.2 or higher, pig 8.1 or
> higher, and that all the lzo stuff is set up -- it's a bit involved.
> Use replicated joins where possible.
> If you are doing a large number of small jobs, scheduling and
> provisioning is likely to dominate -- tune your job scheduler to
> schedule more tasks per heartbeat and make sure your jar is as small
> as you can get it (there's a lot of unjarring going on in Hadoop)
> On Wed, Jun 15, 2011 at 11:14 AM, Dexin Wang <[EMAIL PROTECTED]> wrote:
> > Tomas,
> > What worked well for me is still to be figured out. Right now, it works
> > it's too slow. I think one of the main problem is that my job has many
> > JOIN/GROUP BY, so lots of intermediate steps ending up writing to disk
> > is slow.
> > On that node, anyone knows how to know if the lzo is turned on for
> > intermediate jobs. Reference to this
> > and this
> > I see I have this in my mapred-site.xml file:
> > <property><name>mapred.map.output.compression.codec</name>
> > <value>com.hadoop.compression.lzo.LzoCodec</value></property>
> > Is that all I need to have map compression turned on? Thanks.
> > Dexin
> > On Tue, Jun 14, 2011 at 3:36 PM, Tomas Svarovsky
> > <[EMAIL PROTECTED]>wrote:
> >> Hi Dexin,
> >> Since I am being a Pig and map reduce newbie your post is very
> >> intriguing for me. I am coming from Talend background and trying to
> >> asses if map/reduce would bring any possible speed up and faster
> >> turnaround to my projects. My worries are that my data are to small so
> >> that map reduce overhead will be prohibitive in certain cases.
> >> When using Talend if the transformation was reasonable it could
> >> process 10s of thousand rows per second. Processing 1 million rows
> >> could be finished well under 1 minute so I think that your dataset is
> >> fairly small. Nevertheless my data are growing so soon it wil be time
> >> for pig.
> >> Could you provide some info what worked well for you to run your job on
> >> EC2?
> >> Thanks in advance,
> >> Tomas
> >> On Tue, Jun 14, 2011 at 9:16 PM, Daniel Dai <[EMAIL PROTECTED]>
> >> wrote:
> >> > If the job finishes in 3 minutes in local mode, I would think it is
> >> small.
> >> >
> >> > On 06/14/2011 11:07 AM, Dexin Wang wrote:
> >> >>
> >> >> Good to know. Trying single node hadoop cluster now. The main input
> >> >> about 1+ million lines of events. After some aggregation, it joins
> >> >> another input source which has also about 1+ million rows. Is this
> >> >> considered small query? Thanks.
> >> >>
> >> >> On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <[EMAIL PROTECTED]
> >> >> <mailto:[EMAIL PROTECTED]>> wrote:
> >> >>
> >> >> Local mode and mapreduce mode makes a huge difference. For a small
> >> >> query, the mapreduce overhead will dominate. For a fair