Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - running pig on amazon ec2


Copy link to this message
-
Re: running pig on amazon ec2
Dexin Wang 2011-06-14, 18:07
Good to know. Trying single node hadoop cluster now. The main input is about
1+ million lines of events. After some aggregation, it joins with another
input source which has also about 1+ million rows. Is this considered small
query? Thanks.

On Tue, Jun 14, 2011 at 11:01 AM, Daniel Dai <[EMAIL PROTECTED]> wrote:

>  Local mode and mapreduce mode makes a huge difference. For a small query,
> the mapreduce overhead will dominate. For a fair comparison, can you setup a
> single node hadoop cluster on your laptop and run Pig on it?
>
> Daniel
>
>
> On 06/14/2011 10:54 AM, Dexin Wang wrote:
>
> Thanks for your feedback. My comments below.
>
> On Tue, Jun 14, 2011 at 10:41 AM, Daniel Dai <[EMAIL PROTECTED]>wrote:
>
>> Curious, couple of questions:
>> 1. Are you running in local mode or mapreduce mode?
>>
> Local mode (-x local) when I ran it on my laptop, and mapreduce mode when I
> ran it on ec2 cluster.
>
>  2. If mapreduce mode, did you look into the hadoop log to see how much
>> slow down each mapreduce job does?
>>
> I'm looking into that.
>
>
>> 3. What kind of query is it?
>>
>>  The input is gzipped json files which has one event per line. Then I do
> some hourly aggregation on the raw events, then do bunch of groupping,
> joining and some metrics computing (like median, variance) on some fields.
>
>  Daniel
>>
>>    Someone mentioned it's EC2's I/O performance. But I'm sure there are
> plenty of people using EC2/EMR running big MR jobs so more likely I have
> some configuration issues? My jobs can be optimized a bit but the fact that
> running on my laptop is faster tells me this is a separate issue.
>
> Thanks!
>
>
>
>> On 06/13/2011 11:54 AM, Dexin Wang wrote:
>>
>>> Hi,
>>>
>>> This is probably not directly a Pig question.
>>>
>>> Anyone running Pig on amazon EC2 instances? Something's not making sense
>>> to
>>> me. I ran a Pig script that has about 10 mapred jobs in it on a 16 node
>>> cluster using m1.small. It took *13 minutes*. The job reads input from S3
>>> and writes output to S3. But from the logs the reading and writing part
>>> to/from S3 is pretty fast. And all the intermediate steps should happen
>>> on
>>> HDFS.
>>>
>>> Running the same job on my mbp laptop, it only took *3 minutes*.
>>>
>>> Amazon is using pig0.6 while I'm using pig 0.8 on laptop. I'll try Pig
>>> 0.6
>>> on my laptop. Some hadoop config is probably also not ideal. I tried
>>> m1.large instead of m1.small, doesn't seem to make a huge difference.
>>> Anything you would suggest to look for the slowness on EC2?
>>>
>>> Dexin
>>>
>>
>>
>
>