-Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Russell Jurney 2012-03-05, 20:58
Streaming is good for simulation. Long running map-only processes, where pig doesn't really help and it is simple to fire off a streaming process. You do have to set some options so they can take a long time to return/return counters.
Russell Jurney http://datasyndrome.com
On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn <[EMAIL PROTECTED]> wrote:
> I'm really interested in this as well. I have trouble seeing a really good use case for streaming map-reduce. Is there something I can do in streaming that I can't do in Pig? If I want to re-use previously made Python functions from my code base, I can do that in Pig as much as Streaming, and from what I've experienced thus far, Python streaming seems to go slower than or at the same speed as Pig, so why would I want to write a whole lot of more-difficult-to-read mappers and reducers when I can do equally fast performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but currently I just can't think of the right use case.
> On 3/2/12 9:21 AM, Subir S wrote:
>> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J<[EMAIL PROTECTED]> wrote:
>>> On Fri, Mar 2, 2012 at 10:18 AM, Subir S<[EMAIL PROTECTED]>
>>>> Hello Folks,
>>>> Are there any pointers to such comparisons between Apache Pig and Hadoop
>>>> Streaming Map Reduce jobs?
>>> I do not see why you seek to compare these two. Pig offers a language
>>> that lets you write data-flow operations and runs these statements as
>>> a series of MR jobs for you automatically (Making it a great tool to
>>> use to get data processing done really quick, without bothering with
>>> code), while streaming is something you use to write non-Java, simple
>>> MR jobs. Both have their own purposes.
>> Basically we are comparing these two to see the benefits and how much they
>> help in improving the productive coding time, without jeopardizing the
>> performance of MR jobs.
>>>> Also there was a claim in our company that Pig performs better than Map
>>>> Reduce jobs? Is this true? Are there any such benchmarks available
>>> Pig _runs_ MR jobs. It does do job design (and some data)
>>> optimizations based on your queries, which is what may give it an edge
>>> over designing elaborate flows of plain MR jobs with tools like
>>> Oozie/JobControl (Which takes more time to do). But regardless, Pig
>>> only makes it easy doing the same thing with Pig Latin statements for
>> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
>> pretty slow with lot of joins, which we can achieve faster with writing raw
>> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
>> for example what kind of projects should consider Pig. Say when we have a
>> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>> Thank you Harsh for your comments. They are helpful!
>>> Harsh J