Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Follow Up Questions: PigMix, DataGenerator etc...

Copy link to this message
RE: Follow Up Questions: PigMix, DataGenerator etc...
Santhosh Srinivasan 2009-10-31, 18:57
> Misc question: Do you anticipate that Pig will be compatible with
Hadoop 0.20 ?

The Hadoop 0.20 compatible version, Pig 0.5.0,  will be released
shortly. The release got the required votes.

> Finally, I am correct to assume that Pig is not Turing Complete? I am
not clear on this. SQL is not Turing Complete, whereas Java is. So does
that make, Hive or Pig, for example Turing complete, or not?

Short answer: Hive and Pig are not Turing complete. Turing completeness
is for a particular language and not for the language implementing the
language under question. Since Hive is SQL (like), its not Turing
complete. Till Pig supports loops and conditional statements, Pig will
not be Turing complete.


-----Original Message-----
From: Rob Stewart [mailto:[EMAIL PROTECTED]]
Sent: Saturday, October 31, 2009 11:22 AM
Subject: Re: Follow Up Questions: PigMix, DataGenerator etc...


thanks for getting in touch,  I appreciate your time, given that it's
clear you're busy popping up in Pig discussion videos on Vimeo and
YouTube just now, see my responses below.

I intend to get a good feel for the data generation, and to see first of
all: how easy it is for the various interfaces (Pig, JAQL etc..) can
plug into the same file structures, and secondly, how easily and fairly
I would be able to port my queries from one interface to the next.

Misc question: Do you anticipate that Pig will be compatible with Hadoop
0.20 ?

Finally, I am correct to assume that Pig is not Turing Complete? I am
not clear on this. SQL is not Turing Complete, whereas Java is. So does
that make, Hive or Pig, for example Turing complete, or not?

Again, see my responses below, and thanks again.
Rob Stewart

2009/10/30 Alan Gates <[EMAIL PROTECTED]>

> On Oct 30, 2009, at 5:05 AM, Rob Stewart wrote:
>  Hi there.
>> As some of you may have read on this mailing list previously, I'm
>> studying various interfaces with Hadoop, one of those being Pig.
>> I have three further questions. I am now beginning to think about the

>> design of my testing (query design, indicators of performance
>> etc...).
>> 1:
>> I have had a good look at the PigMix benchmark Wiki page, and find it

>> interesting that a few Pig queries now execute more quickly than the
>> associative Java MapReduce application implementation (
>> http://wiki.apache.org/pig/PigMix ). The following data processing
>> functions in Pig outperform the Java equivalent:
>> distinct aggregation
>> anti join
>> group applicationorder by 1 field
>> order by multiple fields
>> distinct + join
>> multi-store
>> A few questions: Am I able to obtain the actual queries used for the
>> PigMix benchmarking? And how about obtaining their Java Hadoop
>> equivalent?
> https://issues.apache.org/jira/browse/PIG-200 has the code.  The
> original perf.patch has both the MR Java code and the Pig Latin
> scripts.  The data generator is also in this patch.
I will check that code out. thanks.
>  And,
>> how, technically, is this high performance achieved? I have read the
>> paper "A benchmark for Hive, Pig and Hadoop" (Yuntao Jia, Zheng
>> Shao), and they were using a snapshot of Pig trunk from June 2009,
>> showing Pig executing an aggregation query and a join query more
>> slowly than Java Hadoop or Hive, but the aspect that interests me is
>> the number of Map/Reduce tasks created by Pig and Hive. In what way
>> does this number have an effect on the execution time performance. I
>> have a feeling that Pig produces more Map/Reduce tasks than other
>> interfaces, which may be benficial where there is extremely skewed
>> data. Am I wrong in thinking this, or is there another benifit to
>> more Map/Reduce tasks. And how to Pig go about splitting a job into
>> these number of tasks?
> Map and reduce parallelism are controlled differently in Hadoop.  Map
> parallelism is controlled by the InputSplit.  IS determines how many
I have a query about this procedure. It will warrant a simple answer I
assume, but I just need clarity on this. I am wondering how, for
example, both the MR applications and the Pig programs will react if
there are no specifications for the number of Map or Reduce jobs. If,
let's say, I were a programmer writing some Pig scripts where I do not
know the skew of the data, my first execution of the Pig script would be
done without any specification of #Mappers or #Reducers. Is it not a
more natural examination of Pig vs MR apps where both Pig and the MR app
have to decide these details for themselves? So my question is: Why is
it a fundamental requirement that the Pig script and the associated MR
app be given figures for initial Map/Reduce tasks?
particular tests.

Sounds very elegant, a really neat solution to skewed data. Is there
some documentation of this process, as I'd like to include that
methodology in my report. And then display data results like: "skewed
data / exeution time", where trend lines for Pig, Hive and MR apps are
shown. It would be nice to show that, as skew of data increases, Pig
overtakes the associative MR app for execution performance.

https://issues.apache.org/jira/browse/PIG-979 ).

it incorrectly.
the event of a DataNode failure, i.e.
OK, so by this, do you mean that you use the web interface to view:
MapReduce tracker, task trackers, and the HDFS name node? I've had a
look at the Chuckwa project, and I may be mistaken, but to me it looks
like a bit of a beast to configure, and becomes more useful as you
increase the number of nodes in the cluster. The cluster I have
available to me is 10 nodes. I will have a good look at the Hadoop logs
generated by each of the nodes to see if that would suffice.