Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> accessing remote cluster with Pig


Copy link to this message
-
RE: accessing remote cluster with Pig
http://blog.rapleaf.com/dev/2010/01/05/the-wrath-of-drwho-or-unpredictable-hadoop-memory-usage/

Check for the load on your client. Sometimes, if the client cannot determine your user name (who am i), your name node will receive DrWho which is the default user name when no user name is specified (read null)

I have seen this behaviour when I use boxes with low memory especially on VMs.

Santhosh

-----Original Message-----
From: Kaluskar, Sanjay [mailto:[EMAIL PROTECTED]]
Sent: Thursday, October 21, 2010 3:17 AM
To: [EMAIL PROTECTED]
Subject: RE: accessing remote cluster with Pig

I am trying to do the same (submitting a PIG script to a remote cluster from a Windows m/c) and the job gets submitted after setting the following in pig.properties:

fs.default.name=hdfs://<node>:54310
mapred.job.tracker=hdfs://<node>:54510

However, my script fails because it looks for inputs under /user/DrWho.
Is it possible to specify the hadoop cluster user in pig.properties? How does one control it? Where is DrWho coming from?

Thanks,
-sanjay

-----Original Message-----
From: Gerrit Jansen van Vuuren [mailto:[EMAIL PROTECTED]]
Sent: Sunday, October 17, 2010 6:47 PM
To: [EMAIL PROTECTED]
Subject: RE: accessing remote cluster with Pig

Glad it worked for you  :)

I use the standard apache pig distributions.
There are several places that environment variables can be changed and set, and I have no idea which one cloudera uses but here is a list:

/etc/profile.d/<any file> (we have hadoop.sh, pig.sh and java.sh here that sets the home variables and is managed by puppet) /etc/bash.bashrc (not good idea to set it here) $HOME/.bashrc  (quick for users that don't have permission to root but not for production )
$PIG_HOME/conf/pig-env.sh   (standard in all hadoop related projects,
gets
sourced by $PIG_HOME/bin/pig )

To see what variables your pig is picking up you can manually insert the lines echo "home:$PIG_HOME conf:$PIG_CONF_DIR" into the $PIG_HOME/bin/pig file just before it calls java.

Cheers,
 Gerrit

-----Original Message-----
From: Anze [mailto:[EMAIL PROTECTED]]
Sent: Sunday, October 17, 2010 7:49 AM
To: [EMAIL PROTECTED]
Subject: Re: accessing remote cluster with Pig
Gerrir, thank you for your answer! It has pointed me in the right direction.
It looks like Pig (at least mine) ignores PIG_HOME. But with your help I was

able to debug a bit further:
-----
$ find / -name 'pig.properties'
/etc/pig/conf.dist/pig.properties
/etc/pig/conf/pig.properties
/usr/lib/pig/example-confs/conf.default/pig.properties
/usr/lib/pig/conf/pig.properties
-----

I have changed /usr/lib/pig/conf/pig.properties and bingo - this is what my Pig uses.

So while Cloudera packaging makes /etc/pig/conf/pig.properties (the "Debian way"), it is not used at all. And it probably ignores the environment vars too.

Thanks again! :)

Anze

On Sunday 17 October 2010, Gerrit Jansen van Vuuren wrote:
> Hi,
>
> Pig configuration is in the file: $PIG_HOME/conf/pig.properties
>
> The two parameters that tell pig where to find the namenode and job
tracker
> are:
>
> E.g (assuming your using the default ports)
>
> ----[ $PIG_HOME/conf/pig.properties ]---------------
>
> fs.default.name=hdfs://<namenode url>:8020/
> mapred.job.tracker=<jobtracker url>:8021
>
> --------------
>
> Having these properties you don't need to specify pig -x mapreduce,
just
> pig is enough.
>
>
> Cheers,
>  Gerrit
>
> -----Original Message-----
> From: Anze [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, October 16, 2010 9:53 PM
> To: [EMAIL PROTECTED]
> Subject: accessing remote cluster with Pig
>
> Hi again! :)
>
> I am trying to run Pig on a local machine, but I want it to connect to
a
> remote cluster. I can't make it use my settings - whatever I do, I get
> this: -----
> $ pig -x mapreduce
> 10/10/16 22:17:43 INFO pig.Main: Logging error messages to:
> /home/pigtest/conf/pig_1287260263699.log
> 2010-10-16 22:17:43,896 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting
file:///
Connecting
change
the
(0.7.0+16-1~lenny-
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB