The difference in the command is where the shell script is coming from. If you are using ~/mapper.sh then it will look in your home directory to run the script. If you have a small cluster with your home directory mounted on all of them then it is not that big of a deal. If you have a large cluster then the NFS mounting the directory on all of the boxes can cause a lot of issues. If you have a large cluster you should use the distributed cache to send it over (you are already sending it through the distributed cache by using the -file option).
I am not completely sure why it would be timing out. Are all of them timing out, or is it just a single mapper that is timing out. One thing you can do it to run your streaming job, but with echo instead of mapper.sh, then you can use that as input to the command running on your local box.
./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file ~/mapper.sh -mapper echo -input ../foo.txt -output output
./hadoop fs -cat output/part-00000 | ~/mapper.sh
#or pick a different part file that corresponds to the mapper task that is timing out.
On 10/7/11 1:43 AM, "Aishwarya Venkataraman" <[EMAIL PROTECTED]> wrote:
My mapper job fails. I am basically trying to run a crawler on hadoop and
hadoop kills the crawler (mapper) if it has not heard from it for a certain
timeout period. But I already have a timeout set in my mapper(500 seconds)
which is lesser than hadoop's timeout(900 seconds). The mapper just stalls
for some reason. My mapper code is as follows:
while read line;do
result="`wget -O - --timeout=500 http://$line 2>&1`"
Any idea why my mapper is getting stalled ?
I don't see the difference between the command you have given and the one I
ran. I am not running in local mode. Is there some way by which I can get
intermediate mapper outputs ? I would like to see for which site the mapper
is getting stalled.
On Thu, Oct 6, 2011 at 1:41 PM, Robert Evans <[EMAIL PROTECTED]> wrote:
> Are you running in local mode? If not you probably want to run
> hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
> ~/mapper.sh -mapper ./mapper.sh -input ../foo.txt -output output
> You may also want to run hadoop fs -ls output/* to see what files were
> produced. If your mappers failed for some reason then there will be no
> files in the output directory. And you may want to look at the stderr logs
> for your processes through the web UI.
> --Bobby Evans
> On 10/6/11 3:30 PM, "Aishwarya Venkataraman" <[EMAIL PROTECTED]> wrote:
> I ran the following (I am using IdentityReducer) :
> ./hadoop jar ../contrib/streaming/hadoop-0.20.2-streaming.jar -file
> ~/mapper.sh -mapper ~/mapper.sh -input ../foo.txt -output output
> When I do
> ./hadoop dfs -cat output/* I do not see any output on screen. Is this how I
> view the output of mapper ?
> On Thu, Oct 6, 2011 at 12:37 PM, Robert Evans <[EMAIL PROTECTED]> wrote:
> > A streaming jobs stderr is logged for the task, but its stdout is what is
> > sent to the reducer. The simplest way to get it is to turn off the
> > reducers, and then look at the output in HDFS.
> > --Bobby Evans
> > On 10/6/11 1:16 PM, "Aishwarya Venkataraman" <[EMAIL PROTECTED]>
> > Hello,
> > I want to view the mapper output for a given hadoop streaming jobs (that
> > runs a shell script). However I am not able to find this in any log
> > Where should I look for this ?
> > Thanks,
> > Aishwarya
> Aishwarya Venkataraman
> [EMAIL PROTECTED]
> Graduate Student | Department of Computer Science
> University of California, San Diego
Graduate Student | Department of Computer Science
University of California, San Diego