|
|
-
Re: no output written to HDFSHemanth Yamijala 2012-08-31, 04:46
Hi,
Do both input files contain data that needs to be processed by the mapper in the same fashion ? In which case, you could just put the input files under a directory in HDFS and provide that as input. The -input option does accept a directory as argument. Otherwise, can you please explain a little more what you're trying to do with the two inputs. Thanks Hemanth On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <[EMAIL PROTECTED]> wrote: > This is interesting. I changed my command to: > > -mapper "cat $1 | $GHU_HOME/test2.py $2" \ > > is producing output to HDFS. But, the output is not what I expected and is > not the same as when I do "cat | map " on Linux. It is producing > part-00000, part-00001 and part-00002. I expected only one output file with > just 2 records. > > I think I have to understand what exactly "-file" does and what exactly > "-input" does. I am experimenting what happens if I give my input files on > the command line (like: test2.py arg1 arg2) as against specifying the input > files via "-file" and "-input" options... > > The problem is I have 2 input files...and have no idea how to pass them. > SHould I keep one in HDFS and stream in the other? > > More digging, > PD/ > > > > On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <[EMAIL PROTECTED]> wrote: > >> Hi Bertrand, >> No, I do not observe the same when I run using cat | map. I can see >> the output in STDOUT when I run my program. >> >> I do not have any reducer. In my command, I provide >> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be >> written directly to HDFS. >> >> Your suspicion maybe right..about the output. In my counters, the "map >> input records" = 40 and "map.output records" = 0. I am trying to see if I >> am messing up in my command...(see below) >> >> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am >> streaming one file in and test2.py takes in only one argument. How should I >> frame my command below? I think that is where I am messing up.. >> >> >> run.sh: (run as: cat <arg2> | ./run.sh <arg1> ) >> ----------- >> >> hadoop jar >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ >> -D mapred.reduce.tasks=0 \ >> -verbose \ >> -input "$HDFS_INPUT" \ >> -input "$HDFS_INPUT_2" \ >> -output "$HDFS_OUTPUT" \ >> -file "$GHU_HOME/test2.py" \ >> -mapper "python $GHU_HOME/test2.py $1" \ >> -file "$GHU_HOME/$1" >> >> >> >> If I modify my mapper to take in 2 arguments, then, I would run it as: >> >> run.sh: (run as: ./run.sh <arg1> <arg2>) >> ----------- >> >> hadoop jar >> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \ >> -D mapred.reduce.tasks=0 \ >> -verbose \ >> -input "$HDFS_INPUT" \ >> -input "$HDFS_INPUT_2" \ >> -output "$HDFS_OUTPUT" \ >> -file "$GHU_HOME/test2.py" \ >> -mapper "python $GHU_HOME/test2.py $1 $2" \ >> -file "$GHU_HOME/$1" \ >> -file "GHU_HOME/$2" >> >> >> Please let me know if I am making a mistake here. >> >> >> Thanks. >> PD >> >> >> >> >> >> >> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote: >> >>> Do you observe the same thing when running without Hadoop? (cat, map, sort >>> and then reduce) >>> >>> Could you provide the counters of your job? You should be able to get them >>> using the job tracker interface. >>> >>> The most probable answer without more information would be that your >>> reducer do not output any <key,value>s. >>> >>> Regards >>> >>> Bertrand >>> >>> >>> >>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <[EMAIL PROTECTED]> >>> wrote: >>> >>> > Hi All, >>> > My Hadoop streaming job (in Python) runs to "completion" (both map >>> and >>> > reduce says 100% complete). But, when I look at the output directory in >>> > HDFS, the part files are empty. I do not know what might be causing this >>> > behavior. I understand that the percentages represent the records that |