Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> no output written to HDFS


Copy link to this message
-
Re: no output written to HDFS
Hi,

Do both input files contain data that needs to be processed by the
mapper in the same fashion ? In which case, you could just put the
input files under a directory in HDFS and provide that as input. The
-input option does accept a directory as argument.

Otherwise, can you please explain a little more what you're trying to
do with the two inputs.

Thanks
Hemanth

On Fri, Aug 31, 2012 at 3:00 AM, Periya.Data <[EMAIL PROTECTED]> wrote:
> This is interesting. I changed my command to:
>
> -mapper "cat $1 |  $GHU_HOME/test2.py $2" \
>
> is producing output to HDFS. But, the output is not what I expected and is
> not the same as when I do "cat | map " on Linux. It is producing
> part-00000, part-00001 and part-00002. I expected only one output file with
> just 2 records.
>
> I think I have to understand what exactly "-file" does and what exactly
> "-input" does. I am experimenting what happens if I give my input files on
> the command line (like: test2.py arg1 arg2) as against specifying the input
> files via "-file" and "-input" options...
>
> The problem is I have 2 input files...and have no idea how to pass them.
> SHould I keep one in HDFS and stream in the other?
>
> More digging,
> PD/
>
>
>
> On Thu, Aug 30, 2012 at 11:52 AM, Periya.Data <[EMAIL PROTECTED]> wrote:
>
>> Hi Bertrand,
>>     No, I do not observe the same when I run using cat | map. I can see
>> the output in STDOUT when I run my program.
>>
>> I do not have any reducer. In my command, I provide
>> "-D mapred.reduce.tasks=0". So, I expect the output of the mapper to be
>> written directly to HDFS.
>>
>> Your suspicion maybe right..about the output. In my counters, the "map
>> input records" = 40 and "map.output records" = 0. I am trying to see if I
>> am messing up in my command...(see below)
>>
>> Initially, I had my mapper - "test2.py" to take in 2 arguments. Now, I am
>> streaming one file in and test2.py takes in only one argument. How should I
>> frame my command below? I think that is where I am messing up..
>>
>>
>> run.sh:        (run as:   cat <arg2> | ./run.sh <arg1> )
>> -----------
>>
>> hadoop jar
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>>         -D mapred.reduce.tasks=0 \
>>         -verbose \
>>         -input "$HDFS_INPUT" \
>>         -input "$HDFS_INPUT_2" \
>>         -output "$HDFS_OUTPUT" \
>>         -file   "$GHU_HOME/test2.py" \
>>         -mapper "python $GHU_HOME/test2.py $1" \
>>         -file   "$GHU_HOME/$1"
>>
>>
>>
>> If I modify my mapper to take in 2 arguments, then, I would run it as:
>>
>> run.sh:        (run as:   ./run.sh <arg1>  <arg2>)
>> -----------
>>
>> hadoop jar
>> /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.*-cdh*.jar \
>>         -D mapred.reduce.tasks=0 \
>>         -verbose \
>>         -input "$HDFS_INPUT" \
>>         -input "$HDFS_INPUT_2" \
>>         -output "$HDFS_OUTPUT" \
>>         -file   "$GHU_HOME/test2.py" \
>>         -mapper "python $GHU_HOME/test2.py $1 $2" \
>>         -file   "$GHU_HOME/$1" \
>>         -file   "GHU_HOME/$2"
>>
>>
>> Please let me know if I am making a mistake here.
>>
>>
>> Thanks.
>> PD
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 29, 2012 at 10:45 PM, Bertrand Dechoux <[EMAIL PROTECTED]>wrote:
>>
>>> Do you observe the same thing when running without Hadoop? (cat, map, sort
>>> and then reduce)
>>>
>>> Could you provide the counters of your job? You should be able to get them
>>> using the job tracker interface.
>>>
>>> The most probable answer without more information would be that your
>>> reducer do not output any <key,value>s.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>>
>>>
>>> On Thu, Aug 30, 2012 at 5:52 AM, Periya.Data <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>> > Hi All,
>>> >    My Hadoop streaming job (in Python) runs to "completion" (both map
>>> and
>>> > reduce says 100% complete). But, when I look at the output directory in
>>> > HDFS, the part files are empty. I do not know what might be causing this
>>> > behavior. I understand that the percentages represent the records that
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB