Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: debugging hadoop streaming programs (first code)


+
Mahesh Balija 2012-11-20, 12:42
Copy link to this message
-
Re: debugging hadoop streaming programs (first code)
jamal sasha 2012-11-20, 13:33
Hi,
   If I just use pipes, then the code runs just fine.. the issue is when I
deploy it on clusters...
:(
Any suggestions on how to debug it.
On Tue, Nov 20, 2012 at 7:42 AM, Mahesh Balija
<[EMAIL PROTECTED]>wrote:

> Hi Jamal,
>
>           You can debug your MapReduce program if it is written in java
> code, by running your MR job in LocalRunner mode via eclipse.
>           Or even you can have some debug statements (or even S.O.Ps)
> written in your code so that you can check where your job fails.
>
>           But I am NOT sure for python, but one suggestion is can you run
> your Python code (Map unit & reduce unit) locally on your input data and
> see whether your logic has any issues.
>
> Best,
> Mahesh Balija,
> Calsoft Labs.
>
>
> On Tue, Nov 20, 2012 at 6:50 AM, jamal sasha <[EMAIL PROTECTED]>wrote:
>
>>
>>
>>
>> Hi,
>>   This is my first attempt to learn the map reduce abstraction.
>>
>> My problem is as follows
>> I have a text file as follows:
>> id 1, id2, date,time,mrps,code,code2
>>
>> 3710100022400,1350219887, 2011-09-10, 12:39:38.000, 99.00, 1, 0
>> 3710100022400, 5045462785, 2011-09-06, 13:23:00.000, 70.63, 1, 0
>>
>>
>> Now what I want is to do is to count the number of transaction happening in every half an hour between 7 am and 11 am.
>>
>> So here are the  intervals.
>>
>>
>> 7-7:30 ->0
>>
>> 7:30-8 -> 1
>>
>> 8-8:30->2
>>
>> ....
>>
>> 10:30-11->7
>>
>> So ultimately what I am doing is creating a 2d dictionary
>>
>> d[id2][interval] = count_transactions.
>>
>>
>> My mappers and reducers are attached (sample input also).
>>
>> The code run just fine if i run via
>>
>> cat input.txt | python mapper.py | sort | python reducer.py
>>
>>
>> Gives me the output but when i run it on clusters.. it throws an error which is not helpful (basically on the terminal it says job unsuccesful reason NA).
>>
>> Any suggestion on what am i doing wrong.
>>
>>
>> Jamal
>>
>>
>>
>>
>>
>>
>>
>