Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> modifying existing wordcount example


Copy link to this message
-
Re: modifying existing wordcount example
Thanks for the help.
I just implemented it as suggested. I am processing the new file and then
joining it with previous results. but can i modify the original document
with updated counts plus new word counts.
so my inputs are step1_word_count_output.txt + new_raw_input
The output I want is saved in step1_word_count_output.txt
Which is to say, that I just want to have one output file?

On Wed, Jan 16, 2013 at 7:30 PM, <[EMAIL PROTECTED]> wrote:

> **
> Hi Jamal
>
> You can use Distributed Cache only if the file to be distributed is small.
> Mapreduce should be dealing with larger datasets so you should expect the
> output file to get larger.
>
> In simple straight forward manner. You can get the second data set
> processed then merge the fist output with second output, you can use
> KeyValueInputFormat to load the outputs to second MR job.
>
> Else you can use multple Inputs here and process the new input file into
> 'word 1' and the previous output file as 'word $count' in the mapper and do
> its aggregation in the reducer.
> Regards
> Bejoy KS
>
> Sent from remote device, Please excuse typos
> ------------------------------
> *From: * jamal sasha <[EMAIL PROTECTED]>
> *Date: *Wed, 16 Jan 2013 18:54:04 -0800
> *To: *[EMAIL PROTECTED]<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> *ReplyTo: * [EMAIL PROTECTED]
> *Subject: *Re: modifying existing wordcount example
>
> Hi,
>  Thanks for giving your thoughts.
> I was reading some libraries in hadoop.. and i feel like distributed cache
> might help me.
> but i picked up hadoop very recently (and along it java as well) and i am
> not able to think of how to actually code :(
>
>
> On Wed, Jan 16, 2013 at 6:13 PM, Chris Embree <[EMAIL PROTECTED]> wrote:
>
>> Can you instead copy intput1 and input2 together?
>>
>> Or process both files on the second pass?
>>
>> Otherwise, you'll have to read in output file, load the values and start
>> your map/red job.
>>
>> Probably someone else will have a better answer. :)
>>
>>
>> On Wed, Jan 16, 2013 at 9:07 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>
>>> Hi,
>>>   In the wordcount example:
>>> http://hadoop.apache.org/docs/r0.17.0/mapred_tutorial.html
>>>  Lets say I run the above example and save the the output.
>>> But lets say that I have now a new input file. What I want to do is..
>>> basically again do the wordcount but basically modifying the previous
>>> counts.
>>> For example..
>>> sample_input1.txt  //foo bar foo bar bar bar
>>> After first run:
>>> 1) foo 2
>>> 2) bar 4
>>>
>>> Save it in output1.txt
>>>
>>> Now sample_input2.txt //bar hello world
>>> Now the result I am looking for is:
>>> 1)foo 2
>>> 2)bar 5
>>> 3) hello 1
>>> 4) world 1
>>>
>>> How do i achieve this in map reduce?
>>>
>>>
>>
>