Thanks for the help.
I just implemented it as suggested. I am processing the new file and then
joining it with previous results. but can i modify the original document
with updated counts plus new word counts.
so my inputs are step1_word_count_output.txt + new_raw_input
The output I want is saved in step1_word_count_output.txt
Which is to say, that I just want to have one output file?
On Wed, Jan 16, 2013 at 7:30 PM, <[EMAIL PROTECTED]> wrote:
> Hi Jamal
> You can use Distributed Cache only if the file to be distributed is small.
> Mapreduce should be dealing with larger datasets so you should expect the
> output file to get larger.
> In simple straight forward manner. You can get the second data set
> processed then merge the fist output with second output, you can use
> KeyValueInputFormat to load the outputs to second MR job.
> Else you can use multple Inputs here and process the new input file into
> 'word 1' and the previous output file as 'word $count' in the mapper and do
> its aggregation in the reducer.
> Bejoy KS
> Sent from remote device, Please excuse typos
> *From: * jamal sasha <[EMAIL PROTECTED]>
> *Date: *Wed, 16 Jan 2013 18:54:04 -0800
> *To: *[EMAIL PROTECTED]<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> *ReplyTo: * [EMAIL PROTECTED]
> *Subject: *Re: modifying existing wordcount example
> Thanks for giving your thoughts.
> I was reading some libraries in hadoop.. and i feel like distributed cache
> might help me.
> but i picked up hadoop very recently (and along it java as well) and i am
> not able to think of how to actually code :(
> On Wed, Jan 16, 2013 at 6:13 PM, Chris Embree <[EMAIL PROTECTED]> wrote:
>> Can you instead copy intput1 and input2 together?
>> Or process both files on the second pass?
>> Otherwise, you'll have to read in output file, load the values and start
>> your map/red job.
>> Probably someone else will have a better answer. :)
>> On Wed, Jan 16, 2013 at 9:07 PM, jamal sasha <[EMAIL PROTECTED]>wrote:
>>> In the wordcount example:
>>> Lets say I run the above example and save the the output.
>>> But lets say that I have now a new input file. What I want to do is..
>>> basically again do the wordcount but basically modifying the previous
>>> For example..
>>> sample_input1.txt //foo bar foo bar bar bar
>>> After first run:
>>> 1) foo 2
>>> 2) bar 4
>>> Save it in output1.txt
>>> Now sample_input2.txt //bar hello world
>>> Now the result I am looking for is:
>>> 1)foo 2
>>> 2)bar 5
>>> 3) hello 1
>>> 4) world 1
>>> How do i achieve this in map reduce?