Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Cumulative value using mapreduce


+
Sarath 2012-10-04, 13:58
+
Bertrand Dechoux 2012-10-04, 16:20
+
Ted Dunning 2012-10-04, 17:52
+
java8964 java8964 2012-10-04, 19:02
+
Bertrand Dechoux 2012-10-04, 21:21
+
Sarath 2012-10-05, 04:56
+
Bertrand Dechoux 2012-10-05, 06:38
+
Ted Dunning 2012-10-05, 05:50
+
Steve Loughran 2012-10-05, 14:43
+
Jane Wayne 2012-10-05, 15:21
Copy link to this message
-
Re: Cumulative value using mapreduce
i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <[EMAIL PROTECTED]>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> [EMAIL PROTECTED]> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
+
java8964 java8964 2012-10-05, 14:03
+
Sarath 2012-10-19, 06:03