Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Cumulative value using mapreduce


Copy link to this message
-
Re: Cumulative value using mapreduce
i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <[EMAIL PROTECTED]>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> [EMAIL PROTECTED]> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB