Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Cumulative value using mapreduce

Copy link to this message
RE: Cumulative value using mapreduce

Are you allowed to change the order of the data in the output? If you want to calculate the cr/dr indicator cumulative sum value, then it will easy if the business allow you to change the order of your data group by CR/DR indicator in the output.
For example, you can do it very easy with the way I described in my original email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount       Dr Cumulative Amount1001                   CR                         1000                1000                                          01004                   CR                         2000                3000                                          01002                   DR                          500                 0                                              5001003                   DR                         1500                0                                            2000
As you can see, you have to group out your output by the Cr/Dr Indicator. If you want to keep the original order, then it is hard, at least I cannot think a way in short time.
But if you allow to change the order of the output, then it is called cumulative sum with grouping (in this case, it is group1 for CR, group 2 for DR).
1) In the mapper, omit your data by Cr/Dr indicator, which will group the data by CR/DR. So all CR data will go to one reducer, then all DR data will go to one reducer.2) Besides grouping the data, if you want the output sorted by the Amount (for example) in each group, then you have to do the 2nd sorting. Google 2nd sort. Then for each group, the data arriving into each reducer will be sorted by amount. Otherwise, if you don't need that sorting, then just ignore the 2nd sorting.3) In each reducer, the data arriving should be already grouped. The default partitioner for MR job is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR', these 2 groups data could go to different reducers (assuming you are running with multi reducers), or they could go to the same reducers. But even they are going to the same reducer, they will be arrived into 2 groups. So the output of your reducers will be grouped, which is sorted by the way.4) In your reducers, for the same group data, you will get an array of values. For CR, you will get all the CR records in the array. What you need to do is to Iterating your array, for every element, calculating the cumulative sum, and omit the cumulative sum with the each record out.5) In the end, your output could be multi files, as each file generated from one reducer. You can merge them into one file, or just leave them as that in the HDFS.6) For best performance, if you have huge data, AND you know all your possible value for THE Indicator, you may want to consider use your own custom Partitioner, instead of HashPartitioner. What you want is like a RoundRobin distribution of your keys inside the available reducers, instead of Random distribution by hash value(). Keep in mind that random distribution DOES NOT work well if the distinct count of your keys is small enough.
Date: Fri, 5 Oct 2012 10:26:43 +0530
Subject: Re: Cumulative value using mapreduce
    Thanks for all your responses. As
      suggested will go through the documentation once again.


      But just to clarify, this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.


      As an example, say we have records as below -

            Txn ID

            Txn Date

            Cr/Dr Indicator



















      When this file passed the logic should append the below 2 columns
      to the output for each record above -

            CR Cumulative Amount

            DR Cumulative Amount










      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.





      On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:

    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort, if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.