Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Cumulative value using mapreduce


+
Sarath 2012-10-04, 13:58
+
Bertrand Dechoux 2012-10-04, 16:20
+
Ted Dunning 2012-10-04, 17:52
+
java8964 java8964 2012-10-04, 19:02
+
Bertrand Dechoux 2012-10-04, 21:21
+
Sarath 2012-10-05, 04:56
Copy link to this message
-
Re: Cumulative value using mapreduce
Bertrand Dechoux 2012-10-05, 06:38
Hi,

The provided example records are perfect. With that I doubt there will be
any confusion about what kind of data is available and it should be
manipulated. However, "the output is not coming as desired" is vague. It's
hard to say why you are not getting your expected result without a bit more
information about what has been done.

The aim is to compute cumulative credit & debit amounts (like you said)
using a sequence of records that need be sorted by date (and transaction id
if the order inside the day is relevant and if the transaction id is
monotonically
increasing.) The mapper won't have much logic and will be only responsible
for transforming the records so that the sort happens as expect. The
<key,value> would be something like <[date,transactionId],[CR/DR,amount]>.
And the reducer would apply the logic of calculating the cumulative sums.

I can see different variations. Like
* what exactly should be the reducer input value : [CR/DR,amount] or only a
signed amount. It doesn't change the logic much but it could help reducing
the volume of data. Alternatives for serialization and compression should
also be explored.
* whether several reducers should be used or not. More than one could be
used but then in order to have the full cumulative sums, a kind of
post-reduce merge should be performed. The last results of a file will be
CR/DR offsets that should be applied to the results of the next file. The
partitioning will greatly depends on the processed time range and the
associated data volumes.
* what group should be used by the reducer : only one group (with all
values sorted inside this single group) or one group per date with internal
sorting per transaction id or one group per [date,transactionId]. I
honestly don't know the impact that each would have without doing
benchmarks.

Yet, all these details might be way of your real problems. So if you
provide more details about your actual computation and results, you might
receive more constructive answers with regard to your problem.

Regards

Bertrand

On Fri, Oct 5, 2012 at 6:56 AM, Sarath <
[EMAIL PROTECTED]> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <[EMAIL PROTECTED]>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
Bertrand Dechoux
+
Ted Dunning 2012-10-05, 05:50
+
Steve Loughran 2012-10-05, 14:43
+
Jane Wayne 2012-10-05, 15:21
+
Jane Wayne 2012-10-05, 15:31
+
java8964 java8964 2012-10-05, 14:03
+
Sarath 2012-10-19, 06:03