|
|
-
How to calculate delta in a column?
Eric Yang 2010-12-31, 06:12
Hi,
What is the most efficient method to calculate delta of columns? Consider this:
(key1, 1, 2, 3) (key1, 2, 4, 5) (key2, 1, 2, 4) (key1, 3, 6, 9) (key2, 2, 4, 6)
The expected transformation output should look like this:
(key1, 1, 2, 2) (key1, 1, 2, 4) (key2, 1, 2, 2)
The idea is to group by f0, and compute f1 (current value) - f1 (previous value). How to write this in pig?
if there is a underflow value, it should reset to 0, for example:
(key1, 1, 2, 3) (key1, 0, 0, 0) (key1, 2, 3, 4)
The output should be:
(key1, 0, 0, 0) (key1, 2, 3, 4)
I haven't been able to find a solution from google. Anyone?
regards, Eric
-
Re: How to calculate delta in a column?
Dmitriy Ryaboy 2010-12-31, 09:16
Can't without a way of ordering the data for the same key.
If you do have a way to do this (a timestamp or some such), you can group by key, inside the foreach order the resulting group, and then run through a UDF (you can even make this udf accumulative).
grouped = group data by key; deltas = foreach grouped { ordered_tuples = order grouped by ordinal; generate key, FLATTEN(calculateDeltas(ordered_tuples)); } -D On Thu, Dec 30, 2010 at 10:12 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Hi, > > What is the most efficient method to calculate delta of columns? Consider > this: > > (key1, 1, 2, 3) > (key1, 2, 4, 5) > (key2, 1, 2, 4) > (key1, 3, 6, 9) > (key2, 2, 4, 6) > > The expected transformation output should look like this: > > (key1, 1, 2, 2) > (key1, 1, 2, 4) > (key2, 1, 2, 2) > > The idea is to group by f0, and compute f1 (current value) - f1 > (previous value). How to write this in pig? > > if there is a underflow value, it should reset to 0, for example: > > (key1, 1, 2, 3) > (key1, 0, 0, 0) > (key1, 2, 3, 4) > > The output should be: > > (key1, 0, 0, 0) > (key1, 2, 3, 4) > > I haven't been able to find a solution from google. Anyone? > > regards, > Eric >
-
Re: How to calculate delta in a column?
Eric Yang 2011-01-01, 20:19
You are right in my example, there should be a timestamp column. Thanks, I will look into writing the UDF.
regards, Eric
On Fri, Dec 31, 2010 at 1:16 AM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Can't without a way of ordering the data for the same key. > > If you do have a way to do this (a timestamp or some such), you can group by > key, inside the foreach order the resulting group, and then run through a > UDF (you can even make this udf accumulative). > > grouped = group data by key; > deltas = foreach grouped { > ordered_tuples = order grouped by ordinal; > generate key, FLATTEN(calculateDeltas(ordered_tuples)); > } > > > -D > > > On Thu, Dec 30, 2010 at 10:12 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> What is the most efficient method to calculate delta of columns? Consider >> this: >> >> (key1, 1, 2, 3) >> (key1, 2, 4, 5) >> (key2, 1, 2, 4) >> (key1, 3, 6, 9) >> (key2, 2, 4, 6) >> >> The expected transformation output should look like this: >> >> (key1, 1, 2, 2) >> (key1, 1, 2, 4) >> (key2, 1, 2, 2) >> >> The idea is to group by f0, and compute f1 (current value) - f1 >> (previous value). How to write this in pig? >> >> if there is a underflow value, it should reset to 0, for example: >> >> (key1, 1, 2, 3) >> (key1, 0, 0, 0) >> (key1, 2, 3, 4) >> >> The output should be: >> >> (key1, 0, 0, 0) >> (key1, 2, 3, 4) >> >> I haven't been able to find a solution from google. Anyone? >> >> regards, >> Eric >> >
|
|