|
|
-
Working on multiple rows
Christian Decker 2010-08-28, 18:10
The title might be a bit misleading but I hope you can help me. I have some data (let's say a Web Log file) and I want to be able to compare multiple items with each other. For example I want to know what items are popular in certain user groups, which means that I want to find items which got many successive hits from users from that group in a short period of time. Until now I only worked on the rows in an isolated manner, that is items could be filtered or modified, without any knowledge of other records, but this now requires to consider multiple records, and I have no clue as to how approach this problem.
Any suggestions?
Regards, Chris
-
Re: Working on multiple rows
Thejas M Nair 2010-08-29, 01:21
Can you give the multiple rows an id and use that ? In your example , can you assign a user-group id for each type of user (or maybe a map with attributes if a user can belong to multiple groups), and then process using that attribute or id ? (I might not have understood the problem correctly, example of input and output data might help) -Thejas
On 8/28/10 11:10 AM, "Christian Decker" <[EMAIL PROTECTED]> wrote:
> The title might be a bit misleading but I hope you can help me. > I have some data (let's say a Web Log file) and I want to be able to compare > multiple items with each other. For example I want to know what items are > popular in certain user groups, which means that I want to find items which > got many successive hits from users from that group in a short period of > time. > Until now I only worked on the rows in an isolated manner, that is items > could be filtered or modified, without any knowledge of other records, but > this now requires to consider multiple records, and I have no clue as to how > approach this problem. > > Any suggestions? > > Regards, > Chris >
-
Re: Working on multiple rows
Mridul Muralidharan 2010-08-29, 16:08
Taking a guess, you could group things based on your criterion and condition.
Something simple like :
a) group by usergroup (might be too expensive ? number of records across timestamps for users in a group might be large !).
b) group by (usergroup, timestamp / window) [this will loose accuracy near the time window, see below] : manageable, but less accurate.
Other more sensible variations based on your input, etc ! Something like :
-- This means that if the users clicked at 9th and 11th minute, we bucket it into two different buckets and miss out on data : so typically, adjust accordingly for error, or replicate input or something more complicated than this simple snippet :-)
%default WINDOW '60 * 10'
A = $MY_INPUT AS (user:chararray, user_grp:chararray, timestamp:long);
-- B = GROUP A by user_grp PARALLEL $PARALLELISM; B = GROUP A by (user_grp, timestamp / $WINDOW) PARALLEL $PARALLELISM; C = FILTER B by COUNT(A) > $THRESHOLD; Ofcourse, I hope I am not misunderstanding your query entirely ! Regards, Mridul On Saturday 28 August 2010 11:40 PM, Christian Decker wrote: > The title might be a bit misleading but I hope you can help me. > I have some data (let's say a Web Log file) and I want to be able to compare > multiple items with each other. For example I want to know what items are > popular in certain user groups, which means that I want to find items which > got many successive hits from users from that group in a short period of > time. > Until now I only worked on the rows in an isolated manner, that is items > could be filtered or modified, without any knowledge of other records, but > this now requires to consider multiple records, and I have no clue as to how > approach this problem. > > Any suggestions? > > Regards, > Chris
|
|