Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Map Reduce Theory Question, getting OutOfMemoryError while reducing


Copy link to this message
-
Re: Map Reduce Theory Question, getting OutOfMemoryError while reducing
Use a custom partitioner and grouping comparator as described here
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

in effect make the time part of the key for sort but not grouping or
partitioning -
also you might frameworks like pig

On Thu, Jun 28, 2012 at 4:20 PM, Berry, Matt <[EMAIL PROTECTED]> wrote:

> My end goal is to have all the records sorted chronologically, regardless
> of the source file. To present it formally:
>
> Let there are X servers.
> Let each server produce one chronological log file that records who
> operated on the server and when.
> Let there be Y users.
> Assume a given user can operate on any number of servers simultaneously.
> Assume a given user can perform any number of operations a second.
>
> My goal would be to have Y output files, each containing the records for
> only that user, sorted chronologically.
> So working backwards from the output.
>
> In order for records to be written chronologically to the file:
> - All records for a given user must arrive at the same reducer (or the
> file IO will mess with the order)
> - All records arriving at a given reducer must be chronological with
> respect to a given user
>
> In order for records to arrive a reducer in chronological with respect to
> a given user:
> - The sorter must be set to sort by time and operate over all records for
> a user
>
> In order for the sorter to operate over all records for a user
> - The grouper must be set to group by user, or not group at all (each
> record is a group)
>
> In order for all records for a given user to arrive at the same reducer:
> - The partitioner must be set to partition by user (i.e., user number mod
> number of partitions)
>
> From this vantage point I see two possible ways to do this.
> 1. Set the Key to be the user number, set the grouper to group by key.
> This results in all records for a user being aggregated (very large)
> 2. Set they Key to be {user number, time}, set the grouper to group by
> key. This results in each record being emitted to the reducer one at a time
> (lots of overhead)
>
> Neither of those seems very favorable. Is anyone aware of a different
> means to achieve that goal?
>
>
> From: Steve Lewis [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, June 28, 2012 3:43 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Map Reduce Theory Question, getting OutOfMemoryError while
> reducing
>
> It is NEVER a good idea to hold items in memory - after all this is big
> data and you want it to scale -
> I do not see what stops you from reading one record, processing it and
> writing it out without retaining it.
> It is OK to keep statistics while iterating through a key and output them
> at the end but holding all values for a key is almost never a good idea
> unless you can guarantee limits to these
> On Thu, Jun 28, 2012 at 2:37 PM, Berry, Matt <[EMAIL PROTECTED]> wrote:
> I have a MapReduce job that reads in several gigs of log files and
> separates the records based on who generated them. My MapReduce job looks
> like this:
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB