Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: OutOfMemory during Plain Java MapReduce


+
Harsh J 2013-03-08, 10:57
Copy link to this message
-
Re: OutOfMemory during Plain Java MapReduce
As always, what Harsh said :)

Looking at your reducer code, it appears that you are trying to compute the distinct set of user IDs for a given reduce key. Rather than computing this by holding the set in memory, use a secondary sort of the reduce values, then while iterating over the reduce values, look for changes of user id. Whenever it changes, write out the key and the newly found value.

Your output will change from this:

  key, [value 1, value2, ... valueN]

to this:

  key, value1
  key, value2
       ...
  key, valueN

Whether this is suitable for your follow-on processing is the next question, but this approach will scale to whatever data you can throw at it.

Paul
On 8 March 2013 10:57, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi,
>
> When you implement code that starts memory-storing value copies for
> every record (even if of just a single key), things are going to break
> in big-data-land. Practically, post-partitioning, the # of values for
> a given key can be huge given the source data, so you cannot hold it
> all in and then write in one go. You'd probably need to write out
> something continuously if you really really want to do this, or use an
> alternative form of key-value storage where updates can be made
> incrementally (Apache HBase is such a store, as one example).
>
> This has been discussed before IIRC, and if the goal were to store the
> outputs onto a file then its better to just directly serialize them
> with a file opened instead of keeping it in a data structure and
> serializing it at the end. The caveats that'd apply if you were to
> open your own file from a task are described at
> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F.
>
> On Fri, Mar 8, 2013 at 4:35 AM, Christian Schneider
> <[EMAIL PROTECTED]> wrote:
> > I had a look to the stacktrace and it says the problem is at the reducer:
> > userSet.add(iterator.next().toString());
> >
> > Error: Java heap space
> > attempt_201303072200_0016_r_000002_0: WARN : mapreduce.Counters - Group
> > org.apache.hadoop.mapred.Task$Counter is deprecated. Use
> > org.apache.hadoop.mapreduce.TaskCounter instead
> > attempt_201303072200_0016_r_000002_0: WARN :
> > org.apache.hadoop.conf.Configuration - session.id is deprecated. Instead,
> > use dfs.metrics.session-id
> > attempt_201303072200_0016_r_000002_0: WARN :
> > org.apache.hadoop.conf.Configuration - slave.host.name is deprecated.
> > Instead, use dfs.datanode.hostname
> > attempt_201303072200_0016_r_000002_0: FATAL: org.apache.hadoop.mapred.Child
> > - Error running child : java.lang.OutOfMemoryError: Java heap space
> > attempt_201303072200_0016_r_000002_0: at
> > java.util.Arrays.copyOfRange(Arrays.java:3209)
> > attempt_201303072200_0016_r_000002_0: at
> > java.lang.String.<init>(String.java:215)
> > attempt_201303072200_0016_r_000002_0: at
> > java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> > attempt_201303072200_0016_r_000002_0: at
> > java.nio.CharBuffer.toString(CharBuffer.java:1157)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.io.Text.decode(Text.java:394)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.io.Text.decode(Text.java:371)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.io.Text.toString(Text.java:273)
> > attempt_201303072200_0016_r_000002_0: at
> > com.myCompany.UserToAppReducer.reduce(RankingReducer.java:21)
> > attempt_201303072200_0016_r_000002_0: at
> > com.myCompany.UserToAppReducer.reduce(RankingReducer.java:1)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
> > attempt_201303072200_0016_r_000002_0: at
> > org.apache.hadoop.mapred.Child$4.run(Child.java:268)
+
Michael Segel 2013-03-08, 14:39