Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: OutOfMemory during Plain Java MapReduce


Copy link to this message
-
Re: OutOfMemory during Plain Java MapReduce
Thanks Paul and Harsh for your Tipps!
I implemented the secondary sort and the related mapper.
This is a very good idea to get a unique set.

The original Question how to translate the "huge" Values (in terms of a
"large" list of users for one key) into the format I need is still
"somehow" open.

If the reducer get's this input:
key1, Iterator[value1, value2, value3, ..., valueN]
key2, Iterator[value1, value2, value3, value4, value5, value6, ..., valueN]
...

How to write this in a textfile formatted like this:
key1 value1 value2 value3 ... valueN N
key2 value1 value2 value3 value4 value5 value6 ... valueN N
...

As Harsh said in a previous mail, I wrote a reducer that writes into HDFS
directly.
But I still don't know whether this is a good idea or a workaround.

I came up with this reducer now:

//-----------------------------------------------
// The Reducer
// It writes to HDFS. So there is no OutputFormat needed.
//-----------------------------------------------
public class UserToAppReducer extends Reducer<AppAndUserKey, Text, Text,
Text> {
private static final int BUFFER_SIZE = 5 * 1024 * 1024;

private BufferedWriter br;

@Override
protected void setup(final Context context) throws IOException,
InterruptedException {
final FileSystem fs = FileSystem.get(context.getConfiguration());

final Path outputPath = FileOutputFormat.getOutputPath(context);

final String fileName = "reducer" + context.getTaskAttemptID().getId() +
"_" + context.getTaskAttemptID().getTaskID().getId() + "_" + new
Random(System.currentTimeMillis()).nextInt(10000);

this.br = new BufferedWriter(new OutputStreamWriter(fs.create(new
Path(outputPath, fileName))), BUFFER_SIZE);
}

@Override
protected void reduce(final AppAndUserKey appAndUserKey, final
Iterable<Text> userIds, final Context context) throws IOException,
InterruptedException {
Text lastUserId = new Text();

long count = 0;

this.br.append(appAndUserKey.getAppIdText().toString()).append('\t');

for (final Text text : userIds) {
if (lastUserId.equals(text))
continue;

this.br.append(text.toString()).append('\t');

count++;
lastUserId = text;
}

this.br.append(String.valueOf(count)).append("\n").append('\n');
}

@Override
protected void cleanup(final Context context) throws IOException,
InterruptedException {
this.br.close();
}
}
Is this the best way achieve this (with plain Map Reduce)?

Or is it better to return some composite keys to user a custom outputformat?
Thanks a lot!

Best Regards,
Christian.
2013/3/9 Harsh J <[EMAIL PROTECTED]>

> Paul's way is much more easier than doing the serialization way I
> mentioned earlier. I didn't pay attention to the logic used but just
> the implementation, my bad :)
>
> On Fri, Mar 8, 2013 at 5:39 PM, Paul Wilkinson <[EMAIL PROTECTED]> wrote:
> > As always, what Harsh said :)
> >
> > Looking at your reducer code, it appears that you are trying to compute
> the
> > distinct set of user IDs for a given reduce key. Rather than computing
> this
> > by holding the set in memory, use a secondary sort of the reduce values,
> > then while iterating over the reduce values, look for changes of user id.
> > Whenever it changes, write out the key and the newly found value.
> >
> > Your output will change from this:
> >
> >   key, [value 1, value2, ... valueN]
> >
> > to this:
> >
> >   key, value1
> >   key, value2
> >        ...
> >   key, valueN
> >
> > Whether this is suitable for your follow-on processing is the next
> question,
> > but this approach will scale to whatever data you can throw at it.
> >
> > Paul
> >
> >
> > On 8 March 2013 10:57, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi,
> >>
> >> When you implement code that starts memory-storing value copies for
> >> every record (even if of just a single key), things are going to break
> >> in big-data-land. Practically, post-partitioning, the # of values for
> >> a given key can be huge given the source data, so you cannot hold it
> >> all in and then write in one go. You'd probably need to write out
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB