Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: OutOfMemory during Plain Java MapReduce


Copy link to this message
-
Re: OutOfMemory during Plain Java MapReduce
Thanks Paul and Harsh for your Tipps!
I implemented the secondary sort and the related mapper.
This is a very good idea to get a unique set.

The original Question how to translate the "huge" Values (in terms of a
"large" list of users for one key) into the format I need is still
"somehow" open.

If the reducer get's this input:
key1, Iterator[value1, value2, value3, ..., valueN]
key2, Iterator[value1, value2, value3, value4, value5, value6, ..., valueN]
...

How to write this in a textfile formatted like this:
key1 value1 value2 value3 ... valueN N
key2 value1 value2 value3 value4 value5 value6 ... valueN N
...

As Harsh said in a previous mail, I wrote a reducer that writes into HDFS
directly.
But I still don't know whether this is a good idea or a workaround.

I came up with this reducer now:

//-----------------------------------------------
// The Reducer
// It writes to HDFS. So there is no OutputFormat needed.
//-----------------------------------------------
public class UserToAppReducer extends Reducer<AppAndUserKey, Text, Text,
Text> {
private static final int BUFFER_SIZE = 5 * 1024 * 1024;

private BufferedWriter br;

@Override
protected void setup(final Context context) throws IOException,
InterruptedException {
final FileSystem fs = FileSystem.get(context.getConfiguration());

final Path outputPath = FileOutputFormat.getOutputPath(context);

final String fileName = "reducer" + context.getTaskAttemptID().getId() +
"_" + context.getTaskAttemptID().getTaskID().getId() + "_" + new
Random(System.currentTimeMillis()).nextInt(10000);

this.br = new BufferedWriter(new OutputStreamWriter(fs.create(new
Path(outputPath, fileName))), BUFFER_SIZE);
}

@Override
protected void reduce(final AppAndUserKey appAndUserKey, final
Iterable<Text> userIds, final Context context) throws IOException,
InterruptedException {
Text lastUserId = new Text();

long count = 0;

this.br.append(appAndUserKey.getAppIdText().toString()).append('\t');

for (final Text text : userIds) {
if (lastUserId.equals(text))
continue;

this.br.append(text.toString()).append('\t');

count++;
lastUserId = text;
}

this.br.append(String.valueOf(count)).append("\n").append('\n');
}

@Override
protected void cleanup(final Context context) throws IOException,
InterruptedException {
this.br.close();
}
}
Is this the best way achieve this (with plain Map Reduce)?

Or is it better to return some composite keys to user a custom outputformat?
Thanks a lot!

Best Regards,
Christian.
2013/3/9 Harsh J <[EMAIL PROTECTED]>

> Paul's way is much more easier than doing the serialization way I
> mentioned earlier. I didn't pay attention to the logic used but just
> the implementation, my bad :)
>
> On Fri, Mar 8, 2013 at 5:39 PM, Paul Wilkinson <[EMAIL PROTECTED]> wrote:
> > As always, what Harsh said :)
> >
> > Looking at your reducer code, it appears that you are trying to compute
> the
> > distinct set of user IDs for a given reduce key. Rather than computing
> this
> > by holding the set in memory, use a secondary sort of the reduce values,
> > then while iterating over the reduce values, look for changes of user id.
> > Whenever it changes, write out the key and the newly found value.
> >
> > Your output will change from this:
> >
> >   key, [value 1, value2, ... valueN]
> >
> > to this:
> >
> >   key, value1
> >   key, value2
> >        ...
> >   key, valueN
> >
> > Whether this is suitable for your follow-on processing is the next
> question,
> > but this approach will scale to whatever data you can throw at it.
> >
> > Paul
> >
> >
> > On 8 March 2013 10:57, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Hi,
> >>
> >> When you implement code that starts memory-storing value copies for
> >> every record (even if of just a single key), things are going to break
> >> in big-data-land. Practically, post-partitioning, the # of values for
> >> a given key can be huge given the source data, so you cannot hold it
> >> all in and then write in one go. You'd probably need to write out