Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - AW: mortbay, huge files and the ulimit


Copy link to this message
-
AW: mortbay, huge files and the ulimit
Christoph Schmitz 2012-08-29, 14:23
Hi Elmar,

I don't know about the technicalities of your problem, but why don't you use a reduce-side join in the first place? (I.e., a sort-merge join instead of a hash join.)

<short-explanation-skip-if-you-already-know>
Assuming you want to self-join a table of the form (FROM, TO) in order to compute all paths of length two, you'd emit each (FROM, TO) tuple twice, once with FROM as key and once with TO as key, plus a tag indicating which is which:

(1, 47), (1, 12) and (47, 15) become, in the Reducer:

((1, F), (1, 12))
((1, F), (1, 47))
((12, T), (1, 12))
((15, T), (47, 15))
((47, F), (47, 15))
((47, T), (1, 47))

F, T indicate if the key was a FROM or a TO value.

You'll make Hadoop group by the first component of the key, which will allow you to see both of the "47" tuples that would be joined in one reducer. From those, you build the resulting (1, 47, 15) tuple.

(Detailed description in the O'Reilly/Tom White book.)
</short-explanation-skip-if-you-already-know>

Hope this helps (and kind regards to the Kassel gang ;-),

Christoph

-----Urspr√ľngliche Nachricht-----
Von: Björn-Elmar Macek [mailto:[EMAIL PROTECTED]]
Gesendet: Mittwoch, 29. August 2012 15:54
An: [EMAIL PROTECTED]
Betreff: mortbay, huge files and the ulimit

Hi there,

i am currently running a job where i selfjoin a 63 gigabyte big csv file
on 20 physically distinct nodes with 15GB each:

While the mapping works just fine and is low cost, the reducer does the
main work: holding a hashmap with elements to join with and finding join
tuples for evry incoming key-value-pair.

The jobs works perfectly on small files with 2 gigabytes, but starts to
get "unstable" as the file size goes up: this becomes evident with a
look into the tasktracker's logs saying:

ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
     at org.mortbay.jetty.Response.resetBuffer(Response.java:1023)
     at org.mortbay.jetty.Response.sendError(Response.java:240)
     at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3945)
     at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
     at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
     at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
     at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
     at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:835)
     at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
     at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
     at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
     at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
     at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
     at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
     at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
     at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
     at org.mortbay.jetty.Server.handle(Server.java:326)
     at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
     at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
     at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
     at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
     at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
     at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
     at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
And while it is no problem at the beginning of the reduce process, where
this happens only on a few nodes and rarely, it becomes crucial as the
progress rises. The reason for this (afaik from reading articles), is
that there are memory or file handle problems. I addressed the memory
problem by conitiously purging the map of outdated elements evry 5
million processed key-value-pairs. And i set  mapred.child.ulimit to
100000000 (ulimit in the shell tells me it is 400000000).

Anyway i am still running into those mortbay errors and i start to
wonder, if hadoop can manage the job with this algorithmn anyways. By
pure naive math it should be:
i explicily assigned 10GB memory to each JVM on each node and set
mapred.child.java.opts to "-Xmx10240m -XX:+UseCompressedOops
-XX:-UseGCOverheadLimit" (its a 64 bit environment and large
datastructures cause the GC to throw exceptions). This would naively
make 18 slave machines with 10GB each resulting in an overall memory of
180GB - three times as much as needed... i would think. So if the
Partitioner distributes them just about equally to all nodes i should
not run into any errors, do i?

Can anybody help me with this issue?

Best regards,
Elmar