Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Hadoop CompositeInputFormat block matrix-vector multiplication


Copy link to this message
-
Hadoop CompositeInputFormat block matrix-vector multiplication
Hi guys,

I am trying to implement a block matrix-vector multiplication algorithm
with Hadoop according to the schematics from
http://i.stanford.edu/~ullman/mmds/ch5.pdf page 162. My matrix is going to
be sparse and the vector dense which is exactly what is required in
PageRank as well. The vector is assumed to not fit in memory. Just to
reiterate: I am not trying to implement PageRank, I want to implement the
matrix-vector multiplication strategy as described in the PDF.

The way I was thinking about a possible implementation is to use
CompositeInputFormat and basically perform a map-side join of a matrix
block with a vector block and sum up the intermediate result blocks that
contribute to the final result block in the combiner/reducer. The
difference between this approach and a general map-side join is that I
would need to join several blocks of the matrix with a single block of the
vector. I know about the requirements of a map-side join in Hadoop
concerning splits, ordering etc. and in terms of splitting I would take
care of that by having one file for each block named accordingly. To me it
looks like I need to modify the code that determines which files shall be
joined so Hadoop wouldn't want to join files from the two paths with the
same name anymore, but rather according to my defined scheme. However, I
don't know if that's a valid approach and where this part of the code is
located. I looked into the Hadoop source, but couldn't find it.

I know that I could do this in 2 passes with a reduce-side join, but
efficiency is critical here because these matrix-vector multiplication need
to be executed in large amounts. Distributed Cache is out of the question
because the vector doesn't fit in memory.

Does anyone have an idea how to tackle this problem?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB