OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?

Thanks,
Sigurd

2012/9/10 Harsh J <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB