-Re: performance consideration when writing to HBase from MR job
Amandeep Khurana 2010-06-05, 23:01
> a) all the Puts are collected in Reduce or Map (if there is no reduce) and
> a batch write is done
> b) writing out each <K,V> pair using context.write(k, v)
> If a) is considered instead of b) then wouldn't there be a violation of
> semantics w.r.t KEYOUT, VALUEOUT (because <K, V> is not being output)?? Is
> this OK?
1. If you can write from the mapper, you would avoid the overhead caused due
to shuffling and sorting between the map and reduce phase.
2. It would not make much difference if you are using the HBase API directly
in the mapper/reducer to write to the table instead of writing out to the
context and using one of the output formats that writes to the table.
However, if you plan to use the bulkload utility (HBASE-48 jira), you will
get much better performance than using the HBase API directly.
Regarding the semantics - no there would not be a problem as long as you
create your Puts properly.