Going directly to HFiles has the following pitfalls:
1. You'll miss out on data that's in the memstore and has not been
flushed to an HFile yet.
2. If you have deletes, you'll probably see the data from some HFiles
where the data resides since a compactions hasn't taken place to throw
it out yet.
3. Different values owing to different versions residing in different HFiles.
In short, you miss out in the reconciliation that gets done at the RS level.
If all you want to do it MR jobs over the data in HBase, why not
consider flat files and just run em over that? Maybe run Hive queries,
since you mentioned that. Why use HBase at all?
(I'm not trying to shoo you away from HBase. Just curious what you are
trying to accomplish)
On Feb 9, 2012, at 12:19 AM, Tim Robertson <[EMAIL PROTECTED]> wrote:
> Hi all,
> Can anyone elaborate on the pitfalls or implications of running
> MapReduce using an HFileInputFormat extending FileInputFormat?
> I'm sure scanning goes through the RS for good reasons (guessing
> handling splits, locking, RS monitoring etc) but can it ever be "safe"
> to run MR over HFiles directly? E.g. For scenarios like a a region
> split, would the MR just get stale data or would _bad_things_happen_?
> For our use cases we could tolerate stale data, the occasional MR
> failure on a node dropping out, and if we could detect a region split
> we can suspend MR jobs on the HFile until the split is finished. We
> don't anticipate huge daily growth, but a lot of scanning and random
> I knocked up a quick example porting the Scala version of HFIF  to
> Java  and full data scans appear to be an order of magnitude
> quicker (30 -> 3 mins), but I suspect this is *fraught* with dangers.
> If not, I'd like to try and take this further, possibly with Hive.
>  https://gist.github.com/1120311
>  http://pastebin.com/e5qeKgAd