Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HFileInputFormat for MapReduce


Copy link to this message
-
Re: HFileInputFormat for MapReduce
>
> From the limitations you mention, 1) and 2) we can live with, but 3)
> could be why my quick tests are already giving incorrect record
> counts.  That sounds like a show stopper straight away right?
>
> One option for us would be HBase for the primary store for random
> access, and periodic (e.g. 12 hourly) exports to HDFS for all the full
> scanning.  Would you consider that sane?
>
>
Is your primary access pattern scans or random reads and writes? If it's
primarily scans, you could consider just keeping flat files. If you need
random reads and writes over a small amount of data, sqoop it out to MySQL
or such. If you need random reads and writes across all your data (I'm
assuming big numbers), sure, have HBase as your authoritative store and
scan over it. If you need to do lots of scans and often, export it out
every now and then (so you minimize the "staleness" of the exported data)
and scan over that.

Having said that, scan performance being an order of magnitude slower
doesn't seem right. You might be able to tune the cluster and extract
better performance.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB