Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> best approach for write and immediate read use case


Copy link to this message
-
Re: best approach for write and immediate read use case
>What would be the behavior for inserting data using map reduce job? would
the recently added records be in the memstore? or I need to load them for
read queries after the insert is done?

Using MR you have 2 options for insertion. One will create the HFiles
directly as o/p  (Using HFileOutputFormat)  Here there is no memstore
coming into picture. In the other one there will be calls to HTable#put()
from mappers.  Here memstore will come into picture.(These are mapper alone
jobs)   When you are using ImportTSV tool and you are giving
"importtsv.bulk.output"  , it will go with 1st way..  JFYI..  Have a look
at ImportTSV tool documentation.

-Anoop-

On Sat, Aug 24, 2013 at 4:10 AM, Gautam Borah <[EMAIL PROTECTED]>wrote:

> Thanks Ted for your response, and clarifying the behavior for using HTable
> interface.
>
> What would be the behavior for inserting data using map reduce job? would
> the recently added records be in the memstore? or I need to load them for
> read queries after the insert is done?
>
> Thanks,
> Gautam
>
>
> On Fri, Aug 23, 2013 at 2:43 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Assuming you are using 0.94, the default value
> > for hbase.regionserver.global.memstore.lowerLimit is 0.35
> >
> > Meaning, memstore on each region server would be able to hold 3000M *
> 0.35
> > / 60 = 17.5 mil records (roughly).
> >
> > bq. If I use HTable interface, would the inserted data be in the HBase
> > cache, before flushing to the files, for immediate read queries?
> >
> > Yes.
> >
> > Cheers
> >
> >
> > On Fri, Aug 23, 2013 at 12:01 PM, Gautam Borah <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi,
> > >
> > > Average size of my records is 60 bytes - 20 bytes Key and 40 bytes
> value,
> > > table has one column family.
> > >
> > > I have setup a cluster for testing - 1 master and 3 region servers.
> Each
> > > have a heap size of 3 GB, single cpu.
> > >
> > > I have pre-split the table into 30 regions. I do not have to keep data
> > > forever, I could purge older records periodically.
> > >
> > > Thanks,
> > >
> > > Gautam
> > >
> > >
> > >
> > > On Fri, Aug 23, 2013 at 3:20 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > Can you tell us the average size of your records and how much heap is
> > > > given to the region servers ?
> > > >
> > > > Thanks
> > > >
> > > > On Aug 23, 2013, at 12:11 AM, Gautam Borah <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I have an use case where I need to write 1 million to 10 million
> > > records
> > > > > periodically (with intervals of 1 minutes to 10 minutes), into an
> > HBase
> > > > > table.
> > > > >
> > > > > Once the insert is completed, these records are queried immediately
> > > from
> > > > > another program - multiple reads.
> > > > >
> > > > > So, this is one massive write followed by many reads.
> > > > >
> > > > > I have two approaches to insert these records into the HBase table
> -
> > > > >
> > > > > Use HTable or HTableMultiplexer to stream the data to HBase table.
> > > > >
> > > > > or
> > > > >
> > > > > Write the data to HDFS store as a sequence file (avro in my case) -
> > run
> > > > map
> > > > > reduce job using HFileOutputFormat and then load the output files
> > into
> > > > > HBase cluster.
> > > > > Something like,
> > > > >
> > > > >  LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
> > > > >  loader.doBulkLoad(new Path(outputDir), hTable);
> > > > >
> > > > >
> > > > > In my use case which approach would be better?
> > > > >
> > > > > If I use HTable interface, would the inserted data be in the HBase
> > > cache,
> > > > > before flushing to the files, for immediate read queries?
> > > > >
> > > > > If I use map reduce job to insert, would the data be loaded into
> the
> > > > HBase
> > > > > cache immediately? or only the output files would be copied to
> > > respective
> > > > > hbase table specific directories?
> > > > >
> > > > > So, which approach is better for write and then immediate multiple
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB