Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - best approach for write and immediate read use case


+
Gautam Borah 2013-08-23, 07:11
+
Ted Yu 2013-08-23, 10:20
+
Gautam Borah 2013-08-23, 19:01
+
Ted Yu 2013-08-23, 21:43
+
Gautam Borah 2013-08-23, 22:40
Copy link to this message
-
Re: best approach for write and immediate read use case
Anoop John 2013-08-24, 04:55
>What would be the behavior for inserting data using map reduce job? would
the recently added records be in the memstore? or I need to load them for
read queries after the insert is done?

Using MR you have 2 options for insertion. One will create the HFiles
directly as o/p  (Using HFileOutputFormat)  Here there is no memstore
coming into picture. In the other one there will be calls to HTable#put()
from mappers.  Here memstore will come into picture.(These are mapper alone
jobs)   When you are using ImportTSV tool and you are giving
"importtsv.bulk.output"  , it will go with 1st way..  JFYI..  Have a look
at ImportTSV tool documentation.

-Anoop-

On Sat, Aug 24, 2013 at 4:10 AM, Gautam Borah <[EMAIL PROTECTED]>wrote:

> Thanks Ted for your response, and clarifying the behavior for using HTable
> interface.
>
> What would be the behavior for inserting data using map reduce job? would
> the recently added records be in the memstore? or I need to load them for
> read queries after the insert is done?
>
> Thanks,
> Gautam
>
>
> On Fri, Aug 23, 2013 at 2:43 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > Assuming you are using 0.94, the default value
> > for hbase.regionserver.global.memstore.lowerLimit is 0.35
> >
> > Meaning, memstore on each region server would be able to hold 3000M *
> 0.35
> > / 60 = 17.5 mil records (roughly).
> >
> > bq. If I use HTable interface, would the inserted data be in the HBase
> > cache, before flushing to the files, for immediate read queries?
> >
> > Yes.
> >
> > Cheers
> >
> >
> > On Fri, Aug 23, 2013 at 12:01 PM, Gautam Borah <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi,
> > >
> > > Average size of my records is 60 bytes - 20 bytes Key and 40 bytes
> value,
> > > table has one column family.
> > >
> > > I have setup a cluster for testing - 1 master and 3 region servers.
> Each
> > > have a heap size of 3 GB, single cpu.
> > >
> > > I have pre-split the table into 30 regions. I do not have to keep data
> > > forever, I could purge older records periodically.
> > >
> > > Thanks,
> > >
> > > Gautam
> > >
> > >
> > >
> > > On Fri, Aug 23, 2013 at 3:20 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > Can you tell us the average size of your records and how much heap is
> > > > given to the region servers ?
> > > >
> > > > Thanks
> > > >
> > > > On Aug 23, 2013, at 12:11 AM, Gautam Borah <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I have an use case where I need to write 1 million to 10 million
> > > records
> > > > > periodically (with intervals of 1 minutes to 10 minutes), into an
> > HBase
> > > > > table.
> > > > >
> > > > > Once the insert is completed, these records are queried immediately
> > > from
> > > > > another program - multiple reads.
> > > > >
> > > > > So, this is one massive write followed by many reads.
> > > > >
> > > > > I have two approaches to insert these records into the HBase table
> -
> > > > >
> > > > > Use HTable or HTableMultiplexer to stream the data to HBase table.
> > > > >
> > > > > or
> > > > >
> > > > > Write the data to HDFS store as a sequence file (avro in my case) -
> > run
> > > > map
> > > > > reduce job using HFileOutputFormat and then load the output files
> > into
> > > > > HBase cluster.
> > > > > Something like,
> > > > >
> > > > >  LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
> > > > >  loader.doBulkLoad(new Path(outputDir), hTable);
> > > > >
> > > > >
> > > > > In my use case which approach would be better?
> > > > >
> > > > > If I use HTable interface, would the inserted data be in the HBase
> > > cache,
> > > > > before flushing to the files, for immediate read queries?
> > > > >
> > > > > If I use map reduce job to insert, would the data be loaded into
> the
> > > > HBase
> > > > > cache immediately? or only the output files would be copied to
> > > respective
> > > > > hbase table specific directories?
> > > > >
> > > > > So, which approach is better for write and then immediate multiple