Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - EC2 Elastic MapReduce HBase install recommendations


+
Pal Konyves 2013-05-08, 02:01
+
Marcos Luis Ortiz Valmase... 2013-05-08, 02:31
+
ramkrishna vasudevan 2013-05-08, 02:41
+
Marcos Luis Ortiz Valmase... 2013-05-08, 02:42
+
Andrew Purtell 2013-05-09, 04:04
+
Amandeep Khurana 2013-05-09, 04:12
+
Michel Segel 2013-05-09, 04:47
+
Pal Konyves 2013-05-09, 09:39
+
Michel Segel 2013-05-09, 12:32
+
Pal Konyves 2013-05-12, 02:14
+
Ted Yu 2013-05-12, 02:25
Copy link to this message
-
Re: EC2 Elastic MapReduce HBase install recommendations
Asaf Mesika 2013-05-12, 05:13
We ran into that as well.
You need to make sure when sending List of Put that all rowkeys there are
unique, otherwise as Ted said, the for loop acquiring locks will run
multiple times for rowkey which repeats it self

On Sunday, May 12, 2013, Ted Yu wrote:

> High collision rate means high contention at taking the row locks.
> This results in poor write performance.
>
> Cheers
>
> On May 11, 2013, at 7:14 PM, Pal Konyves <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I decided not to make any tuning, because my whole project is about
> > experimenting with HBase (it's a scool project). However it turned out
> that
> > my sample data generated lots of rowkey collisions. 4 million inserts
> only
> > resulted in about 5000 rows. The data were different though in the
> columns.
> > When I changed my sample dataset to have no collisions in the rowkey, the
> > performance increased with a magnitude of 10. Why is that?
> >
> > Thanks,
> > Pal
> >
> >
> > On Thu, May 9, 2013 at 2:32 PM, Michel Segel <[EMAIL PROTECTED]
> >wrote:
> >
> >> What I am saying is that by default, you get two mappers per node.
> >> x4large can run HBase w more mapred slots, so you will want to tune the
> >> defaults based on machine size. Not just mapred, but also HBase stuff
> too.
> >> You need to do this on startup of EMR cluster though...
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On May 9, 2013, at 2:39 AM, Pal Konyves <[EMAIL PROTECTED]> wrote:
> >>
> >>> Principally I chose to use Amazon, because they are supposedly high
> >>> performance, and what more important is: HBase is already set up if I
> >> chose
> >>> it as an EMR Workflow. I wanted to save up the time setting up the
> >> cluster
> >>> manually on EC2 instances.
> >>>
> >>> Are you saying I will reach higher performance when I set up the HBase
> on
> >>> the cluster manually, instead of the default Amazon HBase distribution?
> >> Or
> >>> is it worth to tune the Amazon distribution with a bootstrap action?
> How
> >>> long does it take, to set up the cluster with HDFS manually?
> >>>
> >>> I will also try larger instance types.
> >>>
> >>>
> >>> On Thu, May 9, 2013 at 6:47 AM, Michel Segel <
> [EMAIL PROTECTED]
> >>> wrote:
> >>>
> >>>> With respect to EMR, you can run HBase fairly easily.
> >>>> You can't run MapR w HBase on EMR stick w Amazon's release.
> >>>>
> >>>> And you can run it but you will want to know your tuning parameters up
> >>>> front when you instantiate it.
> >>>>
> >>>>
> >>>>
> >>>> Sent from a remote device. Please excuse any typos...
> >>>>
> >>>> Mike Segel
> >>>>
> >>>> On May 8, 2013, at 9:04 PM, Andrew Purtell <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL
> >> datastore
> >>>>> with (I gather) an Apache HBase compatible Java API.
> >>>>>
> >>>>> As for running HBase on EC2, we recently discussed some particulars,
> >> see
> >>>>> the latter part of this thread:
> >> http://search-hadoop.com/m/rI1HpK90guwhere
> >>>>> I hijack it. I wouldn't recommend launching HBase as part of an EMR
> >> flow
> >>>>> unless you want to use it only for temporary random access storage,
> and
> >>>> in
> >>>>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set
> up
> >> a
> >>>>> dedicated HBase backed storage service on high I/O instance types.
> The
> >>>>> fundamental issue is IO performance on the EC2 platform is fair to
> >> poor.
> >>>>>
> >>>>> I have also noticed a large difference in baseline block device
> latency
> >>>> if
> >>>>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this
> >> year.
> >>>>> Use the new ones, they cut the latency long tail in half. There were
> >> some
> >>>>> significant kernel level improvements I gather.
> >>>>>
> >>>>>
> >>>>> On Wed, May 8, 2013 a