Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Hbase import Tsv performance (slow import)


+
Nick maillard 2012-10-24, 11:40
+
ramkrishna vasudevan 2012-10-24, 13:47
+
Nick maillard 2012-10-24, 10:15
+
Sonal Goyal 2012-10-24, 11:18
+
Nick maillard 2012-10-24, 10:05
+
Nick maillard 2012-10-24, 09:23
+
Nick maillard 2012-10-24, 14:35
+
Kevin Odell 2012-10-24, 16:18
+
anil gupta 2012-10-24, 16:30
+
Nick maillard 2012-10-24, 16:29
+
nick maillard 2012-10-24, 19:08
+
Nick maillard 2012-10-23, 17:13
+
Nicolas Liochon 2012-10-23, 17:32
+
Kevin Odell 2012-10-23, 17:47
+
lars hofhansl 2012-10-25, 04:10
+
Nick maillard 2012-10-23, 15:48
+
Anoop John 2012-10-24, 03:29
+
ramkrishna vasudevan 2012-10-24, 04:55
+
anil gupta 2012-10-24, 05:09
+
Anoop John 2012-10-24, 05:11
+
Anoop John 2012-10-24, 05:14
+
anil gupta 2012-10-24, 05:28
+
Anoop John 2012-10-24, 06:07
+
anil gupta 2012-10-24, 06:14
+
Anoop John 2012-10-24, 06:31
Copy link to this message
-
Re: Hbase import Tsv performance (slow import)
Yeah, we never used HBase client api(puts) for loading a batch of millions
of records. Can you tell me by default where the o/p HFile(s) from MR job
are stored in HDFS?
On Tue, Oct 23, 2012 at 11:31 PM, Anoop John <[EMAIL PROTECTED]> wrote:

> I think as per your explanation of need for unique id it is okey.. No need
> to worry abt data loss.. As long as you can make sure you make a unique id
> things are fine..  MR will make sure it run the job on whole data and the
> o/p is persisted in file.. Yes this file is HFile(s) only.. Then finally
> the HBase cluster is used for loading the HFiles to the Region stores..
> Bulk loading huge data using this way will be much much faster than normal
> put()s
>
> -Anoop-
>
> On Wed, Oct 24, 2012 at 11:44 AM, anil gupta <[EMAIL PROTECTED]>
> wrote:
>
> > Anoop: Only thing is that some
> > mappers crashed.. So thin MR fw will run that mapper again on the same
> data
> > set.. Then the unique id will be different?
> >
> > Anil: Yes, for the same dataset also the UniqueId will be different.
> > UniqueID does not depends on the data.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Tue, Oct 23, 2012 at 11:07 PM, Anoop John <[EMAIL PROTECTED]>
> > wrote:
> >
> > > >. Is there a way that i can explicitly turn on WAL for bulk loading?
> > > no..
> > > How you generate the unique id?  Remember that initial steps wont need
> > the
> > > HBase cluster at all. MR generates the HFiles and the o/p will be in
> file
> > > only..  Mappers also will write o/p to file...  Only thing is that some
> > > mappers crashed.. So thin MR fw will run that mapper again on the same
> > data
> > > set.. Then the unique id will be different? I think you no need to
> worry
> > > about data loss from Hbase side..  So WAL is not required..
> > >
> > > -Anoop-
> > >
> > >
> > >
> > >
> > > On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > That's a very interesting fact. You made it clear but my custom Bulk
> > > Loader
> > > > generates an unique ID for every row in map phase. So, all my data is
> > not
> > > > in csv or text. Is there a way that i can explicitly turn on WAL for
> > bulk
> > > > loading?
> > > >
> > > > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]>
> > > > wrote:
> > > >
> > > > > Hi Anil
> > > > >                 In case of bulk loading it is not like data is put
> > into
> > > > > HBase one by one.. The MR job will create an o/p like HFile.. It
> will
> > > > > create the KVs and write to file in order as how HFile will look
> > like..
> > > > The
> > > > > the file is loaded into HBase finally.. Only for this final step
> > HBase
> > > RS
> > > > > will be used.. So there is no point in WAL there...  I am making it
> > > clear
> > > > > for you?   The data is already present in form of raw data in some
> > txt
> > > or
> > > > > csv file  :)
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <
> [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > Hi Anil
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <
> > [EMAIL PROTECTED]
> > > > > >wrote:
> > > > > >
> > > > > >> Hi Anoop,
> > > > > >>
> > > > > >> As per your last email, did you mean that WAL is not used while
> > > using
> > > > > >> HBase
> > > > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case
> of
> > > > > >> RegionServer failure?
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Anil Gupta
> > > > > >>
> > > > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan <
> > > > > >> [EMAIL PROTECTED]> wrote:
> > > > > >>
> > > > > >> > As Kevin suggested we can make use of bulk load that goes thro
> > WAL
> > > > and
> > > > > >> > Memstore.  Or the second option will be to use the o/p of
> > mappers
> > > to
> > > > > >> create
> > > > > >> > HFiles directly.
> > > > > >> >
> > > > > >> > Regards
> > > > > >> > Ram
> > > > > >> >
> > > > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <

Thanks & Regards,
Anil Gupta
+
ramkrishna vasudevan 2012-10-24, 05:52
+
anil gupta 2012-10-24, 06:11
+
Jonathan Bishop 2012-10-25, 15:57
+
anil gupta 2012-10-25, 20:33
+
anil gupta 2012-10-25, 20:35
+
Anoop Sam John 2012-10-26, 04:07
+
Nicolas Liochon 2012-10-23, 16:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB