Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Hbase import Tsv performance (slow import)


+
Nick maillard 2012-10-24, 11:40
+
ramkrishna vasudevan 2012-10-24, 13:47
+
Nick maillard 2012-10-24, 10:15
+
Sonal Goyal 2012-10-24, 11:18
+
Nick maillard 2012-10-24, 10:05
+
Nick maillard 2012-10-24, 09:23
+
Nick maillard 2012-10-24, 14:35
+
Kevin Odell 2012-10-24, 16:18
+
anil gupta 2012-10-24, 16:30
+
Nick maillard 2012-10-24, 16:29
+
nick maillard 2012-10-24, 19:08
+
Nick maillard 2012-10-23, 17:13
+
Nicolas Liochon 2012-10-23, 17:32
+
Kevin Odell 2012-10-23, 17:47
+
lars hofhansl 2012-10-25, 04:10
+
Nick maillard 2012-10-23, 15:48
+
Anoop John 2012-10-24, 03:29
+
ramkrishna vasudevan 2012-10-24, 04:55
+
anil gupta 2012-10-24, 05:09
+
Anoop John 2012-10-24, 05:11
+
Anoop John 2012-10-24, 05:14
+
anil gupta 2012-10-24, 05:28
+
Anoop John 2012-10-24, 06:07
+
anil gupta 2012-10-24, 06:14
+
Anoop John 2012-10-24, 06:31
+
anil gupta 2012-10-24, 06:43
+
ramkrishna vasudevan 2012-10-24, 05:52
Copy link to this message
-
Re: Hbase import Tsv performance (slow import)
Yes, the uniqueId is not part of csv file. In my bulk loader i use
combination of nodeId+processId+counter as UniqueID for each row. I have to
use the uniqueId since the remaining part of rowkey is not unique.

I think there are two approaches to solve this problem:
1. Generate HFiles through MR and then do incremental load. I am fine with
this approach as we will have entire trace of data in HFiles.
2. Use prePut observers? I am already using the prePut hook for some other
purpose.

Thanks,
Anil Gupta
On Tue, Oct 23, 2012 at 10:52 PM, ramkrishna vasudevan <
[EMAIL PROTECTED]> wrote:

> Anil,
> When you do ImportTSV the data that is present in the the TSV file alone
> will be parsed and loaded into HBase.
> How are you planning to generate the UniqueID? Your usecase seems like it
> your data is in CSV file but the unique id that you need is not part of the
> TSV.
> Now you need them to be loaded to HBASE thro WAL.
>
> I would suggest that can you first do a loading of the existing TSV file to
> one HTable.
> Then from that table you can do a bulk load into another table using ur
> custom mapper.  Here you can use the logic of generating unique ID for
> every row that comes out from the loaded table.
> Here we can make the data to be inserted into the new table thro normal
> puts which will use the WAL and memstore.
>
> Regards
> Ram
>
> On Wed, Oct 24, 2012 at 10:58 AM, anil gupta <[EMAIL PROTECTED]>
> wrote:
>
> > That's a very interesting fact. You made it clear but my custom Bulk
> Loader
> > generates an unique ID for every row in map phase. So, all my data is not
> > in csv or text. Is there a way that i can explicitly turn on WAL for bulk
> > loading?
> >
> > On Tue, Oct 23, 2012 at 10:14 PM, Anoop John <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi Anil
> > >                 In case of bulk loading it is not like data is put into
> > > HBase one by one.. The MR job will create an o/p like HFile.. It will
> > > create the KVs and write to file in order as how HFile will look like..
> > The
> > > the file is loaded into HBase finally.. Only for this final step HBase
> RS
> > > will be used.. So there is no point in WAL there...  I am making it
> clear
> > > for you?   The data is already present in form of raw data in some txt
> or
> > > csv file  :)
> > >
> > > -Anoop-
> > >
> > > On Wed, Oct 24, 2012 at 10:41 AM, Anoop John <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Hi Anil
> > > >
> > > >
> > > >
> > > > On Wed, Oct 24, 2012 at 10:39 AM, anil gupta <[EMAIL PROTECTED]
> > > >wrote:
> > > >
> > > >> Hi Anoop,
> > > >>
> > > >> As per your last email, did you mean that WAL is not used while
> using
> > > >> HBase
> > > >> Bulk Loader? If yes, then how we ensure "no data loss" in case of
> > > >> RegionServer failure?
> > > >>
> > > >> Thanks,
> > > >> Anil Gupta
> > > >>
> > > >> On Tue, Oct 23, 2012 at 9:55 PM, ramkrishna vasudevan <
> > > >> [EMAIL PROTECTED]> wrote:
> > > >>
> > > >> > As Kevin suggested we can make use of bulk load that goes thro WAL
> > and
> > > >> > Memstore.  Or the second option will be to use the o/p of mappers
> to
> > > >> create
> > > >> > HFiles directly.
> > > >> >
> > > >> > Regards
> > > >> > Ram
> > > >> >
> > > >> > On Wed, Oct 24, 2012 at 8:59 AM, Anoop John <
> [EMAIL PROTECTED]>
> > > >> wrote:
> > > >> >
> > > >> > > Hi
> > > >> > >     Using ImportTSV tool you are trying to bulk load your data.
> > Can
> > > >> you
> > > >> > see
> > > >> > > and tell how many mappers and reducers were there. Out of total
> > time
> > > >> what
> > > >> > > is the time taken by the mapper phase and by the reducer phase.
> > >  Seems
> > > >> > like
> > > >> > > MR related issue (may be some conf issue). In this bulk load
> case
> > > >> most of
> > > >> > > the work is done by the MR job. It will read the raw data and
> > > convert
> > > >> it
> > > >> > > into Puts and write to HFiles. MR o/p is HFiles itself. The next
> > > part
> > > >> in
> > > >> > > ImportTSV will just put the HFiles under the table region

Thanks & Regards,
Anil Gupta
+
Jonathan Bishop 2012-10-25, 15:57
+
anil gupta 2012-10-25, 20:33
+
anil gupta 2012-10-25, 20:35
+
Anoop Sam John 2012-10-26, 04:07
+
Nicolas Liochon 2012-10-23, 16:46