|
|
-
Re: importing a large tableRita 2012-03-31, 20:26
Heh. Thanks for the links. I already read the Do and Donts :-). The videos
volume is rather low. I am already using lzo as my compression method. My regions are set to 30GB in resident memory. On Sat, Mar 31, 2012 at 1:19 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: > Well, doing some calculations, you have 18 TB of data, divided in 9200 > regions, you have approximately 2.4 GB by regions. Is this correct? > > Well, my first advice is that you have to unable the automatic split > mechanism in HBase. It better to do this manually, but you will have an > insane number on regions in short time. > > The second is to enable compression (Gzip, LZO, Snappy) in all your HBase > cluster. This brings to you less data to work, and less network > overhead. > > Omer, one of the Software Engineer at the LA Hadoop User Group gave a > excellent talk about HBase called: "HBase Do's and Don'ts". I recommend > that you should see this talk. > > See the post first in the Cloudera's blog: > http://www.cloudera.com/blog/**2011/04/hbase-dos-and-donts/<http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/> > > - Video > http://www.meetup.com/LA-HUG/**pages/Video_from_April_13th_** > HBASE_DO%27S_and_DON%27TS/<http://www.meetup.com/LA-HUG/pages/Video_from_April_13th_HBASE_DO%27S_and_DON%27TS/> > > > > On 3/31/2012 5:33 AM, Rita wrote: > >> I have close to 9200 regions. Is there an example I can follow? or are >> there tools to do this already? >> >> >> >> On Fri, Mar 30, 2012 at 10:11 AM, Marcos Ortiz <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> >> >> >> On 03/30/2012 04:54 AM, Rita wrote: >> >>> Thanks for the responses. I am using 0.90.4-cdh3. i exported the table >>> using hbase exporter. Yes, the previous table still exists but on a >>> different cluster.My region servers are large, close to 12GB in size. >>> >> Which is the total number of your regions? >> >> I want to understand regarding Hfiles. We export the table as a >>> series of >>> Hfiles and then import them in? >>> >> Yes, The simplest way to do this is using the TableOutputFormat, but >> if you use instead the HFileOutputFormat, the process will be more >> efficient, because using this feature (bulk loads) will use less CPU >> and network. With a MapReduce job, you prepare your data using the >> HFileOutputFormat (Hadoop's TotalOrderPartitioner class in used to >> partition the map output >> into disjoint ranges of the key space, corresponding to the key >> ranges of the regions in the table). >> >> >> What is the difference between that in the >>> regular MR export job? >>> >> The main difference with regular MR jobs is the output, instead to >> use the classic ouput formats like TextOutputFormat, >> MultipleOutputFormat, SequenceFileOutputFormat, etc, you will use >> the HFileOutputFormat, that is the native data file type for HBase >> (HFile). >> >> I idea sounds good because it sounds simple on the >>> surface :-) >>> >> >> >>> On Fri, Mar 30, 2012 at 12:08 AM, Stack<[EMAIL PROTECTED]> <mailto: >>> [EMAIL PROTECTED]> wrote: >>> >>> On Thu, Mar 29, 2012 at 7:57 PM, Rita<[EMAIL PROTECTED]> >>>> <mailto:[EMAIL PROTECTED]> wrote: >>>> >>>> Hello, >>>>> >>>>> I am importing a 40+ billion row table which I exported several >>>>> months >>>>> >>>> ago. >>>> >>>>> The data size is close to 18TB on hdfs (3x replication). >>>>> >>>>> Does the table from back then still exist? Or do you remember what >>>> the key spread was like? Could you precreate the old table? >>>> >>>> My problem is when I try to import it with mapreduce it takes a few >>>>> days >>>>> >>>> -- >>>> >>>>> which is ok -- however when the job fails to whatever reason, I >>>>> have to >>>>> restart everything. Is it possible to import the table in chunks >>>>> like, >>>>> import 1/3, 2/3, and then finally 3/3 of the table? >>>>> >>>>> Yeah. Funny how the plug gets pulled on the rack when the three |