-Re: Best practices for loading data into hbase
Mohammad Tariq 2013-05-31, 20:31
I am sorry to barge in when heavyweights are already involved here. But,
just out of curiosity, why don't you use Sqoop <http://sqoop.apache.org/> to
import the data directly from your existing systems into HBase instead of
first taking the dump and then doing the import. Sqoop allows us to do
incremental imports as well.
Pardon me if this sounds childish.
On Sat, Jun 1, 2013 at 1:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> bq. Once we process an xml file and we populate our 3 "production" hbase
> tables, could we bulk load another xml file and append this new data to our
> 3 tables or would it write over what was written before?
> You can bulk load another XML file.
> bq. should we process our input xml file with 3 MapReduce jobs instead of 1
> You don't need to use 3 jobs.
> Looks like you were using CDH. Mind telling us the version number for HBase
> and hadoop ?
> On Fri, May 31, 2013 at 1:19 PM, David Poisson <
> [EMAIL PROTECTED]
> > wrote:
> > Hi,
> > We are still very new at all of this hbase/hadoop/mapreduce stuff.
> > are looking for the best practices that will fit our requirements. We are
> > currently using the latest cloudera vmware's (single node) for our
> > development tests.
> > The problem is as follows:
> > We have multiple sources in different format (xml, csv, etc), which are
> > dumps of existing systems. As one might think, there will be an initial
> > "import" of the data into hbase
> > and afterwards, the systems would most likely dump whatever data they
> > accumulated since the initial import into hbase or since the last data
> > dump. Another thing, we would require to have an
> > intermediary step, so that we can ensure all of a source's data can be
> > successfully processed, something which would look like:
> > XML data file --(MR JOB)--> Intermediate (hbase table or hfile?) --(MR
> > JOB)--> production tables in hbase
> > We're guessing we can't use something like a transaction in hbase, so we
> > thought about using a intermediate step: Is that how things are normally
> > done?
> > As we import data into hbase, we will be populating several tables that
> > links data parts together (account X in System 1 == account Y in System
> > as tuples in 3 tables. Currently,
> > this is being done by a mapreduce job which reads the XML source and uses
> > multiTableOutputFormat to "put" data into those 3 hbase tables. This
> > isn't that fast using our test sample (2 minutes for 5Mb), so we are
> > looking at optimizing the loading of data.
> > We have been researching bulk loading but we are unsure of a couple of
> > things:
> > Once we process an xml file and we populate our 3 "production" hbase
> > tables, could we bulk load another xml file and append this new data to
> > 3 tables or would it write over what was written before?
> > In order to bulk load, we need to output a file using HFileOutputFormat.
> > Since MultiHFileOutputFormat doesn't seem to officially exist yet (still
> > the works, right?), should we process our input xml file
> > with 3 MapReduce jobs instead of 1 and output an hfile for each, which we
> > could then become our intermediate step (if all 3 hfiles were created
> > without errors, then process was successful: bulk load
> > in hbase)? Can you experiment with bulk loading on a vmware? We're
> > experiencing problems with partition file not being found with the
> > following exception:
> > java.lang.Exception: java.lang.IllegalArgumentException: Can't read
> > partitions file
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404)
> > Caused by: java.lang.IllegalArgumentException: Can't read partitions file
> > at
> > at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:70)