Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - A list of HBase backup options


Copy link to this message
-
Re: A list of HBase backup options
Otis Gospodnetic 2011-03-10, 20:31
Hi,
> > 1) make a export/backup of 1 table at a time using
> >  org.apache.hadoop.hbase.mapreduce.Export from HBASE-1684
>
> This is actually  checked in.  See:
>
> ./bin/hadoop jar hbase-0.X.X.jar
>
> > 2)  copy 1 table at a time using
> >
>http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/CopyTable.html
>
> >
> >  3) use distcp to copy the whole /hbase part of HDFS
> > 4) replicate the  whole cluster - http://hbase.apache.org/replication.html
> > 5) count on HDFS  replication and live without the standard backup
> >
> >
> > What  I'm not sure about is the following:
> >
> > 1) Is any one of the above  options "hot", meaning that it can be used while
>the
> > source cluster is  running and that it produces a consistent backup (a
>snapshot
> > or  checkpoint of the source cluster's data)?
> > I imagine only replication of  the whole cluster (point 4) above) is really
> >  "hot"?
> >
>
> Options 1) and 2) will give you a snapshot on a table at a  particular
> instance in time.  You'll get the state of the row at the  time the
> MapReduce job crosses that row.

Hm, isn't this contradictory?  That is, doesn't "snapshot of a table at a
particular instance in time" means that I'd get a snapshot of *all* rows at a
single point in time, and not a value of a row when the Export or Copy MR job
crosses it?

Also, it seems like all options are per-table, right?  There is nothing other
than near real-time full-cluster replication that would back up all tables at
once?
This is important when you have multiple tables storing data that depend on each
other.  Imagine tables A and B where table B depends on A.  If you first back up
A, then by the time I back up B, it may reference some data in A that my A's
backup doesn't contain.  If you flip the order and first back up B, then by the
time I back up A it may contain some extra data that B's backup doesn't refer
to.

Simply put, the backup copies of these 2 tables won't be in sync.

How do people deal with this?

Would it make sense to document this sort of stuff on
http://hbase.apache.org/book/book.html ?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/