Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> A list of HBase backup options


Copy link to this message
-
Re: A list of HBase backup options
Hi,
> > 1) make a export/backup of 1 table at a time using
> >  org.apache.hadoop.hbase.mapreduce.Export from HBASE-1684
>
> This is actually  checked in.  See:
>
> ./bin/hadoop jar hbase-0.X.X.jar
>
> > 2)  copy 1 table at a time using
> >
>http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/CopyTable.html
>
> >
> >  3) use distcp to copy the whole /hbase part of HDFS
> > 4) replicate the  whole cluster - http://hbase.apache.org/replication.html
> > 5) count on HDFS  replication and live without the standard backup
> >
> >
> > What  I'm not sure about is the following:
> >
> > 1) Is any one of the above  options "hot", meaning that it can be used while
>the
> > source cluster is  running and that it produces a consistent backup (a
>snapshot
> > or  checkpoint of the source cluster's data)?
> > I imagine only replication of  the whole cluster (point 4) above) is really
> >  "hot"?
> >
>
> Options 1) and 2) will give you a snapshot on a table at a  particular
> instance in time.  You'll get the state of the row at the  time the
> MapReduce job crosses that row.

Hm, isn't this contradictory?  That is, doesn't "snapshot of a table at a
particular instance in time" means that I'd get a snapshot of *all* rows at a
single point in time, and not a value of a row when the Export or Copy MR job
crosses it?

Also, it seems like all options are per-table, right?  There is nothing other
than near real-time full-cluster replication that would back up all tables at
once?
This is important when you have multiple tables storing data that depend on each
other.  Imagine tables A and B where table B depends on A.  If you first back up
A, then by the time I back up B, it may reference some data in A that my A's
backup doesn't contain.  If you flip the order and first back up B, then by the
time I back up A it may contain some extra data that B's backup doesn't refer
to.

Simply put, the backup copies of these 2 tables won't be in sync.

How do people deal with this?

Would it make sense to document this sort of stuff on
http://hbase.apache.org/book/book.html ?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB