Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> issues copying data from one table to another


Copy link to this message
-
Re: issues copying data from one table to another
Can you disable the table?
How much free disk space do you have?

Is  this a production cluster?
Can you upgrade to CDH3u5?

Are you running a capacity scheduler or fair scheduler?

Just out of curiosity, what would happen if you could disable the table, alter the table's max file size and then attempted to merge regions?  Note: I've never tried this, don't know if its possible, just thinking outside of the box...

Outside of that... the safest way to do this would be to export the table. You'll get 2800 mappers so if you are using a scheduler, you just put this in to a queue that limits the number of concurrent mappers.

When you import the data, in to your new table, you can run on an even more restrictive queue so that you have less of an impact on your system.  The downside is that its going to take a bit longer to run. Again, its probably the safest way to do this....

HTH,

-Mike

On Aug 17, 2012, at 2:17 PM, Norbert Burger <[EMAIL PROTECTED]> wrote:

> Hi folks -- we're running CDH3u3 (0.90.4).  I'm trying export data
> from an existing table that has far too many regions (2600+ for only 8
> regionservers) into one with a more reasonable region count for this
> cluster (256).  Overall data volume is approx. 3 TB.
>
> I thought initially that I'd use the bulkload/importtsv approach, but
> it turns out this table's schema has column qualifiers made from
> timestamps, so it's impossible for me to specify a list of target
> columns for importtsv.  From what I can tell, the TSV interchange
> format requires your data to have the same colquals throughout.
>
> I took a look at CopyTable and Export/Import, which both appear to
> wrap the Hbase client API (emitting Puts from a mapper).  But I'm
> seeing significant performance problems with this approach, to the
> point that I'm not sure it's feasible.  Export appears to work OK, but
> when I try importing the data back from HDFS, the rest of our cluster
> drags to halt -- client writes (even those not associated with the
> Import) start timing out.  Fwiw, import already disables autoFlush
> (via TableOutputFormat).
>
> From [1], one option I could try would to disable the WAL.  Are there
> are other techniques I should try?  Has anyone implemented a
> bulkloader which doesn't use the TSV format?
>
> Norbert
>
> [1] http://hbase.apache.org/book/perf.writing.html
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB