Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> hdfs data into Hbase


Copy link to this message
-
Re: hdfs data into Hbase
Some quick thoughts, well your size is bound to increase because recall
that the rowkey is stored in every cell. So when in CSV if you have let us
say 5 columns and when you imported them to HBASE using the first column as
key, then you will end up with essentially 9 (1 for the rowkey and then 2
each for rest of 4 'rowkey-column' pairs) columns (I know very crude and
high-level estimation.)

Also, how are you measuring the size in hdfs after import to HBase. Are you
excluding any replication of data?

Regards,
Shahab
On Mon, Sep 9, 2013 at 5:06 AM, kun yan <[EMAIL PROTECTED]> wrote:

> Hello everyone, I wrote a mapreduce program to import data(HDFS) into
> hbase, but when I import data into hbase later increased a lot, my original
> data size is 69MB (HDFS), import HBase, My HDFS increase the size 3GB, I
> wrote the program do what is wrong
>
> thanks
>
> public class MRImportHBaseCsv {
>     public static void main(String[] args) throws IOException,
>     InterruptedException, ClassNotFoundException {
>
> Configuration conf = new Configuration();
> conf.set("fs.defaultFS", "hdfs://hydra0001:8020");
> conf.set("yarn.resourcemanager.address", "hydra0001:8032");
> Job job = createSubmitTableJob(conf, args);
> job.submit();
>
>     }
>     public static Job createSubmitTableJob(Configuration conf, String[]
> args)
>     throws IOException {
> String tableName = args[0];
> Path inputDir = new Path(args[1]);
> Job job = new Job(conf, "HDFS_TO_HBase");
> job.setJarByClass(HourlyImporter.class);
> FileInputFormat.setInputPaths(job, inputDir);
> job.setInputFormatClass(TextInputFormat.class);
> job.setMapperClass(HourlyImporter.class);
> // ++++ insert into table directly using TableOutputFormat ++++
> TableMapReduceUtil.initTableReducerJob(tableName, null, job);
> job.setNumReduceTasks(0);
> TableMapReduceUtil.addDependencyJars(job);
> return job;
>     }
>
>     static class HourlyImporter extends
>     Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
>
> private long ts;
> // var column family
> static byte[] family = Bytes.toBytes("s");
> static String columns = "HBASE_ROW_KEY,STATION,YEAR,MONTH,DAY,HOUR,MINUTE";
>
> @Override
> protected void cleanup(Context context) throws IOException,
> InterruptedException {
>     ts = System.currentTimeMillis();
> }
>
> @Override
> protected void map(LongWritable key, Text value, Context context)
> throws IOException, InterruptedException {
>
>     ArrayList<String> columnsList = Lists.newArrayList(Splitter.on(',')
>     .trimResults().split(columns));
>
>     String line = value.toString();
>     ArrayList<String> columnValues = Lists.newArrayList(Splitter
>     .on(',').trimResults().split(line));
>     byte[] bRowKey = Bytes.toBytes(columnValues.get(0));
>
>     ImmutableBytesWritable rowKey = new ImmutableBytesWritable(bRowKey);
>
>
>     Put p = new Put(Bytes.toBytes(columnValues.get(0)));
>     for (int i = 1; i < columnValues.size(); i++) {
> p.add(family, Bytes.toBytes(columnsList.get(i)),
> Bytes.toBytes(columnValues.get(i)));
>     }
>     context.write(rowKey, p);
> }
>     }
> }
>
>
> --
>
> In the Hadoop world, I am just a novice, explore the entire Hadoop
> ecosystem, I hope one day I can contribute their own code
>
> YanBit
> [EMAIL PROTECTED]
>