Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # user - RCFile in java MapReduce


Copy link to this message
-
Re: RCFile in java MapReduce
Yin Huai 2012-01-10, 03:16
I have some experiences using RCFile with new MapReduce API from the
project HCatalog ( http://incubator.apache.org/hcatalog/ ).

For the output part,
In your main, you need ...

> job.setOutputFormatClass(RCFileMapReduceOutputFormat.class);
>
> RCFileMapReduceOutputFormat.setColumnNumber(job.getConfiguration(),
>> numCols); // numCols is the total number of columns of your output table
>
> RCFileMapReduceOutputFormat.setOutputPath(job, new Path(outputPath));
>
> RCFileMapReduceOutputFormat.setCompressOutput(job, true);
>
> The Map class would look like ...

> public static class Map
>
>     extends Mapper<Object, Text, NullWritable, BytesRefArrayWritable>{
>
>   private byte[] fieldData;
>
>  private int numCols;
>
>  private BytesRefArrayWritable bytes;
>
>   @Override
>
>  protected void setup(Context context) throws IOException,
>> InterruptedException {
>
>  numCols >> context.getConfiguration().getInt("hive.io.rcfile.column.number.conf", 0);
>
>  bytes = new BytesRefArrayWritable(numCols);
>
>  }
>
>   public void map(Object key, Text line, Context context
>
>                 ) throws IOException, InterruptedException {
>
>  bytes.clear();
>
>  String[] cols = line.toString().split("\\|");
>
>  for (int i=0; i<numCols; i++){
>
>          fieldData = cols[i].getBytes("UTF-8");
>
>          BytesRefWritable cu = null;
>
>             cu = new BytesRefWritable(fieldData, 0, fieldData.length);
>
>             bytes.set(i, cu);
>
>         }
>
>  context.write(NullWritable.get(), bytes);
>
>  }
>
>  }
>
> Basically, you need to convert a row to a BytesRefArrayWritable object
(which is bytes in above example).

For the input part, I do not know how to use RCFileMapReduceInputFormat to
write a MapReduce job for a join operation, so I customized a new
InputFormat and RecordReader.
You can find these two class (MultiRCFileMapReduceInputFormat and
MultiRCFileMapReduceRecordReader) from
http://www.cse.ohio-state.edu/~huai/RCFile/ .
In this link, TestPrintTables.java is an example program that you can use
it to convert tables in RCFile format to text. I hope that this example is
self-explaining. If you need to

Hope these can help you.

Thanks,

Yin

On Wed, Dec 14, 2011 at 8:54 AM, Dominik Wiernicki <[EMAIL PROTECTED]> wrote:

> Hi,
>
> Can someone show me how to use RCfile in plain MapReduce job (as Input and
> Output Format)?
> Please.
>
>
>