Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Writing small files to one big file in hdfs


Copy link to this message
-
Re: Writing small files to one big file in hdfs
Finally figured it out. I needed to use SequenceFileAstextInputFormat.
There is just lack of examples that makes it difficult when you start.

On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> It looks like in mapper values are coming as binary instead of Text. Is
> this expected from sequence file? I initially wrote SequenceFile with Text
> values.
>
>
> On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
>
>> Need some more help. I wrote sequence file using below code but now when
>> I run mapreduce job I get "file.*java.lang.ClassCastException*:
>> org.apache.hadoop.io.LongWritable cannot be cast to
>> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I
>> originally wrote to the sequence
>>
>> //Code to write to the sequence file. There is no LongWritable here
>>
>> org.apache.hadoop.io.Text key >> *new* org.apache.hadoop.io.Text();
>>
>> BufferedReader buffer >> *new* BufferedReader(*new* FileReader(filePath));
>>
>> String line >> *null*;
>>
>> org.apache.hadoop.io.Text value >> *new* org.apache.hadoop.io.Text();
>>
>> *try* {
>>
>> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(),
>>
>> value.getClass(), SequenceFile.CompressionType.
>> *RECORD*);
>>
>> *int* i = 1;
>>
>> *long* timestamp=System.*currentTimeMillis*();
>>
>> *while* ((line = buffer.readLine()) != *null*) {
>>
>> key.set(String.*valueOf*(timestamp));
>>
>> value.set(line);
>>
>> writer.append(key, value);
>>
>> i++;
>>
>> }
>>
>>
>>   On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I think the following link will help:
>>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
>>>
>>> Cheers
>>> Arko
>>>
>>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <[EMAIL PROTECTED]
>>> >wrote:
>>>
>>> > Sorry may be it's something obvious but I was wondering when map or
>>> reduce
>>> > gets called what would be the class used for key and value? If I used
>>> > "org.apache.hadoop.io.Text
>>> > value = *new* org.apache.hadoop.io.Text();" would the map be called
>>> with
>>>  > Text class?
>>> >
>>> > public void map(LongWritable key, Text value, Context context) throws
>>> > IOException, InterruptedException {
>>> >
>>> >
>>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee <
>>> > [EMAIL PROTECTED]> wrote:
>>> >
>>> > > Hi Mohit,
>>> > >
>>> > > I am not sure that I understand your question.
>>> > >
>>> > > But you can write into a file using:
>>> > > *BufferedWriter output = new BufferedWriter
>>> > > (new OutputStreamWriter(fs.create(my_path,true)));*
>>> > > *output.write(data);*
>>> > > *
>>> > > *
>>> > > Then you can pass that file as the input to your MapReduce program.
>>> > >
>>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );*
>>> > >
>>> > > From inside your Map/Reduce methods, I think you should NOT be
>>> tinkering
>>> > > with the input / output paths of that Map/Reduce job.
>>> > > Cheers
>>> > > Arko
>>> > >
>>> > >
>>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <
>>> [EMAIL PROTECTED]
>>> > > >wrote:
>>> > >
>>> > > > Thanks How does mapreduce work on sequence file? Is there an
>>> example I
>>> > > can
>>> > > > look at?
>>> > > >
>>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
>>> > > > [EMAIL PROTECTED]> wrote:
>>> > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > Let's say all the smaller files are in the same directory.
>>> > > > >
>>> > > > > Then u can do:
>>> > > > >
>>> > > > > *BufferedWriter output = new BufferedWriter
>>> > > > > (newOutputStreamWriter(fs.create(output_path,
>>> > > > > true)));  // Output path*
>>> > > > >
>>> > > > > *FileStatus[] output_files = fs.listStatus(new
>>> Path(input_path));  //
>>> > > > Input
>>> > > > > directory*
>>> > > > >
>>> > > > > *for ( int i=0; i < output_files.length; i++ )  *
>>> > > > >
>>> > > > > *{*
>>> > > > >
>>> > > > > *   BufferedReader reader = new