|
Mohit Anchlia
2012-02-21, 17:15
Joey Echeverria
2012-02-21, 17:23
Bejoy Ks
2012-02-21, 17:25
Mohit Anchlia
2012-02-21, 17:30
Bejoy Ks
2012-02-21, 18:25
Bill Graham
2012-02-21, 18:41
Mohit Anchlia
2012-02-21, 19:27
Arko Provo Mukherjee
2012-02-21, 19:34
Mohit Anchlia
2012-02-21, 19:38
Arko Provo Mukherjee
2012-02-21, 19:59
Mohit Anchlia
2012-02-21, 20:04
Arko Provo Mukherjee
2012-02-21, 20:18
Mohit Anchlia
2012-02-22, 00:13
Mohit Anchlia
2012-02-22, 00:50
Edward Capriolo
2012-02-22, 02:42
Mohit Anchlia
2012-02-22, 03:31
|
-
Writing small files to one big file in hdfsMohit Anchlia 2012-02-21, 17:15
We have small xml files. Currently I am planning to append these small
files to one file in hdfs so that I can take advantage of splits, larger blocks and sequential IO. What I am unsure is if it's ok to append one file at a time to this hdfs file Could someone suggest if this is ok? Would like to know how other do it.
-
Re: Writing small files to one big file in hdfsJoey Echeverria 2012-02-21, 17:23
I'd recommend making a SequenceFile[1] to store each XML file as a value.
-Joey [1] http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/SequenceFile.html On Tue, Feb 21, 2012 at 12:15 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > We have small xml files. Currently I am planning to append these small > files to one file in hdfs so that I can take advantage of splits, larger > blocks and sequential IO. What I am unsure is if it's ok to append one file > at a time to this hdfs file > > Could someone suggest if this is ok? Would like to know how other do it. > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: Writing small files to one big file in hdfsBejoy Ks 2012-02-21, 17:25
Mohit
Rather than just appending the content into a normal text file or so, you can create a sequence file with the individual smaller file content as values. Regards Bejoy.K.S On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > We have small xml files. Currently I am planning to append these small > files to one file in hdfs so that I can take advantage of splits, larger > blocks and sequential IO. What I am unsure is if it's ok to append one file > at a time to this hdfs file > > Could someone suggest if this is ok? Would like to know how other do it. >
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-21, 17:30
On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Mohit > Rather than just appending the content into a normal text file or > so, you can create a sequence file with the individual smaller file content > as values. > > Thanks. I was planning to use pig's org.apache.pig.piggybank.storage.XMLLoader for processing. Would it work with sequence file? This text file that I was referring to would be in hdfs itself. Is it still different than using sequence file? > Regards > Bejoy.K.S > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > We have small xml files. Currently I am planning to append these small > > files to one file in hdfs so that I can take advantage of splits, larger > > blocks and sequential IO. What I am unsure is if it's ok to append one > file > > at a time to this hdfs file > > > > Could someone suggest if this is ok? Would like to know how other do it. > > >
-
Re: Writing small files to one big file in hdfsBejoy Ks 2012-02-21, 18:25
Hi Mohit
AFAIK XMLLoader in pig won't be suited for Sequence Files. Please post the same to Pig user group for some workaround over the same. SequenceFIle is a preferred option when we want to store small files in hdfs and needs to be processed by MapReduce as it stores data in key value format.Since SequenceFileInputFormat is available at your disposal you don't need any custom input formats for processing the same using map reduce. It is a cleaner and better approach compared to just appending small xml file contents into a big file. On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote: > > > Mohit > > Rather than just appending the content into a normal text file or > > so, you can create a sequence file with the individual smaller file > content > > as values. > > > > Thanks. I was planning to use pig's > org.apache.pig.piggybank.storage.XMLLoader > for processing. Would it work with sequence file? > > This text file that I was referring to would be in hdfs itself. Is it still > different than using sequence file? > > > Regards > > Bejoy.K.S > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > We have small xml files. Currently I am planning to append these small > > > files to one file in hdfs so that I can take advantage of splits, > larger > > > blocks and sequential IO. What I am unsure is if it's ok to append one > > file > > > at a time to this hdfs file > > > > > > Could someone suggest if this is ok? Would like to know how other do > it. > > > > > >
-
Re: Writing small files to one big file in hdfsBill Graham 2012-02-21, 18:41
You might want to check out File Crusher:
http://www.jointhegrid.com/hadoop_filecrush/index.jsp I've never used it, but it sounds like it could be helpful. On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote: > Hi Mohit > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > post the same to Pig user group for some workaround over the same. > SequenceFIle is a preferred option when we want to store small > files in hdfs and needs to be processed by MapReduce as it stores data in > key value format.Since SequenceFileInputFormat is available at your > disposal you don't need any custom input formats for processing the same > using map reduce. It is a cleaner and better approach compared to just > appending small xml file contents into a big file. > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > wrote: > > > > > Mohit > > > Rather than just appending the content into a normal text file or > > > so, you can create a sequence file with the individual smaller file > > content > > > as values. > > > > > > Thanks. I was planning to use pig's > > org.apache.pig.piggybank.storage.XMLLoader > > for processing. Would it work with sequence file? > > > > This text file that I was referring to would be in hdfs itself. Is it > still > > different than using sequence file? > > > > > Regards > > > Bejoy.K.S > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > We have small xml files. Currently I am planning to append these > small > > > > files to one file in hdfs so that I can take advantage of splits, > > larger > > > > blocks and sequential IO. What I am unsure is if it's ok to append > one > > > file > > > > at a time to this hdfs file > > > > > > > > Could someone suggest if this is ok? Would like to know how other do > > it. > > > > > > > > > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [EMAIL PROTECTED] going forward.*
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-21, 19:27
I am trying to look for examples that demonstrates using sequence files
including writing to it and then running mapred on it, but unable to find one. Could you please point me to some examples of sequence files? On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote: > Hi Mohit > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > post the same to Pig user group for some workaround over the same. > SequenceFIle is a preferred option when we want to store small > files in hdfs and needs to be processed by MapReduce as it stores data in > key value format.Since SequenceFileInputFormat is available at your > disposal you don't need any custom input formats for processing the same > using map reduce. It is a cleaner and better approach compared to just > appending small xml file contents into a big file. > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > wrote: > > > > > Mohit > > > Rather than just appending the content into a normal text file or > > > so, you can create a sequence file with the individual smaller file > > content > > > as values. > > > > > > Thanks. I was planning to use pig's > > org.apache.pig.piggybank.storage.XMLLoader > > for processing. Would it work with sequence file? > > > > This text file that I was referring to would be in hdfs itself. Is it > still > > different than using sequence file? > > > > > Regards > > > Bejoy.K.S > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > We have small xml files. Currently I am planning to append these > small > > > > files to one file in hdfs so that I can take advantage of splits, > > larger > > > > blocks and sequential IO. What I am unsure is if it's ok to append > one > > > file > > > > at a time to this hdfs file > > > > > > > > Could someone suggest if this is ok? Would like to know how other do > > it. > > > > > > > > > >
-
Re: Writing small files to one big file in hdfsArko Provo Mukherjee 2012-02-21, 19:34
Hi,
Let's say all the smaller files are in the same directory. Then u can do: *BufferedWriter output = new BufferedWriter (newOutputStreamWriter(fs.create(output_path, true))); // Output path* *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input directory* *for ( int i=0; i < output_files.length; i++ ) * *{* * BufferedReader reader = new BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); * * String data;* * data = reader.readLine();* * while ( data != null ) * * {* * output.write(data);* * }* * reader.close* *}* *output.close* In case you have the files in multiple directories, call the code for each of them with different input paths. Hope this helps! Cheers Arko On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > I am trying to look for examples that demonstrates using sequence files > including writing to it and then running mapred on it, but unable to find > one. Could you please point me to some examples of sequence files? > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED]> wrote: > > > Hi Mohit > > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > > post the same to Pig user group for some workaround over the same. > > SequenceFIle is a preferred option when we want to store small > > files in hdfs and needs to be processed by MapReduce as it stores data in > > key value format.Since SequenceFileInputFormat is available at your > > disposal you don't need any custom input formats for processing the same > > using map reduce. It is a cleaner and better approach compared to just > > appending small xml file contents into a big file. > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > > wrote: > > > > > > > Mohit > > > > Rather than just appending the content into a normal text file > or > > > > so, you can create a sequence file with the individual smaller file > > > content > > > > as values. > > > > > > > > Thanks. I was planning to use pig's > > > org.apache.pig.piggybank.storage.XMLLoader > > > for processing. Would it work with sequence file? > > > > > > This text file that I was referring to would be in hdfs itself. Is it > > still > > > different than using sequence file? > > > > > > > Regards > > > > Bejoy.K.S > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > We have small xml files. Currently I am planning to append these > > small > > > > > files to one file in hdfs so that I can take advantage of splits, > > > larger > > > > > blocks and sequential IO. What I am unsure is if it's ok to append > > one > > > > file > > > > > at a time to this hdfs file > > > > > > > > > > Could someone suggest if this is ok? Would like to know how other > do > > > it. > > > > > > > > > > > > > > >
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-21, 19:38
Thanks How does mapreduce work on sequence file? Is there an example I can
look at? On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < [EMAIL PROTECTED]> wrote: > Hi, > > Let's say all the smaller files are in the same directory. > > Then u can do: > > *BufferedWriter output = new BufferedWriter > (newOutputStreamWriter(fs.create(output_path, > true))); // Output path* > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // Input > directory* > > *for ( int i=0; i < output_files.length; i++ ) * > > *{* > > * BufferedReader reader = new > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); > * > > * String data;* > > * data = reader.readLine();* > > * while ( data != null ) * > > * {* > > * output.write(data);* > > * }* > > * reader.close* > > *}* > > *output.close* > > > In case you have the files in multiple directories, call the code for each > of them with different input paths. > > Hope this helps! > > Cheers > > Arko > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > I am trying to look for examples that demonstrates using sequence files > > including writing to it and then running mapred on it, but unable to find > > one. Could you please point me to some examples of sequence files? > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > wrote: > > > > > Hi Mohit > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. Please > > > post the same to Pig user group for some workaround over the same. > > > SequenceFIle is a preferred option when we want to store small > > > files in hdfs and needs to be processed by MapReduce as it stores data > in > > > key value format.Since SequenceFileInputFormat is available at your > > > disposal you don't need any custom input formats for processing the > same > > > using map reduce. It is a cleaner and better approach compared to just > > > appending small xml file contents into a big file. > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > Mohit > > > > > Rather than just appending the content into a normal text > file > > or > > > > > so, you can create a sequence file with the individual smaller file > > > > content > > > > > as values. > > > > > > > > > > Thanks. I was planning to use pig's > > > > org.apache.pig.piggybank.storage.XMLLoader > > > > for processing. Would it work with sequence file? > > > > > > > > This text file that I was referring to would be in hdfs itself. Is it > > > still > > > > different than using sequence file? > > > > > > > > > Regards > > > > > Bejoy.K.S > > > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > We have small xml files. Currently I am planning to append these > > > small > > > > > > files to one file in hdfs so that I can take advantage of splits, > > > > larger > > > > > > blocks and sequential IO. What I am unsure is if it's ok to > append > > > one > > > > > file > > > > > > at a time to this hdfs file > > > > > > > > > > > > Could someone suggest if this is ok? Would like to know how other > > do > > > > it. > > > > > > > > > > > > > > > > > > > > >
-
Re: Writing small files to one big file in hdfsArko Provo Mukherjee 2012-02-21, 19:59
Hi Mohit,
I am not sure that I understand your question. But you can write into a file using: *BufferedWriter output = new BufferedWriter (new OutputStreamWriter(fs.create(my_path,true)));* *output.write(data);* * * Then you can pass that file as the input to your MapReduce program. *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >From inside your Map/Reduce methods, I think you should NOT be tinkering with the input / output paths of that Map/Reduce job. Cheers Arko On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Thanks How does mapreduce work on sequence file? Is there an example I can > look at? > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > Let's say all the smaller files are in the same directory. > > > > Then u can do: > > > > *BufferedWriter output = new BufferedWriter > > (newOutputStreamWriter(fs.create(output_path, > > true))); // Output path* > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // > Input > > directory* > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > *{* > > > > * BufferedReader reader = new > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); > > * > > > > * String data;* > > > > * data = reader.readLine();* > > > > * while ( data != null ) * > > > > * {* > > > > * output.write(data);* > > > > * }* > > > > * reader.close* > > > > *}* > > > > *output.close* > > > > > > In case you have the files in multiple directories, call the code for > each > > of them with different input paths. > > > > Hope this helps! > > > > Cheers > > > > Arko > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > I am trying to look for examples that demonstrates using sequence files > > > including writing to it and then running mapred on it, but unable to > find > > > one. Could you please point me to some examples of sequence files? > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > > wrote: > > > > > > > Hi Mohit > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > Please > > > > post the same to Pig user group for some workaround over the same. > > > > SequenceFIle is a preferred option when we want to store > small > > > > files in hdfs and needs to be processed by MapReduce as it stores > data > > in > > > > key value format.Since SequenceFileInputFormat is available at your > > > > disposal you don't need any custom input formats for processing the > > same > > > > using map reduce. It is a cleaner and better approach compared to > just > > > > appending small xml file contents into a big file. > > > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > > > Mohit > > > > > > Rather than just appending the content into a normal text > > file > > > or > > > > > > so, you can create a sequence file with the individual smaller > file > > > > > content > > > > > > as values. > > > > > > > > > > > > Thanks. I was planning to use pig's > > > > > org.apache.pig.piggybank.storage.XMLLoader > > > > > for processing. Would it work with sequence file? > > > > > > > > > > This text file that I was referring to would be in hdfs itself. Is > it > > > > still > > > > > different than using sequence file? > > > > > > > > > > > Regards > > > > > > Bejoy.K.S > > > > > > > > > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia < > > > > [EMAIL PROTECTED] > > > > > > >wrote: > > > > > > > > > > > > > We have small xml files. Currently I am planning to append > these > > > > small > > > > > > > files to one file in hdfs so that I can take advantage of > splits, > > > > > larger > > > > > > > blocks and sequential IO. What I am unsure is if it's ok to
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-21, 20:04
Sorry may be it's something obvious but I was wondering when map or reduce
gets called what would be the class used for key and value? If I used "org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text();" would the map be called with Text class? public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < [EMAIL PROTECTED]> wrote: > Hi Mohit, > > I am not sure that I understand your question. > > But you can write into a file using: > *BufferedWriter output = new BufferedWriter > (new OutputStreamWriter(fs.create(my_path,true)));* > *output.write(data);* > * > * > Then you can pass that file as the input to your MapReduce program. > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* > > From inside your Map/Reduce methods, I think you should NOT be tinkering > with the input / output paths of that Map/Reduce job. > Cheers > Arko > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Thanks How does mapreduce work on sequence file? Is there an example I > can > > look at? > > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > > [EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > Let's say all the smaller files are in the same directory. > > > > > > Then u can do: > > > > > > *BufferedWriter output = new BufferedWriter > > > (newOutputStreamWriter(fs.create(output_path, > > > true))); // Output path* > > > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // > > Input > > > directory* > > > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > > > *{* > > > > > > * BufferedReader reader = new > > > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); > > > * > > > > > > * String data;* > > > > > > * data = reader.readLine();* > > > > > > * while ( data != null ) * > > > > > > * {* > > > > > > * output.write(data);* > > > > > > * }* > > > > > > * reader.close* > > > > > > *}* > > > > > > *output.close* > > > > > > > > > In case you have the files in multiple directories, call the code for > > each > > > of them with different input paths. > > > > > > Hope this helps! > > > > > > Cheers > > > > > > Arko > > > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > I am trying to look for examples that demonstrates using sequence > files > > > > including writing to it and then running mapred on it, but unable to > > find > > > > one. Could you please point me to some examples of sequence files? > > > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > Hi Mohit > > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > > Please > > > > > post the same to Pig user group for some workaround over the same. > > > > > SequenceFIle is a preferred option when we want to store > > small > > > > > files in hdfs and needs to be processed by MapReduce as it stores > > data > > > in > > > > > key value format.Since SequenceFileInputFormat is available at your > > > > > disposal you don't need any custom input formats for processing the > > > same > > > > > using map reduce. It is a cleaner and better approach compared to > > just > > > > > appending small xml file contents into a big file. > > > > > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > > > [EMAIL PROTECTED] > > > > > >wrote: > > > > > > > > > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks < > [EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > > > Mohit > > > > > > > Rather than just appending the content into a normal text > > > file > > > > or > > > > > > > so, you can create a sequence file with the individual smaller > > file > > > > > > content > > > > > > > as values. > > > > > > > > > > > > > > Thanks. I was planning to use pig's > > > >
-
Re: Writing small files to one big file in hdfsArko Provo Mukherjee 2012-02-21, 20:18
Hi,
I think the following link will help: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html Cheers Arko On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Sorry may be it's something obvious but I was wondering when map or reduce > gets called what would be the class used for key and value? If I used > "org.apache.hadoop.io.Text > value = *new* org.apache.hadoop.io.Text();" would the map be called with > Text class? > > public void map(LongWritable key, Text value, Context context) throws > IOException, InterruptedException { > > > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < > [EMAIL PROTECTED]> wrote: > > > Hi Mohit, > > > > I am not sure that I understand your question. > > > > But you can write into a file using: > > *BufferedWriter output = new BufferedWriter > > (new OutputStreamWriter(fs.create(my_path,true)));* > > *output.write(data);* > > * > > * > > Then you can pass that file as the input to your MapReduce program. > > > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* > > > > From inside your Map/Reduce methods, I think you should NOT be tinkering > > with the input / output paths of that Map/Reduce job. > > Cheers > > Arko > > > > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <[EMAIL PROTECTED] > > >wrote: > > > > > Thanks How does mapreduce work on sequence file? Is there an example I > > can > > > look at? > > > > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > Let's say all the smaller files are in the same directory. > > > > > > > > Then u can do: > > > > > > > > *BufferedWriter output = new BufferedWriter > > > > (newOutputStreamWriter(fs.create(output_path, > > > > true))); // Output path* > > > > > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); // > > > Input > > > > directory* > > > > > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > > > > > *{* > > > > > > > > * BufferedReader reader = new > > > > > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); > > > > * > > > > > > > > * String data;* > > > > > > > > * data = reader.readLine();* > > > > > > > > * while ( data != null ) * > > > > > > > > * {* > > > > > > > > * output.write(data);* > > > > > > > > * }* > > > > > > > > * reader.close* > > > > > > > > *}* > > > > > > > > *output.close* > > > > > > > > > > > > In case you have the files in multiple directories, call the code for > > > each > > > > of them with different input paths. > > > > > > > > Hope this helps! > > > > > > > > Cheers > > > > > > > > Arko > > > > > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > I am trying to look for examples that demonstrates using sequence > > files > > > > > including writing to it and then running mapred on it, but unable > to > > > find > > > > > one. Could you please point me to some examples of sequence files? > > > > > > > > > > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > > > Hi Mohit > > > > > > AFAIK XMLLoader in pig won't be suited for Sequence Files. > > > Please > > > > > > post the same to Pig user group for some workaround over the > same. > > > > > > SequenceFIle is a preferred option when we want to store > > > small > > > > > > files in hdfs and needs to be processed by MapReduce as it stores > > > data > > > > in > > > > > > key value format.Since SequenceFileInputFormat is available at > your > > > > > > disposal you don't need any custom input formats for processing > the > > > > same > > > > > > using map reduce. It is a cleaner and better approach compared to > > > just > > > > > > appending small xml file contents into a big file. > > > > > > > > > > > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia < > > > > [EMAIL PROTECTED] > > > > > > >wrote:
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-22, 00:13
Need some more help. I wrote sequence file using below code but now when I
run mapreduce job I get "file.*java.lang.ClassCastException*: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text" even though I didn't use LongWritable when I originally wrote to the sequence //Code to write to the sequence file. There is no LongWritable here org.apache.hadoop.io.Text key = *new* org.apache.hadoop.io.Text(); BufferedReader buffer = *new* BufferedReader(*new* FileReader(filePath)); String line = *null*; org.apache.hadoop.io.Text value = *new* org.apache.hadoop.io.Text(); *try* { writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), value.getClass(), SequenceFile.CompressionType.*RECORD*); *int* i = 1; *long* timestamp=System.*currentTimeMillis*(); *while* ((line = buffer.readLine()) != *null*) { key.set(String.*valueOf*(timestamp)); value.set(line); writer.append(key, value); i++; } On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < [EMAIL PROTECTED]> wrote: > Hi, > > I think the following link will help: > http://hadoop.apache.org/common/docs/current/mapred_tutorial.html > > Cheers > Arko > > On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <[EMAIL PROTECTED] > >wrote: > > > Sorry may be it's something obvious but I was wondering when map or > reduce > > gets called what would be the class used for key and value? If I used > > "org.apache.hadoop.io.Text > > value = *new* org.apache.hadoop.io.Text();" would the map be called with > > Text class? > > > > public void map(LongWritable key, Text value, Context context) throws > > IOException, InterruptedException { > > > > > > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Mohit, > > > > > > I am not sure that I understand your question. > > > > > > But you can write into a file using: > > > *BufferedWriter output = new BufferedWriter > > > (new OutputStreamWriter(fs.create(my_path,true)));* > > > *output.write(data);* > > > * > > > * > > > Then you can pass that file as the input to your MapReduce program. > > > > > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* > > > > > > From inside your Map/Reduce methods, I think you should NOT be > tinkering > > > with the input / output paths of that Map/Reduce job. > > > Cheers > > > Arko > > > > > > > > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia <[EMAIL PROTECTED] > > > >wrote: > > > > > > > Thanks How does mapreduce work on sequence file? Is there an example > I > > > can > > > > look at? > > > > > > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < > > > > [EMAIL PROTECTED]> wrote: > > > > > > > > > Hi, > > > > > > > > > > Let's say all the smaller files are in the same directory. > > > > > > > > > > Then u can do: > > > > > > > > > > *BufferedWriter output = new BufferedWriter > > > > > (newOutputStreamWriter(fs.create(output_path, > > > > > true))); // Output path* > > > > > > > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); > // > > > > Input > > > > > directory* > > > > > > > > > > *for ( int i=0; i < output_files.length; i++ ) * > > > > > > > > > > *{* > > > > > > > > > > * BufferedReader reader = new > > > > > > > > > BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); > > > > > * > > > > > > > > > > * String data;* > > > > > > > > > > * data = reader.readLine();* > > > > > > > > > > * while ( data != null ) * > > > > > > > > > > * {* > > > > > > > > > > * output.write(data);* > > > > > > > > > > * }* > > > > > > > > > > * reader.close* > > > > > > > > > > *}* > > > > > > > > > > *output.close* > > > > > > > > > > > > > > > In case you have the files in multiple directories, call the code > for > > > > each > > > > > of them with different input paths. > > > > > > > > > > Hope this helps! > > > > > > > > > > Cheers > > > > > > > > > > Arko > > > > > > > > > > On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-22, 00:50
It looks like in mapper values are coming as binary instead of Text. Is
this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Need some more help. I wrote sequence file using below code but now when I > run mapreduce job I get "file.*java.lang.ClassCastException*: > org.apache.hadoop.io.LongWritable cannot be cast to > org.apache.hadoop.io.Text" even though I didn't use LongWritable when I > originally wrote to the sequence > > //Code to write to the sequence file. There is no LongWritable here > > org.apache.hadoop.io.Text key > *new* org.apache.hadoop.io.Text(); > > BufferedReader buffer > *new* BufferedReader(*new* FileReader(filePath)); > > String line > *null*; > > org.apache.hadoop.io.Text value > *new* org.apache.hadoop.io.Text(); > > *try* { > > writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), > > value.getClass(), SequenceFile.CompressionType. > *RECORD*); > > *int* i = 1; > > *long* timestamp=System.*currentTimeMillis*(); > > *while* ((line = buffer.readLine()) != *null*) { > > key.set(String.*valueOf*(timestamp)); > > value.set(line); > > writer.append(key, value); > > i++; > > } > > > On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> I think the following link will help: >> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >> >> Cheers >> Arko >> >> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <[EMAIL PROTECTED] >> >wrote: >> >> > Sorry may be it's something obvious but I was wondering when map or >> reduce >> > gets called what would be the class used for key and value? If I used >> > "org.apache.hadoop.io.Text >> > value = *new* org.apache.hadoop.io.Text();" would the map be called with >> > Text class? >> > >> > public void map(LongWritable key, Text value, Context context) throws >> > IOException, InterruptedException { >> > >> > >> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >> > [EMAIL PROTECTED]> wrote: >> > >> > > Hi Mohit, >> > > >> > > I am not sure that I understand your question. >> > > >> > > But you can write into a file using: >> > > *BufferedWriter output = new BufferedWriter >> > > (new OutputStreamWriter(fs.create(my_path,true)));* >> > > *output.write(data);* >> > > * >> > > * >> > > Then you can pass that file as the input to your MapReduce program. >> > > >> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >> > > >> > > From inside your Map/Reduce methods, I think you should NOT be >> tinkering >> > > with the input / output paths of that Map/Reduce job. >> > > Cheers >> > > Arko >> > > >> > > >> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >> [EMAIL PROTECTED] >> > > >wrote: >> > > >> > > > Thanks How does mapreduce work on sequence file? Is there an >> example I >> > > can >> > > > look at? >> > > > >> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >> > > > [EMAIL PROTECTED]> wrote: >> > > > >> > > > > Hi, >> > > > > >> > > > > Let's say all the smaller files are in the same directory. >> > > > > >> > > > > Then u can do: >> > > > > >> > > > > *BufferedWriter output = new BufferedWriter >> > > > > (newOutputStreamWriter(fs.create(output_path, >> > > > > true))); // Output path* >> > > > > >> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); >> // >> > > > Input >> > > > > directory* >> > > > > >> > > > > *for ( int i=0; i < output_files.length; i++ ) * >> > > > > >> > > > > *{* >> > > > > >> > > > > * BufferedReader reader = new >> > > > > >> > > >> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); >> > > > > * >> > > > > >> > > > > * String data;* >> > > > > >> > > > > * data = reader.readLine();* >> > > > > >> > > > > * while ( data != null ) * >> > > > > >> > > > > * {* >> > > > > >> > > > > * output.write(data);* >> > > > > >>
-
Re: Writing small files to one big file in hdfsEdward Capriolo 2012-02-22, 02:42
On Tue, Feb 21, 2012 at 7:50 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
> It looks like in mapper values are coming as binary instead of Text. Is > this expected from sequence file? I initially wrote SequenceFile with Text > values. > > On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > >> Need some more help. I wrote sequence file using below code but now when I >> run mapreduce job I get "file.*java.lang.ClassCastException*: >> org.apache.hadoop.io.LongWritable cannot be cast to >> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I >> originally wrote to the sequence >> >> //Code to write to the sequence file. There is no LongWritable here >> >> org.apache.hadoop.io.Text key >> *new* org.apache.hadoop.io.Text(); >> >> BufferedReader buffer >> *new* BufferedReader(*new* FileReader(filePath)); >> >> String line >> *null*; >> >> org.apache.hadoop.io.Text value >> *new* org.apache.hadoop.io.Text(); >> >> *try* { >> >> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), >> >> value.getClass(), SequenceFile.CompressionType. >> *RECORD*); >> >> *int* i = 1; >> >> *long* timestamp=System.*currentTimeMillis*(); >> >> *while* ((line = buffer.readLine()) != *null*) { >> >> key.set(String.*valueOf*(timestamp)); >> >> value.set(line); >> >> writer.append(key, value); >> >> i++; >> >> } >> >> >> On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < >> [EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I think the following link will help: >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >>> >>> Cheers >>> Arko >>> >>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <[EMAIL PROTECTED] >>> >wrote: >>> >>> > Sorry may be it's something obvious but I was wondering when map or >>> reduce >>> > gets called what would be the class used for key and value? If I used >>> > "org.apache.hadoop.io.Text >>> > value = *new* org.apache.hadoop.io.Text();" would the map be called with >>> > Text class? >>> > >>> > public void map(LongWritable key, Text value, Context context) throws >>> > IOException, InterruptedException { >>> > >>> > >>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >>> > [EMAIL PROTECTED]> wrote: >>> > >>> > > Hi Mohit, >>> > > >>> > > I am not sure that I understand your question. >>> > > >>> > > But you can write into a file using: >>> > > *BufferedWriter output = new BufferedWriter >>> > > (new OutputStreamWriter(fs.create(my_path,true)));* >>> > > *output.write(data);* >>> > > * >>> > > * >>> > > Then you can pass that file as the input to your MapReduce program. >>> > > >>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >>> > > >>> > > From inside your Map/Reduce methods, I think you should NOT be >>> tinkering >>> > > with the input / output paths of that Map/Reduce job. >>> > > Cheers >>> > > Arko >>> > > >>> > > >>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >>> [EMAIL PROTECTED] >>> > > >wrote: >>> > > >>> > > > Thanks How does mapreduce work on sequence file? Is there an >>> example I >>> > > can >>> > > > look at? >>> > > > >>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >>> > > > [EMAIL PROTECTED]> wrote: >>> > > > >>> > > > > Hi, >>> > > > > >>> > > > > Let's say all the smaller files are in the same directory. >>> > > > > >>> > > > > Then u can do: >>> > > > > >>> > > > > *BufferedWriter output = new BufferedWriter >>> > > > > (newOutputStreamWriter(fs.create(output_path, >>> > > > > true))); // Output path* >>> > > > > >>> > > > > *FileStatus[] output_files = fs.listStatus(new Path(input_path)); >>> // >>> > > > Input >>> > > > > directory* >>> > > > > >>> > > > > *for ( int i=0; i < output_files.length; i++ ) * >>> > > > > >>> > > > > *{* >>> > > > > >>> > > > > * BufferedReader reader = new >>> > > > > >>> > > >>> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath()))); >>> > > > > * >>> > > > > >>> > > You might want to look at https://github.com/edwardcapriolo/filecrush
-
Re: Writing small files to one big file in hdfsMohit Anchlia 2012-02-22, 03:31
Finally figured it out. I needed to use SequenceFileAstextInputFormat.
There is just lack of examples that makes it difficult when you start. On Tue, Feb 21, 2012 at 4:50 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > It looks like in mapper values are coming as binary instead of Text. Is > this expected from sequence file? I initially wrote SequenceFile with Text > values. > > > On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > >> Need some more help. I wrote sequence file using below code but now when >> I run mapreduce job I get "file.*java.lang.ClassCastException*: >> org.apache.hadoop.io.LongWritable cannot be cast to >> org.apache.hadoop.io.Text" even though I didn't use LongWritable when I >> originally wrote to the sequence >> >> //Code to write to the sequence file. There is no LongWritable here >> >> org.apache.hadoop.io.Text key >> *new* org.apache.hadoop.io.Text(); >> >> BufferedReader buffer >> *new* BufferedReader(*new* FileReader(filePath)); >> >> String line >> *null*; >> >> org.apache.hadoop.io.Text value >> *new* org.apache.hadoop.io.Text(); >> >> *try* { >> >> writer = SequenceFile.*createWriter*(fs, conf, path, key.getClass(), >> >> value.getClass(), SequenceFile.CompressionType. >> *RECORD*); >> >> *int* i = 1; >> >> *long* timestamp=System.*currentTimeMillis*(); >> >> *while* ((line = buffer.readLine()) != *null*) { >> >> key.set(String.*valueOf*(timestamp)); >> >> value.set(line); >> >> writer.append(key, value); >> >> i++; >> >> } >> >> >> On Tue, Feb 21, 2012 at 12:18 PM, Arko Provo Mukherjee < >> [EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> I think the following link will help: >>> http://hadoop.apache.org/common/docs/current/mapred_tutorial.html >>> >>> Cheers >>> Arko >>> >>> On Tue, Feb 21, 2012 at 2:04 PM, Mohit Anchlia <[EMAIL PROTECTED] >>> >wrote: >>> >>> > Sorry may be it's something obvious but I was wondering when map or >>> reduce >>> > gets called what would be the class used for key and value? If I used >>> > "org.apache.hadoop.io.Text >>> > value = *new* org.apache.hadoop.io.Text();" would the map be called >>> with >>> > Text class? >>> > >>> > public void map(LongWritable key, Text value, Context context) throws >>> > IOException, InterruptedException { >>> > >>> > >>> > On Tue, Feb 21, 2012 at 11:59 AM, Arko Provo Mukherjee < >>> > [EMAIL PROTECTED]> wrote: >>> > >>> > > Hi Mohit, >>> > > >>> > > I am not sure that I understand your question. >>> > > >>> > > But you can write into a file using: >>> > > *BufferedWriter output = new BufferedWriter >>> > > (new OutputStreamWriter(fs.create(my_path,true)));* >>> > > *output.write(data);* >>> > > * >>> > > * >>> > > Then you can pass that file as the input to your MapReduce program. >>> > > >>> > > *FileInputFormat.addInputPath(jobconf, new Path (my_path) );* >>> > > >>> > > From inside your Map/Reduce methods, I think you should NOT be >>> tinkering >>> > > with the input / output paths of that Map/Reduce job. >>> > > Cheers >>> > > Arko >>> > > >>> > > >>> > > On Tue, Feb 21, 2012 at 1:38 PM, Mohit Anchlia < >>> [EMAIL PROTECTED] >>> > > >wrote: >>> > > >>> > > > Thanks How does mapreduce work on sequence file? Is there an >>> example I >>> > > can >>> > > > look at? >>> > > > >>> > > > On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee < >>> > > > [EMAIL PROTECTED]> wrote: >>> > > > >>> > > > > Hi, >>> > > > > >>> > > > > Let's say all the smaller files are in the same directory. >>> > > > > >>> > > > > Then u can do: >>> > > > > >>> > > > > *BufferedWriter output = new BufferedWriter >>> > > > > (newOutputStreamWriter(fs.create(output_path, >>> > > > > true))); // Output path* >>> > > > > >>> > > > > *FileStatus[] output_files = fs.listStatus(new >>> Path(input_path)); // >>> > > > Input >>> > > > > directory* >>> > > > > >>> > > > > *for ( int i=0; i < output_files.length; i++ ) * >>> > > > > >>> > > > > *{* >>> > > > > >>> > > > > * BufferedReader reader = new |