Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Efficiently Stream into Sequence Files?


Copy link to this message
-
Re: Efficiently Stream into Sequence Files?
Have you looked at TFile?

On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote:

> Hi -
>
> I'd like to create a job that pulls small files from a remote server  
> (using FTP, SCP, etc.) and stores them directly to sequence files on  
> HDFS.  Looking at the sequence file APi, I don't see an obvious way  
> to do this.  It looks like what I have to do is pull the remote file  
> to disk, then read the file into memory to place in the sequence  
> file.  Is there a better way?
>
> Looking at the API, am I forced to use the append method?
>
>            FileSystem hdfs =  
> FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new  
> Path(outputPath));
>            writer =  
> SequenceFile.createWriter(context.getConfiguration(), outputStream,  
> Text.class, BytesWritable.class, null, null);
>
>   // read in file to remotefilebytes
>
>            writer.append(filekey, remotefilebytes);
>
>
> The alternative would be to have one job pull the remote files, and  
> a secondary job write them into sequence files.
>
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>
> Thanks.
>
>
>
>