Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Efficiently Stream into Sequence Files?


Copy link to this message
-
Re: Efficiently Stream into Sequence Files?
Have you looked at TFile?

On Mar 12, 2010, at 5:22 AM, Scott Whitecross wrote:

> Hi -
>
> I'd like to create a job that pulls small files from a remote server  
> (using FTP, SCP, etc.) and stores them directly to sequence files on  
> HDFS.  Looking at the sequence file APi, I don't see an obvious way  
> to do this.  It looks like what I have to do is pull the remote file  
> to disk, then read the file into memory to place in the sequence  
> file.  Is there a better way?
>
> Looking at the API, am I forced to use the append method?
>
>            FileSystem hdfs =  
> FileSystem.get(context.getConfiguration());
>            FSDataOutputStream outputStream = hdfs.create(new  
> Path(outputPath));
>            writer =  
> SequenceFile.createWriter(context.getConfiguration(), outputStream,  
> Text.class, BytesWritable.class, null, null);
>
>   // read in file to remotefilebytes
>
>            writer.append(filekey, remotefilebytes);
>
>
> The alternative would be to have one job pull the remote files, and  
> a secondary job write them into sequence files.
>
> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>
> Thanks.
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB