Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Loading file to HDFS with custom chunk structure

Copy link to this message
Loading file to HDFS with custom chunk structure
I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
of a 3-node Apache Hadoop cluster.

To summarize, the SegY file consists of :

   1. 3200 bytes *textual header*
   2. 400 bytes *binary header*
   3. Variable bytes *data*

The 99.99% size of the file is due to the variable bytes data which is
collection of thousands of contiguous traces. For any SegY file to make
sense, it must have the textual header+binary header+at least one trace of
data. What I want to achieve is to split a large SegY file onto the Hadoop
cluster so that a smaller SegY file is available on each node for local

The scenario is as follows:

   1. The SegY file is large in size(above 10GB) and is resting on the
   local file system of the NameNode machine
   2. The file is to be split on the nodes in such a way each node has a
   small SegY file with a strict structure - 3200 bytes *textual header* +
   400 bytes *binary header* + variable bytes *data*As obvious, I can't
   blindly use FSDataOutputStream or hadoop fs -copyFromLocal as this may not
   ensure the format in which the chunks of the larger file are required

Please guide me as to how I must proceed.

Thanks and regards !