|
|
-
Re: Loading file to HDFS with custom chunk structureMohammad Tariq 2013-01-16, 15:49
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for that. You can strip off the EBCDIC+Binary header(Initial 3600 Bytes) and store the SEGY file as Sequence Files, where each trace (Trace Header+Trace Data) would be the value and the trace no. could be the key. Otherwise you have to write a custom InputFormat to deal with that. It would enhance the performance as well, since Sequence Files are already in key-value form. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: > Look at the block size concept in Hadoop and see if that is what you are > looking for > > Sent from my iPhone > > On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist < > [EMAIL PROTECTED]> wrote: > > I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS > of a 3-node Apache Hadoop cluster. > > To summarize, the SegY file consists of : > > 1. 3200 bytes *textual header* > 2. 400 bytes *binary header* > 3. Variable bytes *data* > > The 99.99% size of the file is due to the variable bytes data which is > collection of thousands of contiguous traces. For any SegY file to make > sense, it must have the textual header+binary header+at least one trace of > data. What I want to achieve is to split a large SegY file onto the Hadoop > cluster so that a smaller SegY file is available on each node for local > processing. > > The scenario is as follows: > > 1. The SegY file is large in size(above 10GB) and is resting on the > local file system of the NameNode machine > 2. The file is to be split on the nodes in such a way each node has a > small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes > *binary header* + variable bytes *data*As obvious, I can't blindly use > FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the > format in which the chunks of the larger file are required > > Please guide me as to how I must proceed. > > Thanks and regards ! > > |