|
|
-
Re: Loading file to HDFS with custom chunk structureKaliyug Antagonist 2013-01-22, 15:12
I'm already using Cloudera SeismicHadoop and do not wish to take its track.
Suppose there is a software installed on every node that will expect a SegY file for processing. Now, suppose I wish to call this software via Hadoop Streaming API and expecting that the software must get a reasonably large file for processing, I'll have to do something to pull bytes from the HDFS, say from a SequenceFile. These bytes must have the fixed textual header + fixed binary header + n(trace header + trace data) - how do I achieve this? On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > You might also find this link <https://github.com/cloudera/seismichadoop>useful. > > Warm Regards, > Tariq > https://mtariq.jux.com/ > cloudfront.blogspot.com > > > On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote: > >> Since SEGY files are flat binary files, you might have a tough >> time in dealing with them as their is no native InputFormat for >> that. You can strip off the EBCDIC+Binary header(Initial 3600 >> Bytes) and store the SEGY file as Sequence Files, where each >> trace (Trace Header+Trace Data) would be the value and the >> trace no. could be the key. >> >> Otherwise you have to write a custom InputFormat to deal with >> that. It would enhance the performance as well, since Sequence >> Files are already in key-value form. >> >> Warm Regards, >> Tariq >> https://mtariq.jux.com/ >> cloudfront.blogspot.com >> >> >> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote: >> >>> Look at the block size concept in Hadoop and see if that is what you >>> are looking for >>> >>> Sent from my iPhone >>> >>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist < >>> [EMAIL PROTECTED]> wrote: >>> >>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto >>> HDFS of a 3-node Apache Hadoop cluster. >>> >>> To summarize, the SegY file consists of : >>> >>> 1. 3200 bytes *textual header* >>> 2. 400 bytes *binary header* >>> 3. Variable bytes *data* >>> >>> The 99.99% size of the file is due to the variable bytes data which is >>> collection of thousands of contiguous traces. For any SegY file to make >>> sense, it must have the textual header+binary header+at least one trace of >>> data. What I want to achieve is to split a large SegY file onto the Hadoop >>> cluster so that a smaller SegY file is available on each node for local >>> processing. >>> >>> The scenario is as follows: >>> >>> 1. The SegY file is large in size(above 10GB) and is resting on the >>> local file system of the NameNode machine >>> 2. The file is to be split on the nodes in such a way each node has >>> a small SegY file with a strict structure - 3200 bytes *textual >>> header* + 400 bytes *binary header* + variable bytes *data*As >>> obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal >>> as this may not ensure the format in which the chunks of the larger file >>> are required >>> >>> Please guide me as to how I must proceed. >>> >>> Thanks and regards ! >>> >>> >> > |