-Re: Loading file to HDFS with custom chunk structure
Mohammad Tariq 2013-01-22, 18:03
First of all, the software will get just the block residing on that DN
and not the entire file.
What is your primary intention?To process the SEGY data using MR
or through the tool you are talking about?I had tried something similar
through SU, but it didn't quite work for me and because of the time
constraint I could not continue that. So I can't comment on that with
And, if you are OK with conversion of SEGY files into SequesnceFiles
and do the processing, then you actually don't need any other tool. You
just have to think on how to implement the processing algo you want to
implement as a MR job. Infact, few processing procedures can actually
be implemented very easily as libraries are already available for them.
For example Apache provides libraries for FFT and Inverse FFT and so
On Tue, Jan 22, 2013 at 8:42 PM, Kaliyug Antagonist <
[EMAIL PROTECTED]> wrote:
> I'm already using Cloudera SeismicHadoop and do not wish to take its track.
> Suppose there is a software installed on every node that will expect a
> SegY file for processing. Now, suppose I wish to call this software via
> Hadoop Streaming API and expecting that the software must get a reasonably
> large file for processing, I'll have to do something to pull bytes from the
> HDFS, say from a SequenceFile. These bytes must have the fixed textual
> header + fixed binary header + n(trace header + trace data) - how do I
> achieve this?
> On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>> Warm Regards,
>> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>>> Since SEGY files are flat binary files, you might have a tough
>>> time in dealing with them as their is no native InputFormat for
>>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>>> Bytes) and store the SEGY file as Sequence Files, where each
>>> trace (Trace Header+Trace Data) would be the value and the
>>> trace no. could be the key.
>>> Otherwise you have to write a custom InputFormat to deal with
>>> that. It would enhance the performance as well, since Sequence
>>> Files are already in key-value form.
>>> Warm Regards,
>>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
>>>> Look at the block size concept in Hadoop and see if that is what you
>>>> are looking for
>>>> Sent from my iPhone
>>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>>> [EMAIL PROTECTED]> wrote:
>>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>>> HDFS of a 3-node Apache Hadoop cluster.
>>>> To summarize, the SegY file consists of :
>>>> 1. 3200 bytes *textual header*
>>>> 2. 400 bytes *binary header*
>>>> 3. Variable bytes *data*
>>>> The 99.99% size of the file is due to the variable bytes data which is
>>>> collection of thousands of contiguous traces. For any SegY file to make
>>>> sense, it must have the textual header+binary header+at least one trace of
>>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>>> cluster so that a smaller SegY file is available on each node for local
>>>> The scenario is as follows:
>>>> 1. The SegY file is large in size(above 10GB) and is resting on the
>>>> local file system of the NameNode machine
>>>> 2. The file is to be split on the nodes in such a way each node has
>>>> a small SegY file with a strict structure - 3200 bytes *textual
>>>> header* + 400 bytes *binary header* + variable bytes *data*As
>>>> obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>> as this may not ensure the format in which the chunks of the larger file