Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: Loading file to HDFS with custom chunk structure


Copy link to this message
-
Re: Loading file to HDFS with custom chunk structure
Kaliyug Antagonist 2013-01-22, 15:12
I'm already using Cloudera SeismicHadoop and do not wish to take its track.

Suppose there is a software installed on every node that will expect a SegY
file for processing. Now, suppose I wish to call this software via Hadoop
Streaming API and expecting that the software must get a reasonably large
file for processing, I'll have to do something to pull bytes from the HDFS,
say from a SequenceFile. These bytes must have the fixed textual header +
fixed binary header + n(trace header + trace data) - how do I achieve this?
On Wed, Jan 16, 2013 at 9:26 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> You might also find this link <https://github.com/cloudera/seismichadoop>useful.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Wed, Jan 16, 2013 at 9:19 PM, Mohammad Tariq <[EMAIL PROTECTED]>wrote:
>
>> Since SEGY files are flat binary files, you might have a tough
>> time in dealing with them as their is no native InputFormat for
>> that. You can strip off the EBCDIC+Binary header(Initial 3600
>> Bytes) and store the SEGY file as Sequence Files, where each
>> trace (Trace Header+Trace Data) would be the value and the
>> trace no. could be the key.
>>
>> Otherwise you have to write a custom InputFormat to deal with
>> that. It would enhance the performance as well, since Sequence
>> Files are already in key-value form.
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
>>
>>> Look at  the block size concept in Hadoop and see if that is what you
>>> are looking for
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
>>> [EMAIL PROTECTED]> wrote:
>>>
>>> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto
>>> HDFS of a 3-node Apache Hadoop cluster.
>>>
>>> To summarize, the SegY file consists of :
>>>
>>>    1. 3200 bytes *textual header*
>>>    2. 400 bytes *binary header*
>>>    3. Variable bytes *data*
>>>
>>> The 99.99% size of the file is due to the variable bytes data which is
>>> collection of thousands of contiguous traces. For any SegY file to make
>>> sense, it must have the textual header+binary header+at least one trace of
>>> data. What I want to achieve is to split a large SegY file onto the Hadoop
>>> cluster so that a smaller SegY file is available on each node for local
>>> processing.
>>>
>>> The scenario is as follows:
>>>
>>>    1. The SegY file is large in size(above 10GB) and is resting on the
>>>    local file system of the NameNode machine
>>>    2. The file is to be split on the nodes in such a way each node has
>>>    a small SegY file with a strict structure - 3200 bytes *textual
>>>    header* + 400 bytes *binary header* + variable bytes *data*As
>>>    obvious, I can't blindly use FSDataOutputStream or hadoop fs -copyFromLocal
>>>    as this may not ensure the format in which the chunks of the larger file
>>>    are required
>>>
>>> Please guide me as to how I must proceed.
>>>
>>> Thanks and regards !
>>>
>>>
>>
>