Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: Loading file to HDFS with custom chunk structure


+
Mohit Anchlia 2013-01-16, 15:43
Copy link to this message
-
Re: Loading file to HDFS with custom chunk structure
Since SEGY files are flat binary files, you might have a tough
time in dealing with them as their is no native InputFormat for
that. You can strip off the EBCDIC+Binary header(Initial 3600
Bytes) and store the SEGY file as Sequence Files, where each
trace (Trace Header+Trace Data) would be the value and the
trace no. could be the key.

Otherwise you have to write a custom InputFormat to deal with
that. It would enhance the performance as well, since Sequence
Files are already in key-value form.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com
On Wed, Jan 16, 2013 at 9:13 PM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:

> Look at  the block size concept in Hadoop and see if that is what you are
> looking for
>
> Sent from my iPhone
>
> On Jan 16, 2013, at 7:31 AM, Kaliyug Antagonist <
> [EMAIL PROTECTED]> wrote:
>
> I want to load a SegY <http://en.wikipedia.org/wiki/SEG_Y> file onto HDFS
> of a 3-node Apache Hadoop cluster.
>
> To summarize, the SegY file consists of :
>
>    1. 3200 bytes *textual header*
>    2. 400 bytes *binary header*
>    3. Variable bytes *data*
>
> The 99.99% size of the file is due to the variable bytes data which is
> collection of thousands of contiguous traces. For any SegY file to make
> sense, it must have the textual header+binary header+at least one trace of
> data. What I want to achieve is to split a large SegY file onto the Hadoop
> cluster so that a smaller SegY file is available on each node for local
> processing.
>
> The scenario is as follows:
>
>    1. The SegY file is large in size(above 10GB) and is resting on the
>    local file system of the NameNode machine
>    2. The file is to be split on the nodes in such a way each node has a
>    small SegY file with a strict structure - 3200 bytes *textual header*+ 400 bytes
>    *binary header* + variable bytes *data*As obvious, I can't blindly use
>    FSDataOutputStream or hadoop fs -copyFromLocal as this may not ensure the
>    format in which the chunks of the larger file are required
>
> Please guide me as to how I must proceed.
>
> Thanks and regards !
>
>
+
Mohammad Tariq 2013-01-16, 15:56
+
Mohammad Tariq 2013-01-16, 15:38
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB