Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Manually splitting files in blocks


Copy link to this message
-
Re: Manually splitting files in blocks
Nick Dimiduk 2010-03-26, 18:06
Inline

On Fri, Mar 26, 2010 at 7:49 AM, Yuri K. <[EMAIL PROTECTED]> wrote:

>
> ok so far so good. thanks for the reply. i'm trying to implement a custom
> file input format. but i can set it only in the job configuration:
> job.setInputFormatClass(CustomFileInputFormat.class);
>
>
This is exactly right. The custom input code ends up bundled in your job jar
and is available to the job at runtime just like any other dependency
library. Alternately, you could package your new input format into it's own
jar and "install" it onto the cluster by pushing it out to the
$HADOOP_HOME/lib on every machine. Unless you're building a common
infrastructure for a disparate set of users, I'd recommend the former
approach.

how do i make hadoop implement the file format, or the custom file split
> when i upload new files to the hdfs? do i need a custom upload interface
> for
> that or is there a hadoop config option for that?
>

My understanding (please correct me, list) is that hadoop will always spit
your files based on the block size setting. The InputSplit and RecordReaders
are used by jobs to retrieve chunks of files for processing - that is, there
are two separate splits happening here: one "physical" split for storage and
one "logical" split for processing.

Cheers,
-Nick
> ANKITBHATNAGAR wrote:
> >
> >
> >
> > Yuri K. wrote:
> >>
> >> Dear Hadoopers,
> >>
> >> i'm trying to find out how and where hadoop splits a file into blocks
> and
> >> decides to send them to the datanodes.
> >>
> >> My specific problem:
> >> i have two types of data files.
> >> One large file is used as a database-file where information is sorted
> >> like this:
> >> [BEGIN DATAROW]
> >> ... lots of data 1
> >> [END DATAROW]
> >>
> >> [BEGIN DATAROW]
> >> ... lots of data 2
> >> [END DATAROW]
> >> and so on.
> >>
> >> and the other smaller files contain raw data and are to be compared to a
> >> datarow in the large file.
> >>
> >> so my question is: is it possible to manually set how hadoop splits the
> >> large data file into blocks?
> >> obviously i want the begin-end section to be in one block to optimize
> >> performance. thus i can replicate the smaller files on each node and so
> >> those can work independently from the other.
> >>
> >> thanks, yk
> >>
> >
> >
> > You should create a CustomInputSplit and CustomRecordReader (should have
> > start and end tag )
> >
> >
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Manually-splitting-files-in-blocks-tp28015936p28043517.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>