Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> [Input split] File manipulation


Copy link to this message
-
Re: [Input split] File manipulation
The default split size is 64M which is the block size, and you can
change it by configuration.
What file type of your input file ? if it's gz , it can not been
spited, and you will always get only one mapper task.
On Wed, Aug 18, 2010 at 12:03 AM, Erik Test <[EMAIL PROTECTED]> wrote:
> I'm expecting to come across millions of data points.
>
> Thanks for the response by the way. I thought that Hadoop set the number of
> splits, regardless of file size, to just 1 by default.
> Erik
>
>
> On 17 August 2010 11:44, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>
>> What size is your input ? If the input size is large enough, you do
>> not need to worry about the splitting, only one split (the last split)
>> has the different size, all the other splits has the same split.
>>
>>
>>
>> On Tue, Aug 17, 2010 at 7:50 AM, Erik Test <[EMAIL PROTECTED]> wrote:
>> > Hello,
>> >
>> > I'm trying to determine how to split a file evenly so each map task has a
>> > similar work load. The input I will have is a list of coordinates like
>> this:
>> >
>> > 2,8
>> > 3,9
>> > 4,10
>> > 5,7
>> > 6,2
>> > 7,3
>> > 8,1
>> > 9,0
>> > 10,4
>> >
>> > Since there are 9 inputs in this example, I would like to split the
>> records
>> > so that there would be 3 map tasks.
>> >
>> > I've been looking into different text input format classes but I'm still
>> not
>> > sure how to split the input file the way I would like to.
>> >
>> > Does anyone have advice or suggestions how I can go about manipulating
>> the
>> > input splits by specifying the number of lines are in an input split?
>> >
>> > Erik
>> >
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

--
Best Regards

Jeff Zhang
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB