Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Splitting files on new line using hadoop fs


Copy link to this message
-
Re: Splitting files on new line using hadoop fs
Mohit Anchlia 2012-02-22, 20:29
On Wed, Feb 22, 2012 at 12:23 PM, <[EMAIL PROTECTED]> wrote:

> Hi Mohit
>        AFAIK there is no default mechanism available for the same in
> hadoop. File is split into blocks just based on the configured block size
> during hdfs copy. While processing the file using Mapreduce the record
> reader takes care of the new lines even if a line spans across multiple
> blocks.
>
> Could you explain more on the use case that demands such a requirement
> while hdfs copy itself?
>

 I am using pig's XMLLoader in piggybank to read xml files concatenated in
a text file. But pig script doesn't work when file is big that causes
hadoop to split the files.

Any suggestions on how I can make it work? Below is my simple script that I
would like to enhance, only if it starts working. Please note this works
for small files.
register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

raw = LOAD '/examples/testfile5.txt using
org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);

dump raw;

>
> ------Original Message------
> From: Mohit Anchlia
> To: [EMAIL PROTECTED]
> ReplyTo: [EMAIL PROTECTED]
> Subject: Splitting files on new line using hadoop fs
> Sent: Feb 23, 2012 01:45
>
> How can I copy large text files using "hadoop fs" such that split occurs
> based on blocks + new lines instead of blocks alone? Is there a way to do
> this?
>
>
>
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
>