Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Splitting files on new line using hadoop fs


Copy link to this message
-
Re: Splitting files on new line using hadoop fs
On Wed, Feb 22, 2012 at 12:23 PM, <[EMAIL PROTECTED]> wrote:

> Hi Mohit
>        AFAIK there is no default mechanism available for the same in
> hadoop. File is split into blocks just based on the configured block size
> during hdfs copy. While processing the file using Mapreduce the record
> reader takes care of the new lines even if a line spans across multiple
> blocks.
>
> Could you explain more on the use case that demands such a requirement
> while hdfs copy itself?
>

 I am using pig's XMLLoader in piggybank to read xml files concatenated in
a text file. But pig script doesn't work when file is big that causes
hadoop to split the files.

Any suggestions on how I can make it work? Below is my simple script that I
would like to enhance, only if it starts working. Please note this works
for small files.
register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'

raw = LOAD '/examples/testfile5.txt using
org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);

dump raw;

>
> ------Original Message------
> From: Mohit Anchlia
> To: [EMAIL PROTECTED]
> ReplyTo: [EMAIL PROTECTED]
> Subject: Splitting files on new line using hadoop fs
> Sent: Feb 23, 2012 01:45
>
> How can I copy large text files using "hadoop fs" such that split occurs
> based on blocks + new lines instead of blocks alone? Is there a way to do
> this?
>
>
>
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB