|
|
-
Splitting files on new line using hadoop fs
Mohit Anchlia 2012-02-22, 20:15
How can I copy large text files using "hadoop fs" such that split occurs based on blocks + new lines instead of blocks alone? Is there a way to do this?
+
Mohit Anchlia 2012-02-22, 20:15
-
Re: Splitting files on new line using hadoop fs
bejoy.hadoop@... 2012-02-22, 20:23
Hi Mohit AFAIK there is no default mechanism available for the same in hadoop. File is split into blocks just based on the configured block size during hdfs copy. While processing the file using Mapreduce the record reader takes care of the new lines even if a line spans across multiple blocks.
Could you explain more on the use case that demands such a requirement while hdfs copy itself?
------Original Message------ From: Mohit Anchlia To: [EMAIL PROTECTED] ReplyTo: [EMAIL PROTECTED] Subject: Splitting files on new line using hadoop fs Sent: Feb 23, 2012 01:45
How can I copy large text files using "hadoop fs" such that split occurs based on blocks + new lines instead of blocks alone? Is there a way to do this?
Regards Bejoy K S
>From handheld, Please excuse typos.
+
bejoy.hadoop@... 2012-02-22, 20:23
-
Re: Splitting files on new line using hadoop fs
Mohit Anchlia 2012-02-22, 20:29
On Wed, Feb 22, 2012 at 12:23 PM, <[EMAIL PROTECTED]> wrote:
> Hi Mohit > AFAIK there is no default mechanism available for the same in > hadoop. File is split into blocks just based on the configured block size > during hdfs copy. While processing the file using Mapreduce the record > reader takes care of the new lines even if a line spans across multiple > blocks. > > Could you explain more on the use case that demands such a requirement > while hdfs copy itself? >
I am using pig's XMLLoader in piggybank to read xml files concatenated in a text file. But pig script doesn't work when file is big that causes hadoop to split the files.
Any suggestions on how I can make it work? Below is my simple script that I would like to enhance, only if it starts working. Please note this works for small files. register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
raw = LOAD '/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);
dump raw;
> > ------Original Message------ > From: Mohit Anchlia > To: [EMAIL PROTECTED] > ReplyTo: [EMAIL PROTECTED] > Subject: Splitting files on new line using hadoop fs > Sent: Feb 23, 2012 01:45 > > How can I copy large text files using "hadoop fs" such that split occurs > based on blocks + new lines instead of blocks alone? Is there a way to do > this? > > > > Regards > Bejoy K S > > From handheld, Please excuse typos. >
+
Mohit Anchlia 2012-02-22, 20:29
-
Re: Splitting files on new line using hadoop fs
bejoy.hadoop@... 2012-02-22, 20:44
Hi Mohit I'm not an expert in pig and it'd be better using the pig user group for pig specific queries. I'd try to help you with some basic trouble shooting of the same
It sounds strange that pig's XML Loader can't load larger XML files that consists of multiple blocks. Or is it like, pig is not able to load the concatenated files that you are trying with? If that is the case then it could be because of some issues since you are just appending multiple xml file contents into a single file.
Pig users can give you some workarounds how they are dealing with loading of small xml files that are stored efficiently.
Regards Bejoy K S
From handheld, Please excuse typos.
-----Original Message----- From: Mohit Anchlia <[EMAIL PROTECTED]> Date: Wed, 22 Feb 2012 12:29:26 To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Subject: Re: Splitting files on new line using hadoop fs
On Wed, Feb 22, 2012 at 12:23 PM, <[EMAIL PROTECTED]> wrote:
> Hi Mohit > AFAIK there is no default mechanism available for the same in > hadoop. File is split into blocks just based on the configured block size > during hdfs copy. While processing the file using Mapreduce the record > reader takes care of the new lines even if a line spans across multiple > blocks. > > Could you explain more on the use case that demands such a requirement > while hdfs copy itself? >
I am using pig's XMLLoader in piggybank to read xml files concatenated in a text file. But pig script doesn't work when file is big that causes hadoop to split the files.
Any suggestions on how I can make it work? Below is my simple script that I would like to enhance, only if it starts working. Please note this works for small files. register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
raw = LOAD '/examples/testfile5.txt using org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);
dump raw;
> > ------Original Message------ > From: Mohit Anchlia > To: [EMAIL PROTECTED] > ReplyTo: [EMAIL PROTECTED] > Subject: Splitting files on new line using hadoop fs > Sent: Feb 23, 2012 01:45 > > How can I copy large text files using "hadoop fs" such that split occurs > based on blocks + new lines instead of blocks alone? Is there a way to do > this? > > > > Regards > Bejoy K S > > From handheld, Please excuse typos. >
+
bejoy.hadoop@... 2012-02-22, 20:44
-
Re: Splitting files on new line using hadoop fs
Mohit Anchlia 2012-02-22, 22:57
Thanks I did post this question to that group. All xml document are separated by a new line so that shouldn't be the issue, I think.
On Wed, Feb 22, 2012 at 12:44 PM, <[EMAIL PROTECTED]> wrote:
> ** > Hi Mohit > I'm not an expert in pig and it'd be better using the pig user group for > pig specific queries. I'd try to help you with some basic trouble shooting > of the same > > It sounds strange that pig's XML Loader can't load larger XML files that > consists of multiple blocks. Or is it like, pig is not able to load the > concatenated files that you are trying with? If that is the case then it > could be because of some issues since you are just appending multiple xml > file contents into a single file. > > Pig users can give you some workarounds how they are dealing with loading > of small xml files that are stored efficiently. > > Regards > Bejoy K S > > From handheld, Please excuse typos. > ------------------------------ > *From: *Mohit Anchlia <[EMAIL PROTECTED]> > *Date: *Wed, 22 Feb 2012 12:29:26 -0800 > *To: *<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > *Subject: *Re: Splitting files on new line using hadoop fs > > > On Wed, Feb 22, 2012 at 12:23 PM, <[EMAIL PROTECTED]> wrote: > >> Hi Mohit >> AFAIK there is no default mechanism available for the same in >> hadoop. File is split into blocks just based on the configured block size >> during hdfs copy. While processing the file using Mapreduce the record >> reader takes care of the new lines even if a line spans across multiple >> blocks. >> >> Could you explain more on the use case that demands such a requirement >> while hdfs copy itself? >> > > I am using pig's XMLLoader in piggybank to read xml files concatenated > in a text file. But pig script doesn't work when file is big that causes > hadoop to split the files. > > Any suggestions on how I can make it work? Below is my simple script that > I would like to enhance, only if it starts working. Please note this works > for small files. > > > register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' > > raw = LOAD '/examples/testfile5.txt using > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray); > > dump raw; > >> >> ------Original Message------ >> From: Mohit Anchlia >> To: [EMAIL PROTECTED] >> ReplyTo: [EMAIL PROTECTED] >> Subject: Splitting files on new line using hadoop fs >> Sent: Feb 23, 2012 01:45 >> >> How can I copy large text files using "hadoop fs" such that split occurs >> based on blocks + new lines instead of blocks alone? Is there a way to do >> this? >> >> >> >> Regards >> Bejoy K S >> >> From handheld, Please excuse typos. >> > >
+
Mohit Anchlia 2012-02-22, 22:57
|
|