Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Splitting files on new line using hadoop fs


Copy link to this message
-
Re: Splitting files on new line using hadoop fs
Mohit Anchlia 2012-02-22, 22:57
Thanks I did post this question to that group. All xml document are
separated by a new line so that shouldn't be the issue, I think.

On Wed, Feb 22, 2012 at 12:44 PM, <[EMAIL PROTECTED]> wrote:

> **
> Hi Mohit
> I'm not an expert in pig and it'd be better using the pig user group for
> pig specific queries. I'd try to help you with some basic trouble shooting
> of the same
>
> It sounds strange that pig's XML Loader can't load larger XML files that
> consists of multiple blocks. Or is it like, pig is not able to load the
> concatenated files that you are trying with? If that is the case then it
> could be because of some issues since you are just appending multiple xml
> file contents into a single file.
>
> Pig users can give you some workarounds how they are dealing with loading
> of small xml files that are stored efficiently.
>
> Regards
> Bejoy K S
>
> From handheld, Please excuse typos.
> ------------------------------
> *From: *Mohit Anchlia <[EMAIL PROTECTED]>
> *Date: *Wed, 22 Feb 2012 12:29:26 -0800
> *To: *<[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> *Subject: *Re: Splitting files on new line using hadoop fs
>
>
> On Wed, Feb 22, 2012 at 12:23 PM, <[EMAIL PROTECTED]> wrote:
>
>> Hi Mohit
>>        AFAIK there is no default mechanism available for the same in
>> hadoop. File is split into blocks just based on the configured block size
>> during hdfs copy. While processing the file using Mapreduce the record
>> reader takes care of the new lines even if a line spans across multiple
>> blocks.
>>
>> Could you explain more on the use case that demands such a requirement
>> while hdfs copy itself?
>>
>
>  I am using pig's XMLLoader in piggybank to read xml files concatenated
> in a text file. But pig script doesn't work when file is big that causes
> hadoop to split the files.
>
> Any suggestions on how I can make it work? Below is my simple script that
> I would like to enhance, only if it starts working. Please note this works
> for small files.
>
>
> register '/root/pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
>
> raw = LOAD '/examples/testfile5.txt using
> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);
>
> dump raw;
>
>>
>> ------Original Message------
>> From: Mohit Anchlia
>> To: [EMAIL PROTECTED]
>> ReplyTo: [EMAIL PROTECTED]
>> Subject: Splitting files on new line using hadoop fs
>> Sent: Feb 23, 2012 01:45
>>
>> How can I copy large text files using "hadoop fs" such that split occurs
>> based on blocks + new lines instead of blocks alone? Is there a way to do
>> this?
>>
>>
>>
>> Regards
>> Bejoy K S
>>
>> From handheld, Please excuse typos.
>>
>
>