Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Help with XMLLoader


Copy link to this message
-
Re: Help with XMLLoader
Hi Mohit,
  We use XMLLoader for wiki data which is around 52g (uncompressed) file.
Not sure what is causing this problem here. Can you give a try with Pig 0.9
Thanks
Vivek
On 2/22/12 9:19 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan
> <[EMAIL PROTECTED]>wrote:
>
>> Hi Mohit,
>>  XMLLoader looks for the start and end tag for a given string argument. In
>> the given input there are no end tags and hence it read 0 records.
>>
>> Example:
>> raw = LOAD 'sample_xml' using
>> org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);
>> dump raw;
>>
>> cat sample_xml
>> <abc><def></def></abc>
>> <abc><def></def></abc>
>>
>
> Thanks! I got past this. But I am facing a different problem. When I have a
> big file that splits into multiple nodes then pig is not able to read the
> records. It returns 0 records found.
>
> I create a big file 2G with lots of xml root like above. Then I do hadoop
> fs -copyFromLocal bigfile /examples
>
> But when I run pig script it return 0 records. If I reduce the size of file
> to few MB then it works fine. How can I resolve this?
>
>>
>> Thanks
>> Vivek
>>  On 2/21/12 11:02 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:
>>
>>> I am trying to use XMLLoader to process the files but it doesn't seem to
>> be
>>> quite working. For the first pass I am just trying to dump all the
>> contents
>>> but it's saying 0 records found:
>>>
>>> bash-3.2$ hadoop fs -cat /examples/testfile.txt
>>>
>>> <abc><def></def><abc>
>>>
>>> <abc><def></def><abc>
>>>
>>> register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
>>>
>>> raw = LOAD '/examples/testfile.txt' using
>>> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as
>> (document:chararray);
>>>
>>> dump raw;
>>>
>>> 2012-02-21 09:22:18,947 [main] INFO
>>>
>>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - 50% complete
>>>
>>> 2012-02-21 09:22:24,998 [main] INFO
>>>
>>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - 100% complete
>>>
>>> 2012-02-21 09:22:24,999 [main] INFO
>> org.apache.pig.tools.pigstats.PigStats
>>> - Script Statistics:
>>>
>>> HadoopVersion PigVersion UserId StartedAt FinishedAt Features
>>>
>>> 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
>>> UNKNOWN
>>>
>>> Success!
>>>
>>> Job Stats (time in seconds):
>>>
>>> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
>>> MinReduceTime AvgReduceTime Alias Feature Outputs
>>>
>>> job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
>>> hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
>>>
>>> Input(s):
>>>
>>> Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
>>>
>>> Output(s):
>>>
>>> Successfully stored 0 records in:
>>> "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
>>>
>>> Counters:
>>>
>>> Total records written : 0
>>>
>>> Total bytes written : 0
>>>
>>> Spillable Memory Manager spill count : 0
>>>
>>> Total bags proactively spilled: 0
>>>
>>> Total records proactively spilled: 0
>>>
>>> Job DAG:
>>>
>>> job_201202201638_0012
>>>
>>>
>>>
>>> 2012-02-21 09:22:25,004 [main] INFO
>>>
>>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - Success!
>>>
>>> 2012-02-21 09:22:25,011 [main] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>>> to process : 1
>>>
>>> 2012-02-21 09:22:25,011 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>>> paths to process : 1
>>>
>>> grunt> quit
>>
>>