Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Help with XMLLoader


Copy link to this message
-
Re: Help with XMLLoader
Hi Mohit,
  We use XMLLoader for wiki data which is around 52g (uncompressed) file.
Not sure what is causing this problem here. Can you give a try with Pig 0.9
Thanks
Vivek
On 2/22/12 9:19 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:

> On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan
> <[EMAIL PROTECTED]>wrote:
>
>> Hi Mohit,
>>  XMLLoader looks for the start and end tag for a given string argument. In
>> the given input there are no end tags and hence it read 0 records.
>>
>> Example:
>> raw = LOAD 'sample_xml' using
>> org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);
>> dump raw;
>>
>> cat sample_xml
>> <abc><def></def></abc>
>> <abc><def></def></abc>
>>
>
> Thanks! I got past this. But I am facing a different problem. When I have a
> big file that splits into multiple nodes then pig is not able to read the
> records. It returns 0 records found.
>
> I create a big file 2G with lots of xml root like above. Then I do hadoop
> fs -copyFromLocal bigfile /examples
>
> But when I run pig script it return 0 records. If I reduce the size of file
> to few MB then it works fine. How can I resolve this?
>
>>
>> Thanks
>> Vivek
>>  On 2/21/12 11:02 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:
>>
>>> I am trying to use XMLLoader to process the files but it doesn't seem to
>> be
>>> quite working. For the first pass I am just trying to dump all the
>> contents
>>> but it's saying 0 records found:
>>>
>>> bash-3.2$ hadoop fs -cat /examples/testfile.txt
>>>
>>> <abc><def></def><abc>
>>>
>>> <abc><def></def><abc>
>>>
>>> register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
>>>
>>> raw = LOAD '/examples/testfile.txt' using
>>> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as
>> (document:chararray);
>>>
>>> dump raw;
>>>
>>> 2012-02-21 09:22:18,947 [main] INFO
>>>
>>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - 50% complete
>>>
>>> 2012-02-21 09:22:24,998 [main] INFO
>>>
>>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - 100% complete
>>>
>>> 2012-02-21 09:22:24,999 [main] INFO
>> org.apache.pig.tools.pigstats.PigStats
>>> - Script Statistics:
>>>
>>> HadoopVersion PigVersion UserId StartedAt FinishedAt Features
>>>
>>> 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
>>> UNKNOWN
>>>
>>> Success!
>>>
>>> Job Stats (time in seconds):
>>>
>>> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
>>> MinReduceTime AvgReduceTime Alias Feature Outputs
>>>
>>> job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
>>> hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
>>>
>>> Input(s):
>>>
>>> Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
>>>
>>> Output(s):
>>>
>>> Successfully stored 0 records in:
>>> "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
>>>
>>> Counters:
>>>
>>> Total records written : 0
>>>
>>> Total bytes written : 0
>>>
>>> Spillable Memory Manager spill count : 0
>>>
>>> Total bags proactively spilled: 0
>>>
>>> Total records proactively spilled: 0
>>>
>>> Job DAG:
>>>
>>> job_201202201638_0012
>>>
>>>
>>>
>>> 2012-02-21 09:22:25,004 [main] INFO
>>>
>>
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>>
r
>>> - Success!
>>>
>>> 2012-02-21 09:22:25,011 [main] INFO
>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
>>> to process : 1
>>>
>>> 2012-02-21 09:22:25,011 [main] INFO
>>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
>> input
>>> paths to process : 1
>>>
>>> grunt> quit
>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB