Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Help with XMLLoader


Copy link to this message
-
Re: Help with XMLLoader
On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan <[EMAIL PROTECTED]>wrote:

> Hi Mohit,
>  XMLLoader looks for the start and end tag for a given string argument. In
> the given input there are no end tags and hence it read 0 records.
>
> Example:
> raw = LOAD 'sample_xml' using
> org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray);
> dump raw;
>
> cat sample_xml
> <abc><def></def></abc>
> <abc><def></def></abc>
>

Thanks! I got past this. But I am facing a different problem. When I have a
big file that splits into multiple nodes then pig is not able to read the
records. It returns 0 records found.

I create a big file 2G with lots of xml root like above. Then I do hadoop
fs -copyFromLocal bigfile /examples

But when I run pig script it return 0 records. If I reduce the size of file
to few MB then it works fine. How can I resolve this?

>
> Thanks
> Vivek
>  On 2/21/12 11:02 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:
>
> > I am trying to use XMLLoader to process the files but it doesn't seem to
> be
> > quite working. For the first pass I am just trying to dump all the
> contents
> > but it's saying 0 records found:
> >
> > bash-3.2$ hadoop fs -cat /examples/testfile.txt
> >
> > <abc><def></def><abc>
> >
> > <abc><def></def><abc>
> >
> > register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
> >
> > raw = LOAD '/examples/testfile.txt' using
> > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as
> (document:chararray);
> >
> > dump raw;
> >
> > 2012-02-21 09:22:18,947 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 50% complete
> >
> > 2012-02-21 09:22:24,998 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - 100% complete
> >
> > 2012-02-21 09:22:24,999 [main] INFO
> org.apache.pig.tools.pigstats.PigStats
> > - Script Statistics:
> >
> > HadoopVersion PigVersion UserId StartedAt FinishedAt Features
> >
> > 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24
> > UNKNOWN
> >
> > Success!
> >
> > Job Stats (time in seconds):
> >
> > JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime
> > MinReduceTime AvgReduceTime Alias Feature Outputs
> >
> > job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY
> > hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
> >
> > Input(s):
> >
> > Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
> >
> > Output(s):
> >
> > Successfully stored 0 records in:
> > "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
> >
> > Counters:
> >
> > Total records written : 0
> >
> > Total bytes written : 0
> >
> > Spillable Memory Manager spill count : 0
> >
> > Total bags proactively spilled: 0
> >
> > Total records proactively spilled: 0
> >
> > Job DAG:
> >
> > job_201202201638_0012
> >
> >
> >
> > 2012-02-21 09:22:25,004 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> > - Success!
> >
> > 2012-02-21 09:22:25,011 [main] INFO
> > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> > to process : 1
> >
> > 2012-02-21 09:22:25,011 [main] INFO
> > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> > paths to process : 1
> >
> > grunt> quit
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB