|
|
Mohit Anchlia 2012-02-21, 17:32
I am trying to use XMLLoader to process the files but it doesn't seem to be quite working. For the first pass I am just trying to dump all the contents but it's saying 0 records found:
bash-3.2$ hadoop fs -cat /examples/testfile.txt
<abc><def></def><abc>
<abc><def></def><abc>
register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar'
raw = LOAD '/examples/testfile.txt' using org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray);
dump raw;
2012-02-21 09:22:18,947 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2012-02-21 09:22:24,998 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2012-02-21 09:22:24,999 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646,
Input(s):
Successfully read 0 records (402 bytes) from: "/examples/testfile.txt"
Output(s):
Successfully stored 0 records in: "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201202201638_0012
2012-02-21 09:22:25,004 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2012-02-21 09:22:25,011 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-02-21 09:22:25,011 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
grunt> quit
-
Re: Help with XMLLoader
Mohit Anchlia 2012-02-21, 22:16
It looks like when I have a big file it doesn't read the records. Is it because of how split is occurring that causes it to fail?
On Tue, Feb 21, 2012 at 9:32 AM, Mohit Anchlia <[EMAIL PROTECTED]>wrote:
> I am trying to use XMLLoader to process the files but it doesn't seem to > be quite working. For the first pass I am just trying to dump all the > contents but it's saying 0 records found: > > bash-3.2$ hadoop fs -cat /examples/testfile.txt > > <abc><def></def><abc> > > <abc><def></def><abc> > > register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' > > raw = LOAD '/examples/testfile.txt' using > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray); > > dump raw; > > 2012-02-21 09:22:18,947 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 50% complete > > 2012-02-21 09:22:24,998 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > > 2012-02-21 09:22:24,999 [main] INFO org.apache.pig.tools.pigstats.PigStats > - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt Features > > 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24 > UNKNOWN > > Success! > > Job Stats (time in seconds): > > JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime > MinReduceTime AvgReduceTime Alias Feature Outputs > > job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY > hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646, > > Input(s): > > Successfully read 0 records (402 bytes) from: "/examples/testfile.txt" > > Output(s): > > Successfully stored 0 records in: > "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646" > > Counters: > > Total records written : 0 > > Total bytes written : 0 > > Spillable Memory Manager spill count : 0 > > Total bags proactively spilled: 0 > > Total records proactively spilled: 0 > > Job DAG: > > job_201202201638_0012 > > > > 2012-02-21 09:22:25,004 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > > 2012-02-21 09:22:25,011 [main] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > to process : 1 > > 2012-02-21 09:22:25,011 [main] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : 1 > > grunt> quit >
-
Re: Help with XMLLoader
Vivek Padmanabhan 2012-02-22, 05:57
Hi Mohit, XMLLoader looks for the start and end tag for a given string argument. In the given input there are no end tags and hence it read 0 records.
Example: raw = LOAD 'sample_xml' using org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray); dump raw;
cat sample_xml <abc><def></def></abc> <abc><def></def></abc>
Thanks Vivek On 2/21/12 11:02 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:
> I am trying to use XMLLoader to process the files but it doesn't seem to be > quite working. For the first pass I am just trying to dump all the contents > but it's saying 0 records found: > > bash-3.2$ hadoop fs -cat /examples/testfile.txt > > <abc><def></def><abc> > > <abc><def></def><abc> > > register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' > > raw = LOAD '/examples/testfile.txt' using > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as (document:chararray); > > dump raw; > > 2012-02-21 09:22:18,947 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 50% complete > > 2012-02-21 09:22:24,998 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - 100% complete > > 2012-02-21 09:22:24,999 [main] INFO org.apache.pig.tools.pigstats.PigStats > - Script Statistics: > > HadoopVersion PigVersion UserId StartedAt FinishedAt Features > > 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24 > UNKNOWN > > Success! > > Job Stats (time in seconds): > > JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime > MinReduceTime AvgReduceTime Alias Feature Outputs > > job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY > hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646, > > Input(s): > > Successfully read 0 records (402 bytes) from: "/examples/testfile.txt" > > Output(s): > > Successfully stored 0 records in: > "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646" > > Counters: > > Total records written : 0 > > Total bytes written : 0 > > Spillable Memory Manager spill count : 0 > > Total bags proactively spilled: 0 > > Total records proactively spilled: 0 > > Job DAG: > > job_201202201638_0012 > > > > 2012-02-21 09:22:25,004 [main] INFO > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > - Success! > > 2012-02-21 09:22:25,011 [main] INFO > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > to process : 1 > > 2012-02-21 09:22:25,011 [main] INFO > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input > paths to process : 1 > > grunt> quit
-
Re: Help with XMLLoader
Mohit Anchlia 2012-02-22, 15:49
On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan <[EMAIL PROTECTED]>wrote:
> Hi Mohit, > XMLLoader looks for the start and end tag for a given string argument. In > the given input there are no end tags and hence it read 0 records. > > Example: > raw = LOAD 'sample_xml' using > org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray); > dump raw; > > cat sample_xml > <abc><def></def></abc> > <abc><def></def></abc> >
Thanks! I got past this. But I am facing a different problem. When I have a big file that splits into multiple nodes then pig is not able to read the records. It returns 0 records found.
I create a big file 2G with lots of xml root like above. Then I do hadoop fs -copyFromLocal bigfile /examples
But when I run pig script it return 0 records. If I reduce the size of file to few MB then it works fine. How can I resolve this?
> > Thanks > Vivek > On 2/21/12 11:02 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote: > > > I am trying to use XMLLoader to process the files but it doesn't seem to > be > > quite working. For the first pass I am just trying to dump all the > contents > > but it's saying 0 records found: > > > > bash-3.2$ hadoop fs -cat /examples/testfile.txt > > > > <abc><def></def><abc> > > > > <abc><def></def><abc> > > > > register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' > > > > raw = LOAD '/examples/testfile.txt' using > > org.apache.pig.piggybank.storage.XMLLoader('<abc>') as > (document:chararray); > > > > dump raw; > > > > 2012-02-21 09:22:18,947 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - 50% complete > > > > 2012-02-21 09:22:24,998 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - 100% complete > > > > 2012-02-21 09:22:24,999 [main] INFO > org.apache.pig.tools.pigstats.PigStats > > - Script Statistics: > > > > HadoopVersion PigVersion UserId StartedAt FinishedAt Features > > > > 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24 > > UNKNOWN > > > > Success! > > > > Job Stats (time in seconds): > > > > JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime > > MinReduceTime AvgReduceTime Alias Feature Outputs > > > > job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY > > hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646, > > > > Input(s): > > > > Successfully read 0 records (402 bytes) from: "/examples/testfile.txt" > > > > Output(s): > > > > Successfully stored 0 records in: > > "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646" > > > > Counters: > > > > Total records written : 0 > > > > Total bytes written : 0 > > > > Spillable Memory Manager spill count : 0 > > > > Total bags proactively spilled: 0 > > > > Total records proactively spilled: 0 > > > > Job DAG: > > > > job_201202201638_0012 > > > > > > > > 2012-02-21 09:22:25,004 [main] INFO > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher > > - Success! > > > > 2012-02-21 09:22:25,011 [main] INFO > > org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths > > to process : 1 > > > > 2012-02-21 09:22:25,011 [main] INFO > > org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total > input > > paths to process : 1 > > > > grunt> quit > >
-
Re: Help with XMLLoader
Vivek Padmanabhan 2012-02-24, 06:05
Hi Mohit, We use XMLLoader for wiki data which is around 52g (uncompressed) file. Not sure what is causing this problem here. Can you give a try with Pig 0.9 Thanks Vivek On 2/22/12 9:19 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote:
> On Tue, Feb 21, 2012 at 9:57 PM, Vivek Padmanabhan > <[EMAIL PROTECTED]>wrote: > >> Hi Mohit, >> XMLLoader looks for the start and end tag for a given string argument. In >> the given input there are no end tags and hence it read 0 records. >> >> Example: >> raw = LOAD 'sample_xml' using >> org.apache.pig.piggybank.storage.XMLLoader('abc') as (document:chararray); >> dump raw; >> >> cat sample_xml >> <abc><def></def></abc> >> <abc><def></def></abc> >> > > Thanks! I got past this. But I am facing a different problem. When I have a > big file that splits into multiple nodes then pig is not able to read the > records. It returns 0 records found. > > I create a big file 2G with lots of xml root like above. Then I do hadoop > fs -copyFromLocal bigfile /examples > > But when I run pig script it return 0 records. If I reduce the size of file > to few MB then it works fine. How can I resolve this? > >> >> Thanks >> Vivek >> On 2/21/12 11:02 PM, "Mohit Anchlia" <[EMAIL PROTECTED]> wrote: >> >>> I am trying to use XMLLoader to process the files but it doesn't seem to >> be >>> quite working. For the first pass I am just trying to dump all the >> contents >>> but it's saying 0 records found: >>> >>> bash-3.2$ hadoop fs -cat /examples/testfile.txt >>> >>> <abc><def></def><abc> >>> >>> <abc><def></def><abc> >>> >>> register 'pig-0.8.1-cdh3u3/contrib/piggybank/java/piggybank.jar' >>> >>> raw = LOAD '/examples/testfile.txt' using >>> org.apache.pig.piggybank.storage.XMLLoader('<abc>') as >> (document:chararray); >>> >>> dump raw; >>> >>> 2012-02-21 09:22:18,947 [main] INFO >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>> r >>> - 50% complete >>> >>> 2012-02-21 09:22:24,998 [main] INFO >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>> r >>> - 100% complete >>> >>> 2012-02-21 09:22:24,999 [main] INFO >> org.apache.pig.tools.pigstats.PigStats >>> - Script Statistics: >>> >>> HadoopVersion PigVersion UserId StartedAt FinishedAt Features >>> >>> 0.20.2-cdh3u3 0.8.1-cdh3u3 hadoop 2012-02-21 09:22:12 2012-02-21 09:22:24 >>> UNKNOWN >>> >>> Success! >>> >>> Job Stats (time in seconds): >>> >>> JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime >>> MinReduceTime AvgReduceTime Alias Feature Outputs >>> >>> job_201202201638_0012 1 0 2 2 2 0 0 0 raw MAP_ONLY >>> hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646, >>> >>> Input(s): >>> >>> Successfully read 0 records (402 bytes) from: "/examples/testfile.txt" >>> >>> Output(s): >>> >>> Successfully stored 0 records in: >>> "hdfs://dsdb1:54310/tmp/temp1968655187/tmp-358114646" >>> >>> Counters: >>> >>> Total records written : 0 >>> >>> Total bytes written : 0 >>> >>> Spillable Memory Manager spill count : 0 >>> >>> Total bags proactively spilled: 0 >>> >>> Total records proactively spilled: 0 >>> >>> Job DAG: >>> >>> job_201202201638_0012 >>> >>> >>> >>> 2012-02-21 09:22:25,004 [main] INFO >>> >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunche>> r >>> - Success! >>> >>> 2012-02-21 09:22:25,011 [main] INFO >>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths >>> to process : 1 >>> >>> 2012-02-21 09:22:25,011 [main] INFO >>> org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total >> input >>> paths to process : 1 >>> >>> grunt> quit >> >>
|
|