|
|
-
Re: XMLLoader does not work with BIG wikipedia dumpPrashant Kommireddi 2012-03-28, 19:14
Did you set heap size to 0?
Sent from my iPhone On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger" <[EMAIL PROTECTED]> wrote: > Hi, > > Am 28.03.12 18:28, schrieb Jonathan Coveney: >> - dev@pig >> + user@pig > > You are right, fits better to user@pig. > >> What command are you using to run this? Are you upping the max heap? > > I created a pig script wiki.pig with the following content: > > ===register piggybank.jar; > > pages = load > '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2' > using org.apache.pig.piggybank.storage.XMLLoader('page') as > (page:chararray); > pages = limit pages 1; > dump pages; > ==> and used the command: > > % pig wiki.pig > > to run the pig script. > > I use current Hadoop 1.0.1. My version of PIG is checked out from trunk > and build by myself. > > Everything that I customized was setting HADOOP_HEAPSIZE 00 in > hadoop-env.sh (default heap size was was 1000MB). > > Kind regards, > Herbert > >> 2012/3/28 Herbert Mühlburger<[EMAIL PROTECTED]> >> >>> Hi, >>> >>> I would like to use pig to work with wikipedia dump files. It works >>> successfully with an input file of around 8GB of size but not too big xml >>> element content. >>> >>> In my current case I would like to use the file "enwiki-latest-pages-meta- >>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed >>> size) which can be found here: >>> >>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-** >>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2> >>> >>> Is it possible that due to the fact that the content of the<page></page> >>> xml element could potentially become very large (several GB for instance) >>> XMLLoader of Piggybank has problems loading elements splitted by<page>? >>> >>> Hopefully anybody could help me with this. >>> >>> I've tried to call the following PIG Latin script: >>> >>> ========>> register piggybank.jar; >>> >>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml- >>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page') >>> as (page:chararray); >>> pages = limit pages 1; >>> dump pages; >>> ========>> >>> and always get the following error (the generated logfile is attached): >>> >>> ========>> >>> 2012-03-28 14:49:54,695 [main] INFO org.apache.pig.Main - Apache Pig >>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45 >>> 2012-03-28 14:49:54,696 [main] INFO org.apache.pig.Main - Logging error >>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_** >>> 1332938994693.log >>> 2012-03-28 14:49:54,936 [main] INFO org.apache.pig.impl.util.Utils - >>> Default bootup file /Users/herbert/.pigbootup not found >>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.** >>> executionengine.**HExecutionEngine - Connecting to hadoop file system at: >>> hdfs://localhost:9000 >>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.** >>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker >>> at: localhost:9001 >>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState >>> - Pig features used in the script: LIMIT >>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.** >>> executionengine.**mapReduceLayer.MRCompiler - File concatenation >>> threshold: 100 optimistic? false >>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.** >>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size >>> before optimization: 1 >>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.** >>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size >>> after optimization: 1 >>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState >>> - Pig script settings are added to the job >>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.** |