Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Re: XMLLoader does not work with BIG wikipedia dump


Copy link to this message
-
Re: XMLLoader does not work with BIG wikipedia dump
Prashant Kommireddi 2012-03-28, 19:14
Did you set heap size to 0?

Sent from my iPhone

On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger"
<[EMAIL PROTECTED]> wrote:

> Hi,
>
> Am 28.03.12 18:28, schrieb Jonathan Coveney:
>> - dev@pig
>> + user@pig
>
> You are right, fits better to user@pig.
>
>> What command are you using to run this? Are you upping the max heap?
>
> I created a pig script wiki.pig with the following content:
>
> ===register piggybank.jar;
>
> pages = load
> '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2'
> using org.apache.pig.piggybank.storage.XMLLoader('page') as
> (page:chararray);
> pages = limit pages 1;
> dump pages;
> ==> and used the command:
>
>  % pig wiki.pig
>
> to run the pig script.
>
> I use current Hadoop 1.0.1. My version of PIG is checked out from trunk
> and build by myself.
>
> Everything that I customized was setting HADOOP_HEAPSIZE 00 in
> hadoop-env.sh (default heap size was was 1000MB).
>
> Kind regards,
> Herbert
>
>> 2012/3/28 Herbert Mühlburger<[EMAIL PROTECTED]>
>>
>>> Hi,
>>>
>>> I would like to use pig to work with wikipedia dump files. It works
>>> successfully with an input file of around 8GB of size but not too big xml
>>> element content.
>>>
>>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>>> size) which can be found here:
>>>
>>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>>
>>> Is it possible that due to the fact that the content of the<page></page>
>>> xml element could potentially become very large (several GB for instance)
>>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>>
>>> Hopefully anybody could help me with this.
>>>
>>> I've tried to call the following PIG Latin script:
>>>
>>> ========>> register piggybank.jar;
>>>
>>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
>>> as (page:chararray);
>>> pages = limit pages 1;
>>> dump pages;
>>> ========>>
>>> and always get the following error (the generated logfile is attached):
>>>
>>> ========>>
>>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>>> 1332938994693.log
>>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>>> Default bootup file /Users/herbert/.pigbootup not found
>>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>>> hdfs://localhost:9000
>>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>>> at: localhost:9001
>>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig features used in the script: LIMIT
>>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>>> threshold: 100 optimistic? false
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> before optimization: 1
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> after optimization: 1
>>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig script settings are added to the job
>>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**