Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Re: XMLLoader does not work with BIG wikipedia dump


Copy link to this message
-
Re: XMLLoader does not work with BIG wikipedia dump
Did you set heap size to 0?

Sent from my iPhone

On Mar 28, 2012, at 12:12 PM, "Herbert Mühlburger"
<[EMAIL PROTECTED]> wrote:

> Hi,
>
> Am 28.03.12 18:28, schrieb Jonathan Coveney:
>> - dev@pig
>> + user@pig
>
> You are right, fits better to user@pig.
>
>> What command are you using to run this? Are you upping the max heap?
>
> I created a pig script wiki.pig with the following content:
>
> ===register piggybank.jar;
>
> pages = load
> '/user/herbert/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2'
> using org.apache.pig.piggybank.storage.XMLLoader('page') as
> (page:chararray);
> pages = limit pages 1;
> dump pages;
> ==> and used the command:
>
>  % pig wiki.pig
>
> to run the pig script.
>
> I use current Hadoop 1.0.1. My version of PIG is checked out from trunk
> and build by myself.
>
> Everything that I customized was setting HADOOP_HEAPSIZE 00 in
> hadoop-env.sh (default heap size was was 1000MB).
>
> Kind regards,
> Herbert
>
>> 2012/3/28 Herbert Mühlburger<[EMAIL PROTECTED]>
>>
>>> Hi,
>>>
>>> I would like to use pig to work with wikipedia dump files. It works
>>> successfully with an input file of around 8GB of size but not too big xml
>>> element content.
>>>
>>> In my current case I would like to use the file "enwiki-latest-pages-meta-
>>> **history1.xml-**p000000010p000002162.bz2" (around 2GB of compressed
>>> size) which can be found here:
>>>
>>> http://dumps.wikimedia.org/**enwiki/latest/enwiki-latest-**
>>> pages-meta-history1.xml-**p000000010p000002162.bz2<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-history1.xml-p000000010p000002162.bz2>
>>>
>>> Is it possible that due to the fact that the content of the<page></page>
>>> xml element could potentially become very large (several GB for instance)
>>> XMLLoader of Piggybank has problems loading elements splitted by<page>?
>>>
>>> Hopefully anybody could help me with this.
>>>
>>> I've tried to call the following PIG Latin script:
>>>
>>> ========>> register piggybank.jar;
>>>
>>> pages = load '/user/herbert/enwiki-latest-**pages-meta-history1.xml-
>>> p000000010p000002162.bz2' using org.apache.pig.piggybank.**storage.XMLLoader('page')
>>> as (page:chararray);
>>> pages = limit pages 1;
>>> dump pages;
>>> ========>>
>>> and always get the following error (the generated logfile is attached):
>>>
>>> ========>>
>>> 2012-03-28 14:49:54,695 [main] INFO  org.apache.pig.Main - Apache Pig
>>> version 0.11.0-SNAPSHOT (rexported) compiled Mrz 28 2012, 08:21:45
>>> 2012-03-28 14:49:54,696 [main] INFO  org.apache.pig.Main - Logging error
>>> messages to: /Users/herbert/Documents/**workspace/pig-wikipedia/pig_**
>>> 1332938994693.log
>>> 2012-03-28 14:49:54,936 [main] INFO  org.apache.pig.impl.util.Utils -
>>> Default bootup file /Users/herbert/.pigbootup not found
>>> 2012-03-28 14:49:55,189 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to hadoop file system at:
>>> hdfs://localhost:9000
>>> 2012-03-28 14:49:55,403 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**HExecutionEngine - Connecting to map-reduce job tracker
>>> at: localhost:9001
>>> 2012-03-28 14:49:55,845 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig features used in the script: LIMIT
>>> 2012-03-28 14:49:56,021 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.MRCompiler - File concatenation
>>> threshold: 100 optimistic? false
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> before optimization: 1
>>> 2012-03-28 14:49:56,067 [main] INFO org.apache.pig.backend.hadoop.**
>>> executionengine.**mapReduceLayer.**MultiQueryOptimizer - MR plan size
>>> after optimization: 1
>>> 2012-03-28 14:49:56,171 [main] INFO org.apache.pig.tools.pigstats.**ScriptState
>>> - Pig script settings are added to the job
>>> 2012-03-28 14:49:56,187 [main] INFO org.apache.pig.backend.hadoop.**
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB