Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> debugging java.lang.IndexOutOfBoundsException


Copy link to this message
-
Re: debugging java.lang.IndexOutOfBoundsException
Looks like one of your files is not parsed. By default pig storage thinks
that your file is tab delimited.
03.08.2013 2:49 пользователь "Jesse Jaggars" <[EMAIL PROTECTED]> написал:

> Hey folks,
>
> I'm a brand new user and I'm working on my first 'real' script. The idea is
> to count web traffic hits by day, user, and url. At the end I want to join
> some account information
> for each user. I'm running into an issue and I'm not sure how to go about
> debugging my work.
>
> The sso_to_account.csv is basically user,account_number\n and the  web_data
> is a TSV file with 255 columns. I built a .pig_schema file for that file
> and placed it alongside the data.
> I have one compressed file for each day of data. Picking the first day of
> data and running it alone produces the correct output. But running the
> following:
>
> pig -f pig_scripts/web_users_by_day.pig results in failed jobs with the
> following stack trace:
>
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>         at java.util.ArrayList.get(ArrayList.java:322)
>         at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:116)
>         at
> org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:280)
>         at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:244)
>         at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
>         at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> I found the following bug on jira, but it doesn't seem related:
> https://issues.apache.org/jira/browse/PIG-3051
>
> This issue looks much more relevant:
> https://issues.apache.org/jira/browse/PIG-2127
> , but the comments say it is resolved.
>
> I removed any extra windows style carriage returns with the following job
> prior to working:
>
> hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.1.2.24.jar
> -D mapred.output.compress=true -D
> mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec -D
> mapred.reduce.tasks=0 -mapper "tr -d '\r'" -reducer NONE -input
> /user/jjaggars/web_data/in -output /user/jjaggars/web_data/clean
>
> Here is my (slightly sanitized) script:
>
> accounts = LOAD '/user/jjaggars/sso_to_account.csv' USING PigStorage(',')
> AS (user:chararray, account:chararray);
> web_data = LOAD '/user/jjaggars/web_data/clean/*.lzo' USING
> PigStorage('\t', 'schema');
> logged_in = FILTER web_data BY evar37 is not null AND date_time is not null
> AND evar23 is not null;
> working_set = FOREACH logged_in GENERATE SUBSTRING(date_time, 0, 11) AS
> date, REPLACE(evar37, '"', '') AS user, evar23 AS url;
> by_day = GROUP working_set BY (date, user, url);
> hits_by_day = FOREACH by_day GENERATE FLATTEN(group) as (date, user, url),
> COUNT(working_set) AS hits;
> hits_with_account = JOIN hits_by_day BY user, accounts BY user;
> final = FOREACH hits_with_account GENERATE hits_by_day::date,
> hits_by_day::user, hits_by_day::url, accounts::account, hits_by_day::hits;
> STORE final INTO 'hits_by_day' USING PigStorage();
>
> Here's some version info:
>
> $ pig --version
> Apache Pig version 0.11.2-SNAPSHOT (r: unknown)
> compiled Aug 02 2013, 11:28:54
>
> $ hadoop version
> Hadoop 1.1.2.24
> Subversion  -r
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB