Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> debugging java.lang.IndexOutOfBoundsException


Copy link to this message
-
Re: debugging java.lang.IndexOutOfBoundsException
Looks like one of your files is not parsed. By default pig storage thinks
that your file is tab delimited.
03.08.2013 2:49 пользователь "Jesse Jaggars" <[EMAIL PROTECTED]> написал:

> Hey folks,
>
> I'm a brand new user and I'm working on my first 'real' script. The idea is
> to count web traffic hits by day, user, and url. At the end I want to join
> some account information
> for each user. I'm running into an issue and I'm not sure how to go about
> debugging my work.
>
> The sso_to_account.csv is basically user,account_number\n and the  web_data
> is a TSV file with 255 columns. I built a .pig_schema file for that file
> and placed it alongside the data.
> I have one compressed file for each day of data. Picking the first day of
> data and running it alone produces the correct output. But running the
> following:
>
> pig -f pig_scripts/web_users_by_day.pig results in failed jobs with the
> following stack trace:
>
> java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>         at java.util.ArrayList.get(ArrayList.java:322)
>         at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:116)
>         at
> org.apache.pig.builtin.PigStorage.applySchema(PigStorage.java:280)
>         at org.apache.pig.builtin.PigStorage.getNext(PigStorage.java:244)
>         at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532)
>         at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1178)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
> I found the following bug on jira, but it doesn't seem related:
> https://issues.apache.org/jira/browse/PIG-3051
>
> This issue looks much more relevant:
> https://issues.apache.org/jira/browse/PIG-2127
> , but the comments say it is resolved.
>
> I removed any extra windows style carriage returns with the following job
> prior to working:
>
> hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming-1.1.2.24.jar
> -D mapred.output.compress=true -D
> mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec -D
> mapred.reduce.tasks=0 -mapper "tr -d '\r'" -reducer NONE -input
> /user/jjaggars/web_data/in -output /user/jjaggars/web_data/clean
>
> Here is my (slightly sanitized) script:
>
> accounts = LOAD '/user/jjaggars/sso_to_account.csv' USING PigStorage(',')
> AS (user:chararray, account:chararray);
> web_data = LOAD '/user/jjaggars/web_data/clean/*.lzo' USING
> PigStorage('\t', 'schema');
> logged_in = FILTER web_data BY evar37 is not null AND date_time is not null
> AND evar23 is not null;
> working_set = FOREACH logged_in GENERATE SUBSTRING(date_time, 0, 11) AS
> date, REPLACE(evar37, '"', '') AS user, evar23 AS url;
> by_day = GROUP working_set BY (date, user, url);
> hits_by_day = FOREACH by_day GENERATE FLATTEN(group) as (date, user, url),
> COUNT(working_set) AS hits;
> hits_with_account = JOIN hits_by_day BY user, accounts BY user;
> final = FOREACH hits_with_account GENERATE hits_by_day::date,
> hits_by_day::user, hits_by_day::url, accounts::account, hits_by_day::hits;
> STORE final INTO 'hits_by_day' USING PigStorage();
>
> Here's some version info:
>
> $ pig --version
> Apache Pig version 0.11.2-SNAPSHOT (r: unknown)
> compiled Aug 02 2013, 11:28:54
>
> $ hadoop version
> Hadoop 1.1.2.24
> Subversion  -r