Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> cast to tuple errors

Copy link to this message
cast to tuple errors
Hey pig gurus -

I'm having an issue with cast-to-tuple errors, such as:

ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to

Any help understanding where I've gone wrong would be appreciated!

Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:

day_hour_min  response_code  response_code_count  total_responses
200101011458            200                    9               10
200101011458            503                    1               10
I'm using the following steps, says describe. Note `counted' looks correct.

data: {date: chararray,hour: chararray,minute: chararray,response_code:

grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}

grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}

counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}

counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:

20100110 1 9 20100110 1 9

counted = join
  counted_by_minute_by_response_code by (group.date, group.hour,
  counted_by_minute by (group.date, group.hour, group.minute)
  parallel 1;
I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:

counted_pct = foreach counted generate
  counted_by_minute_by_response_code::group.timebucket as timebucket,
  counted_by_minute_by_response_code::group.response_code as response_code,
  counted_by_minute_by_response_code::count as response_code_count,
  counted_by_minute::count as response_code_count_total,
  (float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;

Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).

Any thoughts on what the problem might be?