Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> cast to tuple errors


Copy link to this message
-
cast to tuple errors
Hey pig gurus -

I'm having an issue with cast-to-tuple errors, such as:

ERROR 2999: Unexpected internal error.
org.apache.pig.data.DefaultDataBag$DefaultDataBagIterator cannot be cast to
org.apache.pig.data.Tuple

Any help understanding where I've gone wrong would be appreciated!
DETAILS:

Given some Apache logs I'd like to see the percentage of responses by
response code by minute. Basically, I'd like to generate the following:

"""
day_hour_min  response_code  response_code_count  total_responses
 response_code_pct
200101011458            200                    9               10
     0.9
200101011458            503                    1               10
     0.1
"""
I'm using the following steps, says describe. Note `counted' looks correct.

"""
data: {date: chararray,hour: chararray,minute: chararray,response_code:
chararray}

grouped_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),data: {date: chararray,hour: chararray,minute:
chararray,response_code: chararray}}

grouped_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),data: {date:
chararray,hour: chararray,minute: chararray,response_code: chararray}}

counted_by_minute: {group: (date: chararray,hour: chararray,minute:
chararray),count: long}
counted_by_minute_by_response_code: {group: (date: chararray,hour:
chararray,minute: chararray,response_code: chararray),count: long}

counted: {counted_by_minute_by_response_code::group: (date: chararray,hour:
chararray,minute: chararray,response_code:
chararray),counted_by_minute_by_response_code::count:
long,counted_by_minute::group: (date: chararray,hour: chararray,minute:
chararray),counted_by_minute::count: long}
"""
Everything works up until my join, where illustrate gives the above
exception. Strangely, I can store the output but it only contains the date,
hour, and minute fields -- missing the counts. For example:

"""
20100110 1 9 20100110 1 9
"""

counted = join
  counted_by_minute_by_response_code by (group.date, group.hour,
group.minute),
  counted_by_minute by (group.date, group.hour, group.minute)
  parallel 1;
I've tried writing this a few ways now and always have an issue when
referencing members of the group tuple. For example, I concat
date+hour+minute together and got one step further, but then ran into what I
believe is the same issue when doing the following:

counted_pct = foreach counted generate
  counted_by_minute_by_response_code::group.timebucket as timebucket,
  counted_by_minute_by_response_code::group.response_code as response_code,
  counted_by_minute_by_response_code::count as response_code_count,
  counted_by_minute::count as response_code_count_total,
  (float)counted_by_minute_by_response_code::count /
(float)counted_by_minute::count as response_code_pct;

Here I got "java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple" when referencing timebucket or response_code.
Removing those two items allowed the script to complete (although with not
very useful output).

Any thoughts on what the problem might be?

Thanks!
Travis