Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Handle NULL values in Cube dimensions

Copy link to this message
Re: Handle NULL values in Cube dimensions
Thanks Jon for your thoughts.

I have a patch which renames the null values in dimension values to "unknown" and use null for rollups. For a sample input tuple

red, null, 12

a = cube inp by ($0, $1);
the above query will emit following combinations

red, unknown, 12
 , unknown, 12
red, , 12
 , , 12

Please let me know if anyone have different opinion on this.
If the above choice looks good I can go ahead and submit the patch in JIRA. Thanks.

-- Prasanth

On Jun 8, 2012, at 11:06 PM, Jonathan Coveney wrote:

> you could always make the value pluggable, going with Unknown for now, and
> then down the line if we want, we could add an "ONNULL" value to the parser
> that sets it.
> 2012/6/8 Prasanth J <[EMAIL PROTECTED]>
>> Thanks Alan and Dmitriy for your thoughts.
>> I think we have two different approaches now.
>> In one approach, if we encounter a null in dimension values we can just
>> label it as "unknown" and use "NULL" string to represent rollups. Whereas,
>> in other approach, if we encounter a null in dimension values, use the null
>> value as such but use "*" or any other string for rollups.
>> Both approaches looks good to me. Please let me know which one should I go
>> ahead with.
>> Thanks
>> -- Prasanth
>> On Jun 8, 2012, at 12:22 PM, Alan Gates wrote:
>>> Option 1 (throwing an error) is bad.  It violates "Pigs eat anything"
>> (see http://pig.apache.org/philosophy.html).
>>> Do we need to give users an ability to name this unknown column?  Why
>> not just label it "unknown" and be done?
>>> Alan.
>>> On Jun 6, 2012, at 2:24 PM, Prasanth J wrote:
>>>> Hello everyone
>>>> I would like to bring up this discussion about the ways for handling
>> NULL values in dimensions specified for cubing. For example, if we have a
>> dimension color with following values
>>>> red
>>>> blue
>>>> null
>>>> green
>>>> how do we differentiate if the null value represent rollup of all
>> colors values or actual null value?
>>>> SQL way:
>>>> There are 2 ways in which SQL server analysis services handles null
>> values in dimensions
>>>> 1) Throw error when it encounters null values in dimension values
>>>> 2) Ignore error by adding the null values to UnknownMembers. By default
>> UnknownMembers will be named as "Unknown". The name for UnknownMembers can
>> also be specified by the user.
>>>> Do we need to handle both ways in Pig? I think the first way (throwing
>> error) is pretty straightforward.
>>>> For the second way (ignoring error), what is the best way to provide
>> support for user specified name for UnknownMembers?
>>>> Please share your thoughts about how we can handle this scenario for
>> different datatypes in Pig.
>>>> Thanks
>>>> -- Prasanth