Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Lateral Views and Multi Table Insert


Copy link to this message
-
Lateral Views and Multi Table Insert
Hi,

I have a question about using lateral views with multi table insert.
I have a table of data that represents raw log data, the structure of
which makes it onerous to query directly largely because it requires
UNIONTYPE columns.  So, I transform that raw table into 3 new tables,
a primary table and 2 1-to-many tables.

The raw table is similar to this:

CREATE TABLE IF NOT EXISTS events_raw (
  event_id STRING,
  event_data_0 INT,
  event_data_1 BIGINT,
  packed_event_data_2 UNIONTYPE <
    INT,
    STRUCT <
      event_data_2:INT,
      event_data_2_sub_0:BOOLEAN,
      event_data_2_sub_1:BOOLEAN>>,
  packed_event_data_3 UNIONTYPE <
    BIGINT,
    ARRAY <
      STRUCT <
        event_data_3_metadata_key:STRING,
        event_data_3_metadata_value:STRING>>>,
  packed_event_data_4 UNIONTYPE <
    BOOLEAN,
    STRUCT <
      event_data_4:BOOLEAN,
      event_data_4_metadata:ARRAY <
        STRUCT <
          event_data_4_metadata_key:STRING,
          event_data_4_metadata_value:STRING>>>>);

This is to be transformed into these tables:

CREATE TABLE IF NOT EXISTS events (
event_id STRING,
event_data_0 INT,
event_data_1 BIGINT,
event_data_2 INT,
event_data_2_sub_0 BOOLEAN,
event_data_2_sub_1 BOOLEAN,
event_data_3 BIGINT,
event_data_4 BOOLEAN);

CREATE TABLE IF NOT EXISTS event_data_3_metadata (
event_id STRING,
metadata_key STRING,
metadata_value STRING);

CREATE TABLE IF NOT EXISTS event_data_4_metadata (
event_id STRING,
metadata_key STRING,
metadata_value STRING);

The only way I know how to unpack and/or explode the UNIONTYPEs is to
create custom UDTFs for each UNIONTYPE column.  For example, I created
an unpack_packed_event_data_2 function which maps an single
UnionObject to a STRUCT<event_data_2:INT, event_data_2_sub_0:BOOLEAN,
event_data_2_sub_1:BOOLEAN>.  Similarly, I created UDTFs to explode
the ARRAY elements contained in the UNIONTYPE columns.

Using those UDTFs I devised these queries to build the transformed tables:

FROM
  events_raw
LATERAL VIEW
  unpack_event_data_2 (packed_event_data_2) event_data_2_struct AS
    event_data_2,
    event_data_2_sub_0,
    event_data_2_sub_1
LATERAL VIEW
  unpack_event_data_3 (packed_event_data_3) event_data_3_struct AS
    event_data_3
LATERAL VIEW
  unpack_event_data_4 (packed_event_data_4) event_data_4_struct AS
    event_data_4
INSERT INTO TABLE events
SELECT
  event_id,
  event_data_0,
  event_data_1,
  event_data_2_struct.event_data_2,
  event_data_2_struct.event_data_2_sub_0,
  event_data_2_struct.event_data_2_sub_1,
  event_data_3_struct.event_data_3,
  event_data_4_struct.event_data_4);

FROM
  events_raw
LATERAL VIEW
  explode_event_data_3 (packed_event_data_3) event_data_3_array_element AS
    metadata_key,
    metadata_value
INSERT INTO TABLE event_data_3_metadata
SELECT
  event_id,
  event_data_3_array_element.metadata_key,
  event_data_3_array_element.metadata_value);

FROM
  events_raw
LATERAL VIEW
  explode_event_data_4 (packed_event_data_4) event_data_4_array_element AS
    metadata_key,
    metadata_value
INSERT INTO TABLE event_data_3_metadata
SELECT
  event_id,
  event_data_4_array_element.metadata_key,
  event_data_4_array_element.metadata_value);

This works correctly, the tables are filled with the appropriate
number of rows.  However, the raw table is scanned 3 times to
accomplish this and that is very costly given the amount of data.
When I combine those 3 statements into one Multi Table Insert:

FROM
  events_raw
LATERAL VIEW
  unpack_event_data_2 (packed_event_data_2) event_data_2_struct AS
    event_data_2,
    event_data_2_sub_0,
    event_data_2_sub_1
LATERAL VIEW
  unpack_event_data_3 (packed_event_data_3) event_data_3_struct AS
    event_data_3
LATERAL VIEW
  unpack_event_data_4 (packed_event_data_4) event_data_4_struct AS
    event_data_4
LATERAL VIEW
  explode_event_data_3 (packed_event_data_3) event_data_3_array_element AS
    metadata_key,
    metadata_value
LATERAL VIEW
  explode_event_data_4 (packed_event_data_4) event_data_4_array_element AS
    metadata_key,
    metadata_value
INSERT INTO TABLE events
SELECT
  event_id,
  event_data_0,
  event_data_1,
  event_data_2_struct.event_data_2,
  event_data_2_struct.event_data_2_sub_0,
  event_data_2_struct.event_data_2_sub_1,
  event_data_3_struct.event_data_3,
  event_data_4_struct.event_data_4
INSERT INTO TABLE event_data_3_metadata
SELECT
  event_id,
  event_data_3_array_element.metadata_key,
  event_data_3_array_element.metadata_value
INSERT INTO TABLE event_data_4_metadata
SELECT
  event_id,
  event_data_4_array_element.metadata_key,
  event_data_4_array_element.metadata_value;

The query fails with:

[Hive Error]: Query returned non-zero code: 10, cause: FAILED: Error
in semantic analysis: Column packed_event_data_3 Found in more than
One Tables/Subqueries.

I don't know how to get around having separate unpack_event_data_3 and
explode_event_data_3 functions.  Combining them would seem to marry
the functions' output signatures and in instances when the BIGINT is
type of the UNIONTYPE there shouldn't be a row of NULL values in the
event_data_4_metadata table and vice versa.

Is there a better way to do this?

Thanks,
Jim Krehl