Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Lateral Views and Multi Table Insert


Copy link to this message
-
Lateral Views and Multi Table Insert
Hi,

I have a question about using lateral views with multi table insert.
I have a table of data that represents raw log data, the structure of
which makes it onerous to query directly largely because it requires
UNIONTYPE columns.  So, I transform that raw table into 3 new tables,
a primary table and 2 1-to-many tables.

The raw table is similar to this:

CREATE TABLE IF NOT EXISTS events_raw (
  event_id STRING,
  event_data_0 INT,
  event_data_1 BIGINT,
  packed_event_data_2 UNIONTYPE <
    INT,
    STRUCT <
      event_data_2:INT,
      event_data_2_sub_0:BOOLEAN,
      event_data_2_sub_1:BOOLEAN>>,
  packed_event_data_3 UNIONTYPE <
    BIGINT,
    ARRAY <
      STRUCT <
        event_data_3_metadata_key:STRING,
        event_data_3_metadata_value:STRING>>>,
  packed_event_data_4 UNIONTYPE <
    BOOLEAN,
    STRUCT <
      event_data_4:BOOLEAN,
      event_data_4_metadata:ARRAY <
        STRUCT <
          event_data_4_metadata_key:STRING,
          event_data_4_metadata_value:STRING>>>>);

This is to be transformed into these tables:

CREATE TABLE IF NOT EXISTS events (
event_id STRING,
event_data_0 INT,
event_data_1 BIGINT,
event_data_2 INT,
event_data_2_sub_0 BOOLEAN,
event_data_2_sub_1 BOOLEAN,
event_data_3 BIGINT,
event_data_4 BOOLEAN);

CREATE TABLE IF NOT EXISTS event_data_3_metadata (
event_id STRING,
metadata_key STRING,
metadata_value STRING);

CREATE TABLE IF NOT EXISTS event_data_4_metadata (
event_id STRING,
metadata_key STRING,
metadata_value STRING);

The only way I know how to unpack and/or explode the UNIONTYPEs is to
create custom UDTFs for each UNIONTYPE column.  For example, I created
an unpack_packed_event_data_2 function which maps an single
UnionObject to a STRUCT<event_data_2:INT, event_data_2_sub_0:BOOLEAN,
event_data_2_sub_1:BOOLEAN>.  Similarly, I created UDTFs to explode
the ARRAY elements contained in the UNIONTYPE columns.

Using those UDTFs I devised these queries to build the transformed tables:

FROM
  events_raw
LATERAL VIEW
  unpack_event_data_2 (packed_event_data_2) event_data_2_struct AS
    event_data_2,
    event_data_2_sub_0,
    event_data_2_sub_1
LATERAL VIEW
  unpack_event_data_3 (packed_event_data_3) event_data_3_struct AS
    event_data_3
LATERAL VIEW
  unpack_event_data_4 (packed_event_data_4) event_data_4_struct AS
    event_data_4
INSERT INTO TABLE events
SELECT
  event_id,
  event_data_0,
  event_data_1,
  event_data_2_struct.event_data_2,
  event_data_2_struct.event_data_2_sub_0,
  event_data_2_struct.event_data_2_sub_1,
  event_data_3_struct.event_data_3,
  event_data_4_struct.event_data_4);

FROM
  events_raw
LATERAL VIEW
  explode_event_data_3 (packed_event_data_3) event_data_3_array_element AS
    metadata_key,
    metadata_value
INSERT INTO TABLE event_data_3_metadata
SELECT
  event_id,
  event_data_3_array_element.metadata_key,
  event_data_3_array_element.metadata_value);

FROM
  events_raw
LATERAL VIEW
  explode_event_data_4 (packed_event_data_4) event_data_4_array_element AS
    metadata_key,
    metadata_value
INSERT INTO TABLE event_data_3_metadata
SELECT
  event_id,
  event_data_4_array_element.metadata_key,
  event_data_4_array_element.metadata_value);

This works correctly, the tables are filled with the appropriate
number of rows.  However, the raw table is scanned 3 times to
accomplish this and that is very costly given the amount of data.
When I combine those 3 statements into one Multi Table Insert:

FROM
  events_raw
LATERAL VIEW
  unpack_event_data_2 (packed_event_data_2) event_data_2_struct AS
    event_data_2,
    event_data_2_sub_0,
    event_data_2_sub_1
LATERAL VIEW
  unpack_event_data_3 (packed_event_data_3) event_data_3_struct AS
    event_data_3
LATERAL VIEW
  unpack_event_data_4 (packed_event_data_4) event_data_4_struct AS
    event_data_4
LATERAL VIEW
  explode_event_data_3 (packed_event_data_3) event_data_3_array_element AS
    metadata_key,
    metadata_value
LATERAL VIEW
  explode_event_data_4 (packed_event_data_4) event_data_4_array_element AS
    metadata_key,
    metadata_value
INSERT INTO TABLE events
SELECT
  event_id,
  event_data_0,
  event_data_1,
  event_data_2_struct.event_data_2,
  event_data_2_struct.event_data_2_sub_0,
  event_data_2_struct.event_data_2_sub_1,
  event_data_3_struct.event_data_3,
  event_data_4_struct.event_data_4
INSERT INTO TABLE event_data_3_metadata
SELECT
  event_id,
  event_data_3_array_element.metadata_key,
  event_data_3_array_element.metadata_value
INSERT INTO TABLE event_data_4_metadata
SELECT
  event_id,
  event_data_4_array_element.metadata_key,
  event_data_4_array_element.metadata_value;

The query fails with:

[Hive Error]: Query returned non-zero code: 10, cause: FAILED: Error
in semantic analysis: Column packed_event_data_3 Found in more than
One Tables/Subqueries.

I don't know how to get around having separate unpack_event_data_3 and
explode_event_data_3 functions.  Combining them would seem to marry
the functions' output signatures and in instances when the BIGINT is
type of the UNIONTYPE there shouldn't be a row of NULL values in the
event_data_4_metadata table and vice versa.

Is there a better way to do this?

Thanks,
Jim Krehl
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB