Sunita Arvind 2013-06-26, 22:47
Stephen Sprague 2013-06-26, 23:11
Sunita Arvind 2013-06-26, 23:38
-Re: unable to partition the table
Stephen Sprague 2013-07-01, 20:31
ok. so i just learned of a perl script called "json_pp" (for json
pretty_prrint) that is a binary included in the distro for the perl JSON
module. You gotta figure there are analogous tools in other languages as
well but given this is a binary it doesn't matter much what language its
written in - it just works.
so doing this: $ hive -e 'select <your_json_column> from <your_table>' |
you'd actually be able to see what your json looks like in a human readable
get the binary (and the module) here: https://metacpan.org/module/JSON
On Wed, Jun 26, 2013 at 4:38 PM, Sunita Arvind <[EMAIL PROTECTED]>wrote:
> Ok. Thanks Stephen. I will try that out.
> Will update the group if I am able to get this to work. For now, I will
> continue with non-partitioned table.
> On Wed, Jun 26, 2013 at 7:11 PM, Stephen Sprague <[EMAIL PROTECTED]>wrote:
>> it would appear to be that you may partition only by non-nested columns.
>> I would recommend transforming your original dataset into one where the
>> first column is YYYYMM and the rest is your json object. During this
>> transformation you may also wish to make further optimizations as well
>> since you'll be scanning every record.
>> as always my 2 cents only.
>> On Wed, Jun 26, 2013 at 3:47 PM, Sunita Arvind <[EMAIL PROTECTED]>wrote:
>>> I am unable to create a partitioned table.
>>> The error I get is:
>>> FAILED: ParseException line 37:16 mismatched input
>>> '"jobs.values.postingDate.year"' expecting Identifier near '(' in column
>>> I tried referring to the columns in various ways,
>>> S.jobs.values.postingDate.year, with quotes, without quotes, get the same
>>> error. Also tried creating a partition by year alone. Still get the same
>>> Here is the create table statement:
>>> create external table linkedin_JobSearch (
>>> jobs STRUCT<
>>> values : ARRAY<STRUCT<
>>> company : STRUCT<
>>> id : STRING,
>>> name : STRING>,
>>> postingDate : STRUCT<
>>> day : STRING>,
>>> descriptionSnippet : STRING,
>>> expirationDate : STRUCT<
>>> locationDescription : STRING>>>
>>> PARTITIONED BY ("jobs.values.postingDate.year" STRING,
>>> "jobs.values.postingDate.month" STRING)
>>> ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
>>> WITH SERDEPROPERTIES (
>>> LOCATION '/user/sunita/Linkedin/JobSearch';
>>> I need to be able to partition this information. Please help.
Stephen Sprague 2013-07-01, 20:43