Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Data are not displayed correctly on hive tables


Copy link to this message
-
Re: Data are not displayed correctly on hive tables
Hi Roshan,
You are right. '\n' in your XML content is going to give you problems. The table you created in Hive assumes one record = 1 '\n' terminated row from your file. I would recommend sanitizing you data before you load it in to get rid of '\n's.

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

----- Original Message -----
From: "mperformer" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, May 8, 2012 6:22:03 PM
Subject: Re: Data are not displayed correctly on hive tables

Hi Mark

Thanks for the reply.

In the HDFS, the row looks like:

1~Order Conf Req~<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<message scope="" type="">
<messageId>0</messageId>
<creatorId>0</creatorId>
<creatorType/>
<workspaceId>0</workspaceId>
<workspaceType/>
<createdTime>2006-06-01T17:59:09.413+10:00</createdTime>
<status>0</status>
<from/>
<to/>
<cc/>
<bcc/>
<nickname/><subject>Order Conf </subject>
<body>Prod

Qty Amt

Dlvry Date

Price
</body>
<smspreview/>
<email bool="true">
<type/>
<body>Prod

Qty Amt

Dlvry Date

Price
</body>
<footer/>
</email>
<voice bool="true">
<type/>
<header/>

<body/>
<footer/>
<other/>
</voice><web bool="false"><type></type><header></header><body></body><footer></footer></web>
<notes/>
<alert bool="false">
<value/>
</alert>
<dlr bool="true">
<type>ReportUnDelivered</type>
<rule>Read</rule>
<period>5 mins</period><publishToWeb>false</publishToWeb><expiryValue>0</expiryValue><expiryPeriod>0</expiryPeriod>
</dlr>
</message>

~PROJECT~General Message~No description~null~true~2~28~0

I used "~" as my field separator. But when I do a select query to my hive table, it only shows the " <?xml version="1.0" encoding="UTF-8" standalone="yes"?>" part from XML data.

Seems there are new line character (\n) in my XML piece and it is the default row terminator in HIVE. If this is the case, is there any way to use another character as the row terminator?

Many Thanks.

/Roshan

On Mon, May 7, 2012 at 11:38 PM, Mark Grover < [EMAIL PROTECTED] > wrote:
Hi Roshan,
The following snippet summarizes the delimiters for your Hive table:

colelction.delim \u0002
field.delim \u0001
mapkey.delim \u0003
serialization.format \u0001
Your fields are delimited by \u0001, collections are delimited by \u0002 and the delimiter between the key and value in any maps is \u0003. Can you verify that your XML content doesn't contain any of these characters?

If this still doesn't help, could you pick an affected row and share what the XML appears as in Hive and what it is expected to be?

Good luck!
Mark
Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

----- Original Message -----
From: "mperformer" < [EMAIL PROTECTED] >
To: [EMAIL PROTECTED]
Sent: Sunday, May 6, 2012 11:34:55 PM
Subject: Re: Data are not displayed correctly on hive tables

Hi Mark

Many thanks for your reply. Please find the below output.

hive> describe formatted messagetemplate;
OK
# col_name data_type comment

messagetemplateid bigint None
messagetemplatename string None
datacol string None
messagetemplatetype string None
messagetype string None
messagetemplatedescription string None
originatingtemplateid bigint None
edited boolean None
userid bigint None
projectid bigint None
responsetemplateid bigint None

# Detailed Table Information
Database: default
Owner: root
CreateTime: Mon May 07 12:06:59 EST 2012
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://app6:9100/mnt/hive-test/warehouse/messagetemplate
Table Type: MANAGED_TABLE
Table Parameters:
comment This is the messagetemplate table
transient_lastDdlTime 1336356473

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim \u0002
field.delim \u0001
mapkey.delim \u0003
serialization.format \u0001
Time taken: 3.2 seconds

Thanks again.

./Roshan.
On Mon, May 7, 2012 at 1:06 PM, Mark Grover < [EMAIL PROTECTED] > wrote:
Could you share the output of the following command in Hive:
describe formatted messagetemplate

My hunch is that your Hive table is using a delimiter (e.g. '\t') that appears in the content of your XML.

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

From: "mperformer" < [EMAIL PROTECTED] >
To: [EMAIL PROTECTED]
Sent: Sunday, May 6, 2012 8:34:27 PM
Subject: Data are not displayed correctly on hive tables

Hi

I am using

• Hadoop 0.20.2
• Hive 0.8.1
• Sqoop 1.4.1-incubating

in my sample project. Currently I am importing data from PostgreSQL to Hive table using Sqoop. My database table in PostgreSQL has 4 columns and one column stores a bit large XML file as TEXT data type. The same column defined in HIVE as string, but after that column data is not importing and shows as null;

Table structure in PostgreSQL

CREATE TABLE public.messagetemplate (
messagetemplateid BIGSERIAL,
messagetemplatename TEXT,
data TEXT,
messagetemplatetype TEXT,
CONSTRAINT pk_messagetemplate PRIMARY KEY(messagetemplateid)
) WITHOUT OIDS;

Table structure in Hive

hive> desc messagetemplate;
OK
messagetemplateid bigint
messagetemplatename string
data string
messagetemplatetype string
The data column store the XML file as text, but during the import to hive, all data are imported properly (checked the files in HDFS). But using HIVE select statement, it only shows small part from the XML text and the rest column (last column) is null.

Could someone please help me to sort this out. Thanks.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB