Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hive >> mail # user >> Starting with Hive - writing custom SerDe


+
Fernando Andrés Doglio Tu... 2012-11-29, 13:38
+
Connell, Chuck 2012-11-29, 14:51
+
Connell, Chuck 2012-11-29, 14:53
Copy link to this message
-
Re: Starting with Hive - writing custom SerDe
Similarly, Pig is pretty nice for this type of data cleansing and it's a
little more flexible

However, for completeness, I'll mention an alternative to writing a UDF;
use the Hive streaming feature, where you stream the data through a program
you write in any language you want and format the data as you see fit. This
will be less efficient at runtime and the query is more complex to write,
although you should be able to hide that complexity behind a view. Here is
a blog post demonstrating this feature with R.

http://picklesanddata.com/data/?p=58

We also cover Hive streaming in Programming Hive (O'Reilly).

Again, it's probably better to cleanse the data before putting it in HDFS
or go to the trouble of writing a UDF.

dean

On Thu, Nov 29, 2012 at 8:53 AM, Connell, Chuck <[EMAIL PROTECTED]>wrote:

>  I meant PLAIN tab-separated text.****
>
> ** **
>
> ** **
>
> *From:* Connell, Chuck [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, November 29, 2012 9:51 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* RE: Starting with Hive - writing custom SerDe****
>
> ** **
>
> You might save yourself a lot of work by pre-processing the data, before
> putting it into Hive. A Python script should be able to find all the
> fields, and change the data to plan tab-separated text. This will load
> directly into Hive, and removes the need to write a custom SerDe.****
>
>  ****
>
> Chuck Connell****
>
> Nuance R&D Data Team****
>
> Burlington, MA****
>
>  ****
>
> *From:* Fernando Andrés Doglio Turissini [
> mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>]
> *Sent:* Thursday, November 29, 2012 8:39 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Starting with Hive - writing custom SerDe****
>
>  ****
>
> Hello everyone, I'm starting to play around with Hive, and I have to load
> a traffic data log file into a table. My problem is that the lines of the
> file don't really have a nice separator for each field (on the same line,
> there are serveral blank or hyphens or single blank spaces used as
> separators)... ****
>
> So after looking around for a while, I found that I have to write a custom
> SerDe in order to tell Hive how to parse those lines.****
>
>  ****
>
> I've also found that I can only write them using Java (unlike UDFs for pig
> for instance, which can be written using other languages), is this correct?
> ****
>
> Furthermore, I wanted to know if anyone can point me into the direction of
> some sort of documentation  that describes the process of writing a SerDe.
> I've found examples around the internet, but none of them explain what
> exactly is each method supposed to do (I'm talking about the methods
> supplied by the SerDe interface).****
>
>  ****
>
> Thanks in advance!****
>
>  ****
>
> Best!****
>
> Fernando****
>

--
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB