Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Writing Custom Serdes for Hive


Copy link to this message
-
Re: Writing Custom Serdes for Hive
There reason I am asking (and maybe YC reads this list and can chime in)
but he has written a connector for MongoDB.  It's simple, basically it
connects to a MongoDB, maps columns (primitives only) to mongodb fields,
and allows you to select out of Mongo. Pretty sweet actually, and with
Mongo, things are really fast for small tables.
That being said, I noticed that his connector basically gets all rows from
a Mongo DB collection every time it's ran.  And we wanted to see if we
could extend it to do some simple MongoDB level filtering based on the
passed query.  Basically have a fail open approach... if it saw something
it thought it could optimize in the mongodb query to limit data, it would,
otherwise, it would default to the original approach of getting all the
data.
For example:

select * from mongo_table where name rlike 'Bobby\\sWhite'

Current method: the connection do db.collection.find() gets all the
documents from MongoDB, and then hive does the regex.

Thing we want to try "Oh one of our defined mongo columns has a rlike, ok
send this instead: db.collection.find("name":/Bobby\sWhite");   less data
that would need to be transfered. Yes, Hive would still run the rlike on
the data... "shrug" at least it's running it on far less data.   Basically
if we could determine shortcuts, we could use them.
Just trying to understand Serdes and how we are completely not using them
as intended :)
On Tue, Oct 16, 2012 at 10:42 AM, Connell, Chuck
<[EMAIL PROTECTED]>wrote:

>  A serde is actually used the other way around… Hive parses the query,
> writes MapReduce code to solve the query, and the generated code uses the
> serde for field access.****
>
> ** **
>
> Standard way to write a serde is to start from the trunk regex serde, then
> modify as needed…****
>
> ** **
>
>
> http://svn.apache.org/viewvc/hive/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java?revision=1131106&view=markup
>
> ****
>
> Also, nice article by Roberto Congiu…****
>
> ** **
>
> http://www.congiu.com/a-json-readwrite-serde-for-hive/****
>
> ** **
>
> Chuck Connell****
>
> Nuance R&D Data Team****
>
> Burlington, MA****
>
> ** **
>
> ** **
>
> *From:* John Omernik [mailto:[EMAIL PROTECTED]]
> *Sent:* Tuesday, October 16, 2012 11:30 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Writing Custom Serdes for Hive****
>
> ** **
>
> We have a maybe obvious question about a serde. When a serde in invoked,
> does it have access to the original hive query?  Ideally the original query
> could provide the Serde some hints on how to access the data on the
> backend.  ****
>
> ** **
>
> Also, are there any good links/documention on how to write Serdes?  Kinda
> hard to google on for some reason. ****
>
> ** **
>
> ** **
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB