Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> ETL workflow experiences with Hive


Copy link to this message
-
Re: ETL workflow experiences with Hive
Wow, this turned out to be a great discussion! Thanks everyone for providing
detailed feedback. As has already been said many times before, this mailing
list has been immensely helpful.

Please do keep responding as you can. I think information like this will be
tremendously helpful for people and teams evaluating hadoop/hive or are in
the initial design phases!

On Tue, Dec 15, 2009 at 4:03 PM, Jason Michael <[EMAIL PROTECTED]>wrote:

>  We do things a little differently than some of the responses I’ve seen so
> far.  Our client software pings a group of apache servers with specific
> URLs/query strings at 15-20 points during its lifecycle, coinciding with
> “interesting” events during the course of the user’s experience.  No data is
> returned, we just store the request in the apache log for consumption.  Each
> request contains a UUID specific to that client’s current session.
>
> We parse the hourly apache logs using cascading to join up all the various
> requests on the UUID, providing us a session-level view of the data.  We do
> a few more basic transforms of the data, and then write it to HDFS as a set
> of SequenceFiles.  We then use hive to create an external table pointed at
> the data’s location.  This lets us do a quick validation query.  If the
> query passes, we load the data into a new partition on our fact table for
> that date and hour.
>
> Here’s where Hive has really helped us.  Our primary fact table contains
> something on the order of 20-30 different fields, the values of which are
> arrived at by applying business logic in most cases.  For example, some
> fields are simply taken directly from the underlying beacons, such as IP
> address.  But then some are, say, the timestamp difference between two
> events.  When we first started off, we executed this business logic during
> the ETL process and stored the results in the hive table.  We quickly saw
> that this would be a problem if we changed the definition of any of the
> fields, however.  We would need to rerun ETL for the entire dataset, which
> could take days.  So we decided instead to take all that business logic out
> of the ETL process and put it in a custom SerDe.
>
> ETL now does only a few transforms, mostly to get the beacons aggregated to
> a session grain as mentioned above. The SerDe defines the fields in the fact
> table, and defines an implementing class/method for each.  The first time
> the data is deserialized and a field requested, the implementing method
> executes the business logic and caches and returns the result.  So now if a
> definition changes, we simply update our SerDe and release the new build to
> our users.  No rerun necessary.
>
> We’re very happy with how it’s all worked out and, as another poster said,
> very appreciative of all the help the mailing list has provided.
>
> Jason
>
>
>
> On 12/14/09 1:00 PM, "Vijay" <[EMAIL PROTECTED]> wrote:
>
> Can anyone share their ETL workflow experiences with Hive? For example, how
> do you transform data from log files to Hive tables? Do you use hive with
> map/reduce scripts or do you use hive programmatically? Or do you do
> something entirely different? I haven't found any samples or details about
> the programmatic usage of hive.
>
> Thanks in advance,
> Vijay
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB