Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> ETL workflow experiences with Hive

Copy link to this message
Re: ETL workflow experiences with Hive
Wow, this turned out to be a great discussion! Thanks everyone for providing
detailed feedback. As has already been said many times before, this mailing
list has been immensely helpful.

Please do keep responding as you can. I think information like this will be
tremendously helpful for people and teams evaluating hadoop/hive or are in
the initial design phases!

On Tue, Dec 15, 2009 at 4:03 PM, Jason Michael <[EMAIL PROTECTED]>wrote:

>  We do things a little differently than some of the responses I’ve seen so
> far.  Our client software pings a group of apache servers with specific
> URLs/query strings at 15-20 points during its lifecycle, coinciding with
> “interesting” events during the course of the user’s experience.  No data is
> returned, we just store the request in the apache log for consumption.  Each
> request contains a UUID specific to that client’s current session.
> We parse the hourly apache logs using cascading to join up all the various
> requests on the UUID, providing us a session-level view of the data.  We do
> a few more basic transforms of the data, and then write it to HDFS as a set
> of SequenceFiles.  We then use hive to create an external table pointed at
> the data’s location.  This lets us do a quick validation query.  If the
> query passes, we load the data into a new partition on our fact table for
> that date and hour.
> Here’s where Hive has really helped us.  Our primary fact table contains
> something on the order of 20-30 different fields, the values of which are
> arrived at by applying business logic in most cases.  For example, some
> fields are simply taken directly from the underlying beacons, such as IP
> address.  But then some are, say, the timestamp difference between two
> events.  When we first started off, we executed this business logic during
> the ETL process and stored the results in the hive table.  We quickly saw
> that this would be a problem if we changed the definition of any of the
> fields, however.  We would need to rerun ETL for the entire dataset, which
> could take days.  So we decided instead to take all that business logic out
> of the ETL process and put it in a custom SerDe.
> ETL now does only a few transforms, mostly to get the beacons aggregated to
> a session grain as mentioned above. The SerDe defines the fields in the fact
> table, and defines an implementing class/method for each.  The first time
> the data is deserialized and a field requested, the implementing method
> executes the business logic and caches and returns the result.  So now if a
> definition changes, we simply update our SerDe and release the new build to
> our users.  No rerun necessary.
> We’re very happy with how it’s all worked out and, as another poster said,
> very appreciative of all the help the mailing list has provided.
> Jason
> On 12/14/09 1:00 PM, "Vijay" <[EMAIL PROTECTED]> wrote:
> Can anyone share their ETL workflow experiences with Hive? For example, how
> do you transform data from log files to Hive tables? Do you use hive with
> map/reduce scripts or do you use hive programmatically? Or do you do
> something entirely different? I haven't found any samples or details about
> the programmatic usage of hive.
> Thanks in advance,
> Vijay