-Re: ETL workflow experiences with Hive
Vijay 2009-12-16, 21:47
Wow, this turned out to be a great discussion! Thanks everyone for providing
detailed feedback. As has already been said many times before, this mailing
list has been immensely helpful.
Please do keep responding as you can. I think information like this will be
tremendously helpful for people and teams evaluating hadoop/hive or are in
the initial design phases!
On Tue, Dec 15, 2009 at 4:03 PM, Jason Michael <[EMAIL PROTECTED]>wrote:
> We do things a little differently than some of the responses I’ve seen so
> far. Our client software pings a group of apache servers with specific
> URLs/query strings at 15-20 points during its lifecycle, coinciding with
> “interesting” events during the course of the user’s experience. No data is
> returned, we just store the request in the apache log for consumption. Each
> request contains a UUID specific to that client’s current session.
> We parse the hourly apache logs using cascading to join up all the various
> requests on the UUID, providing us a session-level view of the data. We do
> a few more basic transforms of the data, and then write it to HDFS as a set
> of SequenceFiles. We then use hive to create an external table pointed at
> the data’s location. This lets us do a quick validation query. If the
> query passes, we load the data into a new partition on our fact table for
> that date and hour.
> Here’s where Hive has really helped us. Our primary fact table contains
> something on the order of 20-30 different fields, the values of which are
> arrived at by applying business logic in most cases. For example, some
> fields are simply taken directly from the underlying beacons, such as IP
> address. But then some are, say, the timestamp difference between two
> events. When we first started off, we executed this business logic during
> the ETL process and stored the results in the hive table. We quickly saw
> that this would be a problem if we changed the definition of any of the
> fields, however. We would need to rerun ETL for the entire dataset, which
> could take days. So we decided instead to take all that business logic out
> of the ETL process and put it in a custom SerDe.
> ETL now does only a few transforms, mostly to get the beacons aggregated to
> a session grain as mentioned above. The SerDe defines the fields in the fact
> table, and defines an implementing class/method for each. The first time
> the data is deserialized and a field requested, the implementing method
> executes the business logic and caches and returns the result. So now if a
> definition changes, we simply update our SerDe and release the new build to
> our users. No rerun necessary.
> We’re very happy with how it’s all worked out and, as another poster said,
> very appreciative of all the help the mailing list has provided.
> On 12/14/09 1:00 PM, "Vijay" <[EMAIL PROTECTED]> wrote:
> Can anyone share their ETL workflow experiences with Hive? For example, how
> do you transform data from log files to Hive tables? Do you use hive with
> map/reduce scripts or do you use hive programmatically? Or do you do
> something entirely different? I haven't found any samples or details about
> the programmatic usage of hive.
> Thanks in advance,