|
Vijay
2009-12-14, 21:00
Bryan Talbot
2009-12-15, 17:31
Edward Capriolo
2009-12-15, 18:27
Ken.Barclay@...
2009-12-15, 19:00
Bobby Rullo
2009-12-15, 19:57
Jason Michael
2009-12-16, 00:03
Vijay
2009-12-16, 21:47
|
-
ETL workflow experiences with HiveVijay 2009-12-14, 21:00
Can anyone share their ETL workflow experiences with Hive? For example, how
do you transform data from log files to Hive tables? Do you use hive with map/reduce scripts or do you use hive programmatically? Or do you do something entirely different? I haven't found any samples or details about the programmatic usage of hive. Thanks in advance, Vijay
-
Re: ETL workflow experiences with HiveBryan Talbot 2009-12-15, 17:31
I've just started using hive but I'll share my experiences with loading data. We have raw log files in HDFS. These files we want to keep and not change. They sometimes also have more fields than we want to have available in hive tables, so here's how we import that data. A hive script creates an external table (e.g., real_table_stg) with the location pointing to the raw data file to be imported. The script then "insert into real_table select <stuff> from real_table_stg". The <stuff> might be as simple as "*" or it might filter or cleanse raw data from the external table.
One notable issue I've had is the lack of variable substitution in hive scripts. This seems to make every hive script become a script template requiring preprocessing to replace values that must change for each run: location, partition names, etc. I'm currently using groovy scripts to perform the template processing and run the hive jobs. I have also found it convenient to use hive's M/R support in some cases where expressing a transformation in SQL is hard. -Bryan On Dec 14, 2009, at Dec 14, 1:00 PM, Vijay wrote: > Can anyone share their ETL workflow experiences with Hive? For example, how do you transform data from log files to Hive tables? Do you use hive with map/reduce scripts or do you use hive programmatically? Or do you do something entirely different? I haven't found any samples or details about the programmatic usage of hive. > > Thanks in advance, > Vijay
-
Re: ETL workflow experiences with HiveEdward Capriolo 2009-12-15, 18:27
> One notable issue I've had is the lack of variable substitution in hive scripts. This seems to make every hive script become a script template requiring preprocessing to replace values that must change for each run:
Interesting. In a nutshell that is why I created: https://issues.apache.org/jira/browse/HIVE-617 In the most simple case, I needed to substitute X with the current day, month or hour. At first I spent some time writing some shell scripts to generate hive scripts to run with bin/hive -f. This expanded into this complicated t-sql,bash,hive generator and I stopped myself. I settled on Hive-617 because in most cases I really only needed to do substitutions using GregorianCalendar, but I got the benefit of skipping the CLI and using Hive components QueryProcessor directly, also I could launch other MR jobs in the same Class program. I would be interested to see your groovy integration though.
-
RE: ETL workflow experiences with HiveKen.Barclay@... 2009-12-15, 19:00
Hi Vijay,
I built a web-based application in Python that allows users to search Solaris syslog files by entering a date or date range, the log file type (User, Auth, Mail, etc.) and a string or regex to search for. We use syslog-ng to aggregate our logs. To get all our log files into Hive, I wrote a Python program that connected to the Hive Server, created a table partitioned by date and log type, then executed a LOAD DATA LOCAL INPATH command on each log file. The full path of each file was parsed to extract the date and type info so it could be placed in the right partition. The program can take as an argument either a directory under which a whole bunch of files exist (used to get all our logs into the system initially) or it can take a list of individual files (run nightly to keep the system up-to-date with the latest logs.) I then wrote a Python CGI script that establishes a connection to the Hive server and reads the date range to search and other info from the web application and uses it to generate Hive QL queries using the partition parameters. Results are returned to the browser or saved to a file on the user's desktop (their choice.) It works great and the performance is surprisingly fast. One issue I came across was that I could not create a partitioned table that uses a RegexSerDe all in one go - this would just give some "Error parsing string" error. I got around this by breaking the command into two pieces using ALTER TABLE: client.execute("CREATE TABLE SYSLOG(month STRING, day STRING, time STRING, host STRING, logline STRING) PARTITIONED BY(ds STRING, type STRING) STORED AS TEXTFILE") client.execute("ALTER TABLE SYSLOG SET SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES('input.regex'='^([^ ]+) {1,2}([^ ]+) ([^ ]+) ([^ ]+) (.+)$')") Next step is to use mod_python or some other mechanism to improve the performance further. Now that the details of using Hive for log searching are all worked out, I'm going to work on a system that handles much larger files (our Weblogic application logs can be 30G/day per server, and we have a lot of servers...) Of course we have to build out a proper cluster also - just running a standalone instance at the moment. I have to say the hive-user list was indispensable in getting this all to work. Cheers Ken From: Vijay [mailto:[EMAIL PROTECTED]] Sent: Monday, December 14, 2009 1:00 PM To: [EMAIL PROTECTED] Subject: ETL workflow experiences with Hive Can anyone share their ETL workflow experiences with Hive? For example, how do you transform data from log files to Hive tables? Do you use hive with map/reduce scripts or do you use hive programmatically? Or do you do something entirely different? I haven't found any samples or details about the programmatic usage of hive. Thanks in advance, Vijay
-
Re: ETL workflow experiences with HiveBobby Rullo 2009-12-15, 19:57
We do a couple of different things. First, we have a bunch of logs
that are just key/value pairs of transaction-id/server-events-in-json- form. I have scripts which add a new partition for every day's log data. To hive, it's just a two column field, although the second column is a huge json field to hive. To make that column queryable, I created a bunch of UDF's which understand our log format, so I can do stuff like: select count(tid) from txns where UserAgent(txn) like '%Chrome%' and FooBar(txn) = 'baz'; There's also tables which get generated from that transactions table, which are simple columnar tables. I've written python wrappers for many Hive commands: execute_hql(), add_partition(), drop_partition(), etc. Bobby On Dec 14, 2009, at 1:00 PM, Vijay wrote: > Can anyone share their ETL workflow experiences with Hive? For > example, how do you transform data from log files to Hive tables? Do > you use hive with map/reduce scripts or do you use hive > programmatically? Or do you do something entirely different? I > haven't found any samples or details about the programmatic usage of > hive. > > Thanks in advance, > Vijay
-
Re: ETL workflow experiences with HiveJason Michael 2009-12-16, 00:03
We do things a little differently than some of the responses I've seen so far. Our client software pings a group of apache servers with specific URLs/query strings at 15-20 points during its lifecycle, coinciding with "interesting" events during the course of the user's experience. No data is returned, we just store the request in the apache log for consumption. Each request contains a UUID specific to that client's current session.
We parse the hourly apache logs using cascading to join up all the various requests on the UUID, providing us a session-level view of the data. We do a few more basic transforms of the data, and then write it to HDFS as a set of SequenceFiles. We then use hive to create an external table pointed at the data's location. This lets us do a quick validation query. If the query passes, we load the data into a new partition on our fact table for that date and hour. Here's where Hive has really helped us. Our primary fact table contains something on the order of 20-30 different fields, the values of which are arrived at by applying business logic in most cases. For example, some fields are simply taken directly from the underlying beacons, such as IP address. But then some are, say, the timestamp difference between two events. When we first started off, we executed this business logic during the ETL process and stored the results in the hive table. We quickly saw that this would be a problem if we changed the definition of any of the fields, however. We would need to rerun ETL for the entire dataset, which could take days. So we decided instead to take all that business logic out of the ETL process and put it in a custom SerDe. ETL now does only a few transforms, mostly to get the beacons aggregated to a session grain as mentioned above. The SerDe defines the fields in the fact table, and defines an implementing class/method for each. The first time the data is deserialized and a field requested, the implementing method executes the business logic and caches and returns the result. So now if a definition changes, we simply update our SerDe and release the new build to our users. No rerun necessary. We're very happy with how it's all worked out and, as another poster said, very appreciative of all the help the mailing list has provided. Jason On 12/14/09 1:00 PM, "Vijay" <[EMAIL PROTECTED]> wrote: Can anyone share their ETL workflow experiences with Hive? For example, how do you transform data from log files to Hive tables? Do you use hive with map/reduce scripts or do you use hive programmatically? Or do you do something entirely different? I haven't found any samples or details about the programmatic usage of hive. Thanks in advance, Vijay
-
Re: ETL workflow experiences with HiveVijay 2009-12-16, 21:47
Wow, this turned out to be a great discussion! Thanks everyone for providing
detailed feedback. As has already been said many times before, this mailing list has been immensely helpful. Please do keep responding as you can. I think information like this will be tremendously helpful for people and teams evaluating hadoop/hive or are in the initial design phases! On Tue, Dec 15, 2009 at 4:03 PM, Jason Michael <[EMAIL PROTECTED]>wrote: > We do things a little differently than some of the responses I’ve seen so > far. Our client software pings a group of apache servers with specific > URLs/query strings at 15-20 points during its lifecycle, coinciding with > “interesting” events during the course of the user’s experience. No data is > returned, we just store the request in the apache log for consumption. Each > request contains a UUID specific to that client’s current session. > > We parse the hourly apache logs using cascading to join up all the various > requests on the UUID, providing us a session-level view of the data. We do > a few more basic transforms of the data, and then write it to HDFS as a set > of SequenceFiles. We then use hive to create an external table pointed at > the data’s location. This lets us do a quick validation query. If the > query passes, we load the data into a new partition on our fact table for > that date and hour. > > Here’s where Hive has really helped us. Our primary fact table contains > something on the order of 20-30 different fields, the values of which are > arrived at by applying business logic in most cases. For example, some > fields are simply taken directly from the underlying beacons, such as IP > address. But then some are, say, the timestamp difference between two > events. When we first started off, we executed this business logic during > the ETL process and stored the results in the hive table. We quickly saw > that this would be a problem if we changed the definition of any of the > fields, however. We would need to rerun ETL for the entire dataset, which > could take days. So we decided instead to take all that business logic out > of the ETL process and put it in a custom SerDe. > > ETL now does only a few transforms, mostly to get the beacons aggregated to > a session grain as mentioned above. The SerDe defines the fields in the fact > table, and defines an implementing class/method for each. The first time > the data is deserialized and a field requested, the implementing method > executes the business logic and caches and returns the result. So now if a > definition changes, we simply update our SerDe and release the new build to > our users. No rerun necessary. > > We’re very happy with how it’s all worked out and, as another poster said, > very appreciative of all the help the mailing list has provided. > > Jason > > > > On 12/14/09 1:00 PM, "Vijay" <[EMAIL PROTECTED]> wrote: > > Can anyone share their ETL workflow experiences with Hive? For example, how > do you transform data from log files to Hive tables? Do you use hive with > map/reduce scripts or do you use hive programmatically? Or do you do > something entirely different? I haven't found any samples or details about > the programmatic usage of hive. > > Thanks in advance, > Vijay > > |