Hi, I currently have a project to process the data using MR. I have some thoughts about it, and am looking for some advices if anyone had any feedback.
Currently in this project, I have lot of events data related to email tracking coming into the HDFC. So the events are the data for email tracking, like email_sent, email_open, email_bnc, link_click etc. Our online system can give me the data in the following 2 kinds of format in raw data:
1) The delta data list the event by type, by timestamp. For example:email_sent, t1, .......email_read, t2, .......email_sent, t3, ......
If I choose this data format, I can put different type into different data set, and just store them in the HDFS, partitioned by Time. This is the easiest for ETL, but not very useful to use the data, as most business analyzing want to link all the event as a chain, like email_sent, email_read, email_click, email_bounce etc.
So I want to join the data in the ETL, and store them in the HDFS in the way it will be used most likely by the end user.
But linking the email event is very expensive, as it is hard to find out this email_read event is for which email_sent, and most importantly, to get the original email_sent timestamp.
Fortunately, our online system (Stored in big Cassandra cluster), can give me data in another format:
2) The delta data include the whole email chain for the whole delta period.For example:email_sent, t1 .....email_sent, email_read, t2......email_sent, t3.......email_sent, email_read, link_click, t4 .......
But here is the trade off, even though it is a delta, but it doesn't ONLY contain delta. For example, in the above example, the 2nd line data, it is an email read event, and gives me the linking email_sent nicely, but the original email_sent event most likely already gave to me in any previous delta data. So I have to merge the email_read to the original email_sent, which already existed in the HDFS, and which only supports append. To make this whole thing work, I have to replace the original record in the HDFS.
We have about 200-300M events generated per day, so it is a challenge to make it right.
Here is my initial thoughts, and look for any feedback and advices:
1) Ideally, the data will store in the HDFS, partitioned by the original email sent timestamp, hourly.2) Within each hour, it maybe a good idea to store as the map file, using (email_address + timestamp) as the key. So index can be built on top of that, make the lookup fast later.3) When the delta comes in (design for daily as first step), choose the 2nd format above as the source raw data, using the first MR job to group the data hourly based on the email_sent. Here is one of my question I am not sure how to best address it.In each reducer, I will have all the data which is having changes for the email originally sent in that hour, and I want to join them back to the hour of data, to do the replacing. But I don't think I can just do this in the reducer directly, right? Ideally, at this time, I want to do a merge join, maybe based on the timestamp. I have to save the reducer output in HDFS some temp location, and do the join again in another MR, right? Without the 1st MR job, I won't know how many hour partition data will be touched by this new incoming delta.4) The challenge part is in the delta data, it also will contain a lot new email_sent events. So the above join really has to be a full outer join.5) The data pattern of the new_email_sent vs old_update is about 9:1, at least based on my current research. Will it really make sense to get the data from both above 1) and 2) format? As from 1), I can get all the new email_sent, and discard the rest. Then I need to go to 2 to identify the part needs to be merged. But in this case, I have to consume 2 big data dumps, which sounds bad idea.
Bottom line, I would like to know:1) What file format I should consider to store my data, map file makes sense?2) Any part I need to pay attention to design these MR jobs? How to make the whole thing efficient?3) Even in the map file, what format I should consider to serialize the value object? Should I use google Protobuf? or Apache Avro? or something else, and why?