-Re: Parallel data stream processing
Hong Tang 2009-10-10, 08:05
MapReduce is indeed inherently a batch processing model, where each
job's outcome is deterministically determined by the input and the
operators (map, reduce, combiner) as long as the input stays immutable
and the operator is deterministic and side-effect free. Such a model
allows the framework to recover from failures without having to
understand the semantics of the operators (unlike SQL). This is
important because failures are bound to happen (frequently) for a
large cluster assembled from commodity hardware.
A typical technique to bridge a batch system and a real-time system is
to pair with the batch system with an incremental processing component
that computes delta on top of some aggregated result. The incremental
processing part would also serve real-time queries, so the data are
typically stored in memory. Some times you have to choose some
approximation algorithms for the incremental part, and periodically
reset the internal state with the more precise batch processing
results (e.g. top-k queries).
Hope this helps, Hong
On Oct 9, 2009, at 11:02 PM, Ricky Ho wrote:
> I'd like to get some Hadoop experts to verify my understanding ...
> To my understanding, within a Map/Reduce cycle, the input data set
> is "freeze" (no change is allowed) while the output data set is
> "created from scratch" (doesn't exist before). Therefore, the map/
> reduce model is inherently "batch-oriented". Am I right ?
> I am thinking whether Hadoop is usable in processing many data
> streams in parallel. For example, thinking about a e-commerce site
> which capture user's product search in many log files, and they want
> to run some analytics on the log files at real time.
> One naïve way is to chunkify the log and perform Map/Reduce in small
> batches. Since the input data file must be freezed, therefore we
> need to switch subsequent write to a new logfile. However, the
> chunking approach is not good because the cutoff point is quite
> arbitrary. Imagine if I want to calculate the popularity of a
> product based on the frequency of searches within last 2 hours (a
> sliding time window). I don't think Hadoop can do this computation.
> Of course, if we don't mind a distorted picture, we can use a
> jumping window (1-3 PM, 3-5 PM ...) instead of a sliding window,
> then maybe OK. But this is still not good, because we have to wait
> for two hours before getting the new batch of result. (e.g. At 4:59
> PM, we only have the result in the 1-3 PM batch)
> It doesn't seem like Hadoop is good at handling this kind of
> processing: "Parallel processing of multiple real time data stream
> processing". Anyone disagree ? The term "Hadoop streaming" is
> confusing because it means completely different thing to me (ie: use
> stdout and stdin as input and output data)
> I'm wondering if a "mapper-only" model would work better. In this
> case, there is no reducer (ie: no grouping). Each map task keep a
> history (ie: sliding window) of data that it has seen and then write
> the result to the output file.
> I heard about the "append" mode of HDFS, but don't quite get it.
> Does it simply mean a writer can write to the end of an existing
> HDFS file ? Or does it mean a reader can read while a writer is
> appending on the same HDFS file ? Is this "append-mode" feature
> helpful in my situation ?