

I have large volume of stream log data. Each data record contains a time stamp, which is very important to the analysis. For example, I have data format like this: (1) 20:30:21 01/April/2012 AAAAA............. (2) 20:30:51 01/April/2012 BBBB............. (3) 21:30:21 01/April/2012 bbbb.............
Moreover, new data comes every few minutes. I have to calculate the probability of the occurrence "bbbb" given the occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is really timedependant.
I wonder if Hadoop is the right platform for this job? Is there any package available for this kind of work?
Thank you.
Zhiwei

Re: Stream data processing
Zhiwei,
How quickly do you have to get the result out once the new data is added? How far back in time do you have to look for BBBB from the occurrence of bbbb? Do you have to do this for all combinations of values or is it just a small subset of values?
Bobby Evans
On 5/21/12 3:01 PM, "Zhiwei Lin" <[EMAIL PROTECTED]> wrote:
I have large volume of stream log data. Each data record contains a time stamp, which is very important to the analysis. For example, I have data format like this: (1) 20:30:21 01/April/2012 AAAAA............. (2) 20:30:51 01/April/2012 BBBB............. (3) 21:30:21 01/April/2012 bbbb.............
Moreover, new data comes every few minutes. I have to calculate the probability of the occurrence "bbbb" given the occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is really timedependant.
I wonder if Hadoop is the right platform for this job? Is there any package available for this kind of work?
Thank you.
Zhiwei

Re: Stream data processing
Hi Robert, Thank you. How quickly do you have to get the result out once the new data is added? If possible, I hope to get the result instantly.
How far back in time do you have to look for BBBB from the occurrence of bbbb? The time slot is not constant. It depends on the "last" occurrence of BBBB in front of bbbb. So, I need to look up the history to get the last BBBB in this case.
Do you have to do this for all combinations of values or is it just a small subset of values? I think this depends on the time of last occurrence of BBBB in the history. If BBBB rarely occurred, then the early stage data has to be taken into account.
Definitely, I think HDFS is a good place to store the data I have (the size of daily log is above 1GB). But I am not sure if Map/Reduce can help to handle the stated problem.
Zhiwei On 21 May 2012 22:07, Robert Evans <[EMAIL PROTECTED]> wrote:
> Zhiwei, > > How quickly do you have to get the result out once the new data is added? > How far back in time do you have to look for BBBB from the occurrence of > bbbb? Do you have to do this for all combinations of values or is it just > a small subset of values? > > Bobby Evans > > On 5/21/12 3:01 PM, "Zhiwei Lin" <[EMAIL PROTECTED]> wrote: > > I have large volume of stream log data. Each data record contains a time > stamp, which is very important to the analysis. > For example, I have data format like this: > (1) 20:30:21 01/April/2012 AAAAA............. > (2) 20:30:51 01/April/2012 BBBB............. > (3) 21:30:21 01/April/2012 bbbb............. > > Moreover, new data comes every few minutes. > I have to calculate the probability of the occurrence "bbbb" given the > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is > really timedependant. > > I wonder if Hadoop is the right platform for this job? Is there any > package available for this kind of work? > > Thank you. > > Zhiwei > > 
Best wishes.
Zhiwei

Re: Stream data processing
If you want the results to come out instantly Map/Reduce is not the proper choice. Map/Reduce is designed for batch processing. It can do small batches, but the overhead of launching the map/redcue jobs can be very high compared to the amount of processing you are doing. I personally would look into using either Storm, S4, or some other realtime stream processing framework. From what you have said it sounds like you probably want to use Storm, as it can be used to guarantee that each event is processed once and only once. You can also store your results into HDFS if you want, perhaps through HBASE, if you need to do further processing on the data.
Bobby Evans
On 5/22/12 5:02 AM, "Zhiwei Lin" <[EMAIL PROTECTED]> wrote:
Hi Robert, Thank you. How quickly do you have to get the result out once the new data is added? If possible, I hope to get the result instantly.
How far back in time do you have to look for BBBB from the occurrence of bbbb? The time slot is not constant. It depends on the "last" occurrence of BBBB in front of bbbb. So, I need to look up the history to get the last BBBB in this case.
Do you have to do this for all combinations of values or is it just a small subset of values? I think this depends on the time of last occurrence of BBBB in the history. If BBBB rarely occurred, then the early stage data has to be taken into account.
Definitely, I think HDFS is a good place to store the data I have (the size of daily log is above 1GB). But I am not sure if Map/Reduce can help to handle the stated problem.
Zhiwei On 21 May 2012 22:07, Robert Evans <[EMAIL PROTECTED]> wrote:
> Zhiwei, > > How quickly do you have to get the result out once the new data is added? > How far back in time do you have to look for BBBB from the occurrence of > bbbb? Do you have to do this for all combinations of values or is it just > a small subset of values? > > Bobby Evans > > On 5/21/12 3:01 PM, "Zhiwei Lin" <[EMAIL PROTECTED]> wrote: > > I have large volume of stream log data. Each data record contains a time > stamp, which is very important to the analysis. > For example, I have data format like this: > (1) 20:30:21 01/April/2012 AAAAA............. > (2) 20:30:51 01/April/2012 BBBB............. > (3) 21:30:21 01/April/2012 bbbb............. > > Moreover, new data comes every few minutes. > I have to calculate the probability of the occurrence "bbbb" given the > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is > really timedependant. > > I wonder if Hadoop is the right platform for this job? Is there any > package available for this kind of work? > > Thank you. > > Zhiwei > > 
Best wishes.
Zhiwei

Re: Stream data processing
Hi Bobby,
Thank you. Great help.
Zhiwei
On 22 May 2012 14:52, Robert Evans <[EMAIL PROTECTED]> wrote:
> If you want the results to come out instantly Map/Reduce is not the proper > choice. Map/Reduce is designed for batch processing. It can do small > batches, but the overhead of launching the map/redcue jobs can be very high > compared to the amount of processing you are doing. I personally would > look into using either Storm, S4, or some other realtime stream processing > framework. From what you have said it sounds like you probably want to use > Storm, as it can be used to guarantee that each event is processed once and > only once. You can also store your results into HDFS if you want, perhaps > through HBASE, if you need to do further processing on the data. > > Bobby Evans > > On 5/22/12 5:02 AM, "Zhiwei Lin" <[EMAIL PROTECTED]> wrote: > > Hi Robert, > Thank you. > How quickly do you have to get the result out once the new data is added? > If possible, I hope to get the result instantly. > > How far back in time do you have to look for BBBB from the occurrence of > bbbb? > The time slot is not constant. It depends on the "last" occurrence of BBBB > in front of bbbb. So, I need to look up the history to get the last BBBB > in this case. > > Do you have to do this for all combinations of values or is it just a small > subset of values? > I think this depends on the time of last occurrence of BBBB in the history. > If BBBB rarely occurred, then the early stage data has to be taken into > account. > > Definitely, I think HDFS is a good place to store the data I have (the size > of daily log is above 1GB). But I am not sure if Map/Reduce can help to > handle the stated problem. > > Zhiwei > > > On 21 May 2012 22:07, Robert Evans <[EMAIL PROTECTED]> wrote: > > > Zhiwei, > > > > How quickly do you have to get the result out once the new data is added? > > How far back in time do you have to look for BBBB from the occurrence of > > bbbb? Do you have to do this for all combinations of values or is it > just > > a small subset of values? > > > > Bobby Evans > > > > On 5/21/12 3:01 PM, "Zhiwei Lin" <[EMAIL PROTECTED]> wrote: > > > > I have large volume of stream log data. Each data record contains a time > > stamp, which is very important to the analysis. > > For example, I have data format like this: > > (1) 20:30:21 01/April/2012 AAAAA............. > > (2) 20:30:51 01/April/2012 BBBB............. > > (3) 21:30:21 01/April/2012 bbbb............. > > > > Moreover, new data comes every few minutes. > > I have to calculate the probability of the occurrence "bbbb" given the > > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is > > really timedependant. > > > > I wonder if Hadoop is the right platform for this job? Is there any > > package available for this kind of work? > > > > Thank you. > > > > Zhiwei > > > > > > >  > > Best wishes. > > Zhiwei > > 
Best wishes.
Zhiwei

