I am not aware of anyone that does this for you directly, but it should not be too difficult for you to write what you want using pig or hive. I am not as familiar with Jaql but I assume that you can do it there too. Although it might be simpler to write it using Map/Reduce because we can abuse Map/Reduce in ways that the higher level languages disallow so that they can do optimizations.
What I would do is in the mapper scan through each entry and look for transitions of $value around $threshold, and the time that they occurred. You can then look for 30+ second windows where $value > $threshold within that partition and output them to the reducer. The trick with this is that you need to pay special attention to the beginning and end of the partition. You need to also send to the reducer the state at the beginning and end of each partition and how long it was in that state. The reducer can then combine these pieces together and see if they meet the 30+ second criteria. If so output them with the rest, otherwise don't. The known times when it is > 30 seconds can be sent to any reducer, so they can have any key, but for the transitions to work correctly you need to send them to a single reducer, so they should have a very specific key. You could also try to divide them up if you have to scale very very large, but that would be rather difficult to get right.
On 3/29/12 4:02 AM, "banermatt" <[EMAIL PROTECTED]> wrote:
I'm developping a log file anomaly detection system on an hadoop cluster.
I'm looking for a way to process query like: "select all values when
value>threshold for a duration>30 secondes". Do you know a tool which could
help me to process such a query?
I documented on the script langages pig, hive and jaql which seem to have
very similar application. I tried it but I was not be able to do what I
Thank you in advance,
View this message in context: http://old.nabble.com/Temporal-query-tp33544869p33544869.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.