Wednesday, March 7, 2012

Real-Time Data Mining Discussion

I am about to prepare a paper concerning the field of real-time data mining. Real-time here means the process of incremental training of an existing model as soon as the data arrives.

There is a number of papers introducing algorithms for incremental association analysis, incremental clustering etc. Stream mining ís a field which is closely related to that. The main reason for the implementation of incremental algorithms is a) the large amount of data to be mined and b) the high rate of new data that is evolving every day.

Using classical batch mining algorithms, models that are outdated for some reason, would have to be re-trained, which could be very time consuming for billions of records. And once the training is completed, the training would have to be restarted once again because a bulk of new data has been arrived.

The question that I would like to discuss now is: For what real world applications would it be a meaningful or even essential to use real-time training of models?

Two main reasons could determine the answer to that question:

    You just want to incorporate new data into existing models in order to increase the prediction accuracy of your model or

    Your underlying data is subject to more or less massive changes (also refered to as concept drift) and you want to adapt your mining model continuously to that reality.

I'm looking for some examples or ideas where one of these cases apply and it would be a good idea to have incremental mining algorithms involved.

I'm looking forward to inspiring some discussion on that issue.

Whenever you model a control system (like validation edits for an application process), you get the ability to tune the controls to stop unwanted behavior. Users that are subject to the new controls will, over time, begin to understand the controls and start to look for weaknesses in the controls that makes their input tasks easier to accomplish. This may lead to new "unwanted" behaviors and it would be great to have control model that learns on the fly and adjusts when it identifies new unwanted behaviors.|||If you haven't already prepared the paper, here's another potential application. Let's say you are doing data mining on stock price movements. You're passing some sort of stock price history, as well as relating it to day of week, day of month, month, year of presidency, moon cycle, what have you. Statistics generally show, for example, that stock prices move up on Fridays more often than Mondays, and this is thought to be due to short sellers covering their positions so they aren't caught by unexpected events over the weekend.

Let's say you want to update these statistics daily, shortly after market close, to keep your trading strategies up-to-date with current market conditions. You don't want to retrain the model with 100+ years of stock data every day, so it'd be much faster to be able to do incremental updates. This becomes particularly important for options and futures trading (though there's not 100 years of data for that), as for every underlying security there are potentially dozens or even hundreds of contracts trading on the market.

No comments:

Post a Comment