Outlier Detection from Time-series Data Stream by Leveraging Change Detection

Motivation:
Outlier detection is one of the most interesting areas in the context of data mining. It has many applications such as intrusion detection, medical anomaly detection, sensor anomaly detection etc. Detecting outlier is challenging in various new data types such as data stream, spatio temporal and time series data. Effective and efficient methods are needed to tackle these challenges. Identifying and analyzing outlier in a given time-series is an important in many applications, because peaks are useful topological features of a time-series. In power distribution data, peaks indicate sudden high demands. In server CPU utilization data, peaks indicate sharp increase in workload. In network data, peaks correspond to bursts in traffic. In financial data, peaks indicate abrupt rise in price or volume.
Outlier detection has been used for ages to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behavior, fraudulent behavior, instrument error etc. In this paper, we are proposing a method to identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The previous outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we propose a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review.
Outlier or anomaly detection is a general challenge for computer science. It can cause many difficulties which is hard to solve. In big systems, outlier detection is very important and affects a lot in the total system. There are some effective algorithms for detecting anomaly despite of causing any kind of change in the system. But there is a need of some cost effective and faster algorithm to solve this system. So, I think developing a system that can detect anomaly or outlier effectively and correctly, in a short period of time will be very helpful for the field of data mining. I find this topic not only challenging but also extremely interesting and helpful to do my research on.
Research Proposal:
Introduction:
Outlier detection is one of the most interesting areas in the context of Data Mining/Knowledge discovery. Outlier detection is also referred to as anomaly detection, event detection, novelty detection, deviant discovery, fault detection, intrusion detection, or misuse detection [GGAH14].
Moreover, a subtle difference between the definitions of outlier and anomaly is mentioned in
[Agg13b, p. 4]:
‘outlier refers to a data point, which could either be considered an abnormality or noise,
whereas an anomaly refers to a special kind of outlier, which is of interest to an analyst.”
Figure 1: The spectrum from normal data to outliers[Agg13b]
Here, we will use the term outlier and anomaly interchangeably. Some well established
definitions of outliers are:
An outlying observation or `outlier’ is one that appears to deviate markedly from other
members of the sample in which it occurs.” [Gru69]
An outlier is an observation which deviates so much from the other observations as to
arouse suspicions that it was generated by a different mechanism.” [Haw80]
an observation (or a set of observations) which appears to be inconsistent with the remainder
of that set of data” [BL94]
These seemingly vague definitions cover a broad spectrum for outliers which provide the opportunity to define outlier differently in various application domains. As a result, outlier detection is the process to effectively detect outliers based on the particular definition of the outlier. It is highly unlikely to find a general purpose outlier detection technique.
Several books provide an extensive overview of this field. [HKP11, Ch. 12] and [Agg15, Ch. 9-
10] provide a broad overview on outlier detection. But the most comprehensive book for outlier
detection is [Agg13b]. There are also several excellent surveys in the literature like [HA04,
CBK09, KKZ09]. Some surveys are more focused on particular domain. [ZMH10, MSME15]
cover outlier detection methods for wireless sensor networks.
Figure 2: Taxonomy of outlier detection in WSN[ZMH10]
[CBK12] covers the topics related to discrete sequences. [SG14] provides the research issues
of outlier detection for data streams. For temporal/time-series data, [Fu11, EA12, GGAH14]
provide a detail overview of the topic.
Figure 3: Taxonomy of outlier detection in temporal data[GGAH14]
Moreover, [Gam10, Ch. 11] and [Agg13b, Ch. 8] provides an overview of outlier detection for
time-series data streams.
In general, Outlier detection techniques can be categorized into several groups: (i) statistical
methods; (ii) Nearest neighbor methods; (iii) Classification methods; (iv) Clustering methods;
(v) Information theoretic methods and (vi) Spectral decomposition methods [CBK09, ZMH10].
on the other hand [KKZ09] has categorized outlier detection techniques into(i) statistical test;
(ii) Depth-based methods; (iii) Deviation-based methods; (iv) Distance-based methods; (v)
Density-based methods and (vi) High-dimensional methods. Each method has its strength
and weakness. Choosing a method largely depends on the application domain. It has been
identified that an anomaly detection problem has four main aspects [CBK09]. Firstly, the nature
of data such as univariate vs. multivariate; discrete vs. continuous. Secondly, based on the availability of data labels, anomaly detection problem can be treated using a supervised/semi supervised/
unsupervised method. Thirdly, anomalies are divided into three types: point, contextual
and collective. Recently a new type of anomaly called contextual collective anomaly
has been proposed in [JZXL14]. Finally, output of an anomaly detection method is generated
as scores or labels.
Recently, the research direction of outlier detection is moving towards “Outlier Ensembles”
after the inuential paper of the same title by Charu Aggarwal [Agg13c]. Moreover, [ZCS14]
has extended the research issues for outlier ensembles with a focus on unsupervised methods.
[MMA14] emphasizes using techniques from both supervised and unsupervised approaches to
leverage the idea of outlier ensembles.
Literature Review:
Data Stream vs. Time-series:
Data stream has brought a new kind of setting in computing: processing a stream of data as
opposed to static, multiple-access data. Data streams are temporally ordered, fast changing and
potentially infinite. Wireless sensor network traffic, telecommunications, on-line transactions
in the financial market or retail industry, web click streams, video surveillance, and weather or
environment monitoring are some sources of data stream. As these kinds of data cannot be
stored in any kind of data repository, effective and efficient management and online analysis of
data streams brings new challenges.
Knowledge discovery from data stream is a broad topic which is covered in several books
like [Agg07, Gam10], [LRU14, Ch. 4], [Agg15, Ch. 12]. As sensor data is one of the sources of
data stream, extensive analysis from this perspective can be found in [GGO+08, Agg13a].
In many application domains data stream includes a temporal attribute where each data
point has either implicit or explicit timestamp with it. Real time sensor data, medical data,
mechanical system diagnosis are such examples. These are also example of Time-series data.
Traditionally it is assumed that time series data can be stored easily and established online
analysis and mining methods can be applied. But in a streaming setting, the focus is shifted
towards online data mining. This requirement makes the online algorithms infeasible.
In [Agg13b, p. 260], it is identified that the problem of outlier detection in streaming time
series data and multidimensional data streams are very different. The former requires the
analysis of each series as a unit, whereas the latter requires the analysis of each multidimensional
point as a unit.
Outlier detection in a time-series can be divided into two categories: values at specific time
stamps are classified as outliers because of sudden changes (contextual anomalies), or entire
time-series or large subsequences within a time series are classified as outliers because of their
unusual shapes (collective anomalies) [Agg13b, p. 227].
Jointly, we are interested to use the term time-series data stream or streaming time series
data interchangeably.
Time-series Data Stream:
We are really motivated by three research issues provided in the context of data stream [SG14]:
‘Research Issue 2- A data point has to be compared with the other data points with same
temporal context (occurred within the time period which is semantically related to the timestamp
of the data point).”
‘Research Issue 6- An outlier detection technique for data streams should not assume any
kind of fixed data distribution.”
‘Research Issue 14- An outlier detection technique for multiple data streams should be able
to compare data points with the same or different schemas in order to detect outliers.”
Change Detection in Data Stream:
Another important task in processing of time-series data streams is change detection. For
temporal data, the task of change detection is closely related with anomaly detection but
different:
It should be emphasized that change analysis and outlier detection(in temporal data) are
very closely related areas, but not necessarily identical” [Agg13b, p. 25].
Figure 4: Different types of Change[GZB+14]
The following different modes of change have been identified in the literature: concept drift
(gradual change) and concept shift (abrupt change). [Gam10, Ch. 3] and [Agg07, Ch. 5] provide
separate chapter to cover change detection for data streams. Detecting concept drift is more
difficult than concept shift. [SG09, G_ZB+14] provides an extensive overview for detecting concept
change. In contrast with anomaly detection, for concept drift detection two distributions
are being compared, rather than comparing a given data point against a model prediction.
Here, a sliding window of most recent examples is usually maintained, which is then compared
against the learned hypothesis or performance indicators, or even just a previous time window.
Much of the difference between the algorithms below is in the way the sliding windows of recent
examples are maintained and in the types of statistical tests performed (except for CVFDT),
though some algorithms, notably ADWIN family, allow different statistical tests to be used.
In particular, statistical tests range from a comparison of means of old and new data, to order
statistics [KBDG04], sequential hypothesis testing [MvdBW07], velocity density estimation
[Agg03], density test method [SWJR07], to Kullback Leibler (KL) divergence [DKVY06]. Many
of the results specifically address multidimensional data. Different tests are suitable for different
situations; in [DKP11] a comparison of applicability of several of the above mentioned tests is
made.
The following are a sample of algorithms for detecting concept drift. There has been publicly
available implementations of some of them: in particular, the MOA software environment
for online learning of evolving data stream (http://moa.cms.waikato.ac.nz/) incorporates
ADWIN (family of) algorithms mentioned below.
1. CUSUM/PH test: Probably the oldest algorithm for change detection, CUSUM maintains
a mean of (adjusted) examples seen so far: g0 = 0 and gt = max(0; gt-1 + (rt – v))
in its simplest form (assuming only positive change). Whenever the cumulative sum gt
exceeds a given threshold, a change is detected. A similar idea with a different cumulative
variable is used in Page-Hinkley (PH) test.
2. CVFDT: The CVFDT [HSD01] algorithm is an early algorithm that proposed an incremental
approach for building and maintaining a decision tree (Hoeffding tree) in the
face of changes or concept drift that occur in a data stream environment. This algorithm
does not need an external classifier, checking the incoming data against the decision tree
it is maintaining; when that tree does not adequately describe the data, a switch to an
alternative tree is made. There is a number of implementations available.
3. ADWIN: A common theme amongst change detection algorithms is maintaining a sliding
window of new or relevant data. Bifet et al. [BG07] proposed an adaptive windowing
scheme called ADWIN; the second version ADWIN2 is now available, as well as a version
with Kalman filter. In ADWIN, the detection of change is based on statistical methods, in
particular on the use of the Hoeffding bound. An implementation of ADWIN is available
at http://adaptive-mining.sourceforge.net/?page_id=20; ADWIN and k-ADWIN
are incorporated into http://moa.cms.waikato.ac.nz/.
4. OnePassSampler: Recently, a faster algorithm has been proposed named OnePassSampler
[SPK13]. This algorithm does not do the extensive within-window comparisons of
ADWIN, but it uses a sequential hypothesis testing strategy. The statistical test involves
computing sample means and using Bernstein bound to estimate the error. It seems to
have good performance in terms of false positive/true positive rate, however its detection
delay is higher.
Proposed Research Methodology:
Contextual(Point) Anomaly Detection Framework:
Input: A univariate time-series data stream X = {x1, x2, x3′, xt-1, xt,’.} where each measurement
has a explicit/implicit timestamp associated with it.
Output: Decide whether xt + 1 is an anomaly (based on the definition of anomaly for the
specific domain).
Assumptions: i) No ground truth is available which makes supervised techniques less applicable.
ii) Near real-time anomaly detection is needed which makes offline methods infeasible. That
is detection xt+1 must be performed before the arrival of xt+2.
iii) Considering domains where data arrival rate is within certain limit. This has made the
second assumption fairly relaxed.
Contextual anomaly detection methods for the aforementioned setting are typically deviation
based [Agg13b, p. 229]. But we are interested to use a non-parametric statistical method
within a sliding window for online anomaly detection. Moreover, we are interested to use a external change detection mechanism for detecting Concept Drift (gradual change) so that we
can adapt the change of underlying data distribution to detect anomalies.
Unified techniques for change point and outlier detection are presented in [TY06, KS09,
SZLH13]. But using change detection mechanism for outlier detection is presented in [BP_Z+09,
PB_Z+10]. But the primary motivation of the work was not anomaly detection rather better
prediction of the model in the presence of concept drift (further review needed). On the other
hand, we are interested to adapt the general framework for model prediction in[PB_Z+10] with
slight modification:
Input: X = {x1, x2, x3, ‘., xt-1, xt, ‘..}.
1) Use ADWIN-2[BG07] to detect the concept change point c (issues: replace outliers and
normalization).
2) Learn the model F(x) from Xnew = {xc,’.., xt}.
3) May use different value of confidence parameter _ for ensembles.
That is our outlier detection framework will be:
1) Remove obvious outlier from Xnew using Z-value test(or other suitable method) to make
next model more robust[Agg13b, p. 125].
2) Apply non-parametric statistical method such as Kernel Density Estimation (KDE) [Sil86]
to detect anomaly.
3) Use only the first window of data as training set to model normal behavior with respect
to the context (within window).
4) General KDE algorithm has a O(n2) computational complexity. But once the model is
learned, the computational cost of outlier detection for each item is very low. May need to use
more efficient method.
We see the following issues and questions for research:
The research questions are:
‘ How the system will work more accurately?
‘ How can the system be more efficient?
‘ How the system differ from other algorithms?
‘ How to deal with the changes appearance in time?
The Sub questions are:
‘ Would the system be user friendly?
‘ Would the system be cost effective?
‘ Would the system be able to find exactly correct results?
References:
[Agg03] Charu C Aggarwal. A framework for diagnosing changes in evolving data streams.
In Proceedings of the 2003 ACM SIGMOD international conference on Manage-
ment of data, pages 575{586. ACM, 2003.
[Agg07] Charu C Aggarwal. Data streams: models and algorithms, volume 31. Springer,
2007.
[Agg13a] Charu C Aggarwal. Managing and mining sensor data. Springer Science & Business
Media, 2013.
[Agg13b] Charu C Aggarwal. Outlier analysis. Springer Science & Business Media, 2013.
[Agg13c] Charu C Aggarwal. Outlier ensembles: position paper. ACM SIGKDD Explo-
rations Newsletter, 14(2):49{58, 2013.
[Agg15] Charu C Aggarwal. An introduction to data mining. In Data Mining, pages 1{26.
Springer, 2015.
[BG07] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive
windowing. In SDM, volume 7, page 2007. SIAM, 2007.
[BL94] Vic Barnett and Toby Lewis. Outliers in statistical data, volume 3. Wiley New
York, 1994.
[BP_Z+09] Jorn Bakker, Mykola Pechenizkiy, I _Zliobait_e, Andriy Ivannikov, and Tommi
Karkkainen. Handling outliers and concept drift in online mass ow prediction
in cfb boilers. In Proceedings of the Third International Workshop on Knowledge
Discovery from Sensor Data, pages 13{22. ACM, 2009.
[CBK09] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A
survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.
[CBK12] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection for
discrete sequences: A survey. Knowledge and Data Engineering, IEEE Transac-
tions on, 24(5):823{839, 2012.
[DKP11] Tamraparni Dasu, Shankar Krishnan, and Gina Maria Pomann. Robustness of
change detection algorithms. In Advances in Intelligent Data Analysis X, pages
125{137. Springer, 2011.
[DKVY06] Tamraparni Dasu, Shankar Krishnan, Suresh Venkatasubramanian, and Ke Yi.
An information-theoretic approach to detecting changes in multi-dimensional data
streams. In In Proc. Symp. on the Interface of Statistics, Computing Science, and
Applications, 2006.
[EA12] Philippe Esling and Carlos Agon. Time-series data mining. ACM Computing
Surveys (CSUR), 45(1):12, 2012.
[Fu11] Tak-chung Fu. A review on time series data mining. Engineering Applications of
Arti_cial Intelligence, 24(1):164{181, 2011.
[Gam10] Jo~ao Gama. Knowledge Discovery from Data Streams. Chapman and Hall / CRC
Data Mining and Knowledge Discovery Series. CRC Press, 2010.
[GGAH14] Manish Gupta, Jing Gao, Charu Aggarwal, and Jiawei Han. Outlier detection
for temporal data. Synthesis Lectures on Data Mining and Knowledge Discovery,
5(1):1{129, 2014.
[GGO+08] Auroop R Ganguly, Joao Gama, Olufemi A Omitaomu, Mohamed Gaber, and
Ranga Raju Vatsavai. Knowledge discovery from sensor data. CRC Press, 2008.
[Gru69] Frank E Grubbs. Procedures for detecting outlying observations in samples. Tech-
nometrics, 11(1):1{21, 1969.
[G_ZB+14] Jo~ao Gama, Indr_e _Zliobait_e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid
Bouchachia. A survey on concept drift adaptation. ACM Computing Surveys
(CSUR), 46(4):44, 2014.
[HA04] Victoria J Hodge and Jim Austin. A survey of outlier detection methodologies.
Arti_cial Intelligence Review, 22(2):85{126, 2004.
[Haw80] Douglas M Hawkins. Identi_cation of outliers, volume 11. Springer, 1980.
[HKP11] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Tech-
niques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition,
2011.
[HSD01] Geo_ Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data
streams. In Proceedings of the seventh ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 97{106. ACM, 2001.
[JZXL14] Yexi Jiang, Chunqiu Zeng, Jian Xu, and Tao Li. Real time contextual collective
anomaly detection over multiple data streams. 2014.
[KBDG04] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data
streams. In Proceedings of the Thirtieth international conference on Very large
data bases-Volume 30, pages 180{191, 2004.
[KKZ09] Hans-Peter Kriegel, Peer Kroger, and Arthur Zimek. Outlier detection techniques.
In Tutorial at the 13th Paci_c-Asia Conference on Knowledge Discovery and Data
Mining, 2009.
[KS09] Yoshinobu Kawahara and Masashi Sugiyama. Change-point detection in timeseries
data by direct density-ratio estimation. In SDM, volume 9, pages 389{400.
SIAM, 2009.
[LRU14] Jure Leskovec, Anand Rajaraman, and Je_rey David Ullman. Mining of massive
Datasets Cambridge University Press, 2014.
[MMA14] Barbora Micenkov_a, Brian McWilliams, and Ira Assent. Learning outlier ensembles:
The best of both worlds{supervised and unsupervised. 2014.
[MSME15] Dylan McDonald, Stewart Sanchez, Sanjay Madria, and Fikret Ercal. A survey of
methods for _nding outliers in wireless sensor networks. Journal of Network and
Systems Management, 23(1):163{182, 2015.
[MvdBW07] S Muthukrishnan, Eric van den Berg, and Yihua Wu. Sequential change detection
on data streams. In Data Mining Workshops, 2007. ICDM Workshops 2007.
Seventh IEEE International Conference on, pages 551{550. IEEE, 2007.
[PB_Z+10] Mykola Pechenizkiy, Jorn Bakker, I _Zliobait_e, Andriy Ivannikov, and Tommi
Karkkainen. Online mass ow prediction in cfb boilers with explicit detection
of sudden concept drift. ACM SIGKDD Explorations Newsletter, 11(2):109{116,
2010.
[SG09] Raquel Sebastiao and Joao Gama. A study on change detection methods. In 4th
Portuguese Conf. on Arti_cial Intelligence, Lisbon, 2009.
[SG14] Shiblee Sadik and Le Gruenwald. Research issues in outlier detection for data
streams. ACM SIGKDD Explorations Newsletter, 15(1):33{40, 2014.
[Sil86] Bernard W Silverman. Density estimation for statistics and data analysis, volume
26. CRC press, 1986.
[SPK13] Sripirakas Sakthithasan, Russel Pears, and Yun Sing Koh. One pass concept
change detection for data streams. In Advances in Knowledge Discovery and Data
Mining, pages 461{472. Springer, 2013.
[SWJR07] Xiuyao Song, Mingxi Wu, Christopher Jermaine, and Sanjay Ranka. Statistical
change detection for multi-dimensional data. In Proceedings of the 13th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
667{676. ACM, 2007.
[SZLH13] Wei-xing Su, Yun-long Zhu, Fang Liu, and Kun-yuan Hu. On-line outlier and
change point detection for time series. Journal of Central South University,
20:114{122, 2013.
[TY06] Jun-ichi Takeuchi and Kenji Yamanishi. A unifying framework for detecting outliers
and change points from time series. Knowledge and Data Engineering, IEEE
Transactions on, 18(4):482{492, 2006.
[ZCS14] Arthur Zimek, Ricardo JGB Campello, and Jorg Sander. Ensembles for unsupervised
outlier detection: challenges and research questions a position paper. ACM
SIGKDD Explorations Newsletter, 15(1):11{22, 2014.
[ZMH10] Yang Zhang, Nirvana Meratnia, and Paul Havinga. Outlier detection techniques
for wireless sensor networks: A survey. Communications Surveys & Tutorials,
IEEE, 12(2):159{170, 2010.
..

Discover more:

Research Proposal Examples

Essay: Outlier Detection from Time-series Data Stream by Leveraging Change Detection

Essay details and download:

Text preview of this essay:

Discover more:

Recommended for you

About this essay:

Essay details and download:

Text preview of this essay:

Discover more:

Recommended for you

About this essay:

Essay Categories: