High Dimensional Outlier Detection

Authors' Names

Omid Khormali

Presentation Type

Oral Presentation

Abstract/Artist Statement

In statistics and data science, the outliers are the data points that differ greatly from other values in a data set. They are important when looking at the large data set because they can sometimes effect on perceiving the whole data. It is therefore very important to detect and adequately deal with outliers. Recently, in [V. Menon and S. Kalyani, Structured and Unstructured Outlier Identification for Robust PCA: A Non iterative, Parameter free Algorithm, arXiv:1809.04445v1], a novel algorithm for detecting outliers is presented which a) does not require the knowledge of outlier fraction, b) does not require the knowledge of the dimension of the underlying subspace, c) is computationally simple and fast d) can handle structured and unstructured outliers. In this research, we improved this algorithm by reducing its complexity from O(n2m) to O((n/log(n))2m) where n is the number of data points and m is the dimension of the space.

Mentor Name

Brian Steele

This document is currently not available here.

Share

COinS
 
Feb 22nd, 9:20 AM Feb 22nd, 9:35 AM

High Dimensional Outlier Detection

UC 333

In statistics and data science, the outliers are the data points that differ greatly from other values in a data set. They are important when looking at the large data set because they can sometimes effect on perceiving the whole data. It is therefore very important to detect and adequately deal with outliers. Recently, in [V. Menon and S. Kalyani, Structured and Unstructured Outlier Identification for Robust PCA: A Non iterative, Parameter free Algorithm, arXiv:1809.04445v1], a novel algorithm for detecting outliers is presented which a) does not require the knowledge of outlier fraction, b) does not require the knowledge of the dimension of the underlying subspace, c) is computationally simple and fast d) can handle structured and unstructured outliers. In this research, we improved this algorithm by reducing its complexity from O(n2m) to O((n/log(n))2m) where n is the number of data points and m is the dimension of the space.