Title

Visualizing Communication: Pattern Recognition on the Enron E-mail Corpus

Presentation Type

Poster

Abstract

Despite the rapid pace of computer hardware advancement in recent years, little has changed in the methodology of viewing and processing large, complicated sets of information. E-mail inboxes are an excellent example; the human brain has difficulty parsing patterns and relationships in any data represented as a raw list, and the principal axis of email (time received) is particularly unhelpful. This remains the prevailing layout, however, simply because there is too much dimensional structure to choose a single meaningful attribute of the messages to arrange by. An efficient way to overcome such a situation is through multidimensional analysis: by combining multiple features into one, we map the email onto a lower-dimensional manifold for navigation and visualization. We seek to arrange the data into distinct categories or clusters based on the reduced representation, and compare such an approach to a recent Dirichlet-based method.

This project analyzes a representative subset of the approximately 500,000 emails encompassing 150 users from the Federal Energy Regulatory Commission's investigation into the Enron Corporation. The research applies and compares a set of commonly used pattern recognition techniques to discover topical clusters in the corpus of unstructured text. As in the literature, each document is represented as a unigram bag-of-words feature vector on a (most-common) subset of the terms included in all messages. To perform dimensionality reduction, we apply and compare the traditional linear methods of Principal Components Analysis (PCA) and Multidimensional Scaling (MDS), using each representation to perform k-means clustering on the messages. The resulting data is further dimensionality-reduced and visualized for accessible comparison. We also create a generative Latent Dirichlet Allocation (LDA) topic model based on the unigram features, a recent innovation in the literature, and show its performance versus the dimensionality-reduction/clustering based methods.

This document is currently not available here.

Share

COinS
 
Apr 12th, 11:00 AM Apr 12th, 12:00 PM

Visualizing Communication: Pattern Recognition on the Enron E-mail Corpus

UC Ballroom

Despite the rapid pace of computer hardware advancement in recent years, little has changed in the methodology of viewing and processing large, complicated sets of information. E-mail inboxes are an excellent example; the human brain has difficulty parsing patterns and relationships in any data represented as a raw list, and the principal axis of email (time received) is particularly unhelpful. This remains the prevailing layout, however, simply because there is too much dimensional structure to choose a single meaningful attribute of the messages to arrange by. An efficient way to overcome such a situation is through multidimensional analysis: by combining multiple features into one, we map the email onto a lower-dimensional manifold for navigation and visualization. We seek to arrange the data into distinct categories or clusters based on the reduced representation, and compare such an approach to a recent Dirichlet-based method.

This project analyzes a representative subset of the approximately 500,000 emails encompassing 150 users from the Federal Energy Regulatory Commission's investigation into the Enron Corporation. The research applies and compares a set of commonly used pattern recognition techniques to discover topical clusters in the corpus of unstructured text. As in the literature, each document is represented as a unigram bag-of-words feature vector on a (most-common) subset of the terms included in all messages. To perform dimensionality reduction, we apply and compare the traditional linear methods of Principal Components Analysis (PCA) and Multidimensional Scaling (MDS), using each representation to perform k-means clustering on the messages. The resulting data is further dimensionality-reduced and visualized for accessible comparison. We also create a generative Latent Dirichlet Allocation (LDA) topic model based on the unigram features, a recent innovation in the literature, and show its performance versus the dimensionality-reduction/clustering based methods.