Azqa Nadeem | Real-time Sequence Clustering

Description:

In cybersecurity, malicious behavior is known to evolve rapidly to evade detection. For instance, network traffic must be continuously monitored to accurately characterize malware capabilities at any given time.

This creates the need for a real-time sequence clustering approach that can handle evolving data distributions (i.e., due to concept drift). Existing real-time clustering algorithms have limited support for sequential data. Besides, a highly efficient algorithm is required to cluster the large volumes of network traffic required for real-time behavioral analytics. Furthermore, the algorithms that support sequential data do not provide interpretability, which is a problem in itself since the analysts need to be able to understand the evolving behaviors captured by a cluster. There is currently no way to cluster large sequential datasets in real-time while supporting interpretability and concept drift.

In this project, we aim to develop the first interpretable clustering algorithm that efficiently clusters sequences in real-time (Nadeem & Verwer, 2022). Specifically, we aim to create a linear and real-time version of the k-medoids algorithm. This algorithm should be able to summarize network traffic, i.e., represent important concepts using its k medoids. We will then investigate whether such an algorithm improves the quality of malware capability assessment compared to standard approaches.

Code repo: SECLEDS

References

ECML

SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids
and Medoid Voting

Nadeem, Azqa, and Verwer, Sicco

In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2022

Abstract PDF Code Slides

Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable highquality clusters, and ii) it handles concept drift using an intuitive Medoid Voting scheme for approximating cluster distances. Unlike existing adaptive algorithms that create new clusters for new concepts, SECLEDS follows a fundamentally different approach, where the clusters themselves evolve with an evolving stream. Using real and synthetic datasets, we empirically demonstrate that SECLEDS produces high-quality clusters regardless of drift, stream size, data dimensionality, and number of clusters. We compare against three popular stream and batch clustering algorithms. The state-of-the-art BanditPAM is used as an offline benchmark. SECLEDS achieves comparable F1 score to BanditPAM while reducing the number of required distance computations by 83.7%. Importantly, SECLEDS outperforms all baselines by 138.7% when the stream contains drift. We also cluster real network traffic, and provide evidence that SECLEDS can support network bandwidths of up to 1.08 Gbps while using the (expensive) dynamic time warping distance.