Description:
In cybersecurity, malicious behavior is known to evolve rapidly to evade detection.
For instance, network traffic must be continuously monitored to accurately characterize malware capabilities at any given time.
This creates the need for a real-time sequence clustering approach that can handle evolving
data distributions (i.e., due to concept drift). Existing real-time clustering algorithms have limited support for sequential data.
Besides, a highly efficient algorithm is required to cluster the large volumes of network traffic required for real-time behavioral analytics.
Furthermore, the algorithms that support sequential data do not provide interpretability, which is a problem in itself since the
analysts need to be able to understand the evolving behaviors captured by a cluster. There is currently no way to cluster large sequential datasets
in real-time while supporting interpretability and concept drift.
In this project, we aim to develop the first interpretable clustering algorithm that efficiently clusters sequences in real-time (Nadeem & Verwer, 2022).
Specifically, we aim to create a linear and real-time version of the k-medoids algorithm. This algorithm should be able to summarize network traffic,
i.e., represent important concepts using its k medoids. We will then investigate whether such an algorithm improves the quality of malware capability assessment
compared to standard approaches.
Code repo: SECLEDS
References
SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids
and Medoid Voting
Nadeem, Azqa,
and Verwer, Sicco
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD),
2022
Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM)
is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster
interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams.
We therefore propose SECLEDS, a streaming variant of the k-medoids
algorithm with constant memory footprint. SECLEDS has two unique
properties: i) it uses multiple medoids per cluster, producing stable highquality
clusters, and ii) it handles concept drift using an intuitive Medoid
Voting scheme for approximating cluster distances. Unlike existing adaptive
algorithms that create new clusters for new concepts, SECLEDS follows
a fundamentally different approach, where the clusters themselves
evolve with an evolving stream. Using real and synthetic datasets, we
empirically demonstrate that SECLEDS produces high-quality clusters
regardless of drift, stream size, data dimensionality, and number of clusters.
We compare against three popular stream and batch clustering algorithms.
The state-of-the-art BanditPAM is used as an offline benchmark.
SECLEDS achieves comparable F1 score to BanditPAM while reducing
the number of required distance computations by 83.7%. Importantly,
SECLEDS outperforms all baselines by 138.7% when the stream contains
drift. We also cluster real network traffic, and provide evidence
that SECLEDS can support network bandwidths of up to 1.08 Gbps
while using the (expensive) dynamic time warping distance.