Agentless Monitoring for Continuous Real-Time AnalyticsTraditional agent-based monitoring solutions are not portable
enough for cloud native monitoring. Instead, software engineers are
increasingly embracing CNCF-graduated Prometheus for its easy deployment
and metrics exposition, a large number of third-party exporters, active data scraping,
multi-dimensional data model, flexible PromQL, single node storage,
and built-in alerting/visualization support. However, Prometheus lacks horizontal scalability,
multi-tenancy, and distributed storage. We are solving these problems by
leveraging Prometheus' remote write API through which
we receive data from multiple Prometheus instances to ensure horizontal scalability,
and multi-tenancy. We also store metrics for our time series analytics.
Unlike other standard open source projects such as M4, Cortex,
our monitoring platform is optimized for continuous real-time analytics.
Time Series Anomaly Monitoring for Early Alarm DetectionTraditional health rule-based failure detection is increasingly
losing efficacy in handling complex seasonal patterns and irregular
spikes in microservices. Additionally, writing and maintaining health rules for a Kubernetes-based
production system demands significant operating overhead.
Our semi-supervised real-time anomaly detection algorithm recognizes
early failure symptoms without the need for health rules and opens up
the opportunity for preventive resolution. The hard machine learning challenges we are solving is related
to data drift, multi-scale contextual data, and absence of labelled data.
Unified Time Series Anomaly Comparison for Fault IsolationFault isolation is a hard problem in a distributed environment because of complex and dynamic interdependencies between
different operating layers. Our time series anomaly detection algorithm comes with powerful slicing and
dicing capabilities to help software engineers efficiently identify dynamic interdependencies between different time series coming
from different operating layers (e.g., container-level metrics and cluster-level metrics). To solve this problem, we are tapping
into our continuous time series anomaly detection capability that works across all different kinds of time series data.
Time Series Relevance Ranking for Fast Alarm TriagingManual alarm triaging does not scale for large scale applications and introduces biases.
Our time series relevance ranking algorithm triages alarms at scale by tapping into interdependencies
in a streaming data. A high relevance rank means a high relative influence the streaming data.
So, an alarm generated from
a high relevance ranked time series data would be prioritized in our design.
Time Series Similarity Ranking for Fast Alarm Resolution
A common alarm resolution strategy involves standard statistical correlation detection (Pearson, Kendall, Spearman) algorithms
are not sufficiently accurate for time series
data analyses. Another strategy of using time series cross-correlation often takes manual exploratory data analysis.
Our time series similarity ranking resolves this trade-off between accuracy and efficiency.
It clusters streaming data based on the pairwise optimal time series distances.
Touch-Controlled AI for Lower False Positives in Time Series Analytics
Traditional black box AI algorithms used for time series analytics lead to high number of false positives because they
fail to accept dynamic context in real-time. Our touch-controlled AI solves this problem by accepting a user feedback from the
click of a mouse or the tap of a finger. Such a declarative approach facilitates seamless user feedback and improves
predictive fidelity for time series anomaly detection and ranking.