Autonomous Time Series Analytics

Our mission is to organize application and infrastructure metrics for enterprises and make them useful for outage risk mitigation and performance optimization.

$10K/min revenue loss

due to app degradation

10% bounce rate

per second of app delay

25% of engineering time

spent on incidents

Raw time series from microservices bear the deep insight about application health and optimization opportunities. However, extracting that insight is hard for current tools which depend on manual benchmarking and agent-based architecture. With our innovative time series algorithms and agentless architecture, we help enterprises extract that deep insight and improve application performance.

Agentless Monitoring for Continuous Real-Time Analytics

Traditional agent-based monitoring solutions are not portable enough for cloud native monitoring. Instead, software engineers are increasingly embracing CNCF-graduated Prometheus for its easy deployment and metrics exposition, a large number of third-party exporters, active data scraping, multi-dimensional data model, flexible PromQL, single node storage, and built-in alerting/visualization support. However, Prometheus lacks horizontal scalability, multi-tenancy, and distributed storage. We are solving these problems by leveraging Prometheus' remote write API through which we receive data from multiple Prometheus instances to ensure horizontal scalability, and multi-tenancy. We also store metrics for our time series analytics. Unlike other standard open source projects such as M4, Cortex, our monitoring platform is optimized for continuous real-time analytics.

Time Series Anomaly Monitoring for Early Alarm Detection

Traditional health rule-based failure detection is increasingly losing efficacy in handling complex seasonal patterns and irregular spikes in microservices. Additionally, writing and maintaining health rules for a Kubernetes-based production system demands significant operating overhead. Our semi-supervised real-time anomaly detection algorithm recognizes early failure symptoms without the need for health rules and opens up the opportunity for preventive resolution. The hard machine learning challenges we are solving is related to data drift, multi-scale contextual data, and absence of labelled data.

Unified Time Series Anomaly Comparison for Fault Isolation

Fault isolation is a hard problem in a distributed environment because of complex and dynamic interdependencies between different operating layers. Our time series anomaly detection algorithm comes with powerful slicing and dicing capabilities to help software engineers efficiently identify dynamic interdependencies between different time series coming from different operating layers (e.g., container-level metrics and cluster-level metrics). To solve this problem, we are tapping into our continuous time series anomaly detection capability that works across all different kinds of time series data.

Time Series Relevance Ranking for Fast Alarm Triaging

Manual alarm triaging does not scale for large scale applications and introduces biases. Our time series relevance ranking algorithm triages alarms at scale by tapping into interdependencies in a streaming data. A high relevance rank means a high relative influence the streaming data. So, an alarm generated from a high relevance ranked time series data would be prioritized in our design.

Time Series Similarity Ranking for Fast Alarm Resolution

A common alarm resolution strategy involves standard statistical correlation detection (Pearson, Kendall, Spearman) algorithms are not sufficiently accurate for time series data analyses. Another strategy of using time series cross-correlation often takes manual exploratory data analysis. Our time series similarity ranking resolves this trade-off between accuracy and efficiency. It clusters streaming data based on the pairwise optimal time series distances.

Touch-Controlled AI for Lower False Positives in Time Series Analytics

Traditional black box AI algorithms used for time series analytics lead to high number of false positives because they fail to accept dynamic context in real-time. Our touch-controlled AI solves this problem by accepting a user feedback from the click of a mouse or the tap of a finger. Such a declarative approach facilitates seamless user feedback and improves predictive fidelity for time series anomaly detection and ranking.