Category Direction - Fleet Visibility

The following page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features or functionality remain at the sole discretion of GitLab Inc.


Stage	Verify
Maturity	Complete
Content Last Reviewed	`2024-07-18`

Introduction and how you can help

Thanks for visiting this direction page on the Fleet Visibility category at GitLab. This page belongs to the Runner Group within the Verify Stage and is maintained by Darren Eastman.

Strategy and Themes

Our vision is that as customers more deeply integrate AI into their development processes, in one unified dashboard, they can manage a GitLab Runner Fleet at scale and have deep visibility into CI/CD pipeline execution metrics that easily correlate to the specific CI/CD build environment.

Adequate Fleet Visibility starts with providing at-a-glance insights into all the statuses (online, offline, stale) of all runner-build servers in your organization. Fleet Visibility will allow you to determine the group or project a runner is associated with while also surfacing critical metrics such as runner build queue performance, failure rates, and the most heavily used runners.

This category provides platform administrators and developers with the metrics they need to identify CI pipeline performance or reliability issues and determine which component to focus on when using trial-and-error approaches to optimization. This category aims to integrate the proper CI build fleet and pipeline metrics within the GitLab UI to eliminate developer pain points with automated CI/CD builds that negatively impact productivity. Those pain points include developers trying to determine if and how to optimize CI job duration or troubleshoot CI/CD job failures due to the build environment.

By correlating the insights provided to platform administrators at the admin or group levels exposed in the Fleet Dashboard with the CI/CD job and pipeline execution metrics (runner queue depth, job duration trends, job failure rates, pipeline reliability) at the project level exposed in CI Insights, organizations will spend less time building custom observability reporting dashboards for GitLab CI/CD pipelines. Developers and platform administrators will have a shared understanding of CI/CD performance trends across the platform.

For executives, as the AI-powered GitLab DevSecOps platform enables your development teams to deliver secure software faster, Fleet Visibility provides your operations team the visibility they need to operate a CI/CD build infrastructure at scale cost-effectively and efficiently on any public or private cloud infrastructure.

In short, we aim to ensure you can operate a build fleet supporting an AI DevSecOps platform at scale efficiently and cost-effectively.

1 year plan

Runner Fleet Dashboard

The Runner Fleet Dashboard - Admin View: Starter Metrics was released in 16.5. The included metrics widgets for the initial release are as follows:

Fleet Health (Postgres DB)
Top Active Runners (Postgres DB)
Wait Time to Pick up a Job (Clickhouse DB)

We followed this up by releasing the Runner Fleet Dashboard for Groups in the GitLab 17.1 release. With this release, customers on GitLab.com and Self Managed can manage runner fleets at the group level. We have heard from many customers that they need this capability and APIs to augment the current observability tooling and custom dashboards they rely on to monitor and operate GitLab CI/CD and Runners.

The next major goal for FY25 is ensure that we have a supported solution for self-managed customers who need to use the Runner Fleet Dashboard and specifically more advanced metrics such as the wait time to pick up a job that relies on Clickhouse, an open-source columnar database that provides fast query performance for large datasets.

With the release of the Fleet Dashboard for Groups in 17.1, we plan to use the following few milestones to gather and analyze customer feedback. That feedback will guide the next evolution of the Fleet Dashboard strategy for Q4 FY25 and into FY26. While we have already identified additional metrics, such as runner failure trends, that could be valuable to include in the dashboard, it is also likely, based on recent customer feedback, that simply extending the metrics data model and enabling customers to create their reports and visualizations is the most valuable future iteration.

Regarding prediction, one prevalent theme in customer conversations is determining when there may be a slowdown in runner queue performance. Another critical problem and pain point often cited by customers is configuring the Runner Fleet to find the optimal balance between compute costs and developer efficiency as measured by reduced CI/CD job durations. These are classic prediction problems, so we aim to explore if we can reduce the cost of prediction and fleet operational costs for our customers by incorporating ML/AI into the Fleet Dashboard. With Clickhouse as the database layer and a new analytics database table structure for Runner Fleet, we believe the foundational elements are in place to make this next evolution a reality in FY26.

Also, based on the many customer conversations, operating a CI/CD build fleet on Kubernetes can be complex. Platform engineering teams sometimes must spend months configuring and optimizing the Kubernetes environment to run their organization's CI/CD jobs reliably. In Runner Core, we have seen the immense value to platform teams of simply enabling the printing of Kubernetes events directly in the CI/CD job log. How can we expose the right metrics for the Kubernetes CI/CD build infrastructure in the Fleet Dashboard? Will doing so simplify the operational burden for those customers who use Kubernetes as the CI/CD build infrastructure? We aim to explore these questions and strategies in depth as we plan the Fleet Dashboard's future roadmap.

CI/CD Insights

The first phase in the unified Fleet Visibility strategy is solving the critical visibility problems developers and platform administrators have with using GitLab CI/CD. Those visibility problems include the fact that there is no built-in report in GitLab CI/CD analytics for a developer to determine if a CI job is running as expected from a duration perspective or whether a CI job is unhealthy, as represented by an increase in job failures. As a result, customers have had to create custom reporting systems or implement third-party observability tools using data exposed in the GitLab jobs API.

Our goal for the second half of FY25 is to improve the GitLab CI/CD analytics view to incorporate the critical metrics our customers tell us they need to use, monitor, and optimize GitLab CI/CD efficiently. That includes providing insights into the pipeline and CI/CD job performance metrics with drill-down capabilities to individual jobs so developers can quickly identify CI/CD jobs with abnormal failure rates, leading to proactive CI/CD job optimization and improved reliability. Providing visibility into all aspects of CI/CD job performance is only the foundation. We intend to develop solutions using this data that seek to eliminate developer pain and frustration related to slow CI/CD job start times, less than optimal CI/CD job duration, and job failures that result in lost developer productivity as developers troubleshoot CI/CD failures instead of working on coding tasks.

What is next for us

In the next three months (August to October) we are focused on the following:

CI/CD Insights

Improve the GitLab CI/CD analytics view

What we are currently working on

In 17.3 (August) and 17.4 (September) we are working on adding the foundational API's required to deliver the new GitLab CI/CD analytics view

What we recently completed

In the past three months, we have shipped the following key features:

What is Not Planned Right Now

In the near term we are not focused on design or development efforts to improve Runners usability in CI/CD settings at the project level.

While improvements in this view could be valuable to the software developer persona, feedback from customers indicates that providing meaningful CI insights that cover vital metrics such as CI job success and failure rates, job duration metrics, average job retries, average queue time for each job, are more valuable for customers and are critical enablers for broader CI adoption.

Best in Class Landscape

BIC (Best In Class) is an indicator of forecasted near-term market performance based on a combination of factors, including analyst views, market news, and feedback from the sales and product teams. It is critical that we understand where GitLab appears in the BIC landscape.

At GitLab, a critical challenge is simplifying the administration and management of a CI/CD build fleet at an enterprise scale. This effort is one foundational pillar to realizing the vision of GitLab Duo AI-optimized DevSecOps. Competitors are also investing in this general category. Earlier this year GitHub announced a new management experience that provides a summary view of GitHub-hosted runners. This is a signal that there will be a focus on reducing maintenance and configuration overhead for managing a CI/CD build environment at scale across the industry.

We also now see additional features on the GitLab public roadmap signaling an increased investment in the category we coined here at GitLab, 'Runner Fleet.' These features suggest that GitHub aims to provide a first-class experience for managing GitHub Actions runners and include features in the UI to simplify runner queue management and resolve performance bottlenecks. With this level of planned investment, it is clear that there is recognition in the market that simplifying the administrative maintenance and overhead of the CI build fleet is critical for large customers and will help enable deeper product adoption.

Indirect competitor Actutated is the first solution that we have seen whose product includes a dashboard for Runners and build queue visibility. This is another strong signal that providing solutions that reduce the CI/CD build infrastructure's management overhead is valuable for organizations with mature DevOps practices.

In the CI Insights arena, a few startups, for example, Trunk.io, are providing CI visibility solutions for GitHub actions. The Datadog CI Visbility product is a mature, full-featured offering that provides CI/CD insights for GitLab CI/CD using the GitLab jobs API as the foundational layer.

To ensure that our GitLab customers can fully realize the value of GitLab's product vision, we must provide solutions that eliminate the complexities, manual tasks, and operational overhead and reduce the costs of delivering a CI build environment at scale. Our goal in FY25 is to include good enough Fleet visibility solutions that customers not yet fully invested in third-party observability or custom tooling can use out of the box to observe, analyze, optimize CI jobs, or troubleshoot CI job failures natively in GitLab.

Key Capabilities

The key capabilities that we hear from customers describing fleet management and CI insights pain points are as follows:

What it the root cause of a CI pipeline failure?
CI/CD job failure rate trends
CI/CD job duration trends
CI/CD job retry rate trends
Runner queue visibility (wait time)
Runner Fleet management metrics
Frictionless upgrades
Security
Cost visibility for runners hosted on public cloud infrastructure
Fleet autoscaling
Fleet cost management while maintaining internal service level objectives (SLOs)
Automatic fleet configuration optimization
Managing runner sprawl
Configuring and managing a heterogeneous runner fleet (container builds on Linux, container builds on Windows, shell builds on Windows, shell builds on macOS)
Self-service runner creation for the developer persona
Automating choosing the right cloud and compute to host a Runner based on CI/CD build performance

Top [1/2/3] Competitive Solutions

Runner Fleet is still a nascent category; competitors like GitHub are beginning to invest in this area. On their future roadmap, GitHub plans to introduce seamless management of GitHub-hosted and self-hosted runners. This feature aims to deliver a "single management plane to manage all runners for a team using GitHub." GitHub also plans to offer Actions Performance Metrics to provide organizations with deep insights into critical CI/CD performance metrics. One example of how the cloud infrastructure market can evolve is Active Assist for Google Cloud - a solution to optimize cloud operations cost reduction. Therefore we can imagine a future where Microsoft and GitHub bring to market AI-based solutions that integrate GitHub Actions with infrastructure on Azure. Our GitLab competitive position is solid in that we will continue to invest in features and capabilities to ensure that customers can use GitLab Runners efficiently on any cloud provider.

In the insights space DataDog has a CI and Test Visibility offering and CircleCI has had an insights offering for some time. While there is no main GitHub actions functionality there are several offerings in the marketplace for collecting test/run data and displaying it on a dashboard.

Harness.io, Software Engineering Insights](https://developer.harness.io/docs/software-engineering-insights) includes several configurable CI/CD job and pipeline metric widgets. The CI/CD job count metric category aims to provide developers with insights into CI/CD job success and failure rates. We also see additional investment planned by Harness.io with their Pipeline Analytics feature slated for Q3 2024 (August 2024+).

In the Test Reduction area Sealights offers a CutTests solution and Redefine.dev is a new player in the space taking advantage of AI to reduce future test runs for faster pipelines.