FixSense
Guides

Flaky Test Detection and Scoring

How FixSense detects flaky tests, assigns flakiness scores, and helps teams systematically eliminate intermittent CI failures.

The Cost of Flaky Tests

Flaky tests — tests that pass and fail without any code changes — are the single biggest drain on CI/CD productivity. Industry research consistently shows:

  • Engineers spend 5-10 hours per week investigating failures that turn out to be flaky
  • Teams with high flaky rates re-run CI pipelines 3-5x more often, wasting compute and time
  • Flaky tests erode trust in CI — developers start ignoring failures, letting real regressions slip through
  • 2-15% of tests in a typical project are flaky at any given time

The worst part: flaky tests are hard to find. A test that fails once every 20 runs looks like a legitimate failure each time it breaks. Without tracking history, every occurrence triggers the same investigation.

What Makes a Test Flaky

Timing and Race Conditions

The most common cause. Tests assume operations complete in a fixed order or within a fixed time:

// Flaky — button may not be clickable yet
await page.click('.save-button');

// Stable — waits for the element to be actionable
await page.locator('.save-button').click();

Playwright's auto-waiting helps, but complex flows with multiple async operations still create race conditions — especially in CI where execution is slower.

Shared State Between Tests

Tests that depend on data created by previous tests:

  • Database records from a prior test
  • Browser cookies or localStorage set in another test
  • Files created on disk during the test run

When test execution order changes (parallelization, sharding), these dependencies break.

Network Dependencies

Tests that call real external services:

  • Third-party APIs with rate limits or occasional downtime
  • CDNs that return different responses based on region
  • OAuth providers with session timeouts

Environment Variability

  • Time-dependent logic — tests that break at midnight, on weekends, or at month boundaries
  • Locale/timezone differences — date formatting assertions that depend on the CI runner's locale
  • Screen size assumptions — responsive layouts that render differently in headless mode
  • Resource pressure — tests that pass on a dedicated runner but fail on a shared one under load

Manual vs. Automated Detection

Manual Approach

Without tooling, teams detect flaky tests by:

  1. Noticing the same test failing across unrelated PRs
  2. Re-running CI and seeing the test pass without changes
  3. Developers reporting "this test is flaky" in Slack

This is slow, unreliable, and depends on individual memory. Tests can be flaky for months before anyone connects the dots.

Automated Detection with FixSense

FixSense tracks every test failure across all your CI runs and applies pattern detection to identify flakiness automatically.

How FixSense Scores Flakiness

Every failed test receives a flakiness score from 0 to 100:

Score RangeMeaning
0-20Almost certainly a real failure — consistent pattern, tied to code changes
21-50Possibly flaky — some intermittent signals but could be a legitimate issue
51-80Likely flaky — strong intermittent pattern, timing-related indicators
81-100Almost certainly flaky — classic flaky signatures, fails randomly across PRs

The score is based on multiple signals:

  • Failure consistency — does this test fail every time, or only sometimes?
  • Cross-PR correlation — does it fail across unrelated pull requests?
  • Error pattern analysis — does the error match known flaky signatures (timeouts, race conditions, network errors)?
  • Retry behavior — does the test pass on retry without code changes?
  • Historical trend — has this test been flagged as flaky before?

The flakiness score appears on every analysis card in your dashboard and in PR comments, so you can immediately tell whether a failure needs investigation or is a known flaky test.

Features

Per-Test Flakiness Score

Every analysis includes the flakiness score front and center. When a developer sees a PR comment from FixSense showing a flakiness score of 85, they know they can merge with confidence — the failure is not related to their changes.

Top Failing Tests Dashboard

The dashboard highlights your most frequently failing tests. This is where flaky tests surface naturally — a test that appears in the top 5 week after week is almost certainly flaky and should be prioritized for fixing.

Failure Categories

FixSense categorizes each failure as Regression, Flaky, Test Maintenance, or Environment. The Flaky category combines the flakiness score with error pattern analysis to give a definitive classification.

PR Comments with Context

When a flaky test fails on a PR, the automated comment includes:

  • The flakiness score
  • How many times this test has failed recently
  • Whether the failure is related to the PR's code changes
  • A suggested fix if the flaky pattern is addressable

Trend Tracking

Monitor your overall flaky test rate over time. See whether your stabilization efforts are working or if new flaky tests are being introduced faster than old ones are fixed.

Test Quarantine (Coming Soon)

Automatically quarantine tests with consistently high flakiness scores — they still run but their failures do not block the pipeline. Quarantined tests appear in a separate dashboard section so they stay visible without disrupting development flow.

Reducing Your Flaky Test Rate

Once FixSense identifies your flaky tests, follow this playbook:

1. Fix the Top 5 First

The Pareto principle applies: a small number of tests cause most flaky failures. Start with the tests that have the highest flakiness scores and the highest failure frequency.

2. Replace Hard Waits with Smart Waits

// Bad — arbitrary timeout, still flaky
await page.waitForTimeout(3000);
await expect(page.locator('.data')).toBeVisible();

// Good — waits for the actual condition
await expect(page.locator('.data')).toBeVisible({ timeout: 15000 });

3. Isolate Test Data

Each test should create its own data and clean up after itself. Never depend on data from another test or a shared seed.

4. Mock Unstable Dependencies

External APIs, email services, and payment providers should be mocked in E2E tests. Use Playwright's page.route() to intercept and stub network requests.

5. Set a Team KPI

Track your flaky test rate as a team metric. A healthy target is less than 5% flaky rate. Review the FixSense dashboard weekly in your team standup.

Getting Started

  1. Install FixSense at fix-sense.vercel.app — takes under 2 minutes
  2. Let it collect data — flakiness scoring improves as FixSense sees more CI runs
  3. Review your dashboard weekly — focus on the top failing tests widget
  4. Fix systematically — use the AI-suggested fixes as a starting point