Flaky Test Detection and Scoring

How FixSense detects flaky tests, assigns flakiness scores, and helps teams systematically eliminate intermittent CI failures.

The Cost of Flaky Tests

Flaky tests — tests that pass and fail without any code changes — are the single biggest drain on CI/CD productivity. Industry research consistently shows:

Engineers spend 5-10 hours per week investigating failures that turn out to be flaky
Teams with high flaky rates re-run CI pipelines 3-5x more often, wasting compute and time
Flaky tests erode trust in CI — developers start ignoring failures, letting real regressions slip through
2-15% of tests in a typical project are flaky at any given time

The worst part: flaky tests are hard to find. A test that fails once every 20 runs looks like a legitimate failure each time it breaks. Without tracking history, every occurrence triggers the same investigation.

What Makes a Test Flaky

Timing and Race Conditions

The most common cause. Tests assume operations complete in a fixed order or within a fixed time:

// Flaky — button may not be clickable yet
await page.click('.save-button');

// Stable — waits for the element to be actionable
await page.locator('.save-button').click();

Playwright's auto-waiting helps, but complex flows with multiple async operations still create race conditions — especially in CI where execution is slower.

Shared State Between Tests

Tests that depend on data created by previous tests:

Database records from a prior test
Browser cookies or localStorage set in another test
Files created on disk during the test run

When test execution order changes (parallelization, sharding), these dependencies break.

Network Dependencies

Tests that call real external services:

Third-party APIs with rate limits or occasional downtime
CDNs that return different responses based on region
OAuth providers with session timeouts

Environment Variability

Time-dependent logic — tests that break at midnight, on weekends, or at month boundaries
Locale/timezone differences — date formatting assertions that depend on the CI runner's locale
Screen size assumptions — responsive layouts that render differently in headless mode
Resource pressure — tests that pass on a dedicated runner but fail on a shared one under load

Manual vs. Automated Detection

Manual Approach

Without tooling, teams detect flaky tests by:

Noticing the same test failing across unrelated PRs
Re-running CI and seeing the test pass without changes
Developers reporting "this test is flaky" in Slack

This is slow, unreliable, and depends on individual memory. Tests can be flaky for months before anyone connects the dots.

Automated Detection with FixSense

FixSense tracks every test failure across all your CI runs and applies pattern detection to identify flakiness automatically.

How FixSense Scores Flakiness

Every failed test receives a flakiness score from 0 to 100:

Score Range	Meaning
0-20	Almost certainly a real failure — consistent pattern, tied to code changes
21-50	Possibly flaky — some intermittent signals but could be a legitimate issue
51-80	Likely flaky — strong intermittent pattern, timing-related indicators
81-100	Almost certainly flaky — classic flaky signatures, fails randomly across PRs

The score is based on multiple signals:

Retry detection — if a test fails then passes on retry in the same CI run, it is automatically scored as highly flaky. This is the strongest single indicator of flakiness, catching 60-80% of flaky tests.
Failure consistency — does this test fail every time, or only sometimes? Tests that fail consistently with the same error across multiple runs are scored LOW (deterministic, not flaky).
Cross-PR correlation — does it fail across unrelated pull requests?
Error pattern analysis — does the error match known flaky signatures (timeouts, race conditions, network errors)?
Historical trend — has this test been flagged as flaky before?

The flakiness score appears on every analysis card in your dashboard and in PR comments, so you can immediately tell whether a failure needs investigation or is a known flaky test.

Features

Per-Test Flakiness Score

Every analysis includes the flakiness score front and center. When a developer sees a PR comment from FixSense showing a flakiness score of 85, they know they can merge with confidence — the failure is not related to their changes.

Top Failing Tests Dashboard

The dashboard highlights your most frequently failing tests. This is where flaky tests surface naturally — a test that appears in the top 5 week after week is almost certainly flaky and should be prioritized for fixing.

Failure Classification

FixSense classifies each failure as either an Application Bug or a Test Bug, combined with the flakiness score. This tells you both what is broken (app or test) and how reliable the failure is (deterministic or flaky).

PR Comments with Context

When a flaky test fails on a PR, the automated comment includes:

The flakiness score
How many times this test has failed recently
Whether the failure is related to the PR's code changes
A suggested fix if the flaky pattern is addressable

Trend Tracking

Monitor your overall flaky test rate over time. See whether your stabilization efforts are working or if new flaky tests are being introduced faster than old ones are fixed.

Test Quarantine (Coming Soon)

Automatically quarantine tests with consistently high flakiness scores — they still run but their failures do not block the pipeline. Quarantined tests appear in a separate dashboard section so they stay visible without disrupting development flow.

Reducing Your Flaky Test Rate

Once FixSense identifies your flaky tests, follow this playbook:

1. Fix the Top 5 First

The Pareto principle applies: a small number of tests cause most flaky failures. Start with the tests that have the highest flakiness scores and the highest failure frequency.

2. Replace Hard Waits with Smart Waits

// Bad — arbitrary timeout, still flaky
await page.waitForTimeout(3000);
await expect(page.locator('.data')).toBeVisible();

// Good — waits for the actual condition
await expect(page.locator('.data')).toBeVisible({ timeout: 15000 });

Install FixSense at fix-sense.com — takes under 2 minutes
Let it collect data — flakiness scoring improves as FixSense sees more CI runs
Review your dashboard weekly — focus on the top failing tests widget
Fix systematically — use the AI-suggested fixes as a starting point