Flaky Test Detection and Scoring
How FixSense detects flaky tests, assigns flakiness scores, and helps teams systematically eliminate intermittent CI failures.
The Cost of Flaky Tests
Flaky tests — tests that pass and fail without any code changes — are the single biggest drain on CI/CD productivity. Industry research consistently shows:
- Engineers spend 5-10 hours per week investigating failures that turn out to be flaky
- Teams with high flaky rates re-run CI pipelines 3-5x more often, wasting compute and time
- Flaky tests erode trust in CI — developers start ignoring failures, letting real regressions slip through
- 2-15% of tests in a typical project are flaky at any given time
The worst part: flaky tests are hard to find. A test that fails once every 20 runs looks like a legitimate failure each time it breaks. Without tracking history, every occurrence triggers the same investigation.
What Makes a Test Flaky
Timing and Race Conditions
The most common cause. Tests assume operations complete in a fixed order or within a fixed time:
// Flaky — button may not be clickable yet
await page.click('.save-button');
// Stable — waits for the element to be actionable
await page.locator('.save-button').click();Playwright's auto-waiting helps, but complex flows with multiple async operations still create race conditions — especially in CI where execution is slower.
Shared State Between Tests
Tests that depend on data created by previous tests:
- Database records from a prior test
- Browser cookies or localStorage set in another test
- Files created on disk during the test run
When test execution order changes (parallelization, sharding), these dependencies break.
Network Dependencies
Tests that call real external services:
- Third-party APIs with rate limits or occasional downtime
- CDNs that return different responses based on region
- OAuth providers with session timeouts
Environment Variability
- Time-dependent logic — tests that break at midnight, on weekends, or at month boundaries
- Locale/timezone differences — date formatting assertions that depend on the CI runner's locale
- Screen size assumptions — responsive layouts that render differently in headless mode
- Resource pressure — tests that pass on a dedicated runner but fail on a shared one under load
Manual vs. Automated Detection
Manual Approach
Without tooling, teams detect flaky tests by:
- Noticing the same test failing across unrelated PRs
- Re-running CI and seeing the test pass without changes
- Developers reporting "this test is flaky" in Slack
This is slow, unreliable, and depends on individual memory. Tests can be flaky for months before anyone connects the dots.
Automated Detection with FixSense
FixSense tracks every test failure across all your CI runs and applies pattern detection to identify flakiness automatically.
How FixSense Scores Flakiness
Every failed test receives a flakiness score from 0 to 100:
| Score Range | Meaning |
|---|---|
| 0-20 | Almost certainly a real failure — consistent pattern, tied to code changes |
| 21-50 | Possibly flaky — some intermittent signals but could be a legitimate issue |
| 51-80 | Likely flaky — strong intermittent pattern, timing-related indicators |
| 81-100 | Almost certainly flaky — classic flaky signatures, fails randomly across PRs |
The score is based on multiple signals:
- Failure consistency — does this test fail every time, or only sometimes?
- Cross-PR correlation — does it fail across unrelated pull requests?
- Error pattern analysis — does the error match known flaky signatures (timeouts, race conditions, network errors)?
- Retry behavior — does the test pass on retry without code changes?
- Historical trend — has this test been flagged as flaky before?
The flakiness score appears on every analysis card in your dashboard and in PR comments, so you can immediately tell whether a failure needs investigation or is a known flaky test.
Features
Per-Test Flakiness Score
Every analysis includes the flakiness score front and center. When a developer sees a PR comment from FixSense showing a flakiness score of 85, they know they can merge with confidence — the failure is not related to their changes.
Top Failing Tests Dashboard
The dashboard highlights your most frequently failing tests. This is where flaky tests surface naturally — a test that appears in the top 5 week after week is almost certainly flaky and should be prioritized for fixing.
Failure Categories
FixSense categorizes each failure as Regression, Flaky, Test Maintenance, or Environment. The Flaky category combines the flakiness score with error pattern analysis to give a definitive classification.
PR Comments with Context
When a flaky test fails on a PR, the automated comment includes:
- The flakiness score
- How many times this test has failed recently
- Whether the failure is related to the PR's code changes
- A suggested fix if the flaky pattern is addressable
Trend Tracking
Monitor your overall flaky test rate over time. See whether your stabilization efforts are working or if new flaky tests are being introduced faster than old ones are fixed.
Test Quarantine (Coming Soon)
Automatically quarantine tests with consistently high flakiness scores — they still run but their failures do not block the pipeline. Quarantined tests appear in a separate dashboard section so they stay visible without disrupting development flow.
Reducing Your Flaky Test Rate
Once FixSense identifies your flaky tests, follow this playbook:
1. Fix the Top 5 First
The Pareto principle applies: a small number of tests cause most flaky failures. Start with the tests that have the highest flakiness scores and the highest failure frequency.
2. Replace Hard Waits with Smart Waits
// Bad — arbitrary timeout, still flaky
await page.waitForTimeout(3000);
await expect(page.locator('.data')).toBeVisible();
// Good — waits for the actual condition
await expect(page.locator('.data')).toBeVisible({ timeout: 15000 });3. Isolate Test Data
Each test should create its own data and clean up after itself. Never depend on data from another test or a shared seed.
4. Mock Unstable Dependencies
External APIs, email services, and payment providers should be mocked in E2E tests. Use Playwright's page.route() to intercept and stub network requests.
5. Set a Team KPI
Track your flaky test rate as a team metric. A healthy target is less than 5% flaky rate. Review the FixSense dashboard weekly in your team standup.
Getting Started
- Install FixSense at fix-sense.vercel.app — takes under 2 minutes
- Let it collect data — flakiness scoring improves as FixSense sees more CI runs
- Review your dashboard weekly — focus on the top failing tests widget
- Fix systematically — use the AI-suggested fixes as a starting point