Writing good checks

A check is one natural-language sentence describing something that must be true about a PR. The reviewer agent reads the description on every PR, looks at the diff (and the surrounding code when it needs to), and answers pass, fail, or needs_review with file:line evidence.

The quality of the answer depends almost entirely on the quality of the description.

The shape of a good description

Write the description the way you'd write a comment on a teammate's PR:

Production code under src/ must not contain print() statements. Use the logger from src/observability/logger.py instead.

That sentence does four useful things:

  1. States the rule ("must not contain print()").
  2. Scopes it ("under src/" — not tests, not scripts).
  3. Names the standard ("Use the logger from ...") so the agent can cite a concrete alternative in its evidence.
  4. Stays specific. "Don't log sensitive data" is too broad; "don't pass user.email or user.ssn into logger.info" is checkable.

Status meanings

StatusWhat it means
passThe agent verified the rule holds for this PR's changes.
failThe agent found a concrete violation. The evidence will reference file:line.
needs_reviewThe agent isn't sure — the rule is ambiguous, the evidence is partial, or the code the rule covers wasn't touched in a way the agent could resolve. A human should look.

needs_review is a real outcome, not a synonym for "I didn't try." When you see it, click into the run and read the agent's reasoning — it'll usually tell you what evidence it was missing.

Patterns that work

Anchor to paths, function names, or extensions. The agent has read access to the whole repo and uses paths to scope its search.

Every new file under migrations/ includes a down_revision and a non-trivial downgrade() body.

Say what to do instead. Rules with a positive alternative get cleaner fail evidence.

External HTTP calls use the http_client wrapper from lib/http.ts so retries and timeouts are applied. Don't use fetch or axios directly.

Be explicit about what counts as "new". "New tables" is ambiguous — added in this PR? Added in the last release?

SQL files added or modified under models/staging/ have a matching test file under tests/staging/ with the same basename.

Lean on file conventions you already enforce in review. If your team already says "all React components must be functional, not class-based," that's a great check — you already know how to spot violations, so the agent can too.

Patterns that don't

Vague quality bars. The agent can't tell you whether something is "clean," "readable," or "well-designed." It can tell you whether a specific structural property holds.

"Code is well-tested." → too vague. "Every public function in src/api/handlers/ has at least one assertion in a sibling *_test.py file." → checkable.

Whole-codebase claims. A check evaluates the changes in a PR plus the code they touch. It is not a static analyzer over the whole repo.

"There are no circular imports anywhere." → not a check; use a linter. "New imports added in this PR don't create a cycle with core/." → checkable.

Things that need to run the code. The agent can read code and run shell commands in a sandbox (no network), but it can't reach your databases, your staging API, or your production logs.

"The new endpoint returns the same JSON shape as the old one in prod." → not a check. "The new endpoint's response model in schemas.py has the same fields as OldEndpointResponse." → checkable.

Compound rules. Two rules that share a sentence make it hard for the agent to give you a meaningful fail with line refs.

"All new functions are typed AND have docstrings AND are tested." → split into three.

Severity and tiering

Two optional fields tune how a check is evaluated.

severity (low | medium | high, default medium) controls how the failure renders in the PR comment and in any policy-pack scoring you build on top — it doesn't change the model's behavior.

tier_hint (auto | triage_only | deep, default auto) controls which tier of the evaluator runs the check. Most checks should stay on auto, which lets a cheap triage pass handle obvious pass / not applicable cases and escalates to a deeper agent only when needed. Use:

  • deep for checks that require following code across multiple files or interpreting complex logic. Costs more credits but produces stronger reasoning.
  • triage_only for checks that are obviously a pass-through if the diff doesn't touch the relevant code. Cheap; no escalation.

See Credits and billing for the credit cost of each tier.

Iterating on a check

When a check gives results you don't like:

  1. Open the run on CrossCheck and read the agent's evidence. The reasoning is usually telling.
  2. Tighten the scope. Add a path constraint or call out a specific file pattern.
  3. Name the alternative. Replace "don't do X" with "do Y instead of X."
  4. Re-run the PR with the Re-review button.

If the description is right but the agent keeps mis-evaluating, set tier_hint: deep and try again.

Where checks live

  • In the dashboard. Open a check group under /<your-org>/check-groups, click Add check.
  • In cross-check.yaml at your repo root. YAML checks run alongside dashboard checks; they never replace them. See cross-check.yaml.

Either way the format is the same: a name and a description.