SARAL v1.3 · Field Results Archive mode · Frozen (Jan 2026)

Field Results: When rules meet discretion

Primary empirical observation

Every single triage instance in the archived pilot cohort (N=260) was marked ELIGIBLE_BY_RULE based on structured inputs. However, outcomes failed to bind. Operators leveraged AI risk flags and unstructured field notes to penalize and withhold approval in 52% of the mathematically eligible applicants.

N = 260 triage instances 100% Rule Eligible 52% Override Rate
  • Pattern A (Protective Schema Gap): Surfacing unstructured evidence provided operators with the contextual justification needed to override formal eligibility.
  • Pattern B (Automation Bias): Operators leveraged non-binding AI risk scores as bureaucratic cover, withholding approval in 100% of cases flagged as medium/high risk, though sample size limits causal claims.
  • Pattern C (Discriminatory Variance): Qualitative patterns suggest override discretion introduces systemic penalties based on unwritten socio-economic proxies, rather than formal fraud.
  • Implication: "Human-in-the-loop" oversight can act as a restrictive gatekeeping mechanism when the human is biased; surfacing more data does not natively equate to fairer outcomes.
Automation Bias Discretionary Gatekeeping Schema Limitations

System at a glance

Where SARAL sits in the decision chain

Final outcomes are determined at the discretion layer; SARAL is advisory. The pilot tests whether formal rule presentation changes operator behavior under realistic discretionary conditions.

DSS is advisory
SARAL decision chain: rules + tabular data + field notes -> decision support -> human discretion
System flow diagram. SARAL consumes policy rules + tabular inputs; decisive evidence often sits in field notes and operator discretion.

Methodology

Study design & measurement

Study design

The field deployment used a quasi-RCT workflow to test whether formal rule presentation changes operator behavior under discretionary conditions. SARAL outputs were advisory and non-binding throughout.

  • Design type: quasi-RCT workflow (operator-facing configurations varied across intake)
  • Intervention: structured rule trace + conflict cues (and, in some configurations, a non-binding AI risk score)
  • Constraint: final authority remained with human operators and local gatekeeping rules

Causal claims are bounded to proximal behavioral shifts (attention, overrides, friction), not downstream welfare outcomes.

Unit of analysis

The unit is a triage instance: one intake interaction producing a SARAL recommendation and a final operator action.

  • N: 260 triage instances (archived remote cohort)
  • Setting: rural Maharashtra
  • Schemes: multi-scheme intake (PMAY, UJJ, etc.)

Information environment

SARAL consumed policy rules + structured fields; decision-critical facts often appeared in unstructured field notes.

  • Structured: tabular intake fields used by the rule engine
  • Unstructured: field notes with disqualifying evidence and context
  • Discretion layer: operator judgment + administrative caps

Primary outcomes

  • Final action: Approve / Reject / Escalate
  • Override rate: divergence between recommendation and action
  • Processing time: triage time (proxy for cognitive friction)
  • Conflict presence: surfaced disagreements between structured rules and note evidence

“Accuracy” is treated as alignment between effective practice and strict policy eligibility.

Telemetry and logging

  • Latency: wait and processing time stamps
  • Decisions: recommendation + final action
  • Reasons: operator override reason codes (when provided)
  • Sequence: exposure order to notes/conflict cues when instrumented

Logs describe operator behavior under the workflow; they do not imply production monitoring.

Operational definitions (used throughout)

Archived cohort: The 260 instances reviewed by remote volunteers without field access (excluding 1 in-field test case).

Override: recommendation ≠ final operator action.

Schema gap: decisive evidence exists but is absent from structured fields.

Binding constraint: factor determining whether tool output translates into action.

Unstructured evidence: field notes not evaluated by rules.

Data handling

  • Identifiers: names removed; direct identifiers excluded from archive
  • Notes: excluded or heavily sanitized prior to export
  • Contact data: phone numbers converted to irreversible citizen hashes; not retained post-session

Qualitative coding framework

  • Method: Open, inductive thematic coding of the N=135 override notes.
  • Process: Single-researcher manual review with a blind second-pass intra-coder reliability check.
  • Categories: Notes were assigned to mutually exclusive heuristic buckets (e.g., asset visibility, kinship penalties) representing the dominant rejection logic.

Operator heterogeneity (exploratory; descriptive only)

Allocation was rotational, not randomized; operators did not process in parallel. Differences likely reflect case-mix + learning + batching/time-of-day, so comparisons are descriptive, not causal.

Metric note: In this export, rule_result is constant (ELIGIBLE_BY_RULE) across all rows. So Override % below is computed as final_action != APPROVE (i.e., “non-approval despite rule-eligibility”). Latencies are shown in milliseconds (ms).

Operator N Override % Escalate % Approve % Reject % Request Docs % Median Triage (ms) Median Meta (ms)
volunteer_1 97 63.9 13.4 36.1 32.0 18.6 26,344 82,000
volunteer_2 100 48.0 10.0 52.0 21.0 17.0 29,903 66,500
volunteer_3 63 39.7 6.3 60.3 17.5 15.9 21,697 51,000

Acknowledgement: Operator review for this archived cohort (N=260) was conducted by three remote volunteer operators who had no direct access to the field (excluding 1 instance processed by a field operator). Identities are withheld to keep the archive non-personal and publication-safe.

Findings

Results

1. The Protective Schema Gap

Contrary to the assumption that missing data harms applicants, the "schema gap" was structurally protective in a punitive welfare environment. Because 100% of the cohort was mathematically ELIGIBLE_BY_RULE based on structured inputs, algorithmic autonomy would have resulted in universal approval. However, surfacing unstructured field notes in the UI provided operators the contextual justification needed to override the algorithm and withhold approval in 52% of the cases.


Interpretation: Surfacing more information does not natively result in fairer outcomes; it can arm gatekeepers.

2. Asymmetric Automation Bias

While the homogeneous applicant pool compressed AI risk scores into a narrow range (median 0.156), behavioral shifts at the margins were absolute. When the algorithm displayed "Low" risk, operators approved ~50% of cases. However, when flagged as "Medium/High" risk (n=7), the approval rate plummeted to 0% (all 7 cases were rejected or escalated). While consistent with automation bias, the sample size precludes definitive causal attribution. Patterns suggest operators leveraged the "High Risk" flag as institutional cover to withhold approval, though underlying note-risk correlation cannot be entirely ruled out.


Interpretation: Risk scores may be leveraged as institutional cover for borderline rejections.

3. Discretion as Discriminatory Variance

Based on single-researcher thematic coding of the 135 override notes, discretion appeared rarely used to catch formal fraud. Instead, qualitative patterns are suggestive of discriminatory variance. Operators routinely bypassed formal eligibility to penalize applicants based on arbitrary kinship, geographic, and visual proxies (e.g., “Lives with parents who have land” or “Structure looks like a bungalow”) to enforce local unwritten moral economies.


Interpretation: Discretion operates as an enforcement mechanism for local socio-economic biases.

4. Friction Correlates with Rejection

Telemetry data challenges the assumption that longer review times equate to careful, fair oversight. Approvals were fast and relatively frictionless (median triage latency: ~20 seconds). Conversely, rejections and document escalations required significantly higher cognitive friction (median triage latency: ~24 to ~42 seconds). When operators spent twice as long scrutinizing a file, they were actively hunting for a discrepancy to justify a denial.


Interpretation: Slower reviews indicate discrepancy hunting rather than careful validation.

Observed failure modes (taxonomy)

Categorization of instances where correct rule execution failed to bind practical outcomes.

Failure mode Description Where it appears
SCHEMA_GAP Decision-critical evidence exists but is not represented in tabular fields evaluated by rules. All schemes / Intake
NOTE_OVERRIDE Operators reject mathematically eligible cases based on disqualifying evidence in free-text notes. Triage stage
CAP_GATEKEEP Formally rule-eligible instances are rejected due to external constraints (caps, local limits). UJJ / Local offices
ATTENTION_FAILURE Recommendations or conflict cues are ignored due to cognitive load or process constraints. Triage stage

Limits

Limitations

Pilot scope

Single-region deployment (Maharashtra) and limited institutional embedding constrain external validity and downstream outcome tracing.

Data shape

Welfare intake is demographically and procedurally constrained; sample homogeneity compresses risk-score variance and limits observable automation-bias effects.

Design constraints

Non-binding recommendations and a quasi-RCT workflow limit causal claims about downstream welfare access. Effects measured here are proximal: attention, overrides, and friction.

Volunteer generalizability

Volunteer operators lack the localized political pressures and career incentives of actual bureaucrats. Consequently, these findings represent experimental behavioral patterns; the 52% override rate may actually understate the exclusionary friction present in live administrative environments.

Interpretation

Interpretation (bounded claims)

What the evidence supports

  • Structured schema is an incomplete abstraction of decision-relevant facts.
  • Operator attention and discretion remain the binding constraint on triage.
  • Power over outcomes resides in human judgment and external gatekeeping, not classification.

What this is not

  • Not a claim that algorithmic systems inherently improve welfare access.
  • Not a generalizable proof of downstream poverty outcomes.
  • Not an assertion that modeling can replace institutional capacity or fix data production.

Theoretical Context

These findings empirically anchor Michael Lipsky’s street-level bureaucracy in the digital age, demonstrating how frontline gatekeeping heuristics evolve alongside new technology. By showing how operators leverage AI risk flags as institutional cover for discretionary exclusion, the pilot provides behavioral evidence for Virginia Eubanks’ digital poorhouse and Ben Green’s critique of algorithmic transparency: surfacing more data often arms biased gatekeepers with contextual ammunition rather than inherently protecting marginalized applicants.

Data & privacy

Release status

This archive hosts instruments, protocols, and reproducible artifacts from the SARAL pilot. Names and direct identifiers are removed. Free-text verification notes are excluded or heavily sanitized prior to any export. Phone numbers used during intake were converted to irreversible citizen hashes and were not retained after the session.

PII not published Archive mode

SARAL suggestions are non-binding. Operator decisions remain authoritative. Released artifacts are sanitized for academic reproducibility, not deployment claims.