Federal Data Collection Platform Rescue: How We Saved a National Survey in 4 Months

When a federal agency had 30 days left, 30 percent of the platform built, and a million daily validation rules with no engine to run them, we stopped pushing the failing plan and rebuilt it from the ground up.

Agency and system names anonymized for security. Full briefing available under mutual NDA.

7 min read

Client
Federal agency (anonymized)
Domain
Demographic and education data, national annual survey
Engagement
Project rescue and platform build
30%
Complete on arrival, with one month to original deadline
1M+
Validation rules processed daily
4 months
From rescue start to production delivery

The situation

A federal agency was preparing to field a national survey it conducts on an annual cycle, capturing demographic and education data that feeds downstream policy and funding decisions. The supporting data collection platform had been in active development for roughly two years.

It had very little to show.

When ExeQut was brought in, the platform was about 30 percent complete with one month remaining on the original delivery window. Requirements lived in spreadsheets and analyst notebooks. The validation logic that needed to run on every incoming response, more than a million rules per day, had no engine to execute it. The team had run out of runway, the budget was effectively spent, and the agency was facing the prospect of either fielding a broken instrument or skipping a year of national data collection entirely.

Two years of build. Thirty percent complete. One month to go.

Skipping a year was not really an option. Annual cadence is the entire point of an annual survey. Every missed cycle leaves a gap in the time series that downstream analysts and policy teams work around for the life of the program.

The challenge

Three problems had to be solved in parallel.

1. Requirements capture had no home. Survey content and validation rules existed in fragments across documents, notebooks, and people's heads. There was no structured way to elicit, version, or hand them off to engineering.

2. Validation had to run at scale. Over a million rules per day, traceable to source, with performance that would not collapse during peak collection windows.

3. The clock and the budget were both done. No realistic option to buy our way out with a commercial rule engine or fresh infrastructure spend. Recovery had to start in days, not weeks.

The approach

Pushing harder on a plan that was already failing was not going to work. We led a controlled reset of the engagement, negotiated a realistic delivery target, and rebuilt the operating model and architecture around what working software actually required.

A working operating model

We assembled an integrated delivery team of roughly 10 contractor staff working alongside 3 dedicated federal subject matter experts. That integration was the foundation. Federal SMEs sat with engineers, owned survey content decisions, and reviewed working software at the end of every iteration.

Ad hoc status reporting was replaced with a disciplined agile cadence: short iterations, demos at the end of every sprint, one prioritized backlog, and a clear path for late-breaking requirements to land without resetting the plan. For the first time in the life of the program, the agency had visibility into actual progress measured in working features rather than slideware.

A custom rule engine, by necessity

With the budget effectively closed, a commercial rule engine was not an option. We built one in .NET, paired with a normalized data model for capturing validation logic. Rules were authored once, versioned, and executed by the engine at scale. Rule definition was decoupled from rule execution, so onboarding the full library of more than a million daily validations did not require writing one-off code per rule.

This was the single most important architectural call of the engagement.

A million rules in handwritten code would have been unmaintainable on day one, and impossible by year two.

Survey content as configuration, not code

Questions, skip logic, branching, and metadata lived in a structured knowledge base that the platform imported automatically at deploy time. Federal SMEs could revise survey content without filing engineering tickets. Changes flowed through repeatable import pipelines.

That property mattered more than it might sound. A survey instrument that evolves between annual cycles cannot afford a development cycle for every wording change. Treating content as configuration is what makes year-over-year operation possible at this scale.

Elastic AWS infrastructure

Hosting the platform on AWS was one of the genuine unlocks of the rescue. Elastic Beanstalk gave us auto-scaling out of the box, which meant the environment could expand and contract with demand instead of being sized for a worst case that we then prayed we had estimated correctly. During collection windows, the platform absorbed peak load without manual intervention. Outside of those windows, we were not paying for capacity we did not need.

Just as important for a federal program with one shot to get it right: we could load test the application aggressively in production-equivalent conditions, push it past expected traffic, and watch how it behaved, without taking the system down or rebuilding infrastructure to do it. That was not a luxury available to the team a few months earlier.

For a national survey with one annual window to get it right, the ability to load test under real conditions and auto-scale through peaks was the difference between confidence and crossed fingers.

Underneath Beanstalk, an open-source .NET job queue handled asynchronous validation work. Heavy validation was decoupled from request handling so the user-facing application stayed responsive even when validation throughput was running flat out. Workloads could be isolated, retried, and observed independently of the web tier, which kept respondent experience smooth and gave operators clear signal when something needed attention.

Deployment and migrations on rails

Deployment moved from manual procedures to scripted, configuration-driven releases. Shell scripts handled environment provisioning and application rollout. Entity Framework managed database schema migrations as versioned, repeatable steps tied to each release. Environment promotion became predictable, rollbacks became safe, and the team could ship fixes quickly during the field period.

The outcome

ExeQut delivered the platform four months after entering the engagement. The annual national survey was fielded on schedule. The system processed over a million validations per day against incoming responses. The agency had a working data collection platform for the first time in two years.

Originally a two-year build. Recovered and shipped in four months.

What we did not fully anticipate at the time was how long the architecture would last. The configuration-based design meant the survey could evolve without rebuilding the platform. ExeQut has continued to support the program without interruption since the rescue, through every annual cycle, as the survey instrument and the policy context around it have changed.

A platform that almost did not field at all has now run uninterrupted across more annual cycles than anyone counted on at the start.

What we took from it

Federal project rescues tend to turn on the same set of decisions. A few of them are worth pulling out.

  • Reset the plan before resetting the team. Most stalled federal programs do not have a people problem. They have a structure problem. Fixing the operating model first is what gives the team a chance.
  • Integrate federal SMEs into the delivery team. Requirements that travel through document handoffs decay. Requirements that come from a federal expert sitting in the same standup do not.
  • Configuration is a long-term decision disguised as a short-term one. Treating content and rules as configuration looks like extra work in month one. It pays back every cycle after.
  • Build for the second year. The platform that ships on time is necessary. The platform that runs every year after that is the actual product.
  • Working software is the only credible status report. No deck recovers a stalled program. A demo on a sprint boundary does.

If you are running a federal program in trouble, the right intervention is rarely to push harder. It is to stop, rebuild the plan around what working software actually requires, and build the kind of architecture that the second year, and the tenth, can keep running on.

Want the unredacted briefing?

Agency, systems, architecture, vendors, and outcomes. We walk you through the full engagement under mutual NDA.

Request a private briefing