──WHY A DATABASE

Why Database-Backed State?

The JSON file is a serialization format masquerading as a storage engine. At scale, that distinction breaks everything.

The common argument: "The JSON file IS the state. It works." The reality: at 50 resources it's invisible. At 50,000 resources it's the bottleneck for every team in the org.

The Blob Problem

Terraform's state is a single atomic blob. It stores not just the resources themselves, but every dependency relationship, every attribute, every computed output, and every provider reference. Every operation (plan, apply, import, state show, state list) must:

>Read the entire file into memory (even to check one resource)
>Parse and resolve the full dependency graph (even if only one branch matters)
>Write the entire file back (even if one attribute changed)
>Lock the entire file (even if two teams touch unrelated resources)

The file size is not just a function of resource count. It scales with the complexity of the dependency graph, the number of computed attributes per resource, and the depth of module nesting. A VPC with 200 resources and deep cross-references can produce a state file as large as a flat list of 2,000 independent resources.

resources

~500KB

Invisible

resources

50-100MB

Painful

resources

500MB+

Broken

Database approach

Each resource is a row. Dependencies are stored as indexed foreign keys in a separate table, not embedded inside a monolithic JSON tree. oxid plan for one resource reads that row and walks only its dependency chain, not the entire graph. State for 500,000 resources with complex cross-references does not slow down a plan for 1 resource.

The Locking Problem

Terraform state locking, whether via the legacy DynamoDB table or the newer S3-native lock file (use_lockfile = true), is a distributed mutex on the entire state:

>Team A modifying a VPC blocks Team B modifying an S3 bucket, even though they are completely unrelated
>Long-running applies (20-30 min) hold the lock the entire time
>Lock stuck? Manual terraform force-unlock. Hope nobody else is mid-write.
>State splitting into smaller workspaces is the "solution", but that is just admitting the model is broken

S3 native locking removed the DynamoDB dependency, but it did not fix the actual problem. It is still one lock for the entire state file. Removing a piece of infrastructure from the locking setup does not change the fact that the locking granularity is wrong.

Database approach

Row-level locking. Postgres supports SELECT ... FOR UPDATE on individual resources. Two teams, two resources, zero contention. Transactions provide atomicity: either the whole apply commits or nothing does.

This isn't novel; this is how every production system stores mutable data. The question is why infrastructure state is the exception.

The Query Problem

"Show me all EC2 instances with tag Environment=production across the state." How does one answer that?

Terraform

Oxid

This isn't a convenience feature. This is operational visibility. When there's an incident at 2 AM and someone needs to know "which security groups reference vpc-abc123", that query takes 5ms in Postgres and requires writing a custom parser against a 200MB JSON blob in Terraform.

The Drift Detection Problem

Terraform detects drift by running terraform plan. This loads the entire state, calls ReadResource for every single resource, compares, and reports. For large infrastructure this takes 30 to 60 minutes and holds the state lock the entire time.

With database-backed state, drift detection can be:

Incremental

Check 100 resources at a time, no global lock

Continuous

Background process polling subsets on a schedule

Targeted

"Just check the network layer" without touching compute state

Concurrent

Multiple drift checks across different resource types simultaneously

The Disaster Recovery Problem

"What's the recovery plan when the state file gets corrupted?"

The typical answer: "Version the S3 bucket." Which means recovery = roll back the entire state to a prior version. Any resources created between backup and corruption are now orphaned. Any resources destroyed are now ghosts. Reconciliation is manual, resource by resource.

Database approach

>Point-in-time recovery (Postgres WAL, SQLite snapshots)
>Transaction log enables replaying/undoing individual operations
>Per-resource history: “show me every change to this VPC in the last 30 days”
>Backup/restore with battle-tested tools (pg_dump, streaming replication)
>Restore one resource without rolling back everything else

The Collaboration Problem

Terraform's "solution" to multi-team state management is workspace splitting: split the infra into 10-50 separate state files. Now cross-stack references require terraform_remote_state data sources. Circular dependencies become impossible. Refactoring requires terraform state mv across files.

This is essentially manual sharding of a flat file. Databases solved this decades ago.

Move resources between workspaces

Query across workspaces

One database, many workspaces, native foreign keys between resources. No need to split infra into arbitrary boundaries because the storage engine can't handle it.

The Security Problem

Open any terraform.tfstate file. Every secret (database passwords, API keys, private keys) is sitting right there in plaintext JSON. Terraform's own docs say:

“State may contain sensitive data... treat the state itself as sensitive.”

The "solution" is encrypting the S3 bucket at rest. But anyone with state file access sees everything. There's no per-attribute encryption, no column-level access control, no audit log of who read what.

Database approach

>Sensitive attributes encrypted per-column (Postgres pgcrypto, application-level)
>Row-level security policies: "team X can only see resources with tag team=X"
>Full audit trail: who queried which resources, when
>Standard database RBAC instead of “whoever has S3 access sees everything”

The Migration / Refactoring Problem

terraform state mv is the most terrifying command in Terraform. It reads the entire state blob, modifies the in-memory representation, writes the entire blob back, has no undo, and if interrupted may leave a partially modified state.

Moving 100 resources between modules? That's 100 sequential state mv commands, each rewriting the entire file.

Terraform

Oxid

──OBJECTIONS

But What About...

"But JSON is simple and portable."

→JSON is a serialization format. Nobody stores a user database as a JSON file on S3 and argues "it's simple." State is mutable, concurrent, relational data. That's exactly what databases were invented for. Oxid can still export to JSON anytime.

"We've been fine with Terraform state for years."

→How many times has a stuck lock blocked a deploy? How many orphaned resources exist because someone ran state rm wrong? How long does terraform plan take? These are all symptoms of the same root cause.

"What if the database goes down?"

→SQLite is an embedded file with the same availability as a JSON file, but with ACID transactions. Postgres offers streaming replication, automated failover, and managed services like RDS. A database has better uptime guarantees than an S3 + DynamoDB lock table setup.

"This sounds like over-engineering."

→It's the opposite. Terraform state management requires: S3 bucket + versioning + encryption + DynamoDB table + IAM policies + locking logic + workspace splitting + backup procedures + drift scripts. That's 8 systems to approximate what one database provides natively.

Terraform's state file is a spreadsheet. It works at 50 rows. But at 50,000 rows across 10 teams, with concurrency managed by file locks, recovery handled by S3 versioning, and queries powered by jq, it is time to store infrastructure state the same way every other piece of critical data gets stored: in a database.

Get Started >>View on GitHub