Skip to content

Storage Backends

The storage backend controls how PHI token mappings are stored and retrieved. Choose one when making each API request via the storage_type query parameter, or set it once in the dashboard.


Recommendation

Use DynamoDB Token-Based for production

DynamoDB Token-Based is the recommended backend for most production deployments. It's fast in both lookup directions, scales automatically with usage, costs a fraction of a cent per operation, and doesn't require you to manage Record IDs. Pick this unless you have a specific reason not to.


At a glance

Backend Production-ready Lookup directions
DynamoDB Token-Based Yes Both directions
DynamoDB Record-Based Yes Both directions
AWS KMS Yes Token->value only
File Dev only Both directions

How it works: Each PHI value gets an HMAC token (deterministic — same value always produces the same token for a given secret). The token is the primary key in a DynamoDB table, with the original value stored as a column. Lookups by token are direct key reads.

Why we recommend it

  1. Bidirectional lookups — both token -> value and value -> token work without a Record ID
  2. Fast — token-keyed reads are ~10 ms per lookup
  3. Scales automatically — DynamoDB on-demand mode grows with your traffic, no capacity planning
  4. Cheap at scale — at typical 30 PHI entities per document, ~$0.00005 per anonymize request
  5. Persistent and durable — DynamoDB replicates across three Availability Zones automatically
  6. No per-record state — anonymize and deanonymize are stateless from the caller's perspective; you don't need to remember a Record ID to restore your data
  7. Deterministic tokens — the same PHI value always produces the same token, which simplifies de-duplication and pipeline integration

What you need

Requirement Notes
Secret Key HMAC secret used for token generation. Must stay the same across all operations or tokens won't match.
AWS Region Region where the DynamoDB table lives.

Configuration example

curl -X POST "http://<host>:8888/anonymize/phi?storage_type=DynamoDBTokenBased&aws_region=us-east-1&secret_key=YOUR_SECRET" \
  -H "API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Patient John Smith, SSN 123-45-6789."}'

In the dashboard: Configuration -> Storage Backend -> DynamoDB — Token.


DynamoDB Record-Based

How it works: Each anonymize call writes a single DynamoDB item keyed by Record ID. All PHI mappings for that document are grouped together under that one key. Lookups and deanonymization require the Record ID.

When to choose it

  • You need granular per-record retrieval or deletion (e.g. patient requests right-to-be-forgotten and you must purge their tokens)
  • You already track stable Record IDs (case numbers, patient IDs) and want them as the unit of organization
  • Your compliance workflow requires the ability to enumerate all PHI per record in one operation

Trade-offs

  • Record ID is required for every operation — anonymize, deanonymize, lookup. Lose it and the data is unrecoverable.
  • Read-modify-write pattern — storing a new PHI value requires reading the existing item, appending, then writing back. This is slower than Token-Based and prone to contention under heavy concurrent writes for the same Record ID.
  • Per-partition throughput limits — if many requests use the same Record ID at once, DynamoDB throttling can occur.

What you need

Requirement Notes
Secret Key Same as Token-Based.
Record ID A stable identifier (e.g. patient-12345, case-2026-001). You must pass this consistently across anonymize, deanonymize, and lookup.
AWS Region DynamoDB region.

Configuration example

curl -X POST "http://<host>:8888/anonymize/phi?storage_type=DynamoDBGroupedByRecord&aws_region=us-east-1&secret_key=YOUR_SECRET" \
  -H "API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Patient John Smith...", "record_id": "patient-12345"}'

AWS KMS

How it works: Each PHI value is encrypted with an AWS KMS RSA-4096 key and the ciphertext is embedded directly in the token — no database is involved. To deanonymize, the token is decrypted on the fly. To anonymize, KMS encrypts each value individually.

When to choose it

  • You explicitly want no separate database for PHI mappings
  • You need defense-in-depth encryption at rest with hardware-backed keys
  • Your security review requires KMS-managed encryption keys with rotation, IAM-based access control, and CloudTrail audit on every decrypt

Trade-offs

  • Value -> token lookup is not supported — RSA encryption is non-deterministic, so the same PHI value produces a different ciphertext on every call. You can decrypt tokens but cannot search for "what's the token for John Smith?"
  • Per-call AWS cost — KMS charges ~$0.0001 per encrypt and decrypt. With ~30 PHI values per document, this adds ~$0.003 per anonymize request and another ~$0.003 per deanonymize. ~300× more expensive than DynamoDB per request.
  • Slower — each PHI value triggers a network round-trip to KMS, both during anonymize and during deanonymize.
  • Tokens are larger — RSA ciphertext is much bigger than HMAC tokens, so anonymized text is significantly longer than the original.

What you need

Requirement Notes
KMS Key ARN An RSA 4096 key with kms:Encrypt and kms:Decrypt permissions for the API server's IAM role. The CloudFormation template creates this automatically.
AWS Region Region where the KMS key exists.

Configuration example

curl -X POST "http://<host>:8888/anonymize/phi?storage_type=KMS&aws_region=us-east-1&kms_key_id=arn:aws:kms:..." \
  -H "API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "Patient John Smith..."}'

File

How it works: Token mappings are written to a JSON file on the API server's local disk.

When to choose it

  • Local development only. Quick sanity checks without spinning up AWS resources.

Why not in production

  • Single-instance only — file is on local disk; multiple API servers can't share it
  • No backup or replication — disk failure = total data loss
  • No concurrent-write safety — risk of corruption under load
  • Doesn't scale — every operation reads/writes the entire mappings file

What you need

Requirement Notes
Secret Key HMAC secret for token generation.
Output Directory Server-side path where the JSON file is saved (optional).

Decision tree

Are you in production?
├─ No -> File (development only)
└─ Yes
   ├─ Do you need to search by PHI value (e.g. "what's the token for John Smith"?)
   │  ├─ Yes -> DynamoDB Token-Based (recommended) or DynamoDB Record-Based
   │  └─ No, only token -> value lookup -> any backend works
   ├─ Do you need to delete all PHI for a specific record at once?
   │  ├─ Yes -> DynamoDB Record-Based
   │  └─ No -> DynamoDB Token-Based (recommended)
   └─ Do you require KMS-managed encryption with no separate database?
      ├─ Yes -> AWS KMS (accept the cost trade-off)
      └─ No -> DynamoDB Token-Based (recommended)

Cost

For the per-request and per-hour cost breakdown across all backends — including how storage choice compares to Bedrock, EC2, and other AWS costs — see the Cost Analysis.

Short version: DynamoDB Token-Based is ~$0.00005 per request; KMS is ~$0.003 per request. At scale, KMS storage cost can exceed your EC2 cost. But both are dwarfed by Bedrock (~$0.009/request), which dominates the total bill regardless of which backend you choose.


Switching backends

You can change backends at any time by passing a different storage_type query parameter — there is no migration required at the server level. However:

  • Data anonymized with one backend cannot be deanonymized using a different backend. The token formats and storage locations are incompatible.
  • If you want to migrate existing anonymized data, you must deanonymize with the old backend and re-anonymize with the new one.
  • We recommend choosing a backend at deployment time and sticking with it.

Summary

For nearly all production use cases, DynamoDB Token-Based is the right choice. It's the fastest, cheapest, and most flexible option. Pick another backend only if you have a specific reason: granular per-record deletion (Record-Based), or no-database encryption requirements (KMS).