Phase 2.5(b) Design — Wire upload_attachments to persist + GCS¶
Status: Approved for implementation
Author: Claude (Chat) + Rajesh
Date: 2026-05-08
Scope: forms-backend/routes_phase2.py:upload_attachments wire-up + forms-frontend/src/lib/api.ts field_key contract
Out of scope: ClamAV scanning (Phase 2.5(b.2)), MIME/size env-var cleanup (Phase 2.5(b.cleanup)), Phase 2.5(c) finalize/routing, Phase 2.5(d) template render, Phase 2.5(e) HappyFox API client, ownership/authorization checks (Phase 2.6 hardening)
Executive summary¶
Replace the POST /api/submissions/<id>/attachments stub with a real handler that:
- Validates each uploaded file (size, MIME type, non-empty)
- Uploads files to GCS at
<submission_id>/<attachment_id>/<original_name> - Inserts
Attachmentrows withclamav_status='pending'(ClamAV deferred to b.2) - Audit-logs each upload as
attachment_uploaded - Returns
AttachmentUploadResponsematching contract
Plus one frontend-side change: send field_key per file via multipart form parts. This is required because the schema mandates field_key NOT NULL on attachments, but the current SPA sends a flat files: File[] array with no field_key.
The data layer is fully designed (Attachment model, schema CHECK constraints, indexes, AuditLog enum values, GCS client) — this commit is purely handler wire-up + contract alignment.
Three commits in sequence¶
| # | Commit | What | Scope |
|---|---|---|---|
| 1 | This design doc | Capture Phase 2.5(b) reasoning in MkDocs | mkdocs-portal |
| 2 | Frontend field_key contract | api.ts sends field_key per file as multipart parts | forms-frontend |
| 3 | Backend wire-up | routes_phase2.py upload_attachments handler | forms-backend |
Sequence rationale: ship frontend first. The new files[<field_key>] multipart shape doesn't break the current backend stub — the stub returns {attachments: []} regardless of input. So frontend ships, user-visible behavior unchanged. Then backend ships, the new contract becomes load-bearing.
The reverse sequence (backend first, accept both old and new contracts) would require the backend to handle two contract shapes — increases complexity for no benefit since we control both sides.
Decisions baked in¶
| # | Decision | Choice |
|---|---|---|
| 1 | ClamAV strategy | Defer to Phase 2.5(b.2) — persist with clamav_status='pending'. The schema partial index supports a future async scan worker. |
| 2 | MIME/size source | Read Cloud Run env vars (current behavior). File env-var-cleanup as Phase 2.5(b.cleanup). |
| 3 | GCS object path | <submission_id>/<attachment_id>/<original_name> — three-level, collision-safe |
| 4 | Partial success | All-or-nothing. GCS uploads first, then single-transaction DB inserts. On any failure, delete uploaded GCS files; abort transaction. |
| 5 | field_key transport | Frontend multipart parts — files[<field_key>] named parts; backend reads field_key from the form-data multipart structure |
| 6 | Authorization scope | Authentication only, no per-submission ownership check. Matches Phase 1 + Phase 2.5(a) patterns. Phase 2.6 hardening pass can introduce ownership semantics across all Phase 2 endpoints uniformly if needed. |
Commit 2: Frontend field_key contract¶
Current state (forms-frontend/src/lib/api.ts)¶
Per Code's discovery — uploadAttachments(submissionId, files: File[]) builds FormData with each file appended under flat name 'files', POSTs to /api/submissions/${submissionId}/attachments. No field_key per file.
Target state¶
Function signature changes to accept attachments as { field_key: string, file: File }[]:
async uploadAttachments(
submissionId: string,
attachments: { field_key: string; file: File }[]
): Promise<AttachmentUploadResponse> {
const formData = new FormData();
for (const { field_key, file } of attachments) {
// Multipart part name encodes field_key:
formData.append(`files[${field_key}]`, file, file.name);
}
const response = await fetch(
`/api/submissions/${encodeURIComponent(submissionId)}/attachments`,
{ method: 'POST', body: formData, headers: { Authorization: `Bearer ${token}` } }
);
if (!response.ok) throw new Error(...);
return response.json();
}
The files[<field_key>] part-name pattern is a common convention for transmitting structured data via multipart form-data. Backend parses by iterating request.files keys and extracting field_key from the bracket notation.
Caller updates¶
Wherever uploadAttachments is called from FormFill (or wherever attachment fields are rendered), the caller now passes the field_key alongside each file. TODO Code: find the call site and verify it has access to field_key — likely the FieldRenderer or FormFill component iterating attachment-typed fields.
Why this is safe¶
- The frontend change ships before the backend change. Backend still returns the same stub response (
{attachments: []}). User-visible behavior unchanged until commit 3. - The multipart shape is backward compatible at the HTTP level — the server can iterate
request.filesand find thefiles[<field_key>]keys regardless of whether it parses the bracket notation.
Commit 3: Backend wire-up¶
What this commit changes¶
One file: forms-backend/routes_phase2.py. The existing stub at upload_attachments (~30 lines) is replaced with a working handler (~100 lines including helpers).
May also touch: imports (add google.cloud.storage, re, uuid, hashlib, Attachment model). One new helper for GCS upload + cleanup.
Detailed handler logic¶
import re
import uuid
import hashlib
from google.cloud import storage as gcs
# Pre-existing helpers from Phase 2.5(a) that we reuse:
# _audit_v2 — already defined in routes_phase2.py
#
# New module-level GCS client (lazy):
_gcs_client = None
def _get_gcs_client() -> gcs.Client:
global _gcs_client
if _gcs_client is None:
# TODO Code: verify the correct config attribute name for the GCP
# project — may be config.GCP_PROJECT, config.PROJECT_ID, or
# something else. Check config.py at apply time.
_gcs_client = gcs.Client(project=config.GCP_PROJECT)
return _gcs_client
def _parse_field_key(part_name: str) -> str | None:
"""Extract field_key from multipart part name 'files[<field_key>]'.
Returns None if part name doesn't match the expected pattern.
"""
m = re.match(r'^files\[([^\]]+)\]$', part_name)
return m.group(1) if m else None
def _validate_file(file_storage, content: bytes) -> tuple[bool, str | None]:
"""Returns (is_valid, error_reason).
TODO Code: verify the correct config attribute names for size and
MIME limits. Cloud Run env vars are MAX_UPLOAD_BYTES and
ALLOWED_MIME_TYPES; config.py constants per discovery may be named
ATTACHMENT_MAX_BYTES and ATTACHMENT_ALLOWED_MIME. Use whichever
config.py actually exposes — those are the runtime values.
"""
if len(content) == 0:
return False, "empty_file"
if len(content) > config.MAX_UPLOAD_BYTES:
return False, f"file_too_large (>{config.MAX_UPLOAD_BYTES} bytes)"
if file_storage.mimetype not in config.ALLOWED_MIME_TYPES:
return False, f"mime_type_not_allowed ({file_storage.mimetype})"
return True, None
def _scan_status_to_wire(db_status: str) -> str:
"""Map DB clamav_status to contract scan_status.
DB has: pending, clean, infected (and possibly failed/error states
added in Phase 2.5(b.2)). Contract has: 'clean' | 'infected' | 'pending'.
Until Phase 2.5(b.2) ClamAV scanning ships, all uploads land as
'pending' in DB and pass through to the wire as 'pending'.
"""
return db_status if db_status in ('clean', 'infected', 'pending') else 'pending'
@bp.post("/submissions/<submission_id>/attachments")
@require_role("submitter")
def upload_attachments(submission_id):
"""Upload attachments for a submission.
Mirrors Phase 2.5(a) discipline: validate, persist transactionally,
audit-log each operation.
Frontend sends multipart form-data with parts named 'files[<field_key>]'.
Each part's filename is the original_name; field_key comes from the
multipart key.
Authorization: requires 'submitter' role (enforced by decorator).
No per-submission ownership check — matches Phase 1 + Phase 2.5(a)
patterns. Phase 2.6 hardening can introduce ownership semantics
across all Phase 2 endpoints uniformly if needed.
"""
user = current_user()
# Iterate request.files for multipart parts matching files[<field_key>]
upload_items = [] # list of (field_key, file_storage)
for part_name, file_storage in request.files.items(multi=True):
field_key = _parse_field_key(part_name)
if field_key is None:
continue # Ignore parts that don't match the expected pattern
upload_items.append((field_key, file_storage))
if not upload_items:
return jsonify(error="no_files"), 400
with Session(get_engine()) as session:
# Verify submission exists (existence check only, not ownership)
submission = session.get(Submission, submission_id)
if submission is None:
return jsonify(error="submission_not_found"), 404
# Read all file contents up front (we need bytes for SHA-256
# and for size validation; also gives us a stable reference
# for both validation and upload)
items_with_content = []
for field_key, fs in upload_items:
fs.stream.seek(0)
content = fs.stream.read()
items_with_content.append((field_key, fs, content))
# Pre-validate all files (fail fast before any GCS upload)
for field_key, fs, content in items_with_content:
ok, reason = _validate_file(fs, content)
if not ok:
_audit_v2(session, "attachment_rejected",
actor_username=user.username,
target_id=submission_id, target_type="submission",
details={"field_key": field_key,
"filename": fs.filename,
"reason": reason},
success=False)
session.commit()
return jsonify(error=reason, field_key=field_key,
filename=fs.filename), 422
# Upload to GCS (all files); track uploaded paths for cleanup on failure
gcs_client = _get_gcs_client()
bucket = gcs_client.bucket(config.ATTACHMENTS_BUCKET)
uploaded_paths = [] # for rollback if anything fails downstream
attachments = []
try:
for field_key, fs, content in items_with_content:
attachment_id = uuid.uuid4()
sha256 = hashlib.sha256(content).hexdigest()
gcs_object = f"{submission_id}/{attachment_id}/{fs.filename}"
blob = bucket.blob(gcs_object)
blob.upload_from_string(content, content_type=fs.mimetype)
uploaded_paths.append(gcs_object)
attachments.append(Attachment(
id=attachment_id,
submission_id=submission_id,
field_key=field_key,
original_name=fs.filename,
mime_type=fs.mimetype,
size_bytes=len(content),
gcs_bucket=config.ATTACHMENTS_BUCKET,
gcs_object=gcs_object,
sha256=sha256,
clamav_status='pending', # Decision 1: defer to b.2
))
# Single-transaction insert of all rows + audit logs
for att in attachments:
session.add(att)
_audit_v2(session, "attachment_uploaded",
actor_username=user.username,
target_id=str(att.id), target_type="attachment",
details={"submission_id": submission_id,
"field_key": att.field_key,
"filename": att.original_name,
"size_bytes": att.size_bytes,
"sha256": att.sha256})
session.commit()
# Refresh to get server-default timestamps
for att in attachments:
session.refresh(att)
except Exception:
# Cleanup: delete any GCS objects we uploaded before the failure
for path in uploaded_paths:
try:
bucket.blob(path).delete()
except Exception:
pass # best-effort cleanup; don't mask the original error
raise # let Flask error handler return 500
# Build wire response
return jsonify({
"attachments": [
{
"id": str(att.id),
"filename": att.original_name,
"size_bytes": att.size_bytes,
"mime_type": att.mime_type,
"scan_status": _scan_status_to_wire(att.clamav_status),
"uploaded_at": att.uploaded_at.isoformat(),
}
for att in attachments
]
}), 201
What we're NOT doing¶
| Item | Why deferred |
|---|---|
| ClamAV scanning | Phase 2.5(b.2) — needs sidecar/worker architecture decision |
| Per-submission ownership check | Phase 2.6 hardening — should be applied uniformly across all Phase 2 endpoints, not just attachments |
| Streaming uploads (stream-to-GCS without buffering) | Premature optimization; current 25MB limit is fine in-memory |
| Resumable uploads | Not needed for 25MB file size limit |
| Per-attachment encryption (DEK like submission_fields) | Attachments aren't encrypted at the application layer; rely on GCS default encryption + bucket-level KMS. Could add app-layer envelope encryption in Phase 2.6 if PII-sensitive forms accumulate in attachments. |
| Antivirus scan on download path | Phase 2.5(b.2) territory |
| Attachment retention/lifecycle in GCS | Phase 2.6 — GCS lifecycle rules are infra-level, not handler-level |
| MIME/size env-var consolidation | Phase 2.5(b.cleanup) — pick config.py vs Cloud Run env vars vs schema CHECK as source of truth |
| Filename sanitization (spaces, special chars) | Phase 2.5(b.2) or download-path work — current behavior is "preserve original_name as-is" which works in GCS but may be awkward on download |
| SOC / observability integration (Wazuh, Cloud Logging, Prometheus, IR runbook, DR procedure, drill) | Tracked separately as T1.6 in priorities tracker — applies to forms portal as a whole, sequenced after Phase 2.5 functional work completes |
Risk assessment¶
| Risk | Severity | Mitigation |
|---|---|---|
| GCS upload succeeds, DB insert fails → orphan GCS objects | Low-medium | Best-effort cleanup loop deletes uploaded paths on exception. Worst case: orphan files in GCS with no DB row. GCS lifecycle rules can clean these later; not a data correctness issue. |
| File contents loaded fully into memory before upload | Low | 25MB max × ~5 typical attachments = 125MB worst case. Cloud Run instance memory is 512MB minimum. Acceptable. |
| MIME type spoofing (Content-Type header lies) | Medium | Allowed-MIME validation reads Content-Type header from multipart, which the client controls. Mitigation: future ClamAV scan in 2.5(b.2) will inspect actual bytes. For now, trust boundary is authenticated GPUS staff. |
Multipart part name not matching files[<key>] |
Low | Parts that don't match are silently ignored. If frontend (commit 2) ships with a different naming pattern than expected, all uploads silently 400 (no_files). Mitigation: explicit verification at deploy time. |
| ClamAV deferred → unscanned files in GCS | Medium | Trust boundary: authenticated GPUS staff. Mitigation: documented gap; ClamAV is Phase 2.5(b.2). |
| No per-submission ownership check → user-A can attach files to user-B's submission if they have the UUID | Low | Submission UUIDs are not user-discoverable (they're UUIDs, not enumerable). Audit log captures actor_username providing accountability. Phase 2.6 hardening can introduce ownership scoping uniformly. |
| Forms portal events not flowing into SOC pipeline | Medium | Tracked as T1.6 in priorities. Forms portal is operationally invisible to SOC since Phase 1 cutover; this is pre-existing debt, not new debt introduced by 2.5(b). |
Verification plan¶
After commit 2 (frontend) deploys:
1. Sign in to forms.greenpeace.us
2. Navigate to a form with an attachment field. TODO Code: identify which forms have attachment fields by querying DB or reading form definitions.
3. Open Web Inspector → Network. Submit form with an attachment. Check that POST .../attachments request has multipart parts named files[<field_key>].
4. Verify response is still {attachments: []} (stub unchanged).
After commit 3 (backend) deploys: 1. Submit form with attachment via SPA. Verify response shape matches contract:
{
"attachments": [
{
"id": "<uuid>",
"filename": "<original-name>",
"size_bytes": <int>,
"mime_type": "<type>",
"scan_status": "pending",
"uploaded_at": "2026-05-08T..."
}
]
}
SELECT id, submission_id, field_key, original_name, mime_type,
size_bytes, gcs_bucket, gcs_object, sha256, clamav_status,
uploaded_at
FROM attachments
WHERE submission_id = '<test-uuid>'
ORDER BY uploaded_at;
clamav_status='pending', gcs_object matching <submission_id>/<attachment_id>/<original_name>, sha256 populated.
3. GCS-side verification:
Expected: directory containing <attachment_id>/<original-name> files matching the DB rows.
4. Audit log:
SELECT action, actor_username, target_id, jsonb_pretty(details), created_at
FROM audit_log
WHERE action IN ('attachment_uploaded', 'attachment_rejected')
ORDER BY created_at DESC LIMIT 10;
attachment_uploaded row per file with actor_username='rchhetry'.
5. Negative tests:
- Upload a file > 25MB → 422 with field_key and filename in response
- Upload a .exe (or other non-allowed MIME) → 422
- Upload an empty (0-byte) file → 422 with reason "empty_file"
Open questions surfaced¶
- Multiple files for the same field_key: schema allows
(submission_id, field_key)to repeat — is that intentional (multi-file attachment fields) or a missed UNIQUE constraint? Skim for any UI that would render this. Not blocking; tracked for Phase 2.6 review. - Filename sanitization: GCS object path
<submission_id>/<attachment_id>/My Document.pdfworks in GCS but creates URL-encoding pain on download. Phase 2.5(b.2) or download-path work could canonicalize at upload time. - SOC / observability integration: forms portal is operationally invisible to SOC. Tracked as T1.6 in priorities; sequenced after Phase 2.5 functional work.
These don't block 2.5(b); they're tracked items.