Skip to content

Phase 2.5(b) Design — Wire upload_attachments to persist + GCS

Status: Approved for implementation Author: Claude (Chat) + Rajesh Date: 2026-05-08 Scope: forms-backend/routes_phase2.py:upload_attachments wire-up + forms-frontend/src/lib/api.ts field_key contract Out of scope: ClamAV scanning (Phase 2.5(b.2)), MIME/size env-var cleanup (Phase 2.5(b.cleanup)), Phase 2.5(c) finalize/routing, Phase 2.5(d) template render, Phase 2.5(e) HappyFox API client, ownership/authorization checks (Phase 2.6 hardening)


Executive summary

Replace the POST /api/submissions/<id>/attachments stub with a real handler that:

  1. Validates each uploaded file (size, MIME type, non-empty)
  2. Uploads files to GCS at <submission_id>/<attachment_id>/<original_name>
  3. Inserts Attachment rows with clamav_status='pending' (ClamAV deferred to b.2)
  4. Audit-logs each upload as attachment_uploaded
  5. Returns AttachmentUploadResponse matching contract

Plus one frontend-side change: send field_key per file via multipart form parts. This is required because the schema mandates field_key NOT NULL on attachments, but the current SPA sends a flat files: File[] array with no field_key.

The data layer is fully designed (Attachment model, schema CHECK constraints, indexes, AuditLog enum values, GCS client) — this commit is purely handler wire-up + contract alignment.


Three commits in sequence

# Commit What Scope
1 This design doc Capture Phase 2.5(b) reasoning in MkDocs mkdocs-portal
2 Frontend field_key contract api.ts sends field_key per file as multipart parts forms-frontend
3 Backend wire-up routes_phase2.py upload_attachments handler forms-backend

Sequence rationale: ship frontend first. The new files[<field_key>] multipart shape doesn't break the current backend stub — the stub returns {attachments: []} regardless of input. So frontend ships, user-visible behavior unchanged. Then backend ships, the new contract becomes load-bearing.

The reverse sequence (backend first, accept both old and new contracts) would require the backend to handle two contract shapes — increases complexity for no benefit since we control both sides.


Decisions baked in

# Decision Choice
1 ClamAV strategy Defer to Phase 2.5(b.2) — persist with clamav_status='pending'. The schema partial index supports a future async scan worker.
2 MIME/size source Read Cloud Run env vars (current behavior). File env-var-cleanup as Phase 2.5(b.cleanup).
3 GCS object path <submission_id>/<attachment_id>/<original_name> — three-level, collision-safe
4 Partial success All-or-nothing. GCS uploads first, then single-transaction DB inserts. On any failure, delete uploaded GCS files; abort transaction.
5 field_key transport Frontend multipart partsfiles[<field_key>] named parts; backend reads field_key from the form-data multipart structure
6 Authorization scope Authentication only, no per-submission ownership check. Matches Phase 1 + Phase 2.5(a) patterns. Phase 2.6 hardening pass can introduce ownership semantics across all Phase 2 endpoints uniformly if needed.

Commit 2: Frontend field_key contract

Current state (forms-frontend/src/lib/api.ts)

Per Code's discovery — uploadAttachments(submissionId, files: File[]) builds FormData with each file appended under flat name 'files', POSTs to /api/submissions/${submissionId}/attachments. No field_key per file.

Target state

Function signature changes to accept attachments as { field_key: string, file: File }[]:

async uploadAttachments(
  submissionId: string,
  attachments: { field_key: string; file: File }[]
): Promise<AttachmentUploadResponse> {
  const formData = new FormData();
  for (const { field_key, file } of attachments) {
    // Multipart part name encodes field_key:
    formData.append(`files[${field_key}]`, file, file.name);
  }
  const response = await fetch(
    `/api/submissions/${encodeURIComponent(submissionId)}/attachments`,
    { method: 'POST', body: formData, headers: { Authorization: `Bearer ${token}` } }
  );
  if (!response.ok) throw new Error(...);
  return response.json();
}

The files[<field_key>] part-name pattern is a common convention for transmitting structured data via multipart form-data. Backend parses by iterating request.files keys and extracting field_key from the bracket notation.

Caller updates

Wherever uploadAttachments is called from FormFill (or wherever attachment fields are rendered), the caller now passes the field_key alongside each file. TODO Code: find the call site and verify it has access to field_key — likely the FieldRenderer or FormFill component iterating attachment-typed fields.

Why this is safe

  • The frontend change ships before the backend change. Backend still returns the same stub response ({attachments: []}). User-visible behavior unchanged until commit 3.
  • The multipart shape is backward compatible at the HTTP level — the server can iterate request.files and find the files[<field_key>] keys regardless of whether it parses the bracket notation.

Commit 3: Backend wire-up

What this commit changes

One file: forms-backend/routes_phase2.py. The existing stub at upload_attachments (~30 lines) is replaced with a working handler (~100 lines including helpers).

May also touch: imports (add google.cloud.storage, re, uuid, hashlib, Attachment model). One new helper for GCS upload + cleanup.

Detailed handler logic

import re
import uuid
import hashlib
from google.cloud import storage as gcs

# Pre-existing helpers from Phase 2.5(a) that we reuse:
#   _audit_v2 — already defined in routes_phase2.py
#
# New module-level GCS client (lazy):
_gcs_client = None

def _get_gcs_client() -> gcs.Client:
    global _gcs_client
    if _gcs_client is None:
        # TODO Code: verify the correct config attribute name for the GCP
        # project — may be config.GCP_PROJECT, config.PROJECT_ID, or
        # something else. Check config.py at apply time.
        _gcs_client = gcs.Client(project=config.GCP_PROJECT)
    return _gcs_client


def _parse_field_key(part_name: str) -> str | None:
    """Extract field_key from multipart part name 'files[<field_key>]'.
    Returns None if part name doesn't match the expected pattern.
    """
    m = re.match(r'^files\[([^\]]+)\]$', part_name)
    return m.group(1) if m else None


def _validate_file(file_storage, content: bytes) -> tuple[bool, str | None]:
    """Returns (is_valid, error_reason).

    TODO Code: verify the correct config attribute names for size and
    MIME limits. Cloud Run env vars are MAX_UPLOAD_BYTES and
    ALLOWED_MIME_TYPES; config.py constants per discovery may be named
    ATTACHMENT_MAX_BYTES and ATTACHMENT_ALLOWED_MIME. Use whichever
    config.py actually exposes — those are the runtime values.
    """
    if len(content) == 0:
        return False, "empty_file"
    if len(content) > config.MAX_UPLOAD_BYTES:
        return False, f"file_too_large (>{config.MAX_UPLOAD_BYTES} bytes)"
    if file_storage.mimetype not in config.ALLOWED_MIME_TYPES:
        return False, f"mime_type_not_allowed ({file_storage.mimetype})"
    return True, None


def _scan_status_to_wire(db_status: str) -> str:
    """Map DB clamav_status to contract scan_status.

    DB has: pending, clean, infected (and possibly failed/error states
    added in Phase 2.5(b.2)). Contract has: 'clean' | 'infected' | 'pending'.
    Until Phase 2.5(b.2) ClamAV scanning ships, all uploads land as
    'pending' in DB and pass through to the wire as 'pending'.
    """
    return db_status if db_status in ('clean', 'infected', 'pending') else 'pending'


@bp.post("/submissions/<submission_id>/attachments")
@require_role("submitter")
def upload_attachments(submission_id):
    """Upload attachments for a submission.

    Mirrors Phase 2.5(a) discipline: validate, persist transactionally,
    audit-log each operation.

    Frontend sends multipart form-data with parts named 'files[<field_key>]'.
    Each part's filename is the original_name; field_key comes from the
    multipart key.

    Authorization: requires 'submitter' role (enforced by decorator).
    No per-submission ownership check — matches Phase 1 + Phase 2.5(a)
    patterns. Phase 2.6 hardening can introduce ownership semantics
    across all Phase 2 endpoints uniformly if needed.
    """
    user = current_user()

    # Iterate request.files for multipart parts matching files[<field_key>]
    upload_items = []  # list of (field_key, file_storage)
    for part_name, file_storage in request.files.items(multi=True):
        field_key = _parse_field_key(part_name)
        if field_key is None:
            continue  # Ignore parts that don't match the expected pattern
        upload_items.append((field_key, file_storage))

    if not upload_items:
        return jsonify(error="no_files"), 400

    with Session(get_engine()) as session:
        # Verify submission exists (existence check only, not ownership)
        submission = session.get(Submission, submission_id)
        if submission is None:
            return jsonify(error="submission_not_found"), 404

        # Read all file contents up front (we need bytes for SHA-256
        # and for size validation; also gives us a stable reference
        # for both validation and upload)
        items_with_content = []
        for field_key, fs in upload_items:
            fs.stream.seek(0)
            content = fs.stream.read()
            items_with_content.append((field_key, fs, content))

        # Pre-validate all files (fail fast before any GCS upload)
        for field_key, fs, content in items_with_content:
            ok, reason = _validate_file(fs, content)
            if not ok:
                _audit_v2(session, "attachment_rejected",
                          actor_username=user.username,
                          target_id=submission_id, target_type="submission",
                          details={"field_key": field_key,
                                   "filename": fs.filename,
                                   "reason": reason},
                          success=False)
                session.commit()
                return jsonify(error=reason, field_key=field_key,
                               filename=fs.filename), 422

        # Upload to GCS (all files); track uploaded paths for cleanup on failure
        gcs_client = _get_gcs_client()
        bucket = gcs_client.bucket(config.ATTACHMENTS_BUCKET)
        uploaded_paths = []  # for rollback if anything fails downstream
        attachments = []

        try:
            for field_key, fs, content in items_with_content:
                attachment_id = uuid.uuid4()
                sha256 = hashlib.sha256(content).hexdigest()

                gcs_object = f"{submission_id}/{attachment_id}/{fs.filename}"
                blob = bucket.blob(gcs_object)
                blob.upload_from_string(content, content_type=fs.mimetype)
                uploaded_paths.append(gcs_object)

                attachments.append(Attachment(
                    id=attachment_id,
                    submission_id=submission_id,
                    field_key=field_key,
                    original_name=fs.filename,
                    mime_type=fs.mimetype,
                    size_bytes=len(content),
                    gcs_bucket=config.ATTACHMENTS_BUCKET,
                    gcs_object=gcs_object,
                    sha256=sha256,
                    clamav_status='pending',  # Decision 1: defer to b.2
                ))

            # Single-transaction insert of all rows + audit logs
            for att in attachments:
                session.add(att)
                _audit_v2(session, "attachment_uploaded",
                          actor_username=user.username,
                          target_id=str(att.id), target_type="attachment",
                          details={"submission_id": submission_id,
                                   "field_key": att.field_key,
                                   "filename": att.original_name,
                                   "size_bytes": att.size_bytes,
                                   "sha256": att.sha256})

            session.commit()

            # Refresh to get server-default timestamps
            for att in attachments:
                session.refresh(att)

        except Exception:
            # Cleanup: delete any GCS objects we uploaded before the failure
            for path in uploaded_paths:
                try:
                    bucket.blob(path).delete()
                except Exception:
                    pass  # best-effort cleanup; don't mask the original error
            raise  # let Flask error handler return 500

        # Build wire response
        return jsonify({
            "attachments": [
                {
                    "id": str(att.id),
                    "filename": att.original_name,
                    "size_bytes": att.size_bytes,
                    "mime_type": att.mime_type,
                    "scan_status": _scan_status_to_wire(att.clamav_status),
                    "uploaded_at": att.uploaded_at.isoformat(),
                }
                for att in attachments
            ]
        }), 201

What we're NOT doing

Item Why deferred
ClamAV scanning Phase 2.5(b.2) — needs sidecar/worker architecture decision
Per-submission ownership check Phase 2.6 hardening — should be applied uniformly across all Phase 2 endpoints, not just attachments
Streaming uploads (stream-to-GCS without buffering) Premature optimization; current 25MB limit is fine in-memory
Resumable uploads Not needed for 25MB file size limit
Per-attachment encryption (DEK like submission_fields) Attachments aren't encrypted at the application layer; rely on GCS default encryption + bucket-level KMS. Could add app-layer envelope encryption in Phase 2.6 if PII-sensitive forms accumulate in attachments.
Antivirus scan on download path Phase 2.5(b.2) territory
Attachment retention/lifecycle in GCS Phase 2.6 — GCS lifecycle rules are infra-level, not handler-level
MIME/size env-var consolidation Phase 2.5(b.cleanup) — pick config.py vs Cloud Run env vars vs schema CHECK as source of truth
Filename sanitization (spaces, special chars) Phase 2.5(b.2) or download-path work — current behavior is "preserve original_name as-is" which works in GCS but may be awkward on download
SOC / observability integration (Wazuh, Cloud Logging, Prometheus, IR runbook, DR procedure, drill) Tracked separately as T1.6 in priorities tracker — applies to forms portal as a whole, sequenced after Phase 2.5 functional work completes

Risk assessment

Risk Severity Mitigation
GCS upload succeeds, DB insert fails → orphan GCS objects Low-medium Best-effort cleanup loop deletes uploaded paths on exception. Worst case: orphan files in GCS with no DB row. GCS lifecycle rules can clean these later; not a data correctness issue.
File contents loaded fully into memory before upload Low 25MB max × ~5 typical attachments = 125MB worst case. Cloud Run instance memory is 512MB minimum. Acceptable.
MIME type spoofing (Content-Type header lies) Medium Allowed-MIME validation reads Content-Type header from multipart, which the client controls. Mitigation: future ClamAV scan in 2.5(b.2) will inspect actual bytes. For now, trust boundary is authenticated GPUS staff.
Multipart part name not matching files[<key>] Low Parts that don't match are silently ignored. If frontend (commit 2) ships with a different naming pattern than expected, all uploads silently 400 (no_files). Mitigation: explicit verification at deploy time.
ClamAV deferred → unscanned files in GCS Medium Trust boundary: authenticated GPUS staff. Mitigation: documented gap; ClamAV is Phase 2.5(b.2).
No per-submission ownership check → user-A can attach files to user-B's submission if they have the UUID Low Submission UUIDs are not user-discoverable (they're UUIDs, not enumerable). Audit log captures actor_username providing accountability. Phase 2.6 hardening can introduce ownership scoping uniformly.
Forms portal events not flowing into SOC pipeline Medium Tracked as T1.6 in priorities. Forms portal is operationally invisible to SOC since Phase 1 cutover; this is pre-existing debt, not new debt introduced by 2.5(b).

Verification plan

After commit 2 (frontend) deploys: 1. Sign in to forms.greenpeace.us 2. Navigate to a form with an attachment field. TODO Code: identify which forms have attachment fields by querying DB or reading form definitions. 3. Open Web Inspector → Network. Submit form with an attachment. Check that POST .../attachments request has multipart parts named files[<field_key>]. 4. Verify response is still {attachments: []} (stub unchanged).

After commit 3 (backend) deploys: 1. Submit form with attachment via SPA. Verify response shape matches contract:

{
  "attachments": [
    {
      "id": "<uuid>",
      "filename": "<original-name>",
      "size_bytes": <int>,
      "mime_type": "<type>",
      "scan_status": "pending",
      "uploaded_at": "2026-05-08T..."
    }
  ]
}
2. MAPLE-side DB inspection (cloud-sql-proxy + psql, pattern from 2026-04-30):
SELECT id, submission_id, field_key, original_name, mime_type,
       size_bytes, gcs_bucket, gcs_object, sha256, clamav_status,
       uploaded_at
FROM attachments
WHERE submission_id = '<test-uuid>'
ORDER BY uploaded_at;
Expected: 1 row per uploaded file, clamav_status='pending', gcs_object matching <submission_id>/<attachment_id>/<original_name>, sha256 populated. 3. GCS-side verification:
gsutil ls -l gs://gpus-forms-attachments/<test-submission-id>/
Expected: directory containing <attachment_id>/<original-name> files matching the DB rows. 4. Audit log:
SELECT action, actor_username, target_id, jsonb_pretty(details), created_at
FROM audit_log
WHERE action IN ('attachment_uploaded', 'attachment_rejected')
ORDER BY created_at DESC LIMIT 10;
Expected: one attachment_uploaded row per file with actor_username='rchhetry'. 5. Negative tests: - Upload a file > 25MB → 422 with field_key and filename in response - Upload a .exe (or other non-allowed MIME) → 422 - Upload an empty (0-byte) file → 422 with reason "empty_file"

Open questions surfaced

  1. Multiple files for the same field_key: schema allows (submission_id, field_key) to repeat — is that intentional (multi-file attachment fields) or a missed UNIQUE constraint? Skim for any UI that would render this. Not blocking; tracked for Phase 2.6 review.
  2. Filename sanitization: GCS object path <submission_id>/<attachment_id>/My Document.pdf works in GCS but creates URL-encoding pain on download. Phase 2.5(b.2) or download-path work could canonicalize at upload time.
  3. SOC / observability integration: forms portal is operationally invisible to SOC. Tracked as T1.6 in priorities; sequenced after Phase 2.5 functional work.

These don't block 2.5(b); they're tracked items.