owned this note
owned this note
Published
Linked with GitHub
# DevSecOps Roadmap
Label relative necessity
1- trivial/stretch goal
2- nice to have
3- want to have
4- important
5- need/essential
TODO: ~~reduce epic size,~~ create tickets for 3 epics below, architecture diagrams
## Historical Priorities
#### Improved DevOps infrastructure:
- Continuous deployment reliability and robustness improvements
- Production operational responsiveness
- End-to-End testing aka integration smoke tests
- Cybersecurity assessment, scanning, and vulnerability management
#### Additional backlog features:
- Improved Kibana Reporting and Analytics
- User Management UI (OFA Admins and Regional Staff)
- User Access Release 3.5+ improvements
- Research-oriented environment w/ Blue/Green testing
- User-feedback features and bugs not yet accounted for
## Current DevOps Priorities
* [~~CI path-filtering front/back separation~~](https://github.com/raft-tech/TANF-app/issues/2457)
* [single clamav scanner](https://github.com/raft-tech/tanf-app/issues/2429)
* [nexus standup](https://github.com/raft-tech/tanf-app/issues/2116)
- [~~split out gunicorn_start.sh~~](https://github.com/raft-tech/TANF-app/issues/2347)
- deploy custom manifest [here](https://github.com/cloudfoundry/python-buildpack#building-the-buildpack)
- docker directly -- ATO questions
- remove makemigrations for deploys
- value: deployment speed, reduced complexity, stable dev exp
* [spike on pipeline refactor](https://github.com/raft-tech/TANF-app/issues/2419)
* [catch migration failures in pipeline](https://github.com/raft-tech/TANF-app/issues/1623)
* deploy over failed deployments
* split out use-cases, draft
* [integrate hashicorp vault into deployments](https://github.com/raft-tech/tanf-app/issues/2225)
* database seed - to be drafted by George
- replace deploy-backend.sh
- terraform?
- value: prevent unplanned debugging of failed deploys
- spike for researching cloudfoundry/cloud.gov-way to do our deployments
History: https://app.mural.co/t/raft2792/m/raft2792/1677172199979/6dae0de24469eb9093af73bdf45a6f1760659abc?sender=60470f8f-3c49-4c8d-a8b8-939dc74f72f7
### Continuous Integration Revamp - Nov/Dec/Jan
-5- 1373, 2000, var storage, sending vars #circleci cfg
- value: reduced complexity, reduce techdebt, break up monolith
-4- 2115, 2116 # pre-built docker image and/or improve buildpacks
- value: reduce build time, reduce local errors, simplifying circleci processes
### Integration test - Nov/Dec/Jan
-5- 310, 2141 - initial cypress setup (auth'd "hello world")
-5- 2274 - user approval
-5- 2282 - file upload
-2- 1492 - pa11y improvement
### Continuous Deployments - Feb/March
1623 - fail pipeline if deployment fails -- need to split up tasks
2429 - singular prod ClamAV scanner
value: frees up so much RAM, possibly more dev environments
-5i- terraform on-demand CD (branch-name vs sandbox)
value: replaces dumb bash scripts, state-machine/desired states, idempotency, infrastructure-as-code, nix manual intervention to fix, robustness/confidence
-3i- better deployment condition checks
value: shorter feedback loop while dev-ing
-3- redeploy over previous failed build
value: robustness, certainty that we always have the app running
-3i- UX demo environment # sub-epic
-3i- true rolling deployments/investigate --no-route
- value: no downtime unless code is broken
### Cloud gov improvements - Apr/May?
-4i- domain name mocking for dev/staging
- value: testing system-tier features before prod, confidence boost
-2i- live env var reloads
- value: dev in deployed, deployed envs flexible
-1- domain name mocking for local
### Continuous Integration Improvements - Apr - should be done after CD revamp
-3i- 1623 # circleci catch migration/deployment failures
-4i- 831 # logging aggregation
-3i- 2242, 1854 # bugs - trufflehog, overwriting env vars for diff spaces
-3i- 1620, 215, 216 #whats deployed where
value: lineage of code/traceability
-1- 1705, 1786 #linting
value: catch silent lint failures for deployments
-1- 1530 # container optimize
-1??- circleci frontend/backend separation # don't rebuild/retest unchanged code
### Key Management - March
-3- 1977, 2225 # hashicorp vault
- value: secure way to rotate jwt_keys, infra-as-code, supplants part of bash deploy scripts
-4i- 2118 # persist env vars across rebuilds
- value: clean/efficient deployments, prevent errors, infra-as-code, robustness
-1- automated key rotation
2435 - env vars for staging/prod
### Continuous Deployments Enhancements - May
-3i- terraform auto-deploy all branches # branchname.raft.aws.com vs branchname.app.cloud.gov
-3- terraform auto take-down
value: cleanup, prevent littering, free space, prevent staleness
-2i- auto-rollback on failed deploys
value: only good for prod, good with 215
### Security Level Up - Apr/May
-2i- SAST - static/dynamic code analysis
-1i- authenticated app crawler/exploiter
-3i- addressing vuln dependencies/version updates
### ops dashboard/notifications
- sentry for first pass (831)
- live monitoring feeds for prod, develop
- splunk logstash datadog
- smoke test/pings
- dynamically control "down for maintenance" page
## Lift-shift off Cloud.gov
- Docker v. buildpacks - start once ATO questions are answered. no timeline, this might end up being a big "why" for contract renewal: lift-shift to AWS GovCloud
# blocked by 2115/2116 as well as ATO
- ATO questions
- evaluate dev-local CF
- hardening dockerfiles
- simplifying docker-compose