A CI/CD framework which expose a RESTful API.
CI/CD job triggering and job status checks as RESTful API calls. E.g. run a unit-test:
curl --request POST -H "Authorization: Bearer abc.def.ghi" \
https://my-ci-server.com/jobs/unit-test/runs \
-d '{"commitSha": "0000000000000000000000000000000000000000"}'
curl --request POST -H "Authorization: Bearer abc.def.ghi" \
https://my-ci-server.com/jobs/unit-test/runs \
-d '{"branchName": "master"}'
check a unit test status:
curl --request GET -H "Authorization: Bearer abc.def.ghi" \
https://my-ci-server.com/jobs/unit-test/runs/12345
It may have multiple clients. The below listed clients should be part of the official framework support.
- UI interface (may be server-side rendered)
- Command line tool
- GitHub PR hooks
Framework configuration (see list below) should be setup by API calls. There can then be command line interface, which can further be wrapped as an infrastructure-as-code infrastructure (maybe through Terraform) which can setup those properties in an idempotent way. Relevant data stay in database (deployed application/files stay unchanged).
- Job specific:
- Job name.
- Job type specific attributes:
- Git repo jobs (the trigger from GitHub/GitLab/... PR hooks is a different thing/not define in here, and this job can be triggered manually or from a PR hook):
- Git repo centralized location.
- Job config file relative path in git repo.
- Freestyle jobs:
- Everything.
- Git repo jobs (the trigger from GitHub/GitLab/... PR hooks is a different thing/not define in here, and this job can be triggered manually or from a PR hook):
- Job specific input parameters.
- It should not be in version controlled config file, because in that case setup of a run in one branch (not merged into master yet) will leak to another run in a different branch.
- Overall config:
- What kind of servers (number of cores) can a job run onto. Or this is hard to be customized.
Job configuration/the logic of a particular job (see list below) should be inside of the to-be-tested repo:
- Job command/script.
- (Dockerized) environment the script can be run onto.
- Unlike in CircleCI it needs special format Dockerfiles (chose one of the official ones and extends on your need through
.circleci/config.ymlruncommand, or create a specific one with strong CircleCI constrain), it should be able to re-use the Dockerfile used for the production or dev environment of the related project.
- Unlike in CircleCI it needs special format Dockerfiles (chose one of the official ones and extends on your need through
- External dependencies (database, queueing system, ...):
- Should be defined as sidecar containers.
- Cannot export port and let main script call localhost:port, as that will cause port crashing when multiple jobs are running in the same hosting machine.
- Will need to setup network in between containers.
- Feature becomes very similar to docker-compose. Wonder if we can use it as a dependent lib.
- Alternatively, we can always
RUN apt-get install postgresqlin the main container and access localhost. But that will make user unable to use their prod container for test.
- Should be defined as sidecar containers.
- A list of environmental variables (used to pass in secrets/credentials).
- Will need the actual credentials saved in backend associate with the job itself. The run will error out if it cannot find some values of the secrets/credentials.
- First step save plat text in database.
- Finally we should move them to a secured backend persistent storage, e.g. HashiCorp Vault.
- Follows The Twelve Factors that they should be passed as environmental variables one at a time.
- Actual pass in the secrets/credentials by
docker --env ENV=foo. Very unforturate cannot dodocker -e ENVand define ENV in host, as that will cause ENV name crashing for secrets/credentials belong to different jobs.- An alternative approach is to use Docker swarm secrets. Needs to investigate more on the possibilities/pros/cons this approach.
- If user need to do git operations inside of the git container, we need to use SSH agent forwarding for credentials (volume mount
.ssh/id_rsainside/outside of the container is not secure, and doesn't work if user uses a passphrase). Refer here, here, here for detail. We should probably still need to create.ssh/id_rsaat some point, safe it to some secret persistent storage (because otherwise it cannot be used by multiple slaves), give user.ssh/id_rsa.puband let them use it to setup 3rd party git servers.
- Will need the actual credentials saved in backend associate with the job itself. The run will error out if it cannot find some values of the secrets/credentials.
- Timeout
- There should probably also a global (umbrella) one shared by all jobs.
- What kind of results it should save, and where are they inside of the container after finishing the job.
- If the plan is to "docker volume link" the result out, and let the slave machine (where slave agent stays) upload the result to persistent storage, we'll have problem cleanup those result files (as they are created/owned by the user inside of the docker container). To resolve it, we'll need to
--user $(id -u):$(id -g)in the docker command, as suggested in here.
- If the plan is to "docker volume link" the result out, and let the slave machine (where slave agent stays) upload the result to persistent storage, we'll have problem cleanup those result files (as they are created/owned by the user inside of the docker container). To resolve it, we'll need to
- Resource quota(?)
- If the existing jobs on a slave machine has taking out all the resource quota, new jobs will be blocked to send to that machine. Not sure if that is needed, as we can also use slave CPU percentage to block sending new jobs.
- No need to define what kind of slaves (CPU core, ...) the job want to run at, since slaves should be just multi-core boxes with multiple jobs running on it (otherwise we cannot autoscale them based on CPU usage).
It doesn't matter if the job config (in repo) and infrastructure-as-code are in the same production repo, as they are with separated deployment anyway (same as whether app code and AWS setup are in the same repo or not, so may apply the same rule for both of them). However, job configuration should probably go in the same place as where the production infrastructure-as-code in.
We should completely hide docker operations (especially forbid volume link) so docker can make sure each run can be cleaned up completely.
Slaves should be a customized extension of docker image with slave agent burned in. Slave agent talks to docker by Java Docker API Client.
There's no need to enable SSH in the slave box, as communication between master API and slave agent is through message broker.
Job are run in docker containers inside of this slave docker container. It is very important that (1) necessary port is opened, and (2) credentials can be passed in, in case the job want to communicate with outside world (e.g. curl/git clone from outside, upload to S3, ...).
Since everything are running on docker, we may consider using registry to share docker cache across hosts/slaves. And if a step is not defined in Dockerfile (e.g. library installation inside of the script), the installation should be down everytime (rather than home baked caching dependencies e.g. in CircleCI). Also refer here (a 4 years old guide which may be out of date but describe the problem clearly) and here (new/updated toolsets).
Even if both master and slave has multiple machines (masters are tranditional API machines with a load balancer balances the API calls), we still want them to be seperated machines. Reason:
- Master (working for simple tasks and response user quickly) can use a pure API framework, can keep stateless, easy to re-deploy and killed when necessary.
- Slave overloading will not cause master to freeze/not responding.
- Master and slave can follow different scaling rules.
Roles for master/slave machines:
- Master machine has API server running on it. It is in charge of
- Expose API endpoints. Manage (create/update/delete) jobs. Serve results of querying historical job run data. Act as the gateway of triggering new job.
- When there's a new job request, it record it -- create run record in database ("trigger time", ...) with status "in progress", and pass the job information to slave to be executed.
- Slave machine hosts a long-run agent on it. Agent is in charge of
- Starts the job.
- Monitors the job execution progress.
- Upload the console output/testing results to persistent storages(s), and modify the database run record (complete time and status "done") without the needs/detour to communicate to master on this.
Master/slave communication:
- Triggering: triggering should be done by message queues. Master sends the task to a message queuing system (may have multiple queues based on "resource quota"). When slave have spear compatibility, it actively goes the message queue to grab messages and work on them. It is a better option than remote calls (like the various options provided by Spring remoting support).
- With this loose coupling (by a queueing system), there's no need for master/slave to keep a (SSH/...) connection when a job is executed.
- It naturally acts as a buffer, so
- Slave machines are not overloaded.
- Unstarted runs (if all slaves are busy) are safe when machines restarts/redeploy.
- It naturally distributes the run tasks to multiple machines, regardless of how many masters/slave machines we have.
- Open question: How to implement job aborting within message queue infrastructure?
- A working but tedious approach is master can setup the run to be in
abortstate. And slave, while executing the job, chech the status periodically. - Maybe when slave starts the job, it records/communicates to master who it is (saving to database?), so later on master has a way to directly find it, and send kill signal to it.
- Not ideal because it is a very different communication protocol which needs a separate setup.
- Since slave host machines and docker nodes are both disposable, it doesn't make many sense to save their identities.
- A working but tedious approach is master can setup the run to be in
- Share/communicate job status: By the database.
- No need for slave agent (as client) to communicate back to master (server) to notify the job is done.
- Master is supposed to be stateless. No need for it to keep the job status in its memory so no need for slave to notify it.
- Everytime the run status is queried, master should consult the database. Master may cache the "done" case since it is forever done. It should be explicitly marked in the caching policy that "in progress" is not a cachable state.
- No need for slave agent (as client) to communicate back to master (server) to notify the job is done.
Slave agent can be an application burn into the slave image (an extension of docker image as described above -- not the container the actual job is running).
- Jenkins is using a different approach to
scpthe slave agent every single time a new job starts (to resolve the problem of legacy agent version) by overwriting the agent is shared by all jobs in slave. That's mostly because Jenkins slave machines (setup by using manually and stays persistently) have configuration drafting. We have no need to do it, as our slaves (as docker containers) are disposable, and can be cleaned up every time we want to upgrade the slave agent version. - This also saves the need for api master to know
scp/Apache MINA SSHD.
Slave is very likely to be implemented in RabbitMQ and Spring Cloud Stream (this article (originally about E2E testing) illustrated a good implementation). Slave should grab/distribute tasks based on a combined concern of CPU and resource quota.
- Slave which has CPU below some threshold (+ not notified to be graceful shutdown) should grab new tasks.
- Cons:
- Potentially risky if a job use significant different amount of resources in different stage of it. CPU may goes up unexpectedly and finally freeze that slave box/affects all jobs running on it.
- Need to know the detail of how docker manage/distribute resources for multiple containers running on the same box.
- For example, a testing job which uses single thread for initialization, and execute tests using multiple CPUs in parallel.
- Will use, and highly tight to backend infrastructure: load balancer in orchestration framework.
- Potentially risky if a job use significant different amount of resources in different stage of it. CPU may goes up unexpectedly and finally freeze that slave box/affects all jobs running on it.
- Cons:
- Slave which has certain amount of unused resource quota (+ not notified to be graceful shutdown) should grab new tasks.
- Cons:
- Resource quota is a customized business logic, which means we need to implement the load balancing logic in our master/slave agents: master consults all slave agents about their remaining quota, and then choose one from them.
- If user mis-configured the resource quota to be too small, particular job may consume to many resources, and finally caused the slave machine to be freeze/in unhealthy state.
- Cons:
- To prevent the cons, slave should only grab new tasks when both of the above conditions meet.
- The lowest always need to be above some threshold, because otherwise autoscaling mechanism is very hard to scale down as no node is completely idle.
Slave auto-scaling:
- There should be multiple tasks running in the same (resource-rich) slave machine, rather than one machine per job.
- Pros:
- Various job running together in a single hosting machine can smooth CPU and other resources usage.
- Scale up and down can be less sparky (not sure if this can be done by a smart auto-scaling policy).
- Cons:
- Runs are not independent. A bad run may overload the machine and affect other jobs running on it (depending on how docker distribute hardware resources).
- Slave agent need to be able to create job-specific docker containers, and execute the job inside of it, and cleanup the container after it. May consider using Java Docker API Client.
- Slave agent should be able to manage multiple jobs to be executed together in the same machine. See below "autoscaling" for reasons.
- Pros:
- Graceful shutdown (scale down only happens when the agent finish all jobs in hands) is a prerequisite for auto-scaling.
- Cannot use retry/rerun a job as a workaround, as
- The job (e.g. deployment) may not be idempotent.
- Users are urgently waiting for the result.
- Need to send notification to a slave machine for graceful shutdown, and only do it after it finished all tasks.
- Not sure if that can be supported by an existing infrastructure (which is mostly for killing a machine if anything may go wrong -- health check endpoint/...).
- Cannot use retry/rerun a job as a workaround, as
- Auto-scaling policy:
- CPU usage:
- May be not compatible with resource quota, as a job may have high resource quota, but use only a little bit of CPU in some phase of it.
- May conflict to the decision when a slave should grab new jobs.
- The length of message queue (SQS + AWS supports that):
- When there's a queue (so the policy has useful input), that means everybody is waiting to be kicked off.
- May cause some important (e.g. fire-fighting) jobs to not being kicked off immediately.
- May use the combination of the two.
- Not sure if that can be supported by an existing infrastructure.
- CPU usage:
- Slave auto-scaling should be implemented (probably through Kubernetes).
For Git repo jobs, repo needs to be fetched two times, once from master and once from slave (or master need to send content to slave through RPC/scp). To minimize network overhead, master fetches the single config file (to know the job name, server type, and other metadata), while slave fetches the whole repo. Slave git fetch should be inside of the docker container rather than the host machine (so we know it is completely gone when the container is destroyed).
Note that while single branch clone is supported by GitHub, single commit fetch is only supported by GitLab but not GitHub. This fact may limits our git operations performance
Also, if we want to run a task only based on the what is changed from git, we'll need to fetch more than one commit which is a more complicated task.
Results should be queried from API endpoint, for example /jobs/123/tasks/456/results/console or /job/123/tasks/456/results/junit or ... May provide /jobs/123/results with JSON return type to provide information on what kind of result that particular job has. Detailed return results can be in different IANA context-type, e.g. JUnit reports in application/zip of a folder of XML files. May use some data lake solutions (e.g. MinIO (preferred), Hadoop HDFS or AWS S3) to save those result files. Parsing/pretty e.g. JUnit report is completely a frontend plugin.
Pipeline should be a client side setup/setup in a layer on top of the RESTful API layer, as it breaks independency between jobs/RESTful endpoints. It can be implemented as:
- Batch operation
- A pipeline microservice on top of the job microservice, which defines its own endpoints, and call job endpoints as dependencies.
- Pros:
- Can build a UI on top of the pipeline API through common sense.
- Cons:
- There are a lot of similarities between the two microservices, e.g. how to define in parameters.
- Simple code logic (pass input to downstream jobs) needs to save/implemented in SQL/RESTful calls. That introduces a lot of complicities.
- Pros:
- User can define a job, and inside of the job
curldependent jobs' RESTful endpoints.- Cons:
- Need to pass the credentials into the container in which we
curl. - User needs to implement the check status loop themselves inside of the job logic, through we may provide a library for them to do so.
- Need to pass the credentials into the container in which we
- Cons:
Command line client shouldn't have shared pipeline. User can have their shellscript with multiple steps.
In case pipeline definition is in code (not necessarily to be in the same repo as the endpoint job), consider the pipeline layer git fetch a single file from a single commit (GitHub does not support, GitLab supports) or a single branch (GitHub supports). Then each step/endpoint git fetch separately.
git clone <source> -b <branch-name> --single-branch --depth 1
git init
git fetch --depth=1 <source>
git checkout -f <commit-sha>
- JFrog artifactory
- Upload to/download from S3/docker hub/ECR/...
- UI/client tights together with job logic: Jenkins UI is defined at the same place as backend job logic. There's no easy way to have different form of clients for different proposes.
- Whitelisting a wide of IPs: To use CircleCI, you need to expose your entire cluseter to a very wide list of CircleCI IPs.
- Hard to define jobs which are not PR triggered in the all-in-one YAML config.
- Hard to define multiple (maybe not quite relevant) jobs all based on the same git repo.