At Thirdbridge, we believe that project-oriented teams deliver superior quality results, and do so more quickly. Given that they are responsible for the entire value creation flow, these teams can increase their velocity by eliminating bottlenecks themselves. Moreover, entrusting end-to-end flow responsibility to our developer teams makes their work even more engaging and motivating.
Yet, in practical terms, we've faced certain challenges while endeavoring to break down the numerous silos within the organization. Not surprisingly, the major challenge has been to make teams autonomous in relation to their infrastructure, particularly those who must manage a Kubernetes cluster.
A Bit of History
Long ago, when Thirdbridge consisted of just a few programmers fresh out of school, our infrastructures were poor. All our projects were hosted on the same server. Deployments were done manually by connecting via SSH, syncing the git history, and then restarting the process. Very quickly, we realized that if we want the company to grow, we have to be more serious about our infrastructure. Based on the various challenges we faced at the time, we began looking for a technology that could help us level up.
- Automatic Horizontal Scaling: Some of our clients experienced highly variable traffic loads, and we sought an elastic solution to automatically scale our systems.
- Manual Horizontal Scaling: Automatic scaling works well in most cases when traffic increases steadily. However, some of our clients ran significant advertising campaigns with mobile notifications, causing almost instantaneous traffic spikes. For these scenarios, the chosen technology needed to allow us to manually scale the system in preparation for these events.
- Progressive Deployment: Manually stopping and restarting systems always caused a small service interruption. For small projects, these interruptions weren't really problematic. However, for our larger clients, we wanted technology that would allow us to deploy new versions without any interruption.
- Continuous Deployment: Manually deploying new versions wasn't a scalable approach. We were looking for an automatable solution with a few lines of configuration in our GitLab pipeline.
- Autonomous System: We were still traumatized by a bug that caused an infinite loop in one of our systems. As the project was in JavaScript (single thread), the infinite loop had paralyzed the entire project for a few hours. So, we were looking for technology that could continuously monitor our various processes and restart them in case of a major problem.
- Docker Support: We were beginning to realize the potential of containerization and were looking for a Docker-compatible solution.
- After studying the existing technologies at the time, we settled on Kubernetes.
Back to the Present
Many years have passed since we made this choice. Overall, we consider it a very good decision! Kubernetes has allowed us to solve all the technical challenges we encountered, and we are aware that we have only scratched the surface of the features offered by this tool.
However, as Uncle Ben told Peter Parker in a deep discussion about elastic systems: "With great power comes great complexity."
Now that Thirdbridge is approaching 50 employees, the company is divided into several teams, each responsible for one or more projects. As mentioned initially, we want these teams to be as autonomous as possible, and for developers to be maximally empowered.
Over time, we realized that the steep learning curve of Kubernetes made it difficult to adopt this technology within multiple teams. The result was predictable and disappointing. The few people with advanced Kubernetes expertise quickly became bottlenecks. Teams lacked the confidence to innovate and improve their infrastructure.
Faced with this situation, we made the decision to give AWS ECS a chance. It was not an easy decision but rather the result of several compromises.
Simplicity vs. Functionality
A common compromise in the web development world (and probably in many other domains) is reducing the scope of features in exchange for a better experience.
For example, Vercel offers a better experience to developers than AWS for deploying certain modern web frameworks like NextJS or SvelteKit. The same paradigm applies to serverless functions in JavaScript. By accepting some initial constraints, Cloudflare Workers offer a better experience and more advantageous pricing than AWS Lambda.
The first question we had to answer was whether the features offered by ECS covered the needs listed above, as well as new needs we had added over time (e.g., advanced observability, CloudFormation support, private DNS resolution, etc.)
The answer was positive. We were able to replicate the majority of the features present in our existing Kubernetes clusters.
The reduction in complexity was significant!
No need to run a Fluent DaemonSet for logging; we can now simply use Firelens.
No need to set up a Horizontal Pod Autoscaler, a Metric Server, and a Cluster Autoscaler; we can simply use CloudWatch alarms to set up automatic horizontal scaling.
That being said, we're not naive. We're aware that Kubernetes offers more features than ECS. However, for our use cases, these additional features are unnecessary and introduce too much complexity.
AWS Dependency
The second concern we had was vendor lock-in.
Kubernetes is supported by most major cloud players and is also available through several open-source products like k3s. In contrast, ECS is a closed, proprietary technology owned by AWS. But even worse, the fact that ECS is a proprietary technology is just the tip of the vendor lock-in iceberg.
The subtler part is that ECS uses several other AWS services to function. For example, AWS Secret Manager for secret management, AWS Cloud Map for internal DNS resolution, or AWS EventBridge for creating periodic tasks.
Even for a system with a relatively low level of complexity, half a dozen AWS services may be needed to replicate some basic Kubernetes features.
However, once this reality is accepted, there are certain benefits. At Thirdbridge, we are strong proponents of infrastructure as code. My colleagues and I believe that every serious project should use this approach to define its infrastructure.
One difficulty we had with Kubernetes is that you need to know two syntaxes to define a system. All components related to Kubernetes (e.g., Deployments, Cronjob, DaemonSet, etc.) are written in Kubernetes definition files, while other resources (CloudFront CDN, Route 53 record, S3 bucket) are defined in CloudFormation files.
By migrating to ECS, we can define our infrastructures using exclusively CloudFormation. It's a small detail, but it greatly simplifies things. We hope that by lowering the complexity barrier, junior developers will be able to more easily familiarize themselves with infrastructure as code.
Serverless Adoption
Although Fargate is also available for EKS, we never used this approach. Our reasoning was that running administrative Pods (e.g., CoreDns, CertManager, etc.) in a dedicated Fargate container was unnecessary. Since these Pods usually don't consume many resources, it makes sense to co-locate them on the same EC2 instance.
Now that we have migrated to ECS, using Fargate is more natural. Not having to manage EC2 instances at an additional cost seems like a good compromise. It's true that most of the time, no interaction was required with the EC2 instances, but when a problem arose, it was usually complex and difficult to resolve.
Moreover, even though we now use Lambda functions regularly for certain types of tasks, we believe that a long-running process has significant advantages for more traditional REST APIs where the majority of execution time is IO-bound anyway.
For now, we consider Fargate to be the serverless abstraction that makes the most sense for several of our systems.
In Conclusion
In all transparency, this change saddens me a little. I love Kubernetes and have invested a considerable amount of time in learning this technology.
However, my feelings are not relevant to the smooth operation of Thirdbridge.What is important is laying the foundations to build superior quality software solutions.
We believe that by reducing infrastructure complexity while offering a wide range of features, ECS will enable teams to more easily take ownership of their project infrastructures, thus fostering innovation and increasing velocity. So for now, ECS will be our default technology for our infrastructures.
However, we are still aware of the limitations and compromises associated with this choice. We will also closely monitor the advancements of Cloudflare Workers. Their execution and pricing model being more suitable for classic REST APIs than AWS Lambda. Now that their JavaScript execution environment is approaching parity with Node.js, this technology could be an interesting choice in the future.