Wednesday, October 02, 2019

Cloud Foundry on Kubernetes: missing features

I already laid out some thought on the issues we need to deal with adding Cloud Foundry (CF) layer on top of Kubernetes (K8s) in my previous blog.

The CF on K8s topic sees a lot of interest especially since the creation of two projects in CF Foundation: Eirini - a thin layer that replaces the CF Diego scheduler with K8s and Quarks - provides CF components as containers.
Eirini provides a way to run CF apps on K8s by replacing Diego with K8s scheduler. You might have guessed that this alone is not enough. As we mentioned in the previous blog we need the Cloud Controller (CC) API to successfully mimic CF.

Well the simplest way would be to have the CC and all dependent components running in K8s pod. To achieve this Eirini deploys a containerized CF - the SUSE Cloud Foundry (SCF) on K8s. SCF contains gluing code to make everything work on K8s. Quarks aims to get rid of this glue and replace it with natively working on K8s components, but until then SCF is the way to go.

Additionally there is an Eirini pod running the gluing code between CF components and K8s. This includes translation of CC LRPs to K8s pods, handling of logs and routes, etc.

Feature parity?

To get a clear picture what needs to be done to have feature parity between Diego and Eirini, we need to have a look at the differences between
  • Garden and K8s containers security
  • Diego and Eirini features
Let's discuss the gaps in these two groups.

Containers security

The Eirini and Garden teams already have this assessed, so the table below is based on their investigation:



Remarks:
#1 Possible with mutating webhooks
#2 https://github.com/kubernetes/enhancements/issues/127
#3 Application is restarted after reaching the limit. The limit is configured globally for every application.
#4 Fewer masked paths than garden/docker (e.g. /proc/scsi)
#5 different implementation; not less secure
#6 not used in Eirini, Garden or K8s; https://github.com/kubernetes/frakti
#7 AppArmor is used


Low-level details and comments are available in the Garden team story.

Diego vs Eirini features

Eirini CI runs CF Acceptance Tests (CATS), so based on them here is a feature comparison based on SAP CP features (Diego based):


Remarks:
#1: Not tested
#2: Performance and log-cache route as impact; SCF/Quarks issue?
#3: Old SCF version; scalability tests will be needed
#4: use CredHub or K8s secrets? SCF/Quarks issue?
#5: no v3 support in SCF 
#7: part of Eirini migration plans; cluster per tenant/org? What about CF flow?
#8: Bits-service as private registry or K8s re-configuration
#9: Not tested; Flaky tests
#10: Isolation segments dependency; clusters in K8s?
#11: Reuse K8s services
#12: eirinix project; mutating webhooks 
#13: Not tested. Waiting on API v3 in SCF?
#14: eirinix project; (persi-eirini)

There are several features that are not tested with Eirini, mainly due to flaky CATS.

Log Cache is a feature that adds performance improvements over existing functionality, so at least for initial adoption we can live without it. To add this feature we need to add components in SCF.

Container Networking is supported in most CF offerings out there. Currently it is under tests on SAP Cloud Platform. However at scale we've seen scalability issues in vanilla CF.

CredHub is currently not supported on Eirini. The reason for this is that K8s has secrets support and can also rotate secrets.

Eirini uses SCF 2.16.4 fork that currently does not come with CF API v3 support, so Container Networking, Rolling Deployments, Security Groups and V3 features cannot be used or tested.

Docker support is currently integrated in CC and Eirini. It is supported as long as staging of apps is done with Diego stagers.

Isolation Segments and routing are not supported. This is a feature that is needed to migrate Diego backed apps to Eirini, but not very valuable inside K8s world as we already have clusters.

Support for Private Docker Registry can be achieved in two ways. One can either use the existing Bits Service part of SCF (deprecated) or configure K8s itself to pull from private repo. Currently none of these is configurable via Eirini.

Eirinix project brings SSH and Volume Services support for Eirini. This is based on mutating webhooks.


Cloud Foundry on Kubernetes

As you probably know Cloud Foundry (CF) is opinionated "Open Source Cloud Application Platform" in the PaaS space. It works with existing / pre-alocated VMs. CF then uses the VMs to spawn containers in order to increase workload density and speed of creation.

"Containers" in the paragraph above should rang a bell and you're probably thinking about similarities with Kubernetes (K8s). If not you should start thinking about this now :) Because there are a lot of similarities. 

"Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications" (#2). As I mentioned this is quite close to what CF does.

Of course there are lots of differences as well. While K8s can easily handle stateful workloads, CF refuses to support these or at least does not make things easier for you. One consequence of this is that running your own DB or Queue (Kafka for instance) is much easier done on K8s.

On the other side CF offers buildpacks to lift the burden of building secure and up-to-date docker images for your application, and can take care for your app health and scaling via a combination of opinionated requirements it has (12 Factor App) and services that are offered in the CF ecosystem.

So while the choice for services might be clear (VMs if you don't need scale or K8s cluster for elasticity), the application development space is in turmoil. 

There a numerous frameworks from packaging, deploying, CI/CD to management and FaaS. You might heard or used some of the these: Helm, Argo CD, Knative. Not only these projects change scope and deliveries quite fast, but there is also a consolidation effort needed from developers to make use of all of them. The dynamic is not quite like Node.js modules and frameworks with their twice-per-day releases, but you can still feel the disturbance in the force with every update of "minor" K8s version.

While K8s can offer much, there are people that like the simplicity and restrictions imposed by 12 Factor approach of CF (including me). 

So what would happen if we use the K8s power and add CF as a layer on top of it? One would expect this to be an quick and easy task based on what we talked so far, but there are a number of things to consider.

Applications

Both CF and K8s run apps in containers. No big deal, right? Turns out there are different approaches to spawning containers. 

CF containerization goes back to VMWare's VCAP (VMWare Cloud Application Platform) that used a shell script and a bit of gluing native C code, several years before Docker was born. It is now the Garden project, that allows CF to create containers on different back-ends like Windows, Linux and runC. We'll talk about the last one in the next paragraph.

Docker (and K8s) redefined the container world for good. They offer not only isolation but also standardized way to pack and run your code. To have "run" part standardized, Docker extracted the run code from Docker and donated it ro Open Container Initiative (OCI) as runC project.

You might already be thinking "Ok, would it be fine just to swap Garden for Docker?. Garden already uses runC backend by default in CF so not a big deal, right." Well it will work. Actually it works as proven by several attempts. Mostly.

The biggest issues however are: 
  • Garden adds more security rules and restrictions than runC defaults. Some of these restrictions heled avoid CVEs reported for K8s (CVE-2019-5736)
  • Users lose buildpacks and need to add additional build steps in their CI/CD pipelines. There is an ongoing effort to have the buildpacks "translate source code into OCI images" (#3)

Orchestration, APIs

CF uses Diego as a workload scheduler. It is driven by the CF API - the Cloud Controller. These two components define a feature set that users expect to find from every CF installation. We need to support as much as possible from this feature set on top of K8s to consider the "merge" successful.

Routing

Both CF and K8s are experimenting with Istio/Envoy as a way to handle loadbalancing, security/isolation, service discovery.

Scalability

Istio does not scale to handle 250000 app instances as required by some of the leading CF providers.

K8s does not handle more than 5000 nodes and the existing CF installations already have VMs that exceed this limit (some close to 7500 VMs)

So its obvious that a single K8s  cluster cannot be used to replace the biggest CF instance. Not that creating such a behemoth was a good idea in the first place.

We need to think of using different clusters, as this approach has additional added advantages like better isolation, operations and potentially better onboarding experience.

Services

CF ecosystem offers OSBAPI interface to abstract the interaction with external services (such as DBs, Machine Learning APIs, etc). This comes quite handy as CF apps should not care where services are running.

Having services running in a separate K8s cluster is nice idea considering again isolation, scalability, operations aspects.

While we can run all steteful services in K8s there are several stateless service that feel quite happy running in CF managed mode. CF mode here might simply mean K8s cluster overseen by CF components/pods, besides K8s itself.

Provisioning

CF is now provisioned using BOSH on VMs. I see no reason to keep this provisioning model, especially since 90% of the cloud providers already offer managed K8s clusters. 

Seems like we're moving from tools that try to bridge IaaS and PaaS world (CF on AWS, CF on Azure) to tools that work with managed K8s clusters (CF on GKE, CF on EKS). Or in other words the IaaS is moving from VMs to K8s clusters / containers.


References:

Tuesday, April 30, 2019

Cloud Foundry Abacus v2

SAP has been contributing to CF Abacus from late 2015. 

In the beginning Abacus was IBM only project. We saw potential in the project as SAP needed solution to meter the CF usage. We decided to check it out. Me and Georgi Sabev did initial investigation and found it could fit our needs. 

Once we went to use it in production a 10 person team worked on Abacus to have it as part of SAP metering, analytics and billing pipeline.  

Along the way we found several significant architectural issues that made us start planing version 2 of Abacus.

 
MongoDB & Deployment

To enable use of Abacus in SAP we had to solve two problems: how to use CouchDB and how to install the plethora of applications.

We decided to support the more popular MongoDB and create Conourse deploy pipelines.


DB size

However we quickly discovered  that by default Abacus is configured to log each change, much like financial systems do.

For every input document sent to Abacus, it stores several documents in its internal DBs:
  • original document
  • normalized document
  • collector document (from previous step)
  • doc with applied metering function
  • metered doc (from previous step)
  • doc with applied accumulator fn
  • accumulated doc (from previous step)
  • doc with applied aggregated fn

It turned out that we do map-reduce with input:output ratio 1:10. For 1Kb doc we have 10Kb data in Abacus DBs as we keep every iteration.

As we used Abacus only for metering, to limit the impact we:
  • configured SAMPLING
  • implemented app-level sharding
  • add housekeeper app that deletes data older than 3 months
  • drop pricing and rating functions and data from documents

Sampling is a way to keep only one document from reduce stages for a certain period of time. This increases the risk of losing data in case of app or DB crash, but reduces the data we keep.

Back then, sharding was not available for our MongoDB service instance, so the max data we could store was 800GB. This meant input of 80GB could get us in trouble. That's why we implemented additional layer that allowed us to round-robin the Abacus data across normal MongoDB instances, until we get MongoDB supporting data in TB range.

Housekeeper was also needed to comply with legal requirements, so we went on and implemented it to also reduce the size of the DB.

One of the last things we did was to drop the redundant data. The reason for this is that it also changed reporting API output, so we considered it a breaking change.


Scalability

Once we were confident Abacus works fine we concentrated on providing smooth update experience.

Micro-services stages of Abacus are tightly bound together and they keep state. This means they can’t be scaled independently or updated via normal rolling or blue-green deploy.

To solve this problem we decided to redesign the map stages of the Abacus pipeline - the collector and meter applications and turn them into real microservices that are stateless. To externalize the state we put RabbitMQ between them.

This buffer allowed to handle spikes in input documents that would otherwise overload Abacus pipeline by distributing the load over time.

The max speed we got in real live was ~10 documents per second with ~100 applications. Again a ratio of 1:10. As the ratio hints, this is due to the DB design that requires per-organization locking in reduce stages and causes DB contention as we write in the same document.


Zero-downtime update

Scalability and update are tightly coupled. Once we added buffer in front of the reduce stages we could separate Abacus in 2 types of components: 
  • user-facing: collector, RabbitMQ, meter, plugins, reporting
  • internal: accumulator, aggregator
The user-facing components had to be highly available and always on. That was easy as they were real micro-services

The internal ones that were problematic before are now isolated with the buffer and we could update them with downtime, as users will not notice thier lack.


Custom functions

At this point we had a working system and could add more users.

Abacus allows great flexibility in defining metering, accumulation and aggregation functions, so we though this was going to be easy for our users. 

And here was the devil in the details. This was not easy. It was terrible experience for them as it seems. An Abacus user has to:
  • have basic JavaScript knowledge
  • understand what all of the functions do
  • understand how data between functions flow
  • time-window concepts
  • try the function on a real system
We tried to solve some of the issues above with a plan-tester tool for trying out plans locally. This helped visualize the flow and get the functions output what you wanted them.

However there were more aspects to JavaScript functions evaluated at runtime:
  • Node.js was never meant to support this
  • Slow execution (close to second per call)
  • No clean security model
So although custom function looked like a good idea, we came to the conclusion that they cannot be used or used by untrusted parties with Node.js / V8

An example for perfectly working code in a function is allocation of big array. This passes all security checks we have added, but causes Abacus application to repeatedly crash as we retry the calculation with a exponential backoff.


Reporting

Reporting endpoint of Abacus is supposed to return already computed data. With the introduction of time-based functions this contract was broken. We need to calculate the metric in the exact time the report was requested and to call summarize function on multiple entities in the report.

For organization with several thousand spaces, each with several thousand apps this might take up to 5 minutes, while the usual requst-response cycle is hard-coded to 2 minutes (AWS, GCP, ...). 

Besides that, Node.js apps do not cope well with CPU intensive work we had to interleave these calculations to allow the reporting to respond.

The generated report on the other hand can become enormous: 2GB. To customize and reduce the the report Abacus uses GraphQL. However GraphQL schema is applied only after all data is fetched and processed. This is due to the fact that we use DB schema different than the report schema.

Contributions

SAP has contributed lots of code trying to stabilize Abacus.

The LOC statistics below shows the introducing RabbitMQ buffer and removing CouchDB support, rating and pricing between 2017 and 2019:
Lines of code
The SAP commits show that we moved from small changes in 2015-2016 to bigger chunks of work (buffer, yarn, removal of babel).

Commits
The chart above hints that instead of gradually adding functionality to Abacus, instead we were trying to solve bigger problems. 

For most of the issues we had we ended up blaming several architecture concepts that did not fit our use case and requirements. We managed to fix or work-around some of the most pressing ones such as update and scalability.


Architecture problems

However some of the high level problems we still face with current May 2019 Abacus codebase:
  • reduce stages of Abacus are tightly bound together, they keep state, can’t be scaled independently; complex handling of rollbacks in case errors occur
  • DoS vulnerability: API allows definition of metering plans as node.js coding
  • Query API (GraphQL) limited to service level, but does not support service instance level
  • Open source community: IBM no longer contributes to Abacus, SAP remains only contributor
  • API tailored to CF (org/space), does not fit to SAP CP domain model (global account/subaccount)
More technical issues include (in no particular order):
  • Components communicate via HTTPS, going through the standard CF access flow (Load Balancer, HA, GoRouter) which is sub-optimal and can lead to excess network traffic
  • The MongoDB storage used cannot scale horizontally, leading to manual partitioning with multiple service instances (data is not balanced over MongoDB instances)
  • Abacus service discovery model for reduce stages is orthogonal to CF deployment model, resulting in slow updates and development cycles
  • Users can send and store arbitrary data in Abacus (with various size and structure), making it hard to commercialize
  • Users have a hard time understanding, designing, and getting plan functions to work correctly
  • Reports can take forever to construct and can lead to OOM
  • If users abuse the metering model, their aggregation documents can surpass the MongoDB limit of 16 MB leading to failures
  • accumulator and aggregator apps rely that they are singleton. Should at any given point there be two instances (which can happen because of network issues), accumulated data can become incorrect

Abacus v2

All of the architectural issues above and the lack of pressing issues leads us to think that we have to start v2 implementation.

Momchil Atanasov started redesining and creating new concepts for this version. Some of them:
  • Custom Metering Language (instead of JS functions)
  • Hierarchical Accumulation - allows monthly accumulations through daily accumulations
  • Usage Sampler - replace time-based functions with usage sampling

Momchil created 2 prototypes that validated these ideas. Using PostgreSQL he managed to get to ~800 docs/sec and using Cassandra he got up to 4000 docs/sec.

The project however was put on hold and replaced with internal SAP solutions to allow among others:
  • reduce TCO
  • reuse exising software that already do metering
  • support SAP CP domain model
  • metering CF, K8s and custom infrastructure

Cloud Foundry router logs format

If you need to parse Cloud Foundry router logs like this:

2018-06-12T16:04:00.92+0300 [RTR/5] OUT abacus-usage-collector-1.cf.com - [2018-06-12T13:04:00.719+0000] "GET /healthcheck HTTP/1.1" 200 0 16 "-" "-" "-" "10.0.73.146:61163" x_forwarded_for:"-" x_forwarded_proto:"https" vcap_request_id:"d7aa3a8f-ff20-48c4-4a20-cb7f3f1de808" response_time:0.20847815 app_id:"71321d59-bd06-41d6-becd-7e702a6aa59b" app_index:"0" x_b3_traceid:"b7bb54f29b40de64" x_b3_spanid:"b7bb54f29b40de64" x_b3_parentspanid:"-"

Perhaps this definition from deprecated parser on GitHub will help:
{ 'msg' => '%{HOSTNAME:hostname} - \[(?<time>%{MONTHDAY}/%{MONTHNUM}/%{YEAR}:%{TIME} %{INT})\] \"%{WORD:verb} %{URIPATHPARAM:path} %{PROG:http_spec}\" %{BASE10NUM:status:int} %{BASE10NUM:request_bytes_received:int} %{BASE10NUM:body_bytes_sent:int} \"%{GREEDYDATA:referer}\" \"%{GREEDYDATA:http_user_agent}\" %{HOSTPORT} x_forwarded_for:\"%{GREEDYDATA:x_forwarded_for}\" x_forwarded_proto:\"%{GREEDYDATA:x_forwarded_proto}\" vcap_request_id:%{NOTSPACE:vcap_request_id} response_time:%{NUMBER:response_time:float} app_id:%{NOTSPACE}' }

id_rsa.pub: invalid format, error in libcrypto

After I upgraded my Linux and got Python 3.10 by default, it turned out that Ansible 2.9 will no longer run and is unsupported together with...