Distributed systems, Microservices architecture lecture

These are my notes for the 2-day training about microservice architectures given by Sam Newman. It consisted on a selection of patterns to build a microservice based system and some others to migrate from a monolithic application. The list of patterns include their pro/cons, domain of applicability and technologies that implement them.

Definition of microservice architecture

The microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery.

My feeling about the training

The trainer had excellent presentation skills and a broad knowledge derived from extensive consulting experience. Although roughly 40% of the content was obvious, 60% did provide some valid insights. It was mostly a lecture listing patterns, no hands-on training. It would have been nice to have some working code to bring back home …

Why microservices ?

Theoritically microservices architectures enable :

Faster time to market : 1 team owns the service (less coordination), less risk per change
Availability : services should be designed to mitigate the impact of unreachable dependencies
Scalability and resiliency : scale horizontally only the services that are in your hotpath

On the other side you will struggle with :

Testability : the number of dependencies needed to perform integration testing explode
Monitoring and traceability : many logs, complex service call graphs
Automation : service discoverability, deployment
Consistency : transactions and normalized data models are not available

The layer architecture (UI, business logic, persistence) favors competence based teams (middleware, db..). This introduces change friction because each minor change has to be coordinated amongst several teams. Some studies suggest a strong correlation between shared ownership and defects.

Domain driven design (DDD)

When designing services you struggle between :

High cohesion : logic is not duplicated across many services
Loose coupling : services can evolve independently

Domain driven design is a method to split the system into micro services. Because business domains change less often than the technology stack => interface stability.

Domain models should be confined to an explicit scope. It is not cost effective to define a model to the company scale. Bounded contexts follow this idea by controlling the semantics of a given object across different parts of the organisation. Example customer model : On a retail site, a customer has different attributes for search, delivery, payment services. If you build a service to handle all possible attributes of customer then you introduce coupling between search, delivery, payment services.

DDD methodology : Event storming

The event storming method is decomposes a system into bounded contexts. Some guidelines :

All stakeholders should be invited, not only developers
You can focus on domain events or capabilities
- event : client chose item to purchase
- capability : present to client a list of purchasable items
Keep the level of detail to what can be understood by a non-expert
Use the same vocabulary as your customer and be clear on semantics for each (term, context) tuple

Monolith migration

The most important metric is the number of services on your system. It correlates to the investement needed in infrastructure to ease the pain of managing lots of independent services. An interesting example is gilt. By plotting the number of deployed services vs headcount, you notice it took several years for this company to reduce the cost of adding new sevices.

Table wrapper anti-pattern : one way to decouple a piece of the database is to create that acts as a proxy to the table. However the coupling introduced by the data schema is not solved by this pattern. The service proxying the database should expose methods to transition the data between valid states.

How to identify code to migrate ?

You need to balance the ease of migration with the rate of change of the sub-system. There is no profit (other than experience) in migrating a sub-system that never changes. These tools can provide some insights into code dependency, tech debt and even social metrics !

code scene : datamines your github to output really interesting stuff
ndepends : like sonarqube but maybe more advanced ?

Patterns for monolith migration

strangler (fairly obvious…)
data sharding (no magic here either …)

Monoliths for greenfields

At the end of the day unless you have the right automation and experience, the monolith is always the fatest and cheaper way to start building value for your organisation. Many fancy companies started like that (etsy).

Better have a monolith with strong boundaries in terms of code packaging and database schemas that can be evolved later.

Service discovery

How do you enable communication between servers ? Communication entails more than just knowing the endpoint and sending JSON/HTTP.

Dynamic service discovery (and health pings)
Call tracing
Connection pooling
Circuit breakers
Fancy routing : staged rollouts, load balancing

You may distribute client libraries for each service, but it starts to get complicated on a polyglot environment.

Service mesh (or side car pattern)

Service mesh is middleware to provide the advanced service discovery features above. Recommended implementations : linkerd, istio.

Choreography Vs Orchestration

Collaboration styles are related to messaging patterns

Event based (publish/subscribe) asynchronous communication for choreography
Request/response synchronous communication for orchestration

Orchestration	Choreogaphy
self documenting business logic	code does not provide a complete picture of the business process
centralized error management	how can we even know the process actually finished ?
orchestrator has to know about all subcomponents, it will be involved in most of the changes	the only coupling between subcomponents is the event to listen to
no need for middleware	message broker/queue needed to distribute (persist?) the events

The biggest shortcoming of the choreography collaboration is that you will always need an orchestrator to check all subcomponents finished correctly. You need to limit the coupling introduced by this reconciliation orchestrator by only listening to the end events pushed by each subcomponent.

To avoid shared state we can use the data on the outside pattern.

Retry policy

On monoliths, most of the calls are local so the failure behaviour is binary, either it works or not. On the contrary calls crossing microservices boundaries may have transient failures. However the answer is not always to retry, on a complex call flow this with nested retry attempts can produce very high latency.

Distinguish between client and servers errors : if the server replies the request is invalid, retrying will not help.
You need a queue if you want to retry asynchronous flows

Slow failure

This is on of the worst types of failures because it can block all of its callers. Indeed, if threads are used by the callers to make service calls, a slow responding (or timing out) service will hog all resources.

Use the bulkhead pattern to partition your resources based on who you are calling
Use a self-healing circuit breaker : it lets a fraction of the trafic reach the faulty service to detect when it is back online
Set the right timeouts : use the 90th percentile round trip time to have an idea on the order of magnitude
Try to provide a graceful degraded mode if service is unavailable

If possible you can spool retry operations to be dequeued later or put a queue in the middle so that clients can enqueue and return independently of whether the service is online.

Tracing, performance/business KPI, monitoring, alerting

To track complex call graphs you need to include a correlation id on each message. You cannot always delegate this to your service mesh for asynchronous communication. To avoid pain all log records should be consistent independently of the language used. Include in your set of KPIs both technical and business metrics.

Do not only build alerts for errors. Define a model to define normal operation and use alerting to warn you when the system is not working normally. Example : a KPI to indicate normal behavior may be the revenue generated per hour.

Some useful tools :

Security

Microservices are a double edged sword when it comes to security.

Increased surface of attack
- Communication between services lends itself to man-in-the-middle or impersonation
- Polyglot environments have a higher risk of an unpatched vulnerability
Defence in depth by protecting each service individually an attacker needs to compromise more hosts.
- Codespaces horror story : define multiple security domains.

Attack trees : a methodology for threat modelling

The roots of the trees are the goals an attacker may have
Assign a profit to each goal
The child nodes are the different methods to reach that goal
- Iterate by attaching to the method nodes the actions needed
Annotate each leaf node with the cost for the attacker
- Propagate the costs to parent nodes by choosing the cheaper child

Confused deputy attack

You need to enforce permissions based on both user and service. Otherwise a service that is permissioned to access sensitive information may be used by an attacker. If the attacker can request data for other users without a strong authentication proof, the data is not secure.

Random bits of interesting stuff

morning paper : a selection + summary of CS related papers
pat helland writings : data on the outside
orleans: an alternative for akka in .net (maintened by microsoft and support on azure), soon to have a version 2.0
consul : a service discovery packed with features
- DNS interface to have simple service discovery without any consul specific integration
- Health checks to keep the list of services consistent
- Datacenter and network topology (by tomography) aware
- Also acts as a consistent key-value store
- Templates : a text file with references to values stored in consul. The references will by resolved to the latest value automatically.
sock shop : a demo application that can be deployed to aws, kuberneter, mesos, swarm to compare the different PaaS
weave scope : yet another monitoring tool showing the runtime topology of your application and infrasctructure