Sunday, November 29, 2015

Data consistency in distributed systems: From ACID to BASE

Brewer's CAP conjecture proven by Gilbert and Lynch established that it is impossible to achieve consistency, high availability and partition tolerance together.  This has led to the design of distributed systems that provide weaker consistency guarantees while ensuring high availability even with node and communication failures - BASE (Basically Available Soft state Eventually consistent).  In traditional databases, the two-phase commit protocol guarantees consistency with updates either being committed to all of the N nodes or none at all.  Waiting for all N nodes to respond increases latency for the operation and also impacts availability,  since if any one of the N nodes fails to respond the update cannot occur.  In eventually consistent systems like Dynamodb, updates are considered complete once they are written to a subset of the nodes. They are eventually propagated to all the nodes, so it is possible that a subsequent read operation against a node may return stale data if the update has not yet propagated to that node.  As a result, it is possible for these systems to have different "versions" of the same data and hence they must also be able to reconcile these different versions. DynamoDb uses vector clocks to order the versions when possible - so latest wins - otherwise, relying on the client to reconcile conflicting versions.   It is possible to avoid conflicts entirely by using consensus algorithms like PAXOS or RAFT that ensure that only one entity can perform an update. The general consensus problem is about reaching agreement across a set of distributed processes that can fail due to faults in the infrastructure (network or node failure) or for malicious reasons (byzantine failures).  Most practical implementations of these algorithms adopt a leader/master based approach where all writes are always made to the master and then propagated to replicas as in MongoDb.  When the master fails a leader election process is initiated that elects a new master while demoting the failed master.  MongoDb supports various consistency vs latency tradeoffs by allowing you to specify different write-concerns: un-acknowledged, journaled (written to journal on master), or replica acknowledged (written to one or more replicas in addition to master).  Jepsen (a tool that tests partition tolerance of distributed systems) testing with MongoDb shows even with write-concern set to acknowledge writes to majority of the replicas in a cluster, you can still end up with missing writes! 

Sunday, August 10, 2014

Cluster management tools

Cluster management tools provide abstractions to run software applications on a collection of hardware (physical machines or VMs in a cloud). The tools allow you to declaratively specify the resources - e.g. tasks, services - required by the application. The mapping of resources to processes on the hardware is the responsibility of the tools.

Kubernetes

Kubernetes currently provides the following abstractions: pods, replication controllers and services. Pods are collections of containers that are akin to application virtual hosts. Services are used to setup proxies pointing to pods. The pods a service proxies are identified by labels.  Replication controllers are used to monitor a population of pods using labels. Kubernetes will ensure that the specified number of replicas are available. Kubernetes does not yet allow you to specify resource requirements (CPU, memory) for the application, but there are plans to support this.  This talk by Brandon Burns provides a good introduction to Kubernetes.  Kubernetes is still under active development.

Mesos

Mesos implements sharing of computing resources using resource offers. A resource offer is a list of free resources on multiple slaves. The decision about which resources to use are made by the programs sharing the resources offered by Mesos.  Mesos does not collect any resource requirements from these programs, instead the programs can “reject” offers. Mesos resources use OS isolation mechanisms like Linux containers which allow resources to be constrained by CPU and memory.  The isolation mechanism is pluggable via external isolator plugins like the Diemos plugin for Docker.  Diemos allows Mesos to use Docker containers with a specific image along with cpu and memory constraints as resources.  Mesosphere is a startup offering mesos on ec2 among other platforms. Currently, there does not seem to be any auto-scaling support for a Mesos cluster on ec2 - you preallocate ec2 instances to the cluster and Mesos will offer resources (e.g docker containers) from the set of ec2 instances.  This talk on Mesos for Cluster Management by Ben Hindman is a good introduction to Mesos.  For an academic perspective, see this Mesos paper from Berkeley.

There are several schedulers available for Mesos: Aurora, Marathon and soon .. Kubernetes-Mesos.

Aurora

Apache aurora was open-sourced by Twitter. Job definitions are written in python.
  import os 
  hello_world_process = Process(name = 'hello_world', cmdline = 'echo hello world')

 hello_world_task = Task(
  resources = Resources(cpu = 0.1, ram = 16 * MB, disk = 16 * MB),
  processes = [hello_world_process])

 hello_world_job = Job(
  cluster = 'cluster1',
  role = os.getenv('USER'),
  task = hello_world_task)

 jobs = [hello_world_job]

 Jobs are submitted using a cmd line client:
 aurora create cluster1/$USER/test/hello_world hello_world.aurora

Marathon

Marathon - by Mesosphere. It has a well documented REST API to create, start, stop and scale tasks. When creating tasks, you specify the number of instances to run and Marathon will ensure that those instances are available on the Mesos cluster.  Marathon allows for constraints to indicate that tasks should run only on certain slaves etc. You would have to run a load-balancer (e.g. HAProxy) on each host to proxy traffic from outside to the tasks.Jobs are defined in JSON format:

{
    "container": {
    "image": "docker:///libmesos/ubuntu",
    "options" : []
  },
  "id": "ubuntu",
  "instances": "1",
  "cpus": ".5",
  "mem": "512",
  "uris": [ ],
  "cmd": "while sleep 10; do date -u +%T; done"
}
Submit the job to Marathon via the REST API:
curl -X POST -H "Content-Type: application/json" localhost:8080/v2/apps -d@ubuntu.json

Kubernetes-Mesos

Kubernetes-Mesos - Work has just started for running Kubernetes pods on Mesos.

Omega

Google has been working on the problem of scheduling jobs on a cluster to maximize utilization.  Here is an overview of their work on the Omega scheduler.

Other tools

Apache Helix (https://github.com/linkedin/helix/) - opensourced by LinkedIn.
Autoscaling on EC2: http://aws.amazon.com/autoscaling/

Sunday, May 4, 2014

Meteor, load balancing and sticky sessions

Meteor clients establish a long-lived connection with the server that is uniquely identified by a session identifier to support the DDP protocol.  The DDP protocol allows Meteor clients to make RPC calls and also allows the server to keep the client updated with changes to data, i.e., Mongo documents. Meteor uses sockjs, which provides a cross-browser web-socket like API which falls back on long-polling when web sockets are not supported by the server.

Meteor with sockjs long-polling

Consider the following setup: Nginx is acting as a load balancer that is load-balancing 2 or more meteor servers.  The Nginx server configuration looks like this:
upstream meteor_server_lp {
   server localhost:3000;
   server localhost:3001;
}
server {
        listen       8084;
        server_name  localhost;

        location / {
            proxy_pass  http://meteor_server_lp;
        }

}
This configuration of Nginx does not support web-sockets, so the Meteor clients will use long polling.  Since long polling re-establishes connections every so often due to connection timeouts, such a configuration will require sticky sessions to ensure that client is directed to the same server that they have previously established a connection with. A sockjs connection that is directed to the wrong server by the Nginx load balancer will fail with a 404 Not Found.  The solution is to compile Nginx with the sticky module, and modify the configuration to be sticky like this:
upstream meteor_server_lp {
    sticky;
   server localhost:3000;
   server localhost:3001;
}
server {
        listen       8084;
        server_name  localhost;

        location / {
            proxy_pass  http://meteor_server_lp;
        }

}
When using this setup with AWS load balancers, sticky session needs to be enabled on the load balancer - app stickiness using the "route" cookie setup by Nginx's sticky module.

Meteor with web sockets

Nginx since 1.3.13, supports web-sockets using the protocol switch mechanism in HTTP/1.1. The configuration file looks like this:
upstream meteor_server {
   server localhost:3000;
   server localhost:3001;
}
server {
        listen       8082;
        server_name  localhost;

        location / {
            proxy_pass  http://meteor_server;
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
        }
}
This works as is without the need for any sticky sessions. The websocket connection between the client and any one of the two servers will be established when the client first connects or reconnects or if one of the servers go down.  AWS load balancers don't support websockets with http listeners, but it works with a tcp listener setup.  But this means any SSL termination must occur on the instances being load balanced. 

Friday, March 16, 2012

HAProxy and SSL

HAProxy does not have support for SSL. Common solution is to use Stud to handle SSL and send un-encrypted data to the backends.
Terminating SSL in the load balancer is not considered a good idea because it does not scale.
It is considered better to use webservers like Nginx with session caching enabled.
Good benchmark comparing Nginx, Stud and Stunnel is here- http://vincent.bernat.im/en/blog/2011-ssl-benchmark.html.
Another benchmark comparing stud,stunnel and nginx: http://matt.io/entry/uq and the follow up which establishes Nginx to be just as performant as Stud - the key is picking the right cipher.
http://matt.io/technobabble/hivemind_devops_alert:_nginx_does_not_suck_at_ssl/ur

Sunday, February 19, 2012

Tomcat with HAProxy/Nginx

Tomcat is usually fronted with a http server for various reasons - security, load balancing and additional functionality like URL-rewriting. Most common options for the proxy include: HTTPD, HAProxy and NGINx.

Compile HAProxy from source
$ make
$ make TARGET=generic
$ sudo make install

Resources:
http://www.tomcatexpert.com/blog/2010/07/12/trick-my-proxy-front-tomcat-haproxy-instead-apache
http://www.mulesoft.com/tomcat-proxy-configuration
http://haproxy.1wt.eu/download/1.2/doc/architecture.txt

Tuesday, January 31, 2012

Comet technology

Server Push, long polling, Good descriptions here: http://code.google.com/p/google-web-toolkit-incubator/wiki/ServerPushFAQ
Maturity of Comet implementations: http://cometdaily.com/maturity.html
Best Comet/Streaming server: Caplin Liberator (http://www.caplin.com/caplin_liberator.php)

Sunday, August 14, 2011

Building C++/.NET apps with MSBuild 4.0

In .NET 4.0/VS 2010, Microsoft replaced vcbuild.exe with msbuild.exe.
To build both .NET managed as well as native C++ apps, you only need .NET 4 along with Windows 7 SDK:
http://www.microsoft.com/download/en/details.aspx?displayLang=en&id=8279
There is no need to install VS 2010.
Here is a walkthrough for a simple hello world C++ app:
http://msdn.microsoft.com/en-us/library/dd293607.aspx
With VS 2010, vcbuild.exe is no longer used to build C++ projects.

For VS 2008 solution files, you will need Microsoft Windows 7 SDK and .NET 3.5:
http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=3138
After installation, use the CMD shell (Programs->Windows 7 SDK->Cmd) to invoke msbuild on solution files.
This version of MSBuild (3.xx) uses vcbuild.exe to build C++ projects.

There is no need to install VS 2008.