My tech musings: Using Queueing Theory to Simulate Server Load

Introduction

One of the projects that we've been working at work is about multivariate time series modelling. We've been evaluating a particular model on a variety of real and synthetically generated datasets and have been trying to figure out the strengths and weaknesses of the current model. Real and clean datasets are hard to come by and may not provide all the scenarios that you want to test your model. We have been wondering of a variety of scenarios and how would the model react to them. Of course, you can write some ad-hoc code to introduce those scenarios in your existing dataset but it doesn't scale. For every different scenario you need to write a bunch of complex code which is time consuming and error prone.

Queueing Theory

This lead me to look into queueing theory, which is commonly used to model server load for tasks such as capacity planning and resource requirements, as well as simulation. Queueing theory is a branch of study designed to model systems involving queues, scarce resources and delays. It can help answer questions such as, "if from tomorrow the number of job requests to the system doubles, how much more CPU do we need so that the average response latency remains the same". Queueing theory predates computer science and is a part of Operations Research where it has been employed to study the resource requirements for providing a service. It was first developed and used in 1909 by Agner Krarup Erlang, a Danish engineer, to model the number of telephone calls arriving at an exchange. It has also been used to model the flow of packets in the packet switching networks during the development of TCP/IP in the 70s.

Queueing Theory by Example

Let's say we have a system consisting of a single server with a single CPU and an practically unbounded queue size. There is a random process which is generating jobs at some fixed rate, the jobs come and sit in the queue on the server. The jobs are served first come first serve and each job has some resource requirement in order to be serviced. Let's also assume that the server is capable of serving n jobs per second.

Concretely, we can model the job generating process as a Poisson process with rate λ , i.e. on average there are λ jobs per second. Similarly we can think of the resource requirement of a job as a probability distribution. If all jobs are homogeneous (i.e. require similar amount of work), then we could model it as a normal distribution with parameters (𝛍, 𝛔).

Once we have this basic structure in place, we can answer more complex questions. For example, if the number of jobs coming to the server doubles (while the resource requirement of each job being same as before), how much more CPU do we need to keep the average response latency same?. Or if the resource requirement of the jobs has doubled, what will be the average response latency of the server?

Simulating Server Load using Queueing Theory

Now let's talk about how can we utilize queueing theory to model a server and simulate data using that model. The goal is to model the resource usage on a server hosting a microservice. The server receives requests over the network at some rate and each request requires some amount of resources in order to be served, e.g. CPU, memory, network, disk etc.

Modelling the Requests

It's possible to model the requests as a random variable following some probability distribution (.e.g exponential or Poisson) but in my case I decided to use an existing time series data to represent request rate. The time series represented the number of requests to a popular ride sharing app. I decided to use it because it was a real world data with nice weekly and daily seasonality patterns and reflected the kind of data we wanted to simulate. For those unfamiliar with time-series data, the data consists of observations made at fixed time steps. In this case, the request data consisted of number of requests every minute.

Not All Requests are Same:

We can definitely model the resource requirement of all the requests as being identical but in real world situations things are not that simple. For example, for a database, some queries might be simple one row index lookup which can be served pretty fast (especially if the data is sitting in cache). But there may be other queries which may require reading a large number of rows and then there might be others which may require reading large number of rows without using any index or primary key, thus resulting in slow response. We decided to model this kind of behavior in our simulation. In order to do this, we split the requests into 6 categories, each request could belong to one of these categories. We defined a categorical (or multinomial) distribution to decide which category a request falls into. For example

request_job_sizes ~ Categorical(n=6, probs=[1/6, 1/6, 1/6, 1/10, 1/10, 1/3])

Which means each request has 1/6 probability of being in the first 3 categories, 0.1 probability of being in categories 4 and 5, and 0.3 probability of being in category 6. At simulation time we can sample from this distribution to decide the request job category.

We can also assume that the resource requirements increase with the category, i.e. category 1 requires the minimum amount of resources, whereas category 6 request requires maximum amount of resources.

To model a real world business service, we probably want to include the time of the day as a factor as well. For example, during the day hours there might be more number of requests of certain size as compared to others and during night time the behavior changes. Similar differences can be modelled for weekdays vs weekends as well. As an example we could define the above distribution for the business hours from morning 9 to evening 9. And for the night hours we could define a different distribution such as probs = [1/3, 1/3, 1/10, 1/5, 0, 0] which basically means during night hours the larger requests stop coming and small requests grow in numbers.

Another possibility is to model requests as a Markov chain. For example, if we are modelling a certain application/businesss workflow where the user follows a certain set of steps. For instance, in an e-commerce application once you login, there is a high probability of doing a search. Similarly after you click on checkout there is a high probability of going to the payments page, rather than going back to search for another item. Such behavior can also be modelled. To model this, we can say that our system consists of 3 (as an example) possible states, at the beginning the system starts up in state-1. We can define a state transition probability distribution like this:

P(state-1|state-1) = 0.5

P(state-1|state-2) = 0.3

P(state-1|state-3 = 0.2

P(state-2|state-1) = 0.1

P(state-2|state-2) = 0.3

P(state-2|state-3) = 0.4

P(state-3|state-1) = 0.8

P(state-3|state-2) = 0.0

P(state-3|state-3) = 0.2

And then we could define the categorical distribution of request sizes for each of the states.

Modelling Resource Requirements for Request Jobs

For each of the 6 request job categories, we can then define the resource requirements.

Modelling CPU:

We need to define the CPU requirement for each request type. One possible formulation is this:

cpu | request_category_1 ~ Normal(𝛍=10000, 𝛔=1000)

This means that the CPU requirement for request category 1 is normally distributed with the given mean and standard deviation. The CPU requirement here is defined in terms of the number of CPU cycles required to service the job. If we know the maximum CPU cycles per second, we can easily calculate the percentage CPU requirement using that. We can similarly define the CPU model for rest of the 5 other request job categories.

Apart from modelling the CPU load because of the requests, we may also want to factor in some constant noise because of background processes and daemons running on the server causing some constant load at all times.

Modelling Memory:

We can assume handling each request adds up some stress on the memory utilization of the server. Similar to CPU we can define distributions for memory usage for each of the request types. For sake of brevity I will skip reproducing it here but it can be modelled using any of the common distributions, such as normal, uniform, student's t etc.

Modelling Network:

Since the requests are arriving at the server over the network, each request coming in will cause some amount of traffic in and the response being generated by the server will cause some traffic flowing out of the server. These can also be modelled similar to the CPU using distributions of our choice. The interesting part in modelling network is modelling the background noise, i.e. the traffic which is present because at all times because of background processes, daemons and things like health checks (or pings). Usually this will be very small amount of traffic as compared to the traffic because of the actual service requests. This can be best modelled using an extreme value distribution such as the Pareto or Weibull distributions. These distributions are from the family of long tail distributions and commonly used for modelling extreme value data, where majority of data has small values but there are large disruptions once in a while.

Modelling Disk:

Disk activity because of the requests can also be modelled similar to the other resources. Although in our case we decided to model the requests such as they don't cause any disk activity on the server, so the disk is independent of the requests being served by the server. But we decided to model disk activity occurring due to a daemon process running on the server at regular intervals and causing disk reads and writes.

Modelling disk activity is slightly different from modelling CPU or memory. Disks are very slow to read and write, and the access patterns can result in variable latencies. For example in spinning magnetic disks, if the data being read resides on one of the outer tracks of the disk, it can be read faster as compared to data residing on the inner track. Also, the disk controller does read ahead and caches the near by data, apart from the requested data so that if request for near by data comes next it can just return from cache. The operating system also caches the disk data in its page cache and can service future reads much faster. We may want to model these behavior in our simulation if we want.

In our case, I decided to model the variation in disk latency because of the sector location (outer track vs inner track) using a Bernouli distribution with 50% percent chance of reading from an outer track. Based on if we are reading from an outer track or inner track, we can define the speed of the disk, for exampl 7200 RPM for outer track and 5000 RPM for inner track. We can model the number of bytes read/written as a normal distribution and based on the disk speed we can come up with a latency of reads and writes.

Disk read ~ Normal(mean=10000 bytes, std=500 bytes)

Disk write ~ Normal(mean=4000 bytes, std=200 bytes)

P(outer track) = 0.5

P(inner track) = 0.5

Simulating the Data:

Let's first take a look at how the request data looks like (which we took from a real world time series data), rather than simulating. This serves as the base to simulate server resource utilization load.

This data represents the number of requests coming to our server at each timestamp. In order to simulate the server resources utilized we follow the below process:

Define limits of the system:

First we define the maximum resources available on the system and then what part of it can actually be utilized to service these requests. Usually in real world deployment we don't let 100% of the resources be consumed. We first define the maximum capacity such as:

Number of CPU cycles per second: for example for a 2 GHz system it is 2,000,000,000 cycles per second

Maximum RAM: e.g. 32 GB

Maximum Network Bandwidth: e.g. 1 Gbps

Max disk speed: e.g. 7200 RPM for spinning disks, or 4 MBps for SSD

Then we define the maximum allowed usage of these resources by the service. For example we could say only 80% of the CPU and memory can be utilized by the service.

Actual Simulation:

At every timestamp we read the number of requests k. For each of the k requests, we determine their job category by sampling from the categorical distributions we defined in our model.

Once we know the job category for each of the k requests, we start to simulate the resource utilization caused by each of those. We do this by sampling from the respective probability distributions we defined for each of the resources.

Each job coming to the server sits in a queue. The simulator will pick them up and simulate their load. The simulator needs to make sure that load on the system at no point exceeds the maximum allowed load. If such a situation occurs, all the jobs in the queue need to wait till the next timestep at which point some of the currently running jobs may have finished, making space for the new ones.

Let's take an example to simulate just the CPU load. At the beginning of simulation the queue is empty and no CPU is being utilized. Now we get k requests with varying job sizes. We go through each of the k jobs, and depending on their size, we sample from the corresponding CPU distribution to determine the number of cpu cycles required to service the job and reduce the available CPU cycles for the current timestamp. We do this for each of the k jobs.

If all of the k jobs were satisfied within less than 80% of CPU, the queue would be empty at next timestamp because the CPU would have been able to serve all the requests. On the other hand if the CPU load reached 80% by just the first m jobs, then at the beginning of next timestamp we will have k-m jobs waiting in the queue, so we would have to give them preference (in first come first serve policy) before we take on any of the new requests.

We should note that this is a simplified scenario where we are just simulating CPU data. But if we were simulating CPU and memory then things could be more interesting. For example even though there might be CPU cycles available to serve more requests, but it's possible that memory was saturated and we would have to make the rest of the jobs wait.

We can follow this process for all the system resources and simultaneously simulate an actual server. The complicated part is simulating the response latency because it is not an actual system resource but it is generated as a function of the collective resource availability of the system and we need to be able to factor in the time a job spends waiting in the queue. For that to happen we need to define a clock in the simulator so that we know how many clock ticks have passed since the job arrived till it got serviced.

This is a sample simulated data generated:

We came up with a generic config format in order to facilitate doing simulations easily. The config used for the above data is provided below. There is a potential to simplify this by moving to a custom DSL based config which will be simpler at the same time giving more expressive power in defining the behavior of the simulation.

{
    "job_metrics": [
        {
            "name": "http_requests",
            "dist_type": "categorical",
            "num_categories": 10,
            "metric_type": "job",
            "source": "uber-aprsep-14.csv",
            "categorical_probs": {
                "comment": "using different probabilities during day vs night hours and weekdays and weekends",
                "day_of_weeks":
                    {
                        "0": {
                            "9": "[1/20., 1/20., 1/20., 1/20., 1/10., 1/10., 1/10., 1/10., 2/10., 2/10.]",
                            "21": "[1/10.] * 10"
                        },
                        "5": {
                            "0": "[1/6.] * 6 + [0.0] * 4"
                        }
                    }
            },
            "affected_resources": [
                {
                    "name": "cpu",
                    "distribution_spec": {
                        "distribution_type": "uniform",
                        "values_type": "absolute",
                        "distribution_parameters": [
                            {"low": "10 ** 6", "high": "5 * 10 ** 6"},
                            {"low": "5 * 10 ** 6", "high": "8 * 10 ** 6"},
                            {"low": "8 * 10 ** 6", "high": "10 ** 7"},
                            {"low": "10 ** 7", "high": "5 * 10 ** 7"},
                            {"low": "5 * 10 ** 7", "high": "10 ** 8"},
                            {"low": "10 ** 8", "high": "5 * 10 ** 8"},
                            {"low": "5 * 10 ** 8", "high": "8 * 10 ** 8"},
                            {"low": "8 * 10 ** 8", "high": "10 ** 9"},
                            {"low": "10 ** 9", "high": "5 * 10 ** 9"},
                            {"low": "5 * 10 ** 9", "high": "10 ** 10"}
                        ]
                    }
                },
                {
                    "name": "network_rx_bytes",
                    "distribution_spec": {
                        "distribution_type": "normal",
                        "values_type": "absolute",
                        "distribution_parameters": [
                            {"loc": "10 ** 3", "scale": "100"},
                            {"loc": "2 * 10 ** 4", "scale": "5000"},
                            {"loc": "5.0 * 10 ** 5", "scale": "10000"},
                            {"loc": "8 * 10 ** 5", "scale": "10000"},
                            {"loc": "10 ** 6", "scale": "10 ** 4"},
                            {"loc": "5 * 10 ** 6", "scale": "50000"},
                            {"loc": "10 ** 7", "scale": "50000"},
                            {"loc": "5 * 10 ** 7", "scale": "30000"},
                            {"loc": "10 ** 8", "scale": "10 ** 6"},
                            {"loc": "10 ** 9", "scale": "10 ** 7"}
                        ]
                    }
                },
                {
                    "name": "network_tx_bytes",
                    "distribution_spec": {
                        "distribution_type": "lognormal",
                        "values_type": "absolute",
                        "distribution_parameters":[
                            {"loc": "1", "scale": "0.5"},
                            {"loc": "2", "scale": "0.6"},
                            {"loc": "2.5", "scale": "0.8"},
                            {"loc": "3", "scale": "1"},
                            {"loc": "4", "scale": "1.3"},
                            {"loc": "4.5", "scale": "1.5"},
                            {"loc": "5", "scale": "2"},
                            {"loc": "6", "scale": "2.5"},
                            {"loc": "7", "scale": "3"},
                            {"loc": "8", "scale": "3.5"}
                        ]
                    }
                },
                {
                    "name": "memory_used_bytes",
                    "distribution_spec": {
                        "distribution_type": "normal",
                        "values_type": "absolute",
                        "distribution_parameters":[
                            {"loc": "1e3", "scale": "1e2"},
                            {"loc": "5 * 1e3", "scale": "1e3"},
                            {"loc": "1e4", "scale": "5 * 1e3"},
                            {"loc": "5 * 1e4", "scale": "8 * 1e3"},
                            {"loc": "1e5", "scale": "1e4"},
                            {"loc": "5 * 1e5", "scale": "1e4"},
                            {"loc": "1e6", "scale": "5 * 1e4"},
                            {"loc": "5 * 1e6", "scale": "1e5"},
                            {"loc": "1e7", "scale": "5 * 1e5"},
                            {"loc": "5e7", "scale": "1e6"}
                        ]
                    }
                }
            ]
        },
        {
            "name": "disk_jobs",
            "dist_type": "categorical",
            "num_categories": 9,
            "source": "sim_disk_jobs.csv",
            "metric_type": "resource",
            "categorical_probs": {
                "comment": "we model this as having 9 kind of jobs, representing small, medium, large and read/write/read+write categories",
                "day_of_weeks":{
                    "0": {
                        "0": "[0.33, 0.33, 0.33, 0, 0, 0, 0, 0, 0]",
                        "6": "[0.33, 0.33, 0.33, 0, 0, 0, 0, 0, 0]",
                        "18": "[0.0, 0.0, 0.0, 0.3, 0.3, 0.3, 0.025, 0.025, 0.05]"
                    }
                }
            },
            "affected_resources": [
                {
                    "name": "cpu",
                    "distribution_spec": {
                        "distribution_type": "normal",
                        "values_type": "absolute",
                        "distribution_parameters": [
                            {"loc": "0.08 * 2 * 10 ** 9", "scale": "0.02 * 2 * 10 ** 9"},
                            {"loc": "0.1 * 2 * 10 ** 9", "scale": "0.03 * 2 * 10 ** 9"},
                            {"loc": "0.4 * 2 * 10 ** 9", "scale": "0.1 * 2 * 10 ** 9"},
                            {"loc": "0.08 * 2 * 10 ** 9 * 10", "scale": "0.01 * 2 * 10 ** 9 * 10"},
                            {"loc": "0.2 * 2 * 10 ** 9 * 10", "scale": "0.02 * 2 * 10 ** 9 * 10"},
                            {"loc": "0.3 * 2 * 10 ** 9 * 10", "scale": "0.02 * 2 * 10 ** 9 * 10"},
                            {"loc": "0.1 * 2 * 10 ** 8 ", "scale": "0.06 * 2 * 10 ** 8"},
                            {"loc": "0.4 * 2 * 10 ** 8 ", "scale": "0.1 * 2 * 10 ** 8"},
                            {"loc": "0.5 * 2 * 10 ** 8 ", "scale": "0.1 * 2 * 10 ** 8"}
                        ]
                    }
                },
                {
                    "name": "disk_read_bytes",
                    "distribution_spec": {
                        "distribution_type": "exponential",
                        "values_type": "absolute",
                        "distribution_parameters": [
                            {"scale": "100.0"},
                            {"scale": "200.0"},
                            {"scale": "500.0"},
                            {"scale": "0"},
                            {"scale": "0"},
                            {"scale": "0"},
                            {"scale": "200.0"},
                            {"scale": "400.0"},
                            {"scale": "600.0"}
                        ]
                    }
                },
                {
                    "name": "disk_write_bytes",
                    "distribution_spec": {
                        "distribution_type": "normal",
                        "values_type": "absolute",
                        "distribution_parameters": [
                            {"loc": "0", "scale": "0"},
                            {"loc": "0", "scale": "0"},
                            {"loc": "0", "scale": "0"},
                            {"loc": "1e5", "scale": "2e3"},
                            {"loc": "1e6", "scale": "1e4"},
                            {"loc": "5e6", "scale": "1e4"},
                            {"loc": "100", "scale": "10"},
                            {"loc": "1e3", "scale": "1e2"},
                            {"loc": "1e4", "scale": "1e3"}
                        ]
                    }
                },
                {
                    "name": "memory_used_bytes",
                    "distribution_spec": {
                        "distribution_type": "exponential",
                        "values_type": "absolute",
                        "distribution_parameters": [
                            {"scale": "100"},
                            {"scale": "500"},
                            {"scale": "1000"},
                            {"scale": "10"},
                            {"scale": "20"},
                            {"scale": "30"},
                            {"scale": "150"},
                            {"scale": "600"},
                            {"scale": "1000"}
                        ]
                    }
                }
            ]
        }
    ],
    "resource_metrics": [
        {
            "name": "cpu",
            "baseline_distribution_spec": {
                "distribution_type": "normal",
                "distribution_parameters": [{"loc": "0.08", "scale": "0.02"}]
            },
            "baseline_usage_type": "percent"
        },
        {
            "name": "disk_read_bytes",
            "baseline_distribution_spec": {
                "distribution_type": "normal",
                "distribution_parameters": [{"loc": "10 ** 2", "scale": "20"}]
            },
            "baseline_usage_type": "absolute"
        },
        {
            "name": "network_rx_bytes"
        },
        {
            "name": "disk_write_bytes",
            "baseline_distribution_spec": {
                "distribution_type": "normal",
                "distribution_parameters": [{"loc": "1e2", "scale": "20"}]
            },
            "baseline_usage_type": "absolute"
        },
        {
            "name": "network_tx_bytes"
        },
        {
            "name": "memory_used_bytes",
            "baseline_distribution_spec": {
                "distribution_type": "pareto",
                "distribution_parameters": [{"a": "1.0"}]
            },
            "baseline_usage_type": "absolute"
        }

    ]
}

My tech musings

Saturday, May 15, 2021

Using Queueing Theory to Simulate Server Load