SECRET OF CSS

Tackling Database Resource Utilization | by Andrii Hrachov | Jul, 2022


A developers’ guide

0*x4GUSG owXGHv1Iu
Photo by Campaign Creators on Unsplash

Database resource utilisation is frequently overlooked.

In the early project stages, when there is neither load nor a large amount of data, there isn’t much sense in hyper-optimising everything. The main focus is to build features to launch an application.

When the live traffic comes in, some performance degradation starts to appear. Usually, there is some extra indexing happening, but still, at this stage, it’s easier to add extra hardware than trying to “calculate bytes” and spend a tremendous amount of time on measuring, debugging, and fixing performance bottlenecks.

Software engineering is an endless journey of finding the right balance at the right point in time. Remember the famous saying:

“Premature optimization is the root of all evil.” — Donald Knuth

This is always true for writing code/designing a system, and it’s also true for the database usage tuning. Good engineering always considers RoI (return of investment), and up until your company becomes “big tech” (200+ engineers), usually, it makes more sense to invest time and energy in building product features and activities that bring more clients, revenue, traffic, etc.

When the company has big cloud budgets, you wouldn’t do a database optimisation as well. Because there is always a possibility to scale up hardware and have an efficient and fast way to tackle the problem. In some cases, it makes more sense. (Again, engineering time is the most precious resource, isn’t it?)

At some point, scaling the hardware is no longer feasible. The amount of data and load grows significantly. The existing hardware scaling is no longer beneficial. It’s either too costly or doesn’t bring any benefit, e.g., some unoptimised query or database schema flaw can eat up 100% of 96 vcpu, which is a maximum on the Google Cloud SQL. History knows such cases.

Going a bit forward — when tackling database resource utilisation, query optimisation is a part of it. A major part, but usually, you cannot get off with only optimising your queries if you have a big distributed application. Some places probably should be revisited, some parts should be rewritten and moved around, and even the DB engine itself can be changed.

This article is not aiming to give an in-depth performance course across several databases(which can take more than a single article), but the purpose is rather to give a high-level overview of common things happening inside various database systems and how to mitigate common problems.
Let’s start.

Throughout the system design articles and software engineering experience, you may probably encounter concepts like “write-intensive” or “read-intensive.” What does it mean? Well, it’s pretty simple:

Write intensive — the system is oriented on handling mostly writes

Read intensive — the system is oriented on handling mostly reads

Many aspects can represent a write-intensive system, for example, IoT telemetry data ingestion system. While handling millions of writes per second (on large systems), the data read happens infrequently. It may be in form of pre-aggregated data(dashboards) or real-time analytics over a sliding time window(monitoring). None of these tasks induce a heavy read workload, especially compared to the number of writes.

For a read-intensive system, imagine a news website. Millions of visits per day, hundreds of thousands of customers, and millions of read QPS. However, edits/updates happen very infrequently (how many times per day content is edited on the news website? Tens of times, probably).

A write-/read-intensive system happens when writes and reads are more or less equal and produce the same amount of load on the system. The most complex scenario includes a lot of tradeoffs, dancing around CAP theorem and aligning with business stakeholders.

While the concept is very simple, based on common logic and predicting usage patterns based on business purpose, it is surprising how often the step of defining this crucial characteristic is skipped.

Because understanding whether our system is read-heavy or write-heavy directly affects one of the most important parts of our design — database selection.

If you have predictable behaviour, you have the luxury to choose a specialised database for your needs, and this will give you a better load/price ratio.

In the above examples, a news website can use something like elasticsearch . It has many useful features like good read/aggregation and index coverage. Also, it’s usually a go-to solution for full-text searching. But everything comes with a price, and elasticsearch is no different. Fast read performance and scalability come at the price of a slow ingestion rate.

For write-intensive workloads and occasional queries, nothing performs better than column-based databases like Cassandra or Clickhouse (Actually, there is much more than these two, but these two are, probably, the most popular).

But in most cases, you will probably end up with a mixed write/read workload and the need to balance them somehow. So the choice will be the general-purpose database or the SQL or noSQL ones like PostgreSQL or Mongodb.

In this article, we will use Mongodb Atlas monitoring metrics for explanatory purposes. Similar tooling (open source and paid) can be found for every popular database.

If you don’t know where you’re going, any road will take you there. — Lewis Carrol

It’s quite hard to achieve something without quantifiable objectives. “We need the system to work fast.” It’s a fuzzy objective, and everybody has their own feeling of what “fast” actually is.

For some systems, having 95% of HTTP requests served under one second is OK; for others, one millisecond is already too much. First, we need to define what “fast” means for us in numbers that can be measured.

“(only) What gets measured, gets managed.” — Peter Drucker

An important thing to mention here; the initial “null-hypothesis” is not forever. Do 95 percentile 100ms, aspire for 50ms. It’s a live and constantly changing thing. There isn’t any shame in revisiting the objectives and moving them up and down. The only thing that matters is the business need. If the business is OK with 100ms and 50ms, which will cost four times more in terms of hardware, there is no need to squeeze the system to 50ms.

CPU, memory, and disk utilisation (and also network).

The database is an “application” in a wide sense, so the high level can be understood as we understand our applications. Our applications also use CPU, memory, and, occasionally, disk. To properly tune up the database, we need to understand what’s happening there, at least on a high level. So, how do these three pillars of resource usage allocated? Let’s see.

The CPU

CPU is used in every operation. Usually, it’s a main metric for DB monitoring and the most descriptive one. Inside every database, there will be computational work performed. Whether we perform index scans, read data from disk, or do writes there. However, its usage heavily depends on the other two pillars, and optimising memory and disk usage can significantly affect CPU usage. Below, we’ll explain why we should address CPU usage last in the chain.

Readable breakdown of our CPU usage in Atlas monitoring dashboard. Notice the IOWAIT (the time when the CPU is idle and does nothing and cannot be used for other tasks)

The memory

Like the CPU, memory is used for pretty much everything. Again, we allocate some space for our DB objects when we perform operations, and for the very important part of our database — disk reads caching. Why is the cache so important?

The more memory database has, the less frequently disk will be accessed.

The Disk

Reading from cache is especially important when we are talking about disk utilisation and disk performance. In this article, we won’t touch on exotic cases like magnetic HDDs, self-managed SSDs, or NVMe SSDs. A whole new article can be written on this topic. There are a tremendous amount of specifics that differ from database to database and from hardware to hardware. To simplify things, we will take the most common setup: a common SSD on some cloud provider, AWS, GCP, Azure, etc.

As stated before, memory is used in almost every operation, and usually, the easiest way to tackle memory problems is to add more memory. But let’s step back and think about what will happen when we are out of memory.

When we deploy our applications, we have the luxury of limiting our application memory usage via Kubernetes limits, Linux cgroups, ulimit, etc., and we can restart them if they run out of memory. It’s not ideal, but we can usually afford that.

Note: If you are developing mission-critical software for medical, planes, or space missions, please close this article and forget about what is written here.

Good memory usage metric; we still have a lot of memory available

Database operational requirements, however, should provide a higher level of stability, and when they run out of memory, the DB still has to operate. We don’t want our data to be corrupted across many applications, do we?

To address that, the operating system can lend us a hand in the form of a memory swap. Here’s a very brief explanation of how memory swap works: It’s a dedicated place on the disk where we hold “virtual memory” — the same memory from the app/database perspective, but instead of reading data from RAM, it’s reading data from disk. Reading from disk is magnitudes slower than reading from RAM, resulting in disk read operations, longer execution times, higher threads(IOWAIT) waiting times, and, as a result, increased CPU usage.

1*AS5pWab47sc Ap8P6hye5Q
Even SSD is quite slow, max latency can be up to 21ms

As a first step, it is advised to watch swap metrics and swap occurrences in the first place, adding memory if it’s needed

1*68SkJwqfIsQSnM6gyncJZg
A good sign of swap metric is no swap at all

If it’s not about the disk capacity(which usually can be easily fixed). Often, the real problem with the disk is the throughput.

When deployed to the cloud, there will always be an input/output operations per second (IOPS) limit. Whenever the database performs disk read or write, the IOPS capacity is used. If we are out of IOPS, the operations will be throttled. This will result in thread waiting times. Thus, it will look like the CPU is overloaded, but it will be just idle, waiting for a disk operation to be completed. So these resources cannot be used to perform other tasks (scan indexes, etc.).

1*HdL1MPDcY4mTaCknSugGag
Monitoring IOPS. Looks like we have twice more writes than reads in our system

Tackling read IOPS is pretty straightforward. Usually, adding more memory helps. Even if we don’t have a swap happening, with more memory, we will have more data being cached — so, less disk access. Hurray!

Write IOPS are slightly more complex because of write multiplication. Each time we write something to the database, many things are going on. The DB has to write to the journal (if enabled), insert data to the particular disk area, and update all the indexes that data touches. We’re most interested in the last part.

The easiest way to reduce write IOPS is to reduce the indexes amount.

The idea is pretty simple; it’s easier to write 1x than 10x. Doing so will reduce write IOPS, reduce disk access frequency, iowait, and as a result, again, reduce CPU usage.

Also, when we deploy to the cloud, IOPS limit can be increased simply. This is a good way to go as a first response; the total cost will not be affected as much.

When we are sure we aren’t running out of memory and the disk IOPS is not the problem, let’s start the application code.

OK, increasing memory helps, but what if increasing resources or memory is not an option? For example, we have to upgrade to another tier that costs twice more and has more powerful CPUs that we don’t need.

Let’s think again about what memory is used for. The most obvious thing, which is mission critical, is loading indexes for fast access. Again, reading from RAM is often faster than reading from disk, so a database first loads indexes to the memory, where they can be quickly accessed.

Cache reads mb/sec

These structures (B-trees — don’t confuse them with binary trees!) will have some size. In the case of MongoDB and Postgres, there are lots of them — LSM, Hashtable, etc. In the case of large collections/tables, index size can grow up to collection size. Several of them can take up more disk space/ memory than entire data on a collection.

So, as a first step, watch out for what you index! Avoid wildcard indexes and excessive indexing on fields you don’t need!

Most databases will come with a set of tooling that will allow you to watch index size and utilisation. Remove indexes you don’t need; reduce the size of the indexes you need. Pretty simple, isn’t it?

1*jaGAFWujqyMkpCabpy0AiQ
Indexes and index usage stats from “mongodb Compass” tool. There are tools like this for every database system.

A more advanced technique (if your application logic allows it) is to use partial indexing. Let’s say your data removal approach is soft deletion, which marks records with the deleted: true field. This field is used in every query to ensure removed data will not reach any objects that query for it.

So, a logical question appears: do we still need to index fields in these documents? We don’t, so partial expression can be created. Old-new index size ratio may vary, but it’s up to the business model. In our case, we can reduce index size by 3x-5x (from 6GB to 1.2GB).

Reducing index size means less memory footprint which allows for other operations, more caching, less disk access, and thus, less CPU usage

As we already discovered, the CPU is used everywhere. It is affected by pretty much everything. But, before jumping to the query optimisation, which is actually the first thing that comes to mind when talking about optimising DB, we need to ensure we have no side effects. Doing this, will make sure the results measured will be clear and not affected by disk IOPS or lack of memory to perform operations.

Optimising queries is a very broad topic that differs from database to database. So, again, that would take another big article. Here, we’ll briefly describe the common concepts.

So, how do we measure this? There is a variety of tooling/built-in profilers coming with every database. As for MongoDB, the best metric to track is scanned keys/documents vs returned documents ratio.

Full table/collection scans

Querying data without an index will result in full collection scans, which means that for every 100,000 records, we will have a 10,000/1 ratio after retrieving 10 of them. It’s a large number, and it’s good to remember this is disk read operation, memory load operation, and the iteration itself. You can think about this as a big “for loop,” which should read the disk first, then iterate over each record. So make sure your application doesn’t perform unindexed queries. It takes a great toll on the CPU.

Full collection scan shows how many documents are examined without an index

As we already know, the database’s most popular structure to index records is a B-tree. Lookups on the B-tree is O(logN) operation which means that for one billion records on a balanced B-tree, there will be approximately only 30 hops. That makes indexing so efficient — with the increasing dataset, there is a very slight increase in the number of lookups performed. Our CPU feels relief, and we can perform significantly more queries per second on an indexed set.

1*AqRJgdAL7YTOEool5L Acg
Query performs 200x faster with index

Compound indexing and complex queries

However, as with everything else, it’s not as simple as it looks, especially regarding complex queries and compound indexes (indexes that span across multiple fields). You can encounter queries that use index, but performance is still poor, and the scanned keys-documents/returned objects are still high. It looks complicated, but if we think it over a little bit, everything has its reason. For example,

If we query for one field that exists in an index, and one that doesn’t, we only narrow our full collection/table scan

Let’s say we have a persons collection with a million records(or more) and we want to filter out all the persons with the name Walther and age 30.

db.persons.createIndex({firstName: 1})
db.persons.find({firstName: 'Walther', age: 30})

Name is indexed, and age is not. What will happen on the DB side? The DB will reduce our dataset to records with the firstName “Walther” using data from a B-tree, then perform a scan on this subset of data. Again, inefficient O(N) operations, poor scanned/returned ratio, increased CPU usage.

Another thing to remember when using compound B-tree indexes is the order of the index keys matter.

That’s happening because the tree split will be done by the first key, then by the second, and then by the third. B-tree internals are a little bit more complicated than that, but let’s not touch this unless we want to create our own DB. What does it mean for us? This code can help:

db.persons.createIndex({firstName: 1, age: 1})
db.persons.find({age: 30})

Having a compound index by name first and age second and querying by age only will result in an inefficient index scan or even full scan if the DB query planner decides it will be faster. But why go to the index if it will be a fullscan anyway?

Again, big scanned/returned ratio, low performance, high CPU usage.

The same effect will be if we skip some fields in the query. Let’s say we have the lastName field indexed, then added the age field. What will happen if we query only by firstName and lastName?

db.persons.createIndex({firstName: 1, age: 1, lastName: 1})
db.persons.find({firstName: 'Walther', lastName: "White")

Pretty much the same as in the previous example, with the only exception, that DB will use only part of index (e.g., firstName). After that, it will perform the full sweep on the remains, as the natural order was broken. On the other hand, MongoDB can use index-prefix, a subset of the original big compound index.

So if we query our “persons” collection, only by firstName and age, it will perfectly use the index and a one-to-one scanned returned ratio. This will lead to low CPU usage and high performance. It’s a win. Detailed information on how the indexes work is greatly explained in Mongodb university performance course. It’s highly recommended if you want to do a deep dive.

Note on in-memory sorts

Even with the most optimised queries, the sort stage can be problematic. Why does it happen? Again, even if we fetch data using index, if our B-tree doesn’t know the requested order, it will not return results sorted the way we needed to. We still instruct the database to do some sorting, so what will the database do? Pretty much the same as we would in our application. Use quicksort/mergesort algorithm to return us results.

And, depending on the subset returned from the B-tree, it can be database-side sorting over several thousands of records, even when we specified the query limit and query criteria. Imagine the memory/CPU footprint of doing this in our application. The database is no different.

1*k9OKt05ou6Ru
Query with high keys/returned ratio and in-memory sort

How to tackle in-memory sorts?

Make sure that our compound B-tree indexes include fields that need to be sorted, and that they come after fields that are queried by equality (Equality-Sort-Range rule). And of course, query dosesn’t skip fields (sort is also part of it).

db.persons.createIndex({firstName: 1, lastName: 1, age: 1})
db.persons.find({firstName: 'Walther', lastName: "White"}).sort({age: 1})

So, again, perfect index match == less operations performed == reduced CPU / memory usage. Huge win.

Advanced — ultrafast index-only queries

We did all the indexing we needed: the scanned key-ratio is one, or close to one, we don’t have in-memory sorts, CPU usage is OK, and memory is within limits, but the performance still doesn’t satisfy us.

Let’s think, what happens after our database gets some data from the index?
Absolutely correct — it will read the disk to fetch data (if it’s not in the cache). Again, we know already that disk reads take heavy toll on resource utilisation and block CPU (IOWAIT). Can we avoid them somehow?

If our query needs only firstName and age from our persons collection, we can instruct database to get us only these fields. That’s where the magic happens because we already have them in index. Db query planner is smart enough to understand that, so it will fetch data directly from index, completely omitting reading disk/cache stage.

db.persons.createIndex({firstName: 1, lastName: 1, age: 1})
db.persons.find({firstName: 'Walther', lastName: "White"}, {firstName: 1, age: 1, _id: false}).sort({age: 1}) // set projection to firstName only

This is the most efficient way to retrieve data from the database, as it will be completely in-memory and can result in tremendous performance uplift, reducing CPU/disk usage significantly.

1*quFGR1lNzoiN60q5z2Dang
Covered query (“Index-only scan” in Postgres terms). Notice 0 documents examined — executed completely in memory, without touching the disk. 6ms execution time because in our system shard merge still need to be performed, usually these queries are 1ms or less

There are a lot of specifics that will be different on different databases. Some of them are completely in memory (e.g., memcache, redis), so the disk utilization is not a problem there. Some of the databases are highly optimized to be used on common hardware (clickhouse), so even magnetic disks bring fewer problems than other databases. Some of them are more memory efficient, some less, but the main approach to tackling resource utilisation stays the same:

  • Database performs work
  • Work consumes resources.
  • Understanding how the database works and how the resources are allocated is the first step to reducing resource usage to save money, leave less carbon footprint and increase overall performance and engineering satisfaction
  1. Is your system read- or write-heavy? Do you have a proper database for the workload?
  2. Revisit your monitoring solution. Ask yourself a question — do you have detailed monitoring of what’s going on?
  3. Check metrics. Are you running out of IOPS, memory, or CPU? Do you have a swap happening?
  4. Revisit indexes and clean up those not used. Use partial indexes where it’s feasible. What is the index size? How frequently are they used?
  5. Revisit compound indexes again to check proper keys’ ordering according to access patterns. What is the scanned index keys/documents ratio? Do you have in-memory sorts?
  6. Profile slow queries and cover hotspots with proper indexing.
  7. Do a deep dive — enroll in the performance course that covers your database system. You can learn more about the different tools/approaches available there.

That’s it. Happy hacking!



News Credit

%d bloggers like this: