What Is a Database Schema?
That’s actually a bit of a difficult question to answer! Like many database terms, database schema gets used in a variety of contexts. Depending on the context, it can mean different things.
In the context of generally discussing a database and how it’s organized, the term database schema describes the structure of the data and how the elements within the database relate to each other. This information is often presented visually with diagrams that illustrate how different tables and elements connect to give you a logical view of the entire database. However, the schema itself exists in the database as the coded rules that define the data’s structure and relationships.
In this context, the database schema is typically something that’s typically planned prior to the creation of a new database. It reflects the design of your database, and it’s always best to design before you start building! An architect or developer considers what is required based on the data model or application, and then makes decisions about the tables, data types, foreign key constraints, etc. that will be included. The resulting design is called the database schema.
To sum up, a database schema:
- Defines how data in a database is structured
- Defines how elements within a database relate to each other
- Accomplishes the above through the implementation of coded rules
In this article, we’ll focus primarily on the above definition of database schema, as architects and developers must think carefully about their schema design prior to creating and implementing a database. But first, to avoid confusion we need to understand a few other ways the term database schema gets used.
Other Definitions of “Database Schema”
In SQL databases, the term database schema can also have another, more specific meaning. The term “schema” is sometimes used to represent a named group of database objects. This type of schema is more precisely called user-defined schema, so to avoid confusion, we’ll use that term for the remainder of this article.
A single database may have multiple user-defined schemas, and different user-defined schemas can be used to give different users or user groups access to different parts of the database.
For example, let’s say we have a logistics team and we want to give them access to
shipments tables while keeping them from accessing the PII (personally identifiable information) in our
customers table. We could do this by creating a user-defined schema called
logistics that contains only the database objects (
shipments) that we want the logistics team to be able to access.
To review, a user-defined schema is:
- A collection of user-chosen database objects such as tables, views, etc.
- Useful for security and access management
The term database schema is also sometimes used to refer to other things, such as a flowchart that provides a visual representation of a database (i.e. an entity-relationship diagram, or ER diagram). For the purposes of this article, though, we’re going to focus on our first definition of database schema – the big-picture database design that we need to think about before creating any database.
What About “Schemaless” NoSQL Databases Like MongoDB?
Historically, one of the limitations of using a database management system with a defined schema was that because schemas enforce structure, it’s difficult to store unstructured data. Another limitation was that changing a database’s schema after the database was in production required downtime. That meant that developers either had to put a lot of thought into the construction of their schema up front or risk paying a heavy price for making changes further down the line.
“Schemaless” NoSQL database solutions address some of these problems, but they’re not a good solution for applications with heavy transactional workloads, as many of them struggle with consistency at scale. Additionally, developers setting up a NoSQL database for their application will have to ensure that any necessary constraints on the data are in place at the application level, since the database itself doesn’t have a schema that enforces these constraints.
Thankfully, modern distributed SQL databases make it possible for developers to “have their cake and eat it too.” For example, CockroachDB supports online schema changes, which means developers can quickly spin up a database for their projects and modify the schema over time without having to take their database offline. CockroachDB also supports the storage of unstructured data via the
JSONB data type.
Best Practices for Great Database Schema Design
Depending on the database you’re using, it can be quite important to think carefully about your schema before you create your database. While CockroachDB supports online schema changes, making it easy to adjust and adapt your schema over time with zero production downtime, schema changes can be a bit more painful with legacy databases.
So how can you get it right the first time? Every use case is different, and your specific needs are going to define how your database will be designed.
The first step in designing any database schema is to build a thorough understanding of all of the data you’ll be storing. Creating a “data dictionary” that lists and defines every column of data you intend to store is a way to approach this task that’ll also leave you with a helpful final document you can share to help others understand your data in the future.
Once you’ve got that down, here are some important things to consider for the construction of your database schema:
Figure out exactly what data you’ll want to store, and then begin to separate it into tables and columns. The goal here is to finish with a structure that makes sense, with thematically-organized tables and minimal or no repetition across tables.
Give your tables and columns descriptive names, and avoid redundancies such as calling a table
orders_table – it should simply be called
orders. You also need to avoid using any reserved words in the names for your tables, columns, etc.
Plan the constraints you’ll add to each table, including the primary key and any foreign keys, and the format (single-column or multi-column) and data type you’d like for each constraint.
This is an area where you’ll want to take your chosen database into account, because performance can vary. For example, while using an auto-incrementing column as a primary key can work well for a traditional single-node database, if you’re using a distributed database such as CockroachDB this approach can create a “hotspot” where a single node ends up handling most of the workload.
To spread the workload out and take advantage of the distributed architecture, it’s best to use an auto-generated UUID.
This is just one example. Regardless of the database system you’re using, it’s a good idea to understand the best practices for primary and foreign key constraints before you design your schema.
Consider the appropriate data types for your columns, keeping in mind that columns with a foreign key relationship must share the same data type as the parent column. Here, too, it is wise to see if there are best practices that are specific to your database technology. For example, CockroachDB supports time-related data types including
TIMESTAMPTZ, and recommends using
TIMESTAMPTZ as a best practice. Other systems may support and recommend other formats.
It’s never too early to start thinking about performance! It’s a good idea to create secondary indexes for any columns you expect to use for sorting or filtering data. That said, don’t go crazy here, as having too many indexes can slow write performance and eat up the available node memory.
Security and Encryption
Database security is a consideration that goes far beyond just schema but is also relevant to schema design if your database supports selective encryption. If your database supports it, encrypting tables with PII while leaving non-sensitive tables in plaintext will enable you to get the best performance from your database without compromising security.
Geolocating data close to the users who access it is a best practice for performance, and locating data in the same country where a user is located is also a legal requirement in some countries. If performance or regulatory compliance are likely to be important for your application, and if you have a database that supports geopartioning by table and even by row, it may be worth considering whether specific tables should be associated with specific locations as part of your schema design.
Arguably as important as the schema design itself is documenting your schema design so that other people can understand it. At a minimum, you’ll probably want to create an ER diagram that illustrates your database structure visually, and a data dictionary that provides additional information and context for each table and column in the database.