A Beginner’s Introduction to Professional Database Design

Understanding the Fundamentals of Effective Database Structuring

Jun 07, 2023

Database design is the process of creating a well-structured database that efficiently (1) stores and (2) retrieves data. Databases have formal / technical properties that a designer should strive to achieve in order to enforce data integrity constraints that are desired and expected from a set of business requirements.

A well-designed database ensures that data integrity is enforced and provides great performance and scalability, especially in the flexibility of adding or removing independent applications that communicate with the underlying data.

This is a shorter guide, compared to the other theoretical and architectural database posts, which focuses on outlining the key concepts in database design. It allows a new reader or experienced developer to thoughtfully think through these considerations as they begin or continue their understanding into formal database theory and design.

Key Concepts

Data Modeling
Normalization
Indexing

Data Modeling

Data modeling involves defining the structure of your database, including tables, relationships between tables, and constraints. It helps in organizing and representing data in a way that meets the requirements of your application.

Data modeling involves:

Entities and Attributes
- Entities represent objects or concepts
- Attributes are the properties that describe them
- For example, in a customer database, an entity could be "Customer" with attributes like (Name, Email, Date of Birth, …)
Relationships
- Relationships define how entities relate to each other
- For example, a "Customer" entity might have a relationship with an "Order" entity, indicating that customers place orders
Diagrams
- Entity-Relationship Diagrams (ERDs) are used to visually represent the data model

To recap:

A data model is nothing more than a structured way to model / represent an underlying system, given how relational databases store and retrieve data
An ER diagram is a visual representation of a data model that will allow us to take a model and eventually convert it into relations (tables) inside our database system

Types of Data Models

Like the design of any complex piece of software, the database design process has several key steps
In the database design process, we typically begin with business requirements that must be converted into a set of entities and relationships, not only for our database design but also for the applications that will have user or system interactivity.
Converting requirements into a final design requires various iterations and the use of data models helps us move towards the final specification in a systematic manner. This is where different types of data models come in to achieve this incremental modeling process.

Types

Conceptual Data Model
- High-level overview of business concepts and relationships
- We’ll consider visual and structured data diagrams as well as the final relations and properties
Logical Data Model
- More detailed, without focusing on how the data is physically implemented
- We’ll strongly consider the data integrity mechanisms to ensure our data is always correct and sound
Physical Data Model
- Specific to the database management system (DBMS)
- Details the tables, columns, indexes, and relationships
- We consider what database system we’ll use as well as its capabilities

Normalization

Normalization is a technique used to minimize redundancy
The technique involves reorganizing your tables and fields in a way that minimizes duplication and dependencies between fields - for example, storing the age and date of birth, both, in a database would be considered redundant and normalization helps us (1) identify these inefficiencies and (2) systematically address them such that we arrive at the best database implementation

Normal Forms

Normal forms are ‘levels’ of achievement for a certain amount of normalization that has been achieved
There are different normal forms, which come with a set of guidelines that logically check if a database design is within that normal form - the benefit of a database being classified into a normal form is that it guarantees that a set of integrity problems will not exist at that level
Recipes / algorithms allow us to jump between one normal form into another, either breaking up our tables further (known as decomposition) or bringing our tables back together (known as synthesis)

The different normal forms include

First Normal Form (1NF)
- Ensures each column contains atomic (indivisible) values
- Each entry in a column is of the same data type
Second Normal Form (2NF)
- Meets all the requirements of 1NF
- Ensures that all non-key attributes are fully functionally dependent on the primary key
Third Normal Form (3NF)
- Meets all the requirements of 2NF
- Ensures that all attributes are not only fully dependent on the primary key but also independent of each other (no transitive dependency)
Boyce-Codd Normal Form (BCNF)
- A stricter version of 3NF
- Addresses certain types of anomalies not covered by 3NF
Higher Normal Forms
- 4NF and 5NF (among others) - address multivalued dependencies and join dependencies
- Higher normal forms reduce redundancy even further

What is Database Normalization? How to Normalize a Database

Normalization Example 1 - Passes

Below is a simple database for managing a library's books, authors, and publishers

Entities

Books
Authors
Publishers

Attributes

Books
- BookID (Primary Key)
- Title
- AuthorID (Foreign Key)
- PublisherID (Foreign Key)
- YearPublished
Authors
- AuthorID (Primary Key)
- AuthorName
Publishers
- PublisherID (Primary Key)
- PublisherName

Relationships

Each book has one author
Each book has one publisher
Each author can write multiple books
Each publisher can publish multiple books

Normalization Steps

First Normal Form (1NF)

1NF states that

Each column contains atomic values
Each record entry in a column is of the same data type
Each table must have a primary key

Given the tables defined below, this meets the three conditions above:

Each column contains an atomic data type
Each record entry insertion is enforced by the database system and guarantees that we always have the same data type per column
Each table defined has a primary key

CREATE TABLE Books (
    BookID INT PRIMARY KEY,
    Title VARCHAR(255),
    AuthorID INT,
    PublisherID INT,
    YearPublished INT
);

CREATE TABLE Authors (
    AuthorID INT PRIMARY KEY,
    AuthorName VARCHAR(255)
);

CREATE TABLE Publishers (
    PublisherID INT PRIMARY KEY,
    PublisherName VARCHAR(255)
);

Second Normal Form (2NF)

2NF states that:

All non-key attributes are fully functionally dependent on the primary key
No partial dependency of any column on the primary key

Given the same database outline in 1NF:

Key attributes in our example refer to the primary keys, since we don’t define any other UNIQUE/key column
Each non-key attribute in the Books table is dependent on the entire primary key (BookID)
We can guarantee this because all attributes are directly related to the record stored - it would not make sense to have any of these fields independently / separately stored, since they’re all related - and the reason they’re dependent on the primary key is because the primary key serves as a unique identifier

Third Normal Form (3NF)

3NF states that:

All attributes are not only fully dependent on the primary key but also independent of each other
This means that there are no transitive dependencies

In the example carried forward so far:

There are no transitive dependencies as attributes
Attributes, Title, AuthorID, PublisherID, and YearPublished, are all directly dependent on BookID

Normalization Example 2 - Fails

In this second example, we look at a different database that comes with certain normalization issues in order to illustrate what these issues might look like
We’ll look at a single table called Orders

Attributes

OrderID (Primary Key)
CustomerName
ProductID
ProductName
Quantity
TotalPrice
OrderDate

Design and Normalization Issues:

CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
-- Multiple customer names in a single field
    CustomerName VARCHAR(255), 
-- Multiple product IDs in a single field
    ProductID VARCHAR(255),
-- Multiple product names in a single field   	
    ProductName VARCHAR(255),
    Quantity INT,
    TotalPrice DECIMAL,
    OrderDate DATE
);

First Normal Form (1NF)

The Orders table has multiple values in the CustomerName field
- Example - John Doe, Jane Smith
- A customer name is not atomic, since it refers to a combination of the first, middle and last names
ProductID and ProductName are not atomic if multiple products can be in a single order

Both of these observations violate the 1NF rule of atomicity

Second Normal Form (2NF)

Assuming we fix the 1NF issue by separating products into their own table:

There is a partial dependency if we have combined keys (e.g., OrderID + ProductID)
If TotalPrice is dependent on ProductID and Quantity, it should be in a separate table to ensure full functional dependency

CREATE TABLE Orders (
    OrderID INT,
    ProductID INT,
    Quantity INT,
    TotalPrice DECIMAL,
    OrderDate DATE,
    PRIMARY KEY (OrderID, ProductID)
);

CREATE TABLE Customers (
    OrderID INT,
    CustomerName VARCHAR(255)
);

Third Normal Form (3NF)

Assuming we fix the 2NF issue by properly linking Orders with Customers:

If ProductName is included in Orders, it creates a transitive dependency (ProductID -> ProductName)

CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    OrderDate DATE
);

CREATE TABLE OrderDetails (
    OrderID INT,
    ProductID INT,
    Quantity INT,
    TotalPrice DECIMAL,
    PRIMARY KEY (OrderID, ProductID)
);

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(255)
);

CREATE TABLE Products (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(255)
);

Summary of Normalization Examples

The first design example meets the first four normal forms, ensuring data integrity and reducing redundancy.
The second design example demonstrates common issues at each normalization step, highlighting partial dependencies, transitive dependencies, and other violations.

Indexing

Indexing is a database optimization technique
Indexing is a technique that improves the speed of data retrieval operations on a database table
As you know in software design, nothing is free - no decision comes without a trade-off:
- In order to get better speed, we must sacrifice (1) storage size and (2) maintenance work
- By getting better speed, we increase the cost of additional storage and maintenance overhead

Types of Indexes

Primary Index
- Automatically created by the database system when a primary key is defined
- A primary index uniquely identifies each record in the table, using the primary key
Secondary Index
- Created manually and on non-primary key columns
- Its purpose is to improve query performance
- Secondary indices can be unique or non-unique
Clustered Index
- Determines the physical order of data in the table
- There can be only one clustered index per table
Non-clustered Index
- Does not alter the physical order of the table
- These can be created on any column, allowing multiple non-clustered indexes per table

Index Structures

Indices are represented in software as different types of data structures. Given that indices are used in order to achieve performance / speed-related benefits, unique data structures must be used to implement these benefits.

The most prevalent data structures used to represent index structures are outlined below:

B-tree Indexes
- A type of Tree data structure
- These are balanced tree structures that maintain sorted data and allow searches, insertions, deletions, and sequential access
Hash Indexes
- Hash indices are a data structure that utilize a hash function to map search keys to corresponding records, providing fast data retrieval for exact match queries
Bitmap Indexes
- Bitmap indices are a data structure that utilize bitmaps
- These are efficient for columns with a limited number of distinct values, often used in data warehousing

Pros and Cons of Indexing

Pros

Improved Query Performance - Faster retrieval of records, especially for large datasets
Efficient Sorting - Speeds up ORDER BY operations
Reduced I/O Operations - Decreases the number of disk accesses required for query execution

Cons

Storage Overhead - Additional space is required to store indexes
Maintenance Cost - Indexes need to be updated when data is inserted, updated, or deleted, which can impact performance

How Graph Database Indexing Works in NebulaGraph

Design Steps

In our final section of this guide, we’ll look at a concrete example of taking a database design from a purely conceptual idea all the way through until the end, where we have a functioning database. The steps that we’ll go through are the first three steps of the six steps outlined below. Given that the later three steps are advanced topics, these will be reserved for the more advanced and technical blog posts in database theory and design.

Steps

Requirement analysis
- Understand what data is to be stored
- What apps must be built on top
- What operations are most frequent
Conceptual database design
- Develop a high-level description of the data to be stored in the database and its constraints
- This step involves the ER model (semantic model)
Logical database design
- Convert the conceptual database design into a database schema in the data model of the DBMS
- Converting the ER schema into a relational database schema
Schema refinement
- Analyze the collection of relations in the relational database schema to identify problems and refine it
- Theory of normalization - relations-restructuring
Physical database design
- Performance criteria and requirements
- Clustering of tables and creation of views
Application and security design

Design Example

Designing a Library Database

Let's apply what we've learned to design a simple library database

Step 1 - Conceptual Model (ER Diagram)

The first step in designing a database is to create a conceptual model using an ER diagram
This diagram visually represents the entities and relationships of a database

Entities and Relationships

Books
- Represents the collection of books in the library
- Attributes: book_id, title, author, published_year
Borrower
- Represents the member who borrow books from the library
- Attributes: member_id, name, email
Loans
- Represents the records of books being borrowed and returned
- Attributes: loan_id, book_id, member_id, loan_date, return_date

Relationships

Books and Loans
- A book can be loaned to multiple members over time
- A loan record refers to one book
Members and Loans
- A member can borrow multiple books over time
- A loan record refers to one member

In the ER diagram:

Books and Members have a many-to-many relationship through Loans.
This many-to-many relationship is resolved by creating the Loans entity, which acts as a junction table.

Step 2 - Logical Schema (SQL)

After defining the conceptual model, the next step is to translate it into a logical schema using SQL
The logical schema defines the structure of the database in terms of tables, columns, data types, and relationships

SQL Commands

The books table stores information about each book in the library
The members table stores information about each library member
The loans table records each instance of a book being borrowed by a member
- Includes the loan and return dates
Foreign keys (book_id, member_id) in the loans table create relationships with the books and members tables

CREATE TABLE books (
    book_id INT PRIMARY KEY,
    title VARCHAR(100) NOT NULL,
    author VARCHAR(50) NOT NULL,
    published_year INT
);

CREATE TABLE members (
    member_id INT PRIMARY KEY,
    name VARCHAR(50) NOT NULL,
    email VARCHAR(100) UNIQUE
);

CREATE TABLE loans (
    loan_id INT PRIMARY KEY,
    book_id INT,
    member_id INT,
    loan_date DATE,
    return_date DATE,
    FOREIGN KEY (book_id) REFERENCES books (book_id),
    FOREIGN KEY (member_id) REFERENCES members (member_id)
);

Step 3 - Physical Design Considerations

Physical design involves optimizing the database for performance, storage, and scalability
Key considerations include indexing, data types, and constraints

Indexing

Indexes on Foreign Keys:

Adding indexes on book_id and member_id columns in the loans table can significantly speed up queries that join the loans table with the books and members tables

CREATE INDEX idx_book_id ON loans(book_id); 
CREATE INDEX idx_member_id ON loans(member_id);

Data Types and Constraints

Choose appropriate data types for each column to ensure efficient storage and retrieval
For example
- VARCHAR for strings
- INT for integers
Use constraints to enforce data integrity
- NOT NULL to ensure that critical fields (e.g., title, name) are always populated
- UNIQUE to ensure that email addresses are unique across members

CREATE TABLE books (
    book_id INT PRIMARY KEY,
    title VARCHAR(100) NOT NULL,
    author VARCHAR(50) NOT NULL,
    published_year INT CHECK (published_year > 0)
);

CREATE TABLE members (
    member_id INT PRIMARY KEY,
    name VARCHAR(50) NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL
);

CREATE TABLE loans (
    loan_id INT PRIMARY KEY,
    book_id INT,
    member_id INT,
    loan_date DATE NOT NULL,
    return_date DATE,
    FOREIGN KEY (book_id) REFERENCES books (book_id),
    FOREIGN KEY (member_id) REFERENCES members (member_id)
);

CREATE INDEX idx_book_id ON loans(book_id);
CREATE INDEX idx_member_id ON loans(member_id);

Best Practices in Database Design

In this final section, we’ll quickly analyze three core components of best practices related to database design:

Data Integrity
Performance Optimization
Scalability and Flexibility

Data Integrity

Data integrity guarantees the accuracy, consistency, and reliability of data stored in a database

Here are some best practices:

Use Constraints
- Example SQL constraints - NOT NULL, UNIQUE, and FOREIGN KEY
- Employ constraints to enforce rules on data manipulation
- NOT NULL ensures essential fields are always populated
- UNIQUE prevents duplicate entries
- FOREIGN KEY maintains referential integrity between related tables
Validation Rules
- Implement validation rules at the database level to ensure that only valid data is stored
- This reduces the risk of errors and inconsistencies

Performance Optimization

Optimizing database performance involves designing and tuning the database to ensure efficient query execution and data retrieval. Key practices include:

Indexing
- Design indexes carefully based on query patterns and workload
- Indexes help speed up data retrieval operations by allowing the database engine to quickly locate relevant rows
- Identify frequently queried columns and create indexes on them
- Consider composite indexes for queries that involve multiple columns
- Regularly monitor and optimize indexes to ensure they continue to provide optimal performance as data volumes and query patterns change
Query Optimization
- Write efficient SQL queries that minimize resource consumption and maximize throughput
- Use query execution plans and profiling tools to identify and resolve performance bottlenecks

Scalability and Flexibility

Normalization
- Apply normalization techniques to reduce redundancy and improve data integrity
- Normalized databases are typically easier to scale and modify
Partitioning
- Partition large tables into smaller, more manageable segments based on criteria such as range, list, or hash - this can improve query performance and facilitate data distribution on devices/regions
Replication
- Implement database replication to create redundant copies of data across multiple servers
- Replication enhances availability and fault tolerance while also supporting read scaling by offloading read operations to replica servers
Flexible Schema Design
- Use a flexible schema design that accommodates evolving application requirements
- JSON or XML can provide flexibility in storing diverse data formats

Example

-- Example of creating a partitioned table
CREATE TABLE sales (
    sale_id INT PRIMARY KEY,
    sale_date DATE,
    amount DECIMAL(10, 2),
    ...
)

PARTITION BY RANGE (YEAR(sale_date)) (
    PARTITION p1 VALUES LESS THAN (2020),
    PARTITION p2 VALUES LESS THAN (2021),
    PARTITION p3 VALUES LESS THAN (2022),
    PARTITION p4 VALUES LESS THAN MAXVALUE
);

Next Steps

Databases

Deep Dive into Database Design - Part 1

Eduardo Garza

June 15, 2023

This is an extended guide on the theoretical and academic approach to database design. While databases have had an explosion of adoption in the modern world and their implementation transmitted down to software developers building sophisticated web applications, mobile applications, APIs, and other systems requiring storage capabilities, databases had h…

Read full story

Databases

Deep Dive into Database Design - Part 2

Eduardo Garza

June 15, 2023

This is the second part of a two-part guide on database design. Recall that this guide is split into two sections: Part 1 covers database systems, database design and the relational model. Examples are provided to explore the different aspects of these topics in detail.

Read full story

Databases

Modern Database Options

Eduardo Garza

June 23, 2023

Alright, let's talk about modern databases. The landscape of data storage and management has evolved a lot, and if you're still sticking to just SQL, you're missing out. This post will break down some of the latest technologies, why they matter, and how to use them. We'll cover SQL databases, NoSQL options, and some other cool stuff that's making waves.

Read full story