Deep Dive into Database Design - Part 1

Mastering Advanced Concepts and Best Practices in Database Architecture

Jun 15, 2023

This is an extended guide on the theoretical and academic approach to database design. While databases have had an explosion of adoption in the modern world and their implementation transmitted down to software developers building sophisticated web applications, mobile applications, APIs, and other systems requiring storage capabilities, databases had humble beginnings in formal academic institutions that defined their overall design as a rigorously mathematical language. Most developers are only familiar with the database interaction languages and tools that they touch every day, like MySQL and Postgres (among many others), but what most are less familiar with is how to properly implement database design in a way that is technically sound and logically correct in an intention to maximally reduce or eliminate redundancy, duplication and out-of-sync commonalities.

This is an extensive guide that looks at database design starting from an academic approach and then transitions to practical examples using relational databases. I approach this subject very technically, defining a clear strategy for arriving at a database design implementation, but only after we have completed steps like requirements planning, ER diagrams, schema development and schema normalization.

This guide is split into two sections:

Part 1 covers database systems, database design and the relational model. Examples are provided to explore the different aspects of these topics in detail.
Part 2 covers the database design refinement process. Once we’ve achieved an initial design, the objective becomes to maintain data integrity and constraints while maximally eliminating redundancy.

Part 1 (This Post) - Initial Database Design
- Introduction to database systems
- Database design
- The relational model
Part 2 (Next Post) - Normalization and Refined Database Design
- Schema normalization
- Schema refinement
- Functional dependencies
- Normal forms
- Going Deeper - BCNF
- Going Deeper - 3NF
- Properties of decompositions
- Normalization
- Synthesis

Section 1 - Intro to database systems

We’re going to begin by defining a set of terms:

A database is a collection of data
Entities are the main objects in a database that typically correspond to the core tables
- Examples of entities include … students, faculty, courses, classrooms
Relationships are the “links” (associations) between entities
- An example of a relationship is a student object enrolling into a course object
- Relationships are also commonly their own tables (initially, at least) although during the schema normalization and refinement process, they are often combined and merged with entity tables to avoid the duplication of information in our storage system

File systems vs DBMS

A database manage system (DBMS) is just an application software available on computers
In the earliest of times in the evolution of computers, databases were initially file systems - you would utilize files to store data. This is in fact how a lot of non-technical users continue to save information, like passwords or personal notes and it works in a very friendly form.
The difference between databases and file systems is that databases are a specialized type of storage solution that focuses explicitly on being hyper-optimized to the storage and retrieval of data.
In contrast, file systems focus more on their functionality related to user experience and interactivity

To further define some terms:

A database language, like MySQL, is simply that syntax that is used to communicate between some application and the underlying database system.
A database system is just a live server, running in the background on your computer or a cloud server, that takes SQL commands as inputs and returns structured data as output
Since a database system is a background process on computers, the most direct interaction a developer was with it is through the command line, where the database system comes with a set of commands/functions to be able to interact with the underlying database system, databases and stored data
With that in mind, the purpose of a DBMS is to provide a GUI and extended functionality to make the interaction with the underlying database system simpler and more efficient - after all, remembering and writing CLI commands can be quite cumbersome especially on repeatable tasks

Advantages of DBMS

Data independence
- DBMS provides an abstraction view to hide the data representation from storage
Efficient data access
- Techniques used for faster data access
Integrity and security
- With integrity constraints, it acts like a mathematical model that is proof-safe
Concurrent access and crash recovery
- A DBMS schedules concurrent access

Describing a DBMS

A DBMS allows users to define the data to be stored in terms of a data model
A data model just refers to how data is represented conceptually or practically
- As you will see, there are different types of data models
- Some data models act as an intermediary way to represent a database during the design process - such as a semantic data model
- Other data models act as the final way a database is represented by the actual database system - such as a relational database model, which apply to all relational database systems
A semantic data model is a more abstract, high-level data model that makes it easier for a user to come up with an initial description of the data
- A widely used semantic data model is the ER model / ER diagram
A database design starts with a semantic model and moves towards a relational model (one that can be implemented on the DBMS)

Example of a data model for a university database:

Steps

Semantic data model - Develop an ER diagram
- The ER diagram serves as the conceptual/semantic data model
- This is an initial design of our database to identify all the entities and relationships
Relational model - SQL tables
- The relational model serves as an iterative representation of the final database representation
- The relational model has the final form in schema format

The ER Diagram:

The Relational Model:

The Relational model

The description of data in terms of a data model is called a schema

The schema for a relation specifies

The name of the relation
The name of each field / attribute / column
The data type of each field (string, integer, boolean, etc)

After a schema is defined, we need to define rules that apply to what is allowed and not allowed for this data:

The rules are known as constraints, or integrity constraints
Integrity constraints specify the conditions for which the records in a relation are valid or invalid

Abstraction in a DBMS

Returning now to our database definition - we will define three stages in our database design process:

The conceptual design
- Focuses on the high level structure of the database
- This is an intermediary representation, completely independent of the actual database system
- The objective is to capture the complete set of requirements
The physical design
- This involves the actual implementation of the database
- We translate the high-level design into specific technical designs - for we store/retrieve data
- This is more related to the efficiency considerations related to storing/retrieval
The external design
- This focuses on how the database is interacted with from the user’s perspective
- This refers more to the definition of views for our development team or APIs and their relation actions for the applications that we develop for end-users

The database description consists of a schema at each of these three levels of abstraction.

A data definition language (DDL) is used to define the external and conceptual schemas.

1 | Conceptual schema

Also known as the logical schema
Describes stored items in terms of the data model of the DBMS
Describes entities and relationships

The objective is to capture the overall requirements through the following:

Entity-Relationship Model (ER Model)
- The most common tool used in conceptual design is the ER diagram
- This is a visual representation of the entities (objects), their attributes, and relationships between them
Entities and Attributes
- These are the main objects (entities) in the database and their properties (attributes)
- Properties are anything that describe the entity or relationship object
Relationships
- Defines how entities are related to one another

Example

Students (
  sid: string, 
  name: string, 
  login: string, 
  age: integer, 
  gpa: real
) 
Faculty (
  fid: string, 
  fname: string, 
  sal: real
) 
Courses ( 
  cid: string, 
  cname: string, 
  credits: integer
) 
Rooms (
  nw: integer, 
  address: string, 
  capacity: integer
) 
Enrolled (
  sid: string, 
  cid: string, 
  grade: string
) 
Teaches (
  fid: string, 
  cid: string
) 
Meets_In ( 
  cid: string, 
  rno: integer, 
  time: string
)

2 | Physical schema

Describes additional storage details
Summarizes how the relations in the conceptual schema are actually stored on secondary storage devices
- Storage devices are most often disks and tapes

Considerations

Storage Structures
- Deciding on how data will be stored in the database
- This includes data files, indexes, and storage allocation
Indexes and Access Methods
- Creating indexes to improve query performance and defining how data will be accessed
Data Partitioning and Clustering
- Organizing data in a way that optimizes performance and storage use
Performance Tuning
- Ensuring the database meets performance requirements by adjusting parameters and structures

3 | External schema

Allow data access to be customized and authorized at the level of users or groups
There can be only 1 conceptual and 1 physical schema, but there can be multiple external schemas
Consists of a collection of 1 more views and relations from the conceptual schema

Section 2 - Database design

The entity-relationship (ER) data model allows us to describe the data involved in terms of objects and relationships

Used to develop an initial database design
The ER model is used in a phase called conceptual database design

Steps in Database Design

Requirement analysis
- Understand what data is to be stored
- What apps must be built on top
- What operations are most frequent
Conceptual database design
- Develop a high-level description of the data to be stored in the database and its constraints
- This step involves the ER model (semantic model)
Logical database design
- Convert the conceptual database design into a database schema in the data model of the DBMS
- Converting the ER schema into a relational database schema
Schema refinement
- Analyze the collection of relations in the relational database schema to identify problems and refine it
- Theory of normalization - relations-restructuring
Physical database design
- Performance criteria and requirements
- Clustering of tables and creation of views
Application and security design

Key constraints

Key constraints are defined on associations between entities and relationships:

Participation:
- TOTAL - This means that every record must have participation in the relationship table
- PARTIAL - This means that some records in either entity may not exist in the relationship table
Relationships
- Many to Many
- 1 to Many
- Many to 1
- 1 to 1

Weak Entities

Some data item that would not make sense to keep if some other item didn’t exist
Dependency of data upon a strong entity
The owner entity set and weak entity set must participate in a 1 to many relationship set
The weak entity must have TOTAL participation

Class Hierarchies

Entity sets can be set as super and subclasses
Attributes of a subclass are inherited from the superclass
The symbol is a triangle with the “IS A” label

Two types of constraints for class hierarchies:

Overlap: OVERLAP / NO-OVERLAP
- Whether two subclasses are allowed to contain the same entity
- The default is NO-OVERLAP
Covering: COVER / NONE
- Whether the entities in the subclasses collectively include all entities in the superclass
- Do all instances defined in the subclasses make up the entire set of the entities in the superclass?

Example of OVERLAP constraints

Employee superclass
Hourly_Employees and Contract_Employees are the subclasses
- Could X be both of these subclasses? No
Hourly_Emplyees and Senior_Employees are the subclasses
- Could X be both of these subclasses? Yes
- Denoted as “Contract_Employees OVERLAPS Senior_Employees”

Example of COVERING constraints

Motor_Vehicles superclass
Motorboats and Cars subclasses
Motorboats and Cars COVER Motor_Vehicles
- Any entity in the superclass must be at least one entity in the subclasses

Aggregation

A relationship between a collection of entities and relationships
Aggregation allows that a relationship set can participate in another relationship set (with an entity)
A DASHED BOX is drawn around the relationship set that acts like an entity

Section 3 - The relational model

The relational model is simple and elegant:

A database is a collection of one or more relations
Each relation is a table with rows and columns

The data definition language (DDL) is a subset of features inside SQL that is standard for creating, manipulating, and querying data in a relational DBMS.

An important component of a data model is the set of constructs it provides for specifying conditions that must be satisfied by the data - known as integrity constraints

This is the process of converting an ER diagram into a relational database schema.

Relational model

A relation consists of a relation schema and a relation instance
The schema describes the relations name, the name of each field/attribute and the domain of each field.
The relation instance is a table
The relation schema describes the column heads for the table

Example

Students (sid: string, name: string, login: string, age: integer, gpa: real)

An instance of a relation is a set of rows/tuples/records.

A relational database is a collection of relations with distinct relation names.

The relational database schema is the collection of schemas for the relations in the database.

Integrity constraints

An integrity constraint is a condition specified on a database schema
It restricts data that can be stored in the instance
If the database instance satisfied all the ICs, it is a legal instance
A DBMS enforces ICs and only permits legal instances

Types of constraints

Key constraints
Foreign key constraints
General constraints

1 | Key constraints

A key constraint is a statement that a certain minimal subset of the fields of a relation is a unique identifier
A set of fields that uniquely identifies a tuple to a key constraint is called a candidate key (or just key)

Candidate key

Two distinct rows/tuples in a legal instance cannot have identical values in all the fields of a key
No subset of the set of fields in a key is a unique identifier for a tuple (thus, it’s minimal)

Out of all available candidate keys, one is identified as the primary key.

In SQL

… UNIQUE (name, age) …
… CONSTRAINT StudentsKey PRIMARY KEY (sid) …

A key is created using the UNIQUE constraint
A primary key is declared using the PRIMARY KEY constraint
CONSTRAINT constraint_name is used to name a constraint
- When the constraint is violated, the constraint_name is returned to identify the error

CREATE TABLE Students(
  sid CHAR(20) ,
  name CHAR (30) ,
  login CHAR(20) ,
  age INTEGER,
  gpa REAL,
  UNIQUE (name, age),
  CONSTRAINT StudentsKey PRIMARY KEY (sid)
)

2 | Foreign key constraints

With foreign key constraints, deletes that eliminate the primary key originator tuple will violate the foreign key constraint of another, and therefore it should be rejected
These constraints are specified with ON DELETE and ON UPDATE

Example

# Enrolled(studid: string, cid: string, gTade: string) 

CREATE TABLE Enrolled(
  studid CHAR(20) ,
  cid CHAR(20),
  grade CHAR(10),
  PRIMARY KEY (studid, cid),
  FOREIGN KEY (studid) REFERENCES Students
)

3 | General constraints

The main three constraints are:

Domain
Primary key
Foreign key

Additional constraints can be added using assertions or code outside SQL.

Enforcing ICs

For domain / PRIMARY KEY / UNIQUE constraints:

If an INSERT/DELETE/UPDATE causes a violation, it is rejected.

Every potential violation is checked at the end of each SQL statement execution, although it can be deferred to the end of a transaction execution

For foreign key constraints:

We discuss referential integrity enforcement steps taken by the DBMS
INSERT does not cause any problems
UPDATE and DELETE can cause violations of referential integrity

Suppose there is a Students table with a primary key and an Enrolled table that has that key as a foreign key.

Three basic questions:

What should we do if an Enrolled row is inserted with a student_id that does not appear in the Students table?
- In this case, the INSERT is rejected
Question 2 - What do we do if a Students row is deleted?
Question 3 - What do we do if the PK value of a Students row is updated?
- The options for these are the same as for Q2

Three options for Question 2 / 3:

Delete all the rows in the Enrolled table that refer to the deleted Students row.
Disallow (reject) the deletion of the Students row if an Enrolled row refers to it.
Set the student_id column to the some default value for every Enrolled row that refers to the deleted row in the Students table
For every row in Enrolled that refers to it, set the student_id column to null
- However, if Enrolled has a PK that includes student_id (foreign key), this can become a violation and therefore this fourth option is not always available

SQL allows us to choose any of the four options on the DELETE and UPDATE commands.

CREATE TABLE Enrolled(
  studid CHAR(20) ,
  cid CHAR(20) ,
  grade CHAR(10),
  PRIMARY KEY (studid, dd),
  FOREIGN KEY (studid) REFERENCES Students 
    ON DELETE CASCADE 
    ON UPDATE NO ACTION
)

What the table above states is that:

When the student_id in Students is deleted, you should also delete the Enrolled record that refers to that id.
When the student_id in Students is updated, no change should be made in the Enrolled record
ON DELETE / ON UPDATE can be specified for each foreign key
The default option is NO ACTION (on, both, DELETE / UPDATE) - this means that the action is to be rejected
- “ON UPDATE NO ACTION” can be omitted since it is using the default
- If ON UPDATE used CASCADE, this means that the student_id in Enrolled would be updated for all items that refer to the PK student_id that was changed
Another option is “ON DELETE SET DEFAULT”
- The default would be specified in the table creation query (beow)
Another option is “ON DELETE SET NULL”
- But in this case, the foreign key property cannot be part of the PK of this table

CREATE TABLE Students … sid CHAR(20) DEFAULT “53666”

In this circumstance, a DEFAULT PK does not make sense, so it would make more sense to CASCADE

Transactions and Constraints

A program running against a database is called a transaction.

It can contain several statements - queries, inserts, updates
By default, if any violation occurs during the statement, every action within the statement is rolled back

Example ‘Transaction’:

Consider the following: Students and Courses are two tables that require each other as foreign key attributes. But the creation (INSERT) of one is violated since neither can be created without the representation of the other.

In this case, you need to defer the checking of the constraint and perform it as a transaction

CREATE TABLE Students (
  sid CHAR(20) ,
  name CHAR(30),
  login CHAR (20) ,
  age INTEGER,
  honors CHAR(10) NOT NULL,
  gpa REAL
  PRIMARY KEY (sid),
  FOREIGN KEY (honors) REFERENCES Courses (cid)
)

CREATE TABLE Courses (
  cid CHAR(10),
  cname CHAR (10) ,
  credits INTEGER,
  grader CHAR(20) NOT NULL,
  PRIMARY KEY (dd)
  FOREIGN KEY (grader) REFERENCES Students (sid)
)

At the end of an INSERT statement, add the following:

INSERT … SET CONSTRAINT ConstraintFoo DEFERRED;

SQL allows a constraint to be in DEFERRED or IMMEDIATE mode:

A constraint in deferred mode is checked at commit time (default)

Querying Data

SELECT …
FROM …
WHERE …

Example

SELECT S.name, E.cid 
FROM Students S, Enrolled E 
WHERE S.sid = E.studid AND E.grade = 'A'

Logical database design

Local database design is the design that transitions from an ER design to relational design:

Entities

Each entity set becomes a table
Each attribute in an entity set becomes an attribute of the table

Relationships

A relationship set, like the entity set, is mapped to a relation in a relational model

Attributes include:

The PK attributes of each participating entity as FK fields
The descriptive attributes of the relationship set

The collection of non-descriptive attributes is a superkey

If there are no key constraints (like 1-to-many), this set is a candidate key

Example

CREATE TABLE Reports_To (
    supervisor_ssn CHAR(11),
    subordinate_ssn CHAR(11),
    PRIMARY KEY (
        supervisor_ssn,
        subordinate_ssn
    ),
    FOREIGN KEY (supervisor_ssn) REFERENCES Employees (ssn),
    FOREIGN KEY (subordinate_ssn) REFERENCES Employees (ssn)
)

Relationships with Key Constraints

CREATE TABLE Manages (
    ssn CHAR(11),
    did INTEGER,
    since DATE,
    PRIMARY KEY (did),
    FOREIGN KEY (ssn) REFERENCES Employees,
    FOREIGN KEY (did) REFERENCES Departments
)

Since each Department has one unique Employee, the PK of Manages is set as the employee_id (ssn).

Given that department_id (did) is the PK of Manages, the Departments table could potentially consume the entire Manages table and they could be combined together into one table.
This approach eliminates the need for a separate Manages relation
The drawback is the space wasted if several departments have no managers, since the ssn (employee_id) would be NULL for a lot of entries inside Departments_Manages if there are many departments with no managers

CREATE TABLE DepLMgr (
    did INTEGER,
    dname CHAR(20),
    budget REAL,
    ssn CHAR(11),
    since DATE,
    PRIMARY KEY (did),
    FOREIGN KEY (ssn) REFERENCES Employees
)

In this modified table, ssn (employee_id) can be NULL - which indicates departments with no managers.

Relationships with Participation Constraints

Suppose that every requirement must have one manager (TOTAL participation) of the Departments in the Manages table.

CREATE TABLE DepLMgr (
    did INTEGER,
    dname CHAR(20),
    budget REAL,
    ssn CHAR(11) NOT NULL,
    since DATE,
    PRIMARY KEY (did),
    FOREIGN KEY (ssn) REFERENCES Employees ON DELETE NO ACTION
)

DepLMgr states that the ssn (employee_id) field cannot be NULL. Therefore each record in this table must reference a record from the Employees table.

“NO ACTION” specifies that Employee records cannot be deleted while they are referenced by this table.

They are rejected

Participation constraints (TOTAL / PARTIAL) are captured using the “NOT NULL” statement at a table creation.

Constraints that require assertions: Having each Employees entity manage at least one department

Weak Entity Sets

Always a 1 to many relationship
It must have a key constraint and TOTAL participation
The weak entity only has a partial key
When the owner entity is deleted, all the weak entity references must be also deleted

CREATE TABLE Dep_Policy (
    pname CHAR(20),
    age INTEGER,
    cost REAL,
    ssn CHAR(11),
    PRIMARY KEY (pname, ssn),
    FOREIGN KEY (ssn) REFERENCES Employees ON DELETE CASCADE
)

The simple difference is the “ON DELETE CASCADE” to entries in Employees.

Class Hierarchies

There are two approaches to handling hierarchies:

Each each entity set to a relation for both the superclass and the subclasses
- Example - Creating relations for Employees, Hourly_Employees, and ContractEmployees
- A constraint to consider is that if the superclass is deleted, the delete must be cascaded to the subclass
Just create relations for the subclasses, which will contain all the fields, including those of the superclass

Approach 1 is general and always applicable.

The drawback is the requirement of JOINs when you are examining multiple sub-tables

Approach 2 requires full COVERING constraints in the subclasses, since the superclass is not represented alone.

Also, is there are OVERLAP constraints, you are storing a lot of potential duplicate data
Queries must examine two different queries to search for items too

Views

A view is a table whose rows are not explicitly stored in the database
Data is computed as needed from a view definition

CREATE VIEW B - Students (name, sid, course) AS
SELECT S.sname, S.sid, E.cid
FROM Students S, Enrolled E
WHERE
    S.sid = E.studid
    AND E.grade = 'B';

Then the view can be used as a base table.

We can DELETE/UPDATE items from Students directly through B-Students
- If a change to Students affects GoodStudents, by default SQL will allow the update of GoodStudents - but using “WITH CHECK OPTION”, you can prevent changes that will affect the view table
We can insert a GoodStudents row by inserting a row into Students and using NULL for the values of Students that do not appear in GoodStudents

Eduardo Garza

Deep Dive into Database Design - Part 1

Mastering Advanced Concepts and Best Practices in Database Architecture

Contents

Section 1 - Intro to database systems

File systems vs DBMS

Advantages of DBMS

Describing a DBMS

Steps

The Relational model

Abstraction in a DBMS

1 | Conceptual schema

Example

2 | Physical schema

Considerations

3 | External schema

Section 2 - Database design

Steps in Database Design

Key constraints

Weak Entities

Class Hierarchies

Aggregation

Section 3 - The relational model

Relational model

Example

Integrity constraints

Types of constraints

1 | Key constraints

In SQL

2 | Foreign key constraints

Example

3 | General constraints

Enforcing ICs

Transactions and Constraints

Example ‘Transaction’:

Querying Data

Example

Logical database design

Entities

Relationships

Example

Relationships with Key Constraints

Relationships with Participation Constraints

Weak Entity Sets

Class Hierarchies

Views