Section 4: Normalization | CISP 350 Evrist

Normalization is a systematic design discipline that restructures relational databases so each fact is stored exactly once. It reduces data redundancy, improves integrity, and simplifies updates by organizing information into smaller, related tables linked through keys. The process proceeds through successive normal forms: 1NF, 2NF, 3NF, and optionally BCNF or higher-each eliminating specific types of dependency anomalies.

§1 Why Normalize? #

Normalization is a design discipline that restructures a relational database so that each fact is stored exactly once. The benefits fall into three tightly-linked areas:

Reducing Data Redundancy: Eliminate duplicate data in a database. This saves storage space and prevents inconsistencies, and also helps with query performance.
Improving Data Integrity: By minimizing redundancy, normalization also reduces the risk of data inconsistencies. When data is updated, it only needs to be updated in a single place, ensuring data integrity.
Simplifying Data Modification: A normalized database is easier to maintain. Insertion, update, and deletion of data can be done with less risk of creating anomalies.

§1.1 Example of Unnormalized vs. Normalized Data #

Consider a simple online store’s order tracking system. An unnormalized approach might store all information in a single table:

Unnormalized Orders Table:

OrderID	CustomerID	CustomerName	CustomerEmail	ProductID	ProductName	Quantity	Price
101	1	Alice	alice@email.com	P1	Keyboard	1	75.00
101	1	Alice	alice@email.com	P2	Mouse	1	25.00
102	2	Bob	bob@email.com	P1	Keyboard	2	75.00

Notice the repeated CustomerName, CustomerEmail, and ProductName. This leads to problems:

Update Anomaly: If Alice changes her email, it must be updated in multiple rows.
Insertion Anomaly: We can’t add a new product until a customer orders it.
Deletion Anomaly: If we delete Bob’s only order, we lose the information that he is a customer.

A normalized design would split this into three tables:

Customers Table:

CustomerID	CustomerName	CustomerEmail
1	Alice	alice@email.com
2	Bob	bob@email.com

Products Table:

ProductID	ProductName	Price
P1	Keyboard	75.00
P2	Mouse	25.00

Orders Table:

OrderID	CustomerID	ProductID	Quantity
101	1	P1	1
101	1	P2	1
102	2	P1	2

ERD:

erDiagram Customers { int CustomerID PK string CustomerName string CustomerEmail } Products { string ProductID PK string ProductName decimal Price } Orders { int OrderID PK int CustomerID FK string ProductID FK int Quantity } Customers ||--o{ Orders : "places" Products ||--o{ Orders : "in"

Now, each piece of information is stored only once. If Alice’s email changes, it’s a single update in the Customers table. We can add new products to the Products table without an order, and customers exist independently of their orders. This design is cleaner, more efficient, and easier to maintain.

§2 Normal Forms #

Normal forms are a series of guidelines that help to achieve a normalized database. They are numbered from First Normal Form (1NF) to higher forms like Boyce-Codd Normal Form (BCNF), 4NF, and 5NF. Generally, 3NF (Third Normal Form) is considered sufficient for most applications.

§2.1 First Normal Form (1NF): atomic values, no repeating groups #

A table is in 1NF if all its columns contain atomic values, meaning each cell holds a single value, not a list of values. Each column should only have one entry, and there should be no repeating groups of columns.

Example:
An unnormalized table:

StudentID	Name	Courses
1	Alice	CS101, MA203
2	Bob	PH201

To bring this to 1NF, we would create separate rows for each course:

StudentID	Name	Course
1	Alice	CS101
1	Alice	MA203
2	Bob	PH201

Notice there are now two rows with duplicate StudentID and Name values

This is better because each row now represents a single student-course enrollment. This structure prevents update anomalies (if a course name changes, you don’t have to parse a string to fix it) and makes it much easier to query the data, such as finding all students enrolled in ‘CS101’. The duplication of student information is addressed in later normal forms.

Another 1NF Example:

An unnormalized table for project assignments:

ProjectID	ProjectName	TeamMembers
101	Website Redesign	Alice, Bob, Charlie
102	API Development	David, Alice

To bring this to 1NF, we create a separate row for each team member on a project:

ProjectID	ProjectName	TeamMember
101	Website Redesign	Alice
101	Website Redesign	Bob
101	Website Redesign	Charlie
102	API Development	David
102	API Development	Alice

§2.2 Second Normal Form (2NF): 1NF + no partial dependencies #

A table is in 2NF if it is in 1NF and every non-primary-key attribute is fully dependent on the primary key. This means there are no partial dependencies. This form is relevant for tables with composite primary keys.

Example:
Start with a single table with a composite key (StudentID, CourseID):

StudentID	CourseID	CourseName	Grade
1	CS101	Intro to CS	A
1	MA203	Calculus I	B
2	CS101	Intro to CS	C

Here, CourseName depends only on CourseID, not the full primary key. This is a partial dependency. To achieve 2NF, we split the table:

Students_Courses Table:

StudentID	CourseID	Grade
1	CS101	A
1	MA203	B
2	CS101	C

Courses Table:

CourseID	CourseName
CS101	Intro to CS
MA203	Calculus I

This is better because CourseName is now stored only once for each CourseID. If a course name changes (e.g., “Intro to CS” becomes “Introduction to Computer Science”), you only have to update it in one place: the Courses table. In the original table, you would have to find and update every row for every student taking that course, which is inefficient and risks data inconsistency.

Another 2NF Example:

Consider an OrderItems table with a composite key (OrderID, ProductID):

OrderID	ProductID	ProductName	Quantity	UnitPrice
1	101	Keyboard	1	75.00
1	102	Mouse	1	25.00
2	101	Keyboard	2	75.00

Here, ProductName and UnitPrice depend only on ProductID, not the full (OrderID, ProductID) key. This is a partial dependency. To achieve 2NF, we split the table:

Order_Items Table:

OrderID	ProductID	Quantity
1	101	1
1	102	1
2	101	2

Products Table:

ProductID	ProductName	UnitPrice
101	Keyboard	75.00
102	Mouse	25.00

§2.3 Third Normal Form (3NF): 2NF + no transitive dependencies #

A table is in 3NF if it is in 2NF and there are no transitive dependencies. A transitive dependency is when a non-key attribute depends on another non-key attribute.

Example:
Start with a single table:

StudentID	Name	DepartmentID	DepartmentName
1	Alice	101	Computer Science
2	Bob	102	Physics
3	Charlie	101	Computer Science
4	David	103	Mathematics

Here, DepartmentName depends on DepartmentID, which in turn depends on StudentID (the primary key). This is a transitive dependency because a non-key attribute (DepartmentName) depends on another non-key attribute (DepartmentID). To achieve 3NF, we split the table:

Students Table:

StudentID	Name	DepartmentID
1	Alice	101
2	Bob	102
3	Charlie	101
4	David	103

Departments Table:

DepartmentID	DepartmentName
101	Computer Science
102	Physics
103	Mathematics

This is better because it eliminates redundancy and prevents update anomalies. If the “Computer Science” department changes its name, we only need to update it in one row in the Departments table. In the original table, we would have to update it for both Alice and Charlie, risking inconsistency. It also prevents insertion/deletion anomalies: we can now add a new department before any students are assigned to it.

Another 3NF Example:

Consider an Employees table:

EmployeeID	Name	Department	DepartmentHead
1	Eve	Marketing	Frank
2	Grace	Marketing	Frank
3	Heidi	Sales	Ivan

Here, DepartmentHead depends on Department. Since Department is not a key, this is a transitive dependency (EmployeeID -> Department -> DepartmentHead). To achieve 3NF, we split the table:

Employees Table:

EmployeeID	Name	Department
1	Eve	Marketing
2	Grace	Marketing
3	Heidi	Sales

Departments Table:

Department	DepartmentHead
Marketing	Frank
Sales	Ivan

This is better because it eliminates redundancy and prevents update anomalies. If Marketing gets a new department head, we only need to update one row in the Departments table. In the original table, we would have to update it for both Eve and Grace, risking inconsistency. It also prevents insertion/deletion anomalies: we can now add a new department and its head before any employees are assigned to it, and we can delete the last employee in a department without losing the information about who heads that department.

§2.4 Boyce-Codd Normal Form (BCNF) and higher forms #

Boyce-Codd Normal Form (BCNF) is a stricter version of 3NF. A table is in BCNF if for every non-trivial functional dependency X -> Y, X is a superkey. In simple terms, every determinant must be a candidate key. Higher normal forms like 4NF and 5NF exist to handle more complex multi-valued and join dependencies, but are less commonly used in practice.

Example:

Consider a table that tracks which instructor teaches which course for a student.

A student can take multiple courses.
An instructor teaches only one course.
A course can be taught by multiple instructors.

Student	Course	Instructor
Alice	Biology	Prof. Smith
Alice	Chemistry	Prof. Jones
Bob	Biology	Prof. Davis

The functional dependencies are:

(Student, Course) -> Instructor (A student in a specific course has one instructor)
Instructor -> Course (An instructor teaches only one course)

The candidate keys for this table are (Student, Course).

The dependency Instructor -> Course violates BCNF because Instructor is a determinant but it is not a superkey. To bring this to BCNF, we decompose the table:

Student_Instructor Table:

Student	Instructor
Alice	Prof. Smith
Alice	Prof. Jones
Bob	Prof. Davis

Instructor_Course Table:

Instructor	Course
Prof. Smith	Biology
Prof. Jones	Chemistry
Prof. Davis	Biology

Now, in both tables, the only determinants are candidate keys, so the design is in BCNF.

§3 Denormalization #

Denormalization is the process of intentionally introducing redundancy into a database to improve query performance. While normalization is good for data integrity and reducing redundancy, it can sometimes lead to complex queries with many joins, which can be slow. Denormalization can help by reducing the number of joins needed for a query.

Example:

Imagine you have a normalized database for a blog with Posts and Users tables. Users Table:

UserID	UserName
1	Alice
2	Bob

Posts Table:

PostID	UserID	Title	Content
101	1	First Post	…
102	2	My Post	…

To display a list of posts with their author’s name, you would need to perform a JOIN:

SELECT p.Title, u.UserName
FROM Posts p
JOIN Users u ON p.UserID = u.UserID;

If this query is run very frequently and performance is critical, you might denormalize by adding the UserName directly to the Posts table.

Denormalized Posts Table:

PostID	UserID	UserName	Title	Content
101	1	Alice	First Post	…
102	2	Bob	My Post	…

Now, you can get the same result with a simpler, faster query without a join:

SELECT Title, UserName FROM Posts;

The trade-off is that if a user changes their name, you must update it in the Users table and in every post they have written in the Posts table, which introduces redundancy and potential for inconsistency.

§4 trade-offs #

There is a classic trade-off between storage and processing cost. Normalization reduces storage costs by minimizing redundant data, but can increase processing costs due to the need for more joins. Denormalization, on the other hand, increases storage costs by adding redundant data, but can reduce processing costs by making queries faster. The trade-off between normalization and denormalization is a fundamental concept in database design, balancing data integrity and query performance.

Normalized Databases (OLTP-Optimized):
- Pros: Excellent for Online Transaction Processing (OLTP) systems where data is frequently inserted, updated, and deleted. Data integrity is high, and redundancy is low. This prevents anomalies and saves storage.
- Cons: Retrieving data often requires joining multiple tables, which can be computationally expensive and slow down complex queries.
Denormalized Databases (OLAP-Optimized):
- Pros: Excellent for Online Analytical Processing (OLAP) systems (like data warehouses) where the primary activity is reading and analyzing large amounts of data. Queries are faster because fewer joins are needed.
- Cons: Data redundancy is high, which uses more storage. Inserts and updates are more complex and slower, and there is a higher risk of data inconsistency because a single piece of data is stored in multiple places.

§5 academic vs practical application #

In academia, normalization is often taught as a strict set of rules to be followed. In practice, database design is often more flexible. A common approach is to normalize for data that needs to be referenced and updated (like customer information), and denormalize for data that is primarily used for reporting and analysis, where performance is a key concern. The choice often comes down to whether you need to maintain the integrity of a value as a reference (normalize) or if you just need the value itself at a point in time (can be denormalized).

When designing a database, consider the following:

Understand the Application’s Needs: Is it a write-heavy application (like a bank transaction system) or a read-heavy one (like a reporting dashboard)? This is the most important factor.
Start with Normalization: A good rule of thumb is to start with a design in 3NF.
Denormalize Selectively: Only denormalize when you have identified a specific performance bottleneck. Use performance monitoring tools to find slow queries and determine if a join is the cause.
Consider Alternatives: Instead of denormalizing tables, you might use database features like materialized views or indexed views, which store the results of a query. This can provide the performance benefits of denormalization while keeping the underlying base tables normalized.
Application-Level Caching: Sometimes, performance issues can be solved by caching frequently accessed data in the application layer, rather than changing the database schema.

§6 Questions #

Always provide at least a brief explanations for your answers.
Please number each response to directly correspond with the question. Provide clear, concise answers that fully address each question without unnecessary elaboration or essay-style writing.

What is the primary goal of normalization in a relational database?

Which normal form eliminates repeating groups so that every column contains only atomic values?

Why is First Normal Form (1NF) a critical first step in normalization?

Provide an example of a partial dependency and a transitive dependency

Suppose a table has a composite PK (StudentID, CourseID) and a non-key column CourseName that depends only on CourseID. Which normal form is violated?
StudentID CourseID CourseName
1001 C101 Physics
1002 C101 Physics
1001 C102 Math
1003 C101 Physics

StudentID	CourseID	CourseName
1001	C101	Physics
1002	C101	Physics
1001	C102	Math
1003	C101	Physics

What type of dependency is removed when a table is moved from 2NF to 3NF?

Name the normal form that requires every determinant to be a candidate key.

What are “update anomalies” and how does normalization help prevent them? Provide an example.

Explain the three types of data anomalies that can occur in an unnormalized database.

When might you intentionally choose to denormalize a database design?

When intentionally adding redundant data to speed up reads, what process are you performing?

True or False: A fully normalized schema is always the best choice for an OLAP workload. Explain why.