Section 4: Normalization

Normalization is a systematic design discipline that restructures relational databases so each fact is stored exactly once. It reduces data redundancy, improves integrity, and simplifies updates by organizing information into smaller, related tables linked through keys. The process proceeds through successive normal forms: 1NF, 2NF, 3NF, and optionally BCNF or higher-each eliminating specific types of dependency anomalies.

§1 Why Normalize? #

Normalization is a design discipline that restructures a relational database so that each fact is stored exactly once. The benefits fall into three tightly-linked areas:

  1. Reducing Data Redundancy: Eliminate duplicate data in a database. This saves storage space and prevents inconsistencies, and also helps with query performance.
  2. Improving Data Integrity: By minimizing redundancy, normalization also reduces the risk of data inconsistencies. When data is updated, it only needs to be updated in a single place, ensuring data integrity.
  3. Simplifying Data Modification: A normalized database is easier to maintain. Insertion, update, and deletion of data can be done with less risk of creating anomalies.

§1.1 Example of Unnormalized vs. Normalized Data #

Consider a simple online store’s order tracking system. An unnormalized approach might store all information in a single table:

Unnormalized Orders Table:

OrderIDCustomerIDCustomerNameCustomerEmailProductIDProductNameQuantityPrice
1011Alicealice@email.comP1Keyboard175.00
1011Alicealice@email.comP2Mouse125.00
1022Bobbob@email.comP1Keyboard275.00

Notice the repeated CustomerName, CustomerEmail, and ProductName. This leads to problems:

A normalized design would split this into three tables:

Customers Table:

CustomerIDCustomerNameCustomerEmail
1Alicealice@email.com
2Bobbob@email.com

Products Table:

ProductIDProductNamePrice
P1Keyboard75.00
P2Mouse25.00

Orders Table:

OrderIDCustomerIDProductIDQuantity
1011P11
1011P21
1022P12

ERD:

erDiagram Customers { int CustomerID PK string CustomerName string CustomerEmail } Products { string ProductID PK string ProductName decimal Price } Orders { int OrderID PK int CustomerID FK string ProductID FK int Quantity } Customers ||--o{ Orders : "places" Products ||--o{ Orders : "in"

Now, each piece of information is stored only once. If Alice’s email changes, it’s a single update in the Customers table. We can add new products to the Products table without an order, and customers exist independently of their orders. This design is cleaner, more efficient, and easier to maintain.

§2 Normal Forms #

Normal forms are a series of guidelines that help to achieve a normalized database. They are numbered from First Normal Form (1NF) to higher forms like Boyce-Codd Normal Form (BCNF), 4NF, and 5NF. Generally, 3NF (Third Normal Form) is considered sufficient for most applications.

§2.1 First Normal Form (1NF): atomic values, no repeating groups #

A table is in 1NF if all its columns contain atomic values, meaning each cell holds a single value, not a list of values. Each column should only have one entry, and there should be no repeating groups of columns.

Example:
An unnormalized table:

StudentIDNameCourses
1AliceCS101, MA203
2BobPH201

To bring this to 1NF, we would create separate rows for each course:

StudentIDNameCourse
1AliceCS101
1AliceMA203
2BobPH201

Notice there are now two rows with duplicate StudentID and Name values

This is better because each row now represents a single student-course enrollment. This structure prevents update anomalies (if a course name changes, you don’t have to parse a string to fix it) and makes it much easier to query the data, such as finding all students enrolled in ‘CS101’. The duplication of student information is addressed in later normal forms.

Another 1NF Example:

An unnormalized table for project assignments:

ProjectIDProjectNameTeamMembers
101Website RedesignAlice, Bob, Charlie
102API DevelopmentDavid, Alice

To bring this to 1NF, we create a separate row for each team member on a project:

ProjectIDProjectNameTeamMember
101Website RedesignAlice
101Website RedesignBob
101Website RedesignCharlie
102API DevelopmentDavid
102API DevelopmentAlice

§2.2 Second Normal Form (2NF): 1NF + no partial dependencies #

A table is in 2NF if it is in 1NF and every non-primary-key attribute is fully dependent on the primary key. This means there are no partial dependencies. This form is relevant for tables with composite primary keys.

Example:
Start with a single table with a composite key (StudentID, CourseID):

StudentIDCourseIDCourseNameGrade
1CS101Intro to CSA
1MA203Calculus IB
2CS101Intro to CSC

Here, CourseName depends only on CourseID, not the full primary key. This is a partial dependency. To achieve 2NF, we split the table:

Students_Courses Table:

StudentIDCourseIDGrade
1CS101A
1MA203B
2CS101C

Courses Table:

CourseIDCourseName
CS101Intro to CS
MA203Calculus I

This is better because CourseName is now stored only once for each CourseID. If a course name changes (e.g., “Intro to CS” becomes “Introduction to Computer Science”), you only have to update it in one place: the Courses table. In the original table, you would have to find and update every row for every student taking that course, which is inefficient and risks data inconsistency.

Another 2NF Example:

Consider an OrderItems table with a composite key (OrderID, ProductID):

OrderIDProductIDProductNameQuantityUnitPrice
1101Keyboard175.00
1102Mouse125.00
2101Keyboard275.00

Here, ProductName and UnitPrice depend only on ProductID, not the full (OrderID, ProductID) key. This is a partial dependency. To achieve 2NF, we split the table:

Order_Items Table:

OrderIDProductIDQuantity
11011
11021
21012

Products Table:

ProductIDProductNameUnitPrice
101Keyboard75.00
102Mouse25.00

§2.3 Third Normal Form (3NF): 2NF + no transitive dependencies #

A table is in 3NF if it is in 2NF and there are no transitive dependencies. A transitive dependency is when a non-key attribute depends on another non-key attribute.

Example:
Start with a single table:

StudentIDNameDepartmentIDDepartmentName
1Alice101Computer Science
2Bob102Physics
3Charlie101Computer Science
4David103Mathematics

Here, DepartmentName depends on DepartmentID, which in turn depends on StudentID (the primary key). This is a transitive dependency because a non-key attribute (DepartmentName) depends on another non-key attribute (DepartmentID). To achieve 3NF, we split the table:

Students Table:

StudentIDNameDepartmentID
1Alice101
2Bob102
3Charlie101
4David103

Departments Table:

DepartmentIDDepartmentName
101Computer Science
102Physics
103Mathematics

This is better because it eliminates redundancy and prevents update anomalies. If the “Computer Science” department changes its name, we only need to update it in one row in the Departments table. In the original table, we would have to update it for both Alice and Charlie, risking inconsistency. It also prevents insertion/deletion anomalies: we can now add a new department before any students are assigned to it.

Another 3NF Example:

Consider an Employees table:

EmployeeIDNameDepartmentDepartmentHead
1EveMarketingFrank
2GraceMarketingFrank
3HeidiSalesIvan

Here, DepartmentHead depends on Department. Since Department is not a key, this is a transitive dependency (EmployeeID -> Department -> DepartmentHead). To achieve 3NF, we split the table:

Employees Table:

EmployeeIDNameDepartment
1EveMarketing
2GraceMarketing
3HeidiSales

Departments Table:

DepartmentDepartmentHead
MarketingFrank
SalesIvan

This is better because it eliminates redundancy and prevents update anomalies. If Marketing gets a new department head, we only need to update one row in the Departments table. In the original table, we would have to update it for both Eve and Grace, risking inconsistency. It also prevents insertion/deletion anomalies: we can now add a new department and its head before any employees are assigned to it, and we can delete the last employee in a department without losing the information about who heads that department.

§2.4 Boyce-Codd Normal Form (BCNF) and higher forms #

Boyce-Codd Normal Form (BCNF) is a stricter version of 3NF. A table is in BCNF if for every non-trivial functional dependency X -> Y, X is a superkey. In simple terms, every determinant must be a candidate key. Higher normal forms like 4NF and 5NF exist to handle more complex multi-valued and join dependencies, but are less commonly used in practice.

Example:

Consider a table that tracks which instructor teaches which course for a student.

StudentCourseInstructor
AliceBiologyProf. Smith
AliceChemistryProf. Jones
BobBiologyProf. Davis

The functional dependencies are:

  1. (Student, Course) -> Instructor (A student in a specific course has one instructor)
  2. Instructor -> Course (An instructor teaches only one course)

The candidate keys for this table are (Student, Course).

The dependency Instructor -> Course violates BCNF because Instructor is a determinant but it is not a superkey. To bring this to BCNF, we decompose the table:

Student_Instructor Table:

StudentInstructor
AliceProf. Smith
AliceProf. Jones
BobProf. Davis

Instructor_Course Table:

InstructorCourse
Prof. SmithBiology
Prof. JonesChemistry
Prof. DavisBiology

Now, in both tables, the only determinants are candidate keys, so the design is in BCNF.

§3 Denormalization #

Denormalization is the process of intentionally introducing redundancy into a database to improve query performance. While normalization is good for data integrity and reducing redundancy, it can sometimes lead to complex queries with many joins, which can be slow. Denormalization can help by reducing the number of joins needed for a query.

Example:

Imagine you have a normalized database for a blog with Posts and Users tables. Users Table:

UserIDUserName
1Alice
2Bob

Posts Table:

PostIDUserIDTitleContent
1011First Post
1022My Post

To display a list of posts with their author’s name, you would need to perform a JOIN:

SELECT p.Title, u.UserName
FROM Posts p
JOIN Users u ON p.UserID = u.UserID;

If this query is run very frequently and performance is critical, you might denormalize by adding the UserName directly to the Posts table.

Denormalized Posts Table:

PostIDUserIDUserNameTitleContent
1011AliceFirst Post
1022BobMy Post

Now, you can get the same result with a simpler, faster query without a join:

SELECT Title, UserName FROM Posts;

The trade-off is that if a user changes their name, you must update it in the Users table and in every post they have written in the Posts table, which introduces redundancy and potential for inconsistency.

§4 trade-offs #

There is a classic trade-off between storage and processing cost. Normalization reduces storage costs by minimizing redundant data, but can increase processing costs due to the need for more joins. Denormalization, on the other hand, increases storage costs by adding redundant data, but can reduce processing costs by making queries faster. The trade-off between normalization and denormalization is a fundamental concept in database design, balancing data integrity and query performance.

§5 academic vs practical application #

In academia, normalization is often taught as a strict set of rules to be followed. In practice, database design is often more flexible. A common approach is to normalize for data that needs to be referenced and updated (like customer information), and denormalize for data that is primarily used for reporting and analysis, where performance is a key concern. The choice often comes down to whether you need to maintain the integrity of a value as a reference (normalize) or if you just need the value itself at a point in time (can be denormalized).

When designing a database, consider the following:

§6 Questions #

Always provide at least a brief explanations for your answers.
Please number each response to directly correspond with the question. Provide clear, concise answers that fully address each question without unnecessary elaboration or essay-style writing.

  1. What is the primary goal of normalization in a relational database?
  1. Which normal form eliminates repeating groups so that every column contains only atomic values?
  1. Why is First Normal Form (1NF) a critical first step in normalization?
  1. Provide an example of a partial dependency and a transitive dependency
  1. Suppose a table has a composite PK (StudentID, CourseID) and a non-key column CourseName that depends only on CourseID. Which normal form is violated?
    StudentIDCourseIDCourseName
    1001C101Physics
    1002C101Physics
    1001C102Math
    1003C101Physics
  1. What type of dependency is removed when a table is moved from 2NF to 3NF?
  1. Name the normal form that requires every determinant to be a candidate key.
  1. What are “update anomalies” and how does normalization help prevent them? Provide an example.
  1. Explain the three types of data anomalies that can occur in an unnormalized database.
  1. When might you intentionally choose to denormalize a database design?
  1. When intentionally adding redundant data to speed up reads, what process are you performing?
  1. True or False: A fully normalized schema is always the best choice for an OLAP workload. Explain why.