§1 Introduction to Databases #
Data is essential to modern applications and websites. Every interaction, from liking a social media post, tracking live scores in a multiplayer game, or managing items in an online shopping cart, involves information that is stored and manipulated within a database system. These databases are sophisticated digital repositories designed for the efficient storage, retrieval, and manipulation of vast amounts of information. In this course, you will learn fundamental concepts of data storage and manipulation.
§2 What is a Database? #
A database is an organized collection of information. While the concept can encompass various forms, such as a contact book, a spreadsheet, or a physical filing cabinet, in the context of websites and computer applications, a database typically refers to a computer program. These programs are formally known as Database Management Systems (DBMS). Users interact with databases using a specialized programming language, the most common of which is Structured Query Language, or SQL for short.
A DBMS is designed to efficiently store, retrieve, manage, and update structured data, which is typically stored electronically in a computer system. Databases are essential for systematically managing large amounts of information.
§3 What are Uses for a Database? #
Databases are ubiquitous and are used in almost every industry and application where data needs to be managed. Common uses include:
- Business Operations: Managing customer data, sales, inventory, and financial records.
- Web Applications: Storing user profiles, content, and e-commerce product catalogs.
- Healthcare: Maintaining patient records, medical histories, and appointments.
- Education: Storing student grades, course details, and faculty information.
- Scientific Research: Managing experimental data and research findings.
- Government: Storing census data, tax records, and public safety information.
Databases are used in almost every modern computer system. Your smartphone uses hundreds of small databases for all sorts of purposes. For example: every major messaging app uses a sqlite database on-device to manage all information related to your conversations.
You can find dozens of these databases on your computer right now by running the following in a command prompt/terminal:
You should get output like this:
... dozens of files...
./Library/Messages/CollaborationNoticeCache/collaborationNotices.db
./Library/Messages/NickNameCache/nickNameKeyStore.db
./Library/Messages/NickNameCache/nicknameRecordsStore.db
./Library/Messages/NickNameCache/handleSharingPreferences.db
./Library/Messages/NickNameCache/unknownSenderRecordInfoStore.db
./Library/Messages/NickNameCache/archivedNicknamesKeyStore.db
./Library/Messages/NickNameCache/handledNicknamesKeyStore.db
./Library/Messages/NickNameCache/pendingNicknamesKeyStore.db
./Library/Messages/chat.db
./Library/Messages/LiteSegmentStore.db
./Library/Messages/prewarm.db
./Library/Messages/Sync/sync.db
§4 Why Use a Database Over Other Storage Types? #
While simple datasets can be stored in text files or spreadsheets, databases offer significant advantages over other approaches, especially as data volume and complexity grow:
- Data Integrity: Databases enforce rules (known as constraints) to ensure data is accurate and consistent, preventing errors like duplicate entries or invalid data types.
- Data Security: They provide robust security features, allowing fine-grained control over exactly who can access, modify, or delete specific data.
- Concurrency Control: Multiple users or applications can access and modify data simultaneously without corrupting it. Databases manage concurrent reading and writing efficiently.
- Data Redundancy Reduction: Databases provide tools to minimize duplicate data, saving storage space.
- Efficient Data Retrieval: Optimized indexing and query languages (like SQL) allow for very fast retrieval of specific information, even from massive datasets.
- Scalability: Databases can be scaled to handle massive amounts of data and user traffic.
- Backup and Recovery: They offer built-in mechanisms for backing up data and recovering it in case of system failures.
This course primarily focuses on Relational Databases, though there are many specialized databases that excel at specific use-cases, we will briefly cover some examples of alternate databases as well.
§5 Analogy: Database vs. Filing Cabinet or Spreadsheet #
To understand the benefits of a database, let’s compare it to more traditional storage methods:
§5.1 Filing Cabinet #
Imagine a large office with many filing cabinets. Each cabinet might represent a different category of information (e.g., “Customer Records,” “Employee Files”). Inside each cabinet, there are folders, and inside each folder, there are individual documents.
A loose database equivalent of this would be:
- Filing Cabinet: Represents the entire database system.
- Cabinet Section/Drawers: Represents a specific table within the database (e.g.,
Customerstable). - Folder: Represents a record or row in a table (e.g., a single customer’s details).
- Document: Represents an individual piece of data or field within a record (e.g., customer’s address).
§5.2 Spreadsheet (e.g., Microsoft Excel) #
A spreadsheet is like a single, large table where you can store data in rows and columns.
The database equivalent of this would be:
- Worksheet: A single table in a database.
- Row: A record in a table.
- Column: A field or attribute in a table.
§5.3 Limitations of Filing Cabinets and Spreadsheets Compared to Databases: #
| Feature | Filing Cabinet | Spreadsheet | Database |
|---|---|---|---|
| Search/Retrieval | Manual, time-consuming, prone to human error. | Limited search functions, slow for large datasets. | Fast, complex queries. |
| Data Integrity | No built-in validation; easy to misfile. | No enforced rules; easy to enter incorrect data. | Enforces data types, unique values, relationships, and validation rules. |
| Concurrency | Only one person can access a folder at a time. | Difficult for multiple users to edit simultaneously. | Multiple users can read/write concurrently without data corruption.* |
| Security | Physical lock; anyone with key has full access. | File password; limited granular control. | Granular permissions (who can read/write specific data). |
| Scalability | Limited by physical space. | Performance degrades rapidly with large data. | Designed to handle vast amounts of data and users. |
| Relationships | Manual cross-referencing. | Difficult to link data across multiple sheets. | Defines and enforces relationships between different data sets. |
| Redundancy | Easy to have duplicate documents. | Easy to duplicate rows/columns. | Provides tools to minimize redundancy. |
* users of a SQL database do not actually read/write to the same row of data at the exact same time, the database enforces “concurrency” so that subsequent updates wait a short amount of time for the previous update to complete (typically just a few milliseconds) to ensure the data is always consistent.
§6 Examples of Popular Database Software #
There are many database management systems (DBMS) available, categorized by their underlying data model. Some popular examples:
§6.1 Relational Databases (SQL Databases) #
These are the most common type of database, storing data in tables with predefined schemas and relationships. Well suited for storing structured data.
- MySQL: Open-source, widely used for web applications (e.g., WordPress, Drupal).
- PostgreSQL: Open-source, known for its robustness, extensibility, and compliance with SQL standards.
- Microsoft SQL Server: Commercial, popular in enterprise environments, especially with Microsoft technologies.
- Oracle Database: Commercial, very powerful and scalable, used by large enterprises.
- SQLite: Open-source, embedded, file-based database, ideal for local storage in applications (e.g., mobile apps, desktop software).
There are also Cloud-Based SQL database services, which provide a managed database where the provider takes care of updates and management.
- Amazon RDS (Relational Database Service): Offers managed instances of MySQL, PostgreSQL, Oracle, SQL Server, and MariaDB.
- Google Cloud SQL: Managed relational database service for MySQL, PostgreSQL, and SQL Server.
- Azure SQL Database: Managed relational database service for Microsoft SQL Server.
This is a non-exhaustive list; there are many, many more DBMS software available.
§6.2 NoSQL Databases (Non-Relational Databases) #
Document Databases: Store data together as a “Document” object, particularly well suited to scale across many servers and hold enormous datasets, at the trade-off of certain query capabilities.
- MongoDB: Stores data in flexible, JSON-like documents. Popular for web applications, content management, and mobile apps.
- Couchbase: Another document-oriented database.
Key-Value Stores: A simple and flexible structure that stores any data along with an identifier, or key. These are often extremely fast but serve a narrow use case.
- Redis: In-memory data structure store, used as a database, cache, and message broker. Excellent for real-time applications, caching, and session management.
- DynamoDB (AWS): Fully managed key-value and document database service.
Graph Databases: A specialized structure that focuses on relationships between data.
- Neo4j: Stores data as nodes and relationships, optimized for highly connected data. Used for social networks, recommendation engines, and fraud detection.
Column-Family Stores: Designed for distributed systems and high throughput.
- Apache Cassandra: Designed for high availability and linear scalability across many commodity servers. Used for large-scale data with high write throughput (e.g., IoT data, time-series data).
- HBase: Open-source, non-relational, distributed database modeled after Google’s Bigtable.
§7 What are NoSQL Databases? #
NoSQL (Not Only SQL) databases are non-relational database management systems designed for specific use cases requiring flexible schemas and/or high scalability for large volumes of unstructured or semi-structured data. They emerged to address the limitations of traditional SQL databases when handling massive, unstructured data.
Key Characteristics of NoSQL Databases:
- Flexible Schemas: No rigid schema enforcement, allowing for easier data structure evolution.
- Horizontal Scalability: Designed to distribute data and workload across many servers.
- Varied Data Models: Employ different models optimized for specific needs, unlike the single tabular model of SQL databases.
- NoSQL Query Languages: Use their own APIs or query languages, not SQL.
The choice between a relational and NoSQL database depends heavily on the specific requirements of the application, including data structure, scalability needs, consistency requirements, and query patterns. Many modern applications use a combination of both database types for different needs.
§8 Simple Database Architecture: How Applications Interact with the Database Server #
A typical database architecture involves an application interacting with a database server.
- Application (Client): The software users interact with (e.g., website, mobile app) that initiates data requests.
- Database Driver/Connector: A library that translates the application’s requests into a format the database server understands (e.g., SQL queries).
- Network: The medium over which requests are sent from the application to the database server. (Note: This step is not applicable for embedded databases like SQLite).
- Database Server: A dedicated service running the database management system (DBMS) software. It receives, processes, and executes requests, interacting with the physical data storage before sending results back to the application.
- Data Storage: The physical location (e.g., disk drives) where the database server manages and stores the actual data files.
§9 Questions #
Always provide at least a brief explanations for your answers.
Please number each response to directly correspond with the question. Provide clear, concise answers that fully address each question without unnecessary elaboration or essay-style writing.
- What are some advantages of using a database instead of Excel spreadsheets?
- What is the most used database software (counted by number of deployed instances/installs)?
- Briefly describe an ideal application for each of the 5 database types: Relational Database (SQL), Document Database, Key-Value Store, Graph, and Column-Oriented.
- Which feature of a database becomes most important if 500 customer-service reps need to update the same product-inventory file at the same time?
- What built-in database feature prevents two different users from simultaneously overwriting the same customer’s phone number?
- Which of the five database categories would you use to track real-time chat-message counts for an online game, and why?
- If you needed to model highly connected social-network “friend” relationships efficiently, which of the five database types should you pick?
- A weather-monitoring system ingests millions of time-stamped sensor readings per hour. Which of the five database categories is designed for this workload?
- If you need to store a structured user contact list, including names, addresses, phone numbers, and other information. Which of the five database categories will be well-suited this workload?