Section 8: Subqueries & Complex Query Management

This section is a much more advanced exploration of subqueries, a powerful SQL feature for solving complex data retrieval problems. You will learn about different types of subqueries, the distinction between correlated and non-correlated subqueries, and essential operators used with them. Additionally, we will explore techniques like Common Table Expressions (CTEs) and Views to manage and simplify complex SQL queries.

§1 Review of Subqueries (Nested Queries) #

A subquery (also known as a nested query or inner query) is a SELECT statement embedded within another SQL query. It acts as a component of a larger query, allowing you to perform operations that might be difficult or impossible with a single, flat query.

The primary purpose of subqueries is to solve complex problems by breaking them down into smaller, more manageable parts. They enable you to use the result of one query as input for another, facilitating advanced filtering, calculations, and data manipulation.

For example, you might use a subquery to find the average order value and then use that average in an outer query to list all orders above that average.

§2 Types of Subqueries #

Subqueries can be categorized based on the number of values they return:

§2.1 Scalar Subqueries #

A scalar subquery returns a single value (one row, one column). Because they return a single value, they can be used in most places where a single expression is expected, such as in the SELECT list, WHERE clause, HAVING clause, or even as an argument to a function.

Syntax Example (in SELECT clause):

SELECT
    column1,
    (SELECT MAX(column_name) FROM another_table) AS max_value
FROM
    your_table;

Scalar Subquery Example:
Find the average unit_price of all products and display it alongside each product’s name.
(this example uses ROUND(value,2) to avoid floating point rounding errors)

Note how the value of average_price is identical in all columns, a scalar only returns a single value.

§2.2 Row Subqueries #

A row subquery returns a single row but can return multiple columns. They are typically used in the WHERE or HAVING clauses, often with comparison operators (=, >, <, >=, <=, !=/<>) or with IN and NOT IN when comparing multiple columns.

Syntax Example:

SELECT column1, column2
FROM table1
WHERE (column1, column2) = (SELECT col_a, col_b FROM table2 WHERE condition);

The tuple syntax:
(column1, column2) = (SELECT col_a, col_b FROM table2 WHERE condition);
The expression (column1, column2) is a tuple, and the row subquery must return exactly one row with the same number of columns so the comparison can succeed.

Row Subquery Example:
Find products that have the same unit_price and category_id as a specific product (e.g., product with id = 1).

§2.3 Table Subqueries #

A table subquery returns a table (multiple rows and multiple columns). These are highly versatile and can be used in the FROM clause (where they are often called derived tables), or with operators like IN, NOT IN, EXISTS, and NOT EXISTS.

Syntax Example (in FROM clause - Derived Table):

SELECT outer_col1, outer_col2
FROM (SELECT col_a, col_b FROM table1 WHERE condition) AS derived_table
WHERE derived_table.col_a > 10;

Table Subquery Example (Derived Table):
Find the total quantity of products ordered for each category.

Table subqueries allow you to create complex queries by creating intermediate tables. In the example above, try removing the SUM() and the GROUP BY clause:

SELECT c.name AS category_name, derived_orders.total_qty
FROM categories AS c
JOIN (
  SELECT p.category_id, oi.qty AS total_qty
  FROM products AS p
  JOIN order_items AS oi ON p.id = oi.product_id
) AS derived_orders 
  ON c.id = derived_orders.category_id;

§3 Correlated vs. Non-Correlated Subqueries #

Understanding the execution flow of subqueries is important for performance and correct logic.

§3.1 Non-Correlated Subqueries #

A non-correlated subquery is a subquery that can be executed independently of the outer query. Its results are evaluated once, and then those results are passed to the outer query. The inner query does not reference any columns from the outer query.

Characteristics:

Executes once.
Results are cached and used by the outer query.
Generally more efficient than correlated subqueries.

Non-Correlated Subquery Example:
List all products whose unit_price is greater than the average unit_price of all products.

§3.2 Correlated Subqueries #

A correlated subquery is a subquery that depends on the outer query for its values. It executes once for each row processed by the outer query. The inner query references one or more columns from the outer query.

Characteristics:

Executes repeatedly, once for every row of the outer query.
Can be less efficient for large datasets due to repeated execution.
Often used with EXISTS, NOT EXISTS, or for row-by-row comparisons.

Correlated Subquery Example:
Find all categories that have at least one product with a unit_price greater than $13.

In a correlated subquery, the value after SELECT is effectively ignored, as the query is evaluating rows that simply EXISTS, so any valid value can be placed after SELECT in this query.
SELECT 1 FROM products AS p 
WHERE p.category_id = c.id AND p.unit_price > 13
This query is returning any row that belongs to the current category, where the unit_price is greater than $13, this effectively returns all categories that contain a product worth more than $13.

§4 Operators Used with Subqueries #

Several operators are commonly used to integrate subquery results into the outer query’s logic.

§4.1 `IN` / `NOT IN`: #

Used with row or table subqueries.
IN: Checks if a value exists within the set of values returned by the subquery.
NOT IN: Checks if a value does not exist within the set of values returned by the subquery.

NOT IN has undefined behavior when encountering NULL data, use NOT EXISTS instead if NULL may be encountered.

Example (IN):
Find customers who have placed orders.

§4.2 `EXISTS` / `NOT EXISTS`: #

Used with table subqueries, typically correlated.
EXISTS: Returns TRUE if the subquery returns any rows; otherwise, FALSE. It’s efficient because it stops processing as soon as it finds the first matching row.
NOT EXISTS: Returns TRUE if the subquery returns no rows; otherwise, FALSE.

Example (EXISTS):
List categories that have at least one product.

§4.3 `ANY` / `ALL`: #

Used with scalar or row subqueries, often with comparison operators.
ANY (or SOME): Returns TRUE if the comparison is TRUE for any value in the set returned by the subquery.
- > ANY: Greater than at least one value (i.e., greater than the minimum).
- < ANY: Less than at least one value (i.e., less than the maximum).
ALL: Returns TRUE if the comparison is TRUE for all values in the set returned by the subquery.
- > ALL: Greater than every value (i.e., greater than the maximum).
- < ALL: Less than every value (i.e., less than the minimum).

Example (> ANY):
Find products with a unit_price greater than any unit_price in the ‘Electronics’ category.

SELECT name, unit_price
FROM products
WHERE unit_price > ANY (
  SELECT p.unit_price FROM products AS p 
  JOIN categories AS c ON p.category_id = c.id 
  WHERE c.name = 'Electronics'
);

SQLite does not support the ANY or ALL operators, this can be worked around by using the IN operator with a more specific subquery. Postgres, MySQL, and Oracle all support the ANY/All operators.

§5 Techniques for Managing Complex Queries #

As queries grow in complexity, maintaining readability, reusability, and performance becomes challenging. SQL offers several techniques to manage this.

§5.1 Breaking Down Complex Problems #

The most fundamental technique is to decompose a large, complex problem into smaller, more manageable sub-problems. Each sub-problem can then be addressed with a simple query or subquery. This makes debugging easier and improves overall understanding.

Example: Find customers who have ordered products from at least 2 different categories

Let’s break this down:

Identify what data we need: We need to connect customers to the categories of products they’ve ordered
Break it into sub-problems:
1. Find all products each customer has ordered
2. Determine the categories of those products
3. Count distinct categories per customer
4. Filter for customers with 3+ categories

Build the solution incrementally:

Find all products ordered by each customer

SELECT o.customer_id, oi.product_id
FROM orders o
JOIN order_items oi ON o.id = oi.order_id

Add category information

SELECT o.customer_id, p.category_id
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id

Add customer information

SELECT c.name, p.category_id
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
JOIN customers c ON c.id = o.customer_id;

Count distinct categories per customer

SELECT c.name, COUNT(DISTINCT p.category_id) as category_count
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
JOIN customers c ON c.id = o.customer_id
GROUP BY c.name

Filter for customers with 2+ categories

SELECT name, category_count
FROM (
    SELECT c.name, COUNT(DISTINCT p.category_id) as category_count
    FROM orders o
    JOIN order_items oi ON o.id = oi.order_id
    JOIN products p ON oi.product_id = p.id
    JOIN customers c ON c.id = o.customer_id
    GROUP BY c.name
) AS customer_categories
WHERE category_count >= 2

Or alternatively using a CTE for better readability:

WITH customer_categories AS (
    SELECT c.name, COUNT(DISTINCT p.category_id) as category_count
    FROM orders o
    JOIN order_items oi ON o.id = oi.order_id
    JOIN products p ON oi.product_id = p.id
    JOIN customers c ON c.id = o.customer_id
    GROUP BY c.name
)
SELECT name, category_count
FROM customer_categories
WHERE category_count >= 2

Complex problems become manageable when broken into smaller pieces. Start simple, test each piece, then assemble your solution. The most elegant queries often emerge from patient, systematic construction

§5.2 Common Table Expressions (CTEs) #

Common Table Expressions (CTEs), introduced with the WITH clause, are temporary, named result sets that you can reference within a single SELECT, INSERT, UPDATE, or DELETE statement. They significantly improve query readability and can be used to break down complex, multi-step calculations.

A CTE can be referenced multiple times within the same query, which allows avoiding redundant code. They are necessary for recursive queries (e.g., hierarchical data).

Syntax:

WITH cte_name (column1, column2, ...) AS (
    SELECT column1, column2, ...
    FROM table_name
    WHERE condition
)
SELECT *
FROM cte_name
WHERE condition;

CTE Example:
Calculate the total order value for each order and then find orders with a total value greater than $20.

§6 More Resources #

§7 Questions #

Please number each response to directly correspond with the question. Provide clear, concise answers that fully address each question without unnecessary elaboration or essay-style writing. Provide only the SQL query unless otherwise specified.

What is a subquery in SQL?

What clauses can scalar subqueries be used with?

True or False: A subquery can return multiple rows and multiple columns.

Explain the primary difference between a correlated and a non-correlated subquery.

Which operators require the subquery to be correlated?

Which type of subquery is generally more efficient for large datasets: correlated or non-correlated?

In the example SELECT name, (SELECT MAX(unit_price) FROM products) FROM products;, how many times is the subquery executed?

When would you use the EXISTS operator with a subquery?

What does EXISTS (SELECT 1 ...) return when the subquery finds at least one row?

Why is SELECT 1 used inside an EXISTS subquery?

When would you choose NOT EXISTS instead of NOT IN with a subquery (consider how NULL values are handled)?

What is a Common Table Expression (CTE) and what keyword is used to define it?

What is the primary benefit of using a CTE over a nested subquery?

Provide the operator that must be used when comparing a tuple (col1, col2) to a row subquery.

The remaining questions will use the database below, use the text box to test queries for the questions.
For these questions, I do not want the SQL outputs, only the SQL query.

You do not have to use subqueries to solve all of these, JOINs are acceptable.

Use the (view schema) link to view the list of tables and their structure.

Display every product name and price together with the single highest unit price found in the products table.

Return every product whose supplier_id, category_id matches exactly the product with sku ‘WIDG-B’.

Return the name and supplier_id of any product whose supplier_id equals that of the cheapest ‘Electronics’ item.

List every product name together with its total qty ordered, do not omit products with 0 qty ordered.

Show each product name, its unit_price, and the average price of its category.

List products whose unit_price is above the overall average price.

List categories that contain at least one product priced above $10 using EXISTS.

Show the names of customers who have at least one order in any status.

Display products that have never been ordered.

List every product that belongs to a category, using IN.

Find customers that currently have zero orders (use EXISTS/NOT EXISTS).

Use a CTE named OrderTotals to compute the total value (qty*unit_price) for each order, then return order_id and total_value for orders worth more than 20.

Create a CTE named CntProd that lists every category name and the COUNT() of its products; then use it to show categories having fewer than 3 products.