Divide one complicated SQL query into a few simpler sub-queries.
The "divide and conquer" method works well in SQL queries. If you can take a part of a query into a sub-query - do that (unless it can impair performance)! From the readability viewpoint, it's better to write a number of simple SELECT statements than mix everything into one unreadable and bugs-prone cocktail.
Let's compare two the approaches.
The situation: Country names are stored in the country table (PK country_id). Departments are located in different countries (i.e. dept table has the country_id FK). Each employee is assigned to a department (i.e. emp table has the dept_id FK).
The question: what is the country name of the employee #12345?
*** BAD code (the "all in one heap" approach), the regular tables join syntax: ***
SELECT country.country_name INTO v_country_name FROM country, dept, emp WHERE country.country_id = dept.country_id AND dept.dept_id = emp.dept_id AND emp.emp_id = 12345;
*** Another BAD code (the "all in one heap" approach), the ANSI tables join syntax: ***
SELECT country_name INTO v_country_name FROM country WHERE country_id =
(SELECT country_id FROM dept WHERE dept_id =
(SELECT dept_id FROM emp WHERE emp_id = 12345));
If the overall query has not returned the expected result (the country name), you definitely know where to start the investigation. Maybe, you will find the problem simply looking at the code since each WHERE clause is simple and you can concentrate on each table rather than "see the forest but not see the trees".
This example is very primitive (only to illustrate the idea), so breaking it down into sub-queries doesn't really make the life much easier, but SQLs in real applications can be much more complicated and consist of more tables. To make that magic possible, please use only surrogate primary keys (in short, each table should have a SINLGE primary key field).
There is also third method - SELECT each time INTO a variable and pass that variable to the subsequent query as retrieval argument. It is the best practice in terms of readability and ease of debugging:
*** Another GOOD code (using interim variables): ***
SELECT dept_id INTO v_dept_id -- this variable will be used as the parameter of the next query FROM emp WHERE emp_id = 12345;
SELECT country_id INTO v_country_id -- this variable will be used as the parameter of the next query FROM dept WHERE dept_id = v_dept_id;
SELECT country_name FROM country WHERE country_id = v_country_id;
This approach should be used when each query is complex (not as straightforward as in the example ). The advantages:
1. Clarity of presentation - it's easy to see what happens.
2. You can check each variable for NULL and raise an error if any of them failed to be obtained - in accordance with the best practices of defensive programming. In the previous two methods, you can validate only the final result (v_country_name).
3. If the final result is wrong (even though NOT NULL), you can see in the debugger which step exactly was incorrect. In the real life, you can investigate issues with very complicated SELECTs involved, so you want the bug search area to be as narrowed down as possible.
The last method (with interim variables) should be used only within stored procedures. Using it in SQL, embedded in a layer other than DB, will worsen performance because the application travels twice (to the database server and back) for eachSELECT (it means 6 trips instead of 2 in the given example).
Embed a sub-query in the SELECT statement.
Another useful life-hack is to embed a sub-query in the SELECT statement (so, that sub-query will supply the value for exactly one field of the main SELECT). Since such a sub-query will be invoked once per each record of the recordset, it's named a correlated sub-query (in contrast to regular one in the previous example, which is invoked only one time regardless the number of rows, retrieved by the main SELECT). That approach is very handy when you need to grab the description of an application code, stored in a catalog table (like order status), for example:
*** BAD code (the "all in one heap" approach): ***
SELECT orders.order_id, orders.order_date, order_status.order_status_desc FROM orders, order_status WHERE order_status.order_status_id = orders.order_status_id AND orders.order_id = 12345;
*** GOOD code (the main query stays very primitive): ***
SELECT order_id, order_date, (SELECT order_status_desc FROM order_status WHERE order_status.order_status_id = orders.order_status_id) AS order_status_desc FROM orders WHERE order_id = 12345;
Here is a real-life example. Imagine which mess would you see in the WHERE clause if it will be one complicated query rather than five simple!
SELECT co_cd, prov_cd, (SELECT prod_desc FROM cd_prod WHERE cd_prod.prod_cd = br_item_cov_par.prod_cd) AS prod_desc, (SELECT plan_desc FROM cd_plan WHERE cd_plan.plan_cd = br_item_cov_par.plan_cd) AS plan_desc, (SELECT item_ins_desc FROM cd_item_ins WHERE cd_item_ins.item_ins_typ = br_item_cov_par.item_ins_typ) AS item_ins_desc, (SELECT cov_desc FROM cd_cov WHERE cd_cov.cov_cd = br_item_cov_par.cov_cd) AS cov_desc, Count(1) AS row_count FROM br_item_cov_par WHERE row_sta = 'A' AND tbl_exp_dt IS NULL GROUP BY co_cd, prov_cd, prod_cd, plan_cd, rider_no, item_ins_typ, cov_cd
Caution! The method of breaking of one complicated query into simpler sub-queries must be used carefully when you work with tables having a huge amount of records. The method really makes the code more elegant and easy, but you should make sure it doesn't affect the execution speed, especially when you deal with correlated sub-queries (just to remind - they are executed for each row, returned by the main SELECT). If you have any concern regarding performance, check the execution plan. There is a good chance that the smart database optimizer will build the same plan for both the methods (the "braking into sub-queries" and the "everything in one unreadable heap").
From the book "OCA Oracle DB 12c SQL Fundamentals I":
Star Transformation
An extension of the use of subqueries as an alternative to a join is to enable the star transformation often needed in data warehouse applications. Consider the large SALES table in the demo SH schema used for recording sales transactions. Each record captures a particular product sold to a particular customer through a particular channel. These attributes are identified by lookup codes used as foreign keys to dimension tables with rows that describe each product, customer, and channel. To identify all sales of an item called “Comic Book Heroes” to customers in the city of Oxford through Internet orders, one could run the following query:
SELECT count(quantity_sold) FROM sales s, products p, customers c, channels ch WHERE s.prod_id=p.prod_id AND s.cust_id=c.cust_id AND s.channel_id=ch.channel_id AND p.prod_name='Comic Book Heroes' AND c.cust_city='Oxford' AND ch.channel_desc='Internet';
This query uses the WHERE clause to join the tables and then to filter the results. The following is an alternative query that will yield the same result:
SELECT count(quantity_sold) FROM sales WHERE prod_id IN (SELECT prod_id FROM products WHERE prod_name='Comic Book Heroes') AND cust_id IN (SELECT cust_id FROM customers WHERE cust_city='Oxford') AND channel_id IN (SELECT channel_id FROM channels WHERE channel_desc='Internet');
The rewrite of the first statement to the second is the star transformation. Apart from being an inherently more elegant structure (most SQL developers with any sense of aesthetics will agree with that), there are technical reasons why the database may be able to execute it more efficiently than the original query. Also, star queries are easier to maintain; it is very simple to add more dimensions to the query or to replace the single literals ('Comic Book Heroes', 'Oxford', and 'Internet') with lists of values.
Simplify the query using an inline view (derived table) defined in the WHERE clause
An inline view, also known as a derived table or subquery factoring, is a virtual table created within the context of a larger SQL query. It allows you to use the result of a subquery as a temporary table in the main query:
SELECT t2.column1, view1.column2 FROM (SELECT column1 /* needed to allow the inline view to be joined */, column2 /* to be SELECTed */ FROM table1 WHERE condition) AS view1, table2 AS t2 WHERE t2.column1 = view1.column1;
Note that the view is absolutely standalone - its SELECT can be run independently. If you will reference any element of the outer SELECT in the inline view, you will get an error.
Inline views are commonly used to:
• Simplify complex queries by breaking them down into smaller, more manageable parts. • Avoid code duplication.
Please see an example of avoiding code duplication below (it's my real code).
Firstly, the "SELECT Sum..." subquery was used only once (the 1st occurrence), so everything was fine. But later I copy-pasted it to a couple of other places, and got this code:
,(SELECT Sum(p2.sum__writ_prem__orig - p2.sum__writ_prem__fixed) -- ############## the 1st occurrence FROM cs_fix__pol_ver AS p2 WHERE p2.pol_no = p.pol_no AND p2.pol_term_eff_dt = p.pol_term_eff_dt) AS term_diff
,CASE WHEN (SELECT Sum(p2.sum__writ_prem__orig - p2.sum__writ_prem__fixed) -- ############## copy-pasted here... :-( FROM cs_fix__pol_ver AS p2 WHERE p2.pol_no = p.pol_no AND p2.pol_term_eff_dt = p.pol_term_eff_dt) > 0 THEN 'X' ELSE NULL END AS term_overcharged FROM cs_fix__pol_ver AS p WHERE (SELECT Sum(p2.sum__writ_prem__orig - p2.sum__writ_prem__fixed) -- ############## ...and here :-( FROM cs_fix__pol_ver AS p2 WHERE p2.pol_no = p.pol_no AND p2.pol_term_eff_dt = p.pol_term_eff_dt) <> 0
SELECT DISTINCT p.pol_no ,p.pol_term_eff_dt ,term_view.diff_amt AS term_diff ,CASE WHEN term_view.diff_amt > 0 THEN 'X' ELSE NULL END AS term_overcharged FROM cs_fix__pol_ver AS p ,(SELECT pol_no -- ############## refactored into an inline view, no code duplication! :-) ,pol_term_eff_dt ,Sum(sum__writ_prem__orig - sum__writ_prem__fixed) AS diff_amt FROM cs_fix__pol_ver GROUP BY pol_no, pol_term_eff_dt) AS term_view WHERE term_view.pol_no = p.pol_no AND term_view.pol_term_eff_dt = p.pol_term_eff_dt AND term_view.diff_amt <> 0
If any field of the view is computed (i.e. it represents an expression) then an alias is a must ("AS diff_amt"). The derived table's fields are used in the regular way - as if it would be a normal table, not a view. As you see, some of the fields (pol_no and pol_term_eff_dt) are needed only for the sake of joining the view with other tables.