by Ursego » 22 Feb 2013, 08:10
Each table should have a SINLGE-field primary key.
Field(s), used as primary keys in tables of major entities, should be simple numerators (1, 2, 3 etc.), with no other real-life context (like Driver Licence number or SIM card number) - even if you are sure their values are unique and mandatory. Keys, having no other context besides being records' unique identifiers, are named "surrogate (synthetic) keys" - in contrast to "natural keys".
For catalog tables (like "order status" or "customer type"), mnemonic codes can be used (like "A" or "ACT" for Active, "D" or "DEL" for Deleted etc.) to make viewing data in tables and variables values in debugger more enjoyable.
Let's say, you are creating an information system for an existing company where each worker has an employee number. Don't use that number as the employees table's PK! Instead, create a field like emp_id. In addition, you can create a field to store the existing employee number (with NOT NULL and UNIQUE constraints if needed) - as well as for other "real-life" fields like Social Insurance Number. The database theory calls such fields "candidate keys" - they are like candidates to be elected as primary keys, but don't allow them be elected!
Many data architects don't realize which problems are waiting for developers if natural keys are used.
PROBLEM #1: combination of many fields (sometimes 5-8!!!) in a primary key. As a result, the developers are forced to:
1. Declare and pass between objects multiple variables or even whole structures/classes while ONE PRIMITIVE VARIABLE would be enough to carry out the work - to identify the record.
2. Write SQL queries with extra-inflated, haircurling WHERE clauses having massifs of additional lines written only for the sake of joining tables. Important business conditions, restricting the retrieved/updated population, can simply be lost in that mess - bugs are welcome!
PROBLEM #2: the information system needs to be changed to support a change in the business. Examples of such a situation:
1. The company changes employees numbering after a merge with another company.
2. Duplicate Person ID cards numbers found (that happened in one of the countries I have worked in).
3. An entity was defined incorrectly during the initial system analysis. Later, when a part of the system has already been developed, the correct solution is found, and it is related to the real-life fields, used as PKs.
In any of these cases, you will change the functionality (relationships between tables as well as database and application objects) spending a lot of expensive time to rebuild and retest the system. But if you have used surrogate PKs then you don't need to perform that time-consuming and bugs-prone work. You are OK because the primary key's meaning has not been changed: it was a primitive numerator yesterday, and it will stay the same forever, surviving any business changes in the future!
I wonder why in developers' forums there are discussions which primary keys - surrogate or natural - should be used. We have always to think what will happen in the worst case, and I have described above what will happen if you use a natural key - a lot of headache and more chances of bugs. But what will happen in the worst case if surrogate keys are used? In that case the table will have one extra field. Not a big trouble.
What regarding the PKs of the "many" side tables in one-to-many relations?
The rule is applicable. For example, the "order_item" table should have the order_item_id PK (which is the numerator through the whole table), while order_id should be a regular field (even though NOT NULL and a FK).
Don't make a Foreign Key a part of a complex Primary Key!
It's a bad idea to make the PK which is the combination of order_id and an additional numerator, unique per order (like items 1, 2, 3 for order 1, then items 1, 2, 3 for order 2 etc.). It's easier to write (and understand)
- Code: Select all
WHERE order_item_id = 6
than
- Code: Select all
WHERE order_id = 2 AND item_num = 3
Of course, never make a date field a part of a PK as well! For example, if insurance policies have versions, created in different dates, make the PK policy_version_id rather than a combination of policy_id and version_date!
What regarding the PKs of many-to-many tables?
An entity is whatever to which a table is dedicated in the DB.
There can be 3 kinds of entities:
1. Thing (employee, address, vehicle, order, order item etc.).
2. Event (for example, an employee comes to work, or a car goes through the Canada-USA border, or a patient gets an injection).
3. Many-to-many relation between other entities (like placement of a student in a course, or link between customers and their addresses).
The rule of single-field surrogate PK is NOT mandatory for many-to-many entities.
However, you (or the data architect) can decide to apply it if the table's data will be retrieved and worked with in code. For example, the "student_placement" table can include more fields then just student_id and course_id - for example, student's mark. In such a case, a surrogate numerator student_placement_id can be created, which will simplify the standard DB operations. In that case, you should make both the FKs NOT NULL, and their combination - UNIQUE. That means, if the table is going to be huge, it makes sense to use the combination of student_id and course_id as the PK (rather than student_placement_id) from performance perspective (to eliminate the housekeeping of an extra index).