by Ursego » 22 Feb 2013, 08:10
A field, used as a primary key of an application table (i.e. not a catalog table like a status or an entity type), should be a simple numerator (1, 2, 3 etc.), with no other meaning or context (like Driver Licence number or SIM card number) - even if you are 100% sure their values are unique and mandatory.
Keys, having no other meaning or context besides being records' unique identifiers in the given table, are named "surrogate (synthetic) keys" - in contrast to "natural keys".
Let's say, you are creating an information system for an existing company where each worker has an employee number. Don't use that number as the employees table's PK! Instead, create a field like emp_id. In addition, you can create a field to store the existing employee number (with NOT NULL and UNIQUE constraints if needed) - as well as for other "real-life" fields like Social Security Number. The database theory calls such fields "candidate keys" - they are like candidates to be elected as primary keys, but don't allow them to win the elections!
However, in catalog tables (such as "order status" or "customer type"), mnemonic codes can be used as the PK values (like "A" or "ACT" for Active, "D" or "DEL" for Deleted etc.). That is even recommended - to make viewing data in tables and debugger more enjoyable.
Now we come to another rule:
Each table should have a SINLGE-field primary key.
In other words:
Don't make a Foreign Key a part of a complex Primary Key!
Even if your table is the "many" side of the "one-to-many" relation, create a single-field PK, and another non-PK field which is the pointer (FK) to the "one"-side table. For example, the tables are REPORT (PK is REPORT_ID) and REPORT_LINE. You are deciding what the PK of REPORT_LINE should be. Probably your first wish would be using the combination of REPORT_ID and a numerator which is unique per REPORT_ID (like REPORT_LINE_NUM):
1 - 1;
1 - 2;
1 - 3;
2 - 1;
2 - 2;
2 - 3;
But the REPORT_ID field does have "other meaning or context" - it's the identifier of another table! Instead, create the single field REPORT_LINE_ID as the PK, and make the REPORT_ID a mandatory not-null FK:
1; 1
2; 1
3; 1
4; 2
5; 2
6; 2
REPORT_LINE is a full-fledged entity (no worse than REPORT), so why would you want to deny it the privilege of having its own identity field?
In SQL statements (especially complex ones), it's easier to write (and understand)
- Code: Select all
WHERE REPORT_LINE_ID = 3
than
- Code: Select all
WHERE REPORT_ID = 1 AND REPORT_LINE_NUM = 2
In addition to that, these tables will be able to utilize the standard PK generation mechanism rather then ugly SELECTing Max(REPORT_LINE_NUM) and adding 1 to it (or assigning 1 if it's NULL).
Of course, never make a date field a part of a PK as well! For example, if insurance policies have versions, created in different datetimes, make the PK policy_version_id rather than a combination of policy_id and version_dt!
Do you want troubles?
Many data architects don't realize which difficulties are waiting for developers if natural keys are used. Let's list them:
PROBLEM #1: combination of many fields (sometimes 5-8!!!) in a primary key. As a result, the developers are forced to:
1. Declare and pass between objects multiple variables or even whole structures/classes while ONE PRIMITIVE VARIABLE can do the work - to identify the record.
2. Write SQL queries with extra-inflated, ugly and confusing WHERE clauses having massifs of additional lines written only for the sake of joining tables. Important business conditions can simply be lost in that mess - bugs are welcome!
PROBLEM #2: the information system needs to be changed to support a change in the business. Examples of such a situation:
1. The company changes employees numbering after a merge with another company.
2. Duplicate Person ID cards numbers found (that happened in one of the countries where I have worked).
3. An entity was defined incorrectly during the initial system analysis. Later, when a part of the system has already been developed, the correct solution is found, and it is related to the real-life fields, used as PKs. An example from a telephone company where I happened to work. A customer buys a contract for two phones, one for husband and one for wife. But what is a phone? Phone number, IEMI (a global device instance identifier), SIM card number? Suppose the client changed the phone to a new model, with a new IEMI, and inserted the old SIM card. This means that the device itself is no good as the entity, because the SIM card is the "brain" of the phone. But it could also be lost along with the device, and then another device is bought for the same phone number. Fine, the SIM card number is our PK! But no shit - a person can change the phone number on the old device and the same SIM card. In this situation, the correct entity is the subscriber. It can change its different attributes when and as much as it wants! But before reaching this truth, the architects of the system created a bunch of tables with long and ugly PKs instead of just subscriber_id!
In any of these cases, you will will be forced to change the functionality (relationships between tables as well as database and application objects) spending a lot of expensive time to rebuild and retest the system. But if you have used surrogate PKs then you don't need to perform that time-consuming and bugs-prone work. You are OK because the primary key's meaning has not been changed: it was a primitive numerator yesterday, and it will stay the same forever, surviving any business changes in the future!
I wonder why in developers' forums there are discussions which primary keys - surrogate or natural - should be used. We have always to think what will happen in the worst case, and I have described above what can happen if you use a natural key - a lot of headache and more chances of bugs. But what will happen in the worst case if surrogate keys are used? In that case the table will have one extra field. Not a big trouble.