Home

---

Some Considerations for Assigning a Primary Key in a Database Table

 

Preamble

 

Database developers must decide on how they will allocate primary keys in their database tables.

 

There are many developers who have strong beliefs about this or that way to do it.  In particular, concerning the usage of artificial or surrogate fields.

 

There simply is no consensus on that question, and a definitive statement on it is therefore not possible.  I will certainly reveal my own preferences, but it is not my primary intention here to argue the relative merits of natural versus surrogate keys.

 

Instead, the focus here is on attempting to clarify the questions that the developer needs to take into account, and to dispel some of the confusion that I often see around this subject.

 

Bottom Line

 

In general, each table in a relational database should have a primary key assigned.  This is a basic database design principle.

 

The primary key is made up of one or more fields in the table. The value of the data entered into the primary key must be unique across all records.  The primary key serves as the basis of Relationships with other tables, often as the basis of joins in queries, and other operational functions within the application.

 

Options

 

The developer is free to use whatever they want as the primary key.

 

The purpose of a primary key in a table is to ensure that each record is uniquely identified. If there is already a field, or combination of fields, in the table whose data uniquely identifies each record, then there is absolutely no reason why this should not serve as the primary key for the table.  This is called a “natural key”.  So, this is one option.

 

Another option is adding a field to the table, to contain meaningless record identification data, and assign this as the primary key.  Access provides the AutoNumber data type, which lends itself to this purpose, or otherwise there are various methods of using procedures within the application to generate ID numbers etc.

 

In my opinion, both the above are valid and useful approaches when used appropriately. Which approach is appropriate under various circumstances is a judgement call that becomes easier with experience.

 

Normalisation

 

Simply stated, using an AutoNumber primary key violates normalisation rules when there is a valid natural key available in the data.

 

The choice of an AutoNumber or other artificial data for use as primary key is based on other considerations, and really has nothing to do with normalisation, or data integrity, or any data modelling principles.

 

Characteristics of Candidate Data

 

It is apparent that in almost all tables you would find a field or combination of fields whose values are unique across all rows.  It is also apparent that a primary key should be natural wherever possible.  In practice, however, many developers resort to the use of surrogate keys.  Why is this?

 

It generally comes down to a consideration of the characterisitcs of any natural data which could be a candidate for primary key status.  These are some of the factors:

  • Can the records be uniquely identified by the data in a single field, or does it require a combination of fields?
  • How short and simple is the candidate primary key data?
  • How stable is the candidate primary key data?
  • Is the table storing core operational data?  Or is it a simple lookup table?  Or is it a junction table to resolve a many-to-many relationship?

 

These factors define a continuum, and most developers eventually get themselves comfortable somewhere along that continuum, at the point that defines where they draw the line between using natural or surrogate primary keys.

 

Some take the view that only if the data itself presents no genuine primary key, should one resort to a surrogate key (either an AutoNumber or a derived sequence number).

 

At the other end of the spectrum are those who routinely put an AutoNumber field in almost every table... though probably most have a point at which they make an exception.

 

For example, in the case of simple lookup tables, one field only, by definition unique values in that field, the purpose of which is to simply provide the Row Source for comboboxes in order to expedite and validate data entry into a core table, typical examples being Categories or Colours – in those cases I expect most would make an exception.

 

I once saw a database that included a table of US states.  In this case, the usual two fields – the name of the state, and the two-letter state abbreviation code.  The data in both of these fields are unique, of course, as well as being simple, short, and stable.  The state abbreviation, in other words, meets all the criteria for making an ideal primary key.  But no, in this example there was also an AutoNumber field StateID defined as the primary key!  Quite bizarre.

 

Convenience and Performance

 

In many tables there is a single field that contains simple unique data. Membership numbers, registration numbers, serial numbers, product codes, the list is endless.  This also applies to the example of lookup tables previously mentioned.

 

In such cases, there are many advantages to using these fields as the primary keys, and surrogate AutoNumbers are completely unnecessary.  Queries become simpler, sometimes hugely so, and therefore more efficient, as well as reducing development time.  Commonly used UI approaches such as cascading comboboxes become simpler.  Data becomes more comprehensible.  And so on.

 

Developers who use AutoNumber fields even in these circumstances have said that they don’t have to stop and think about it on a table by table basis.  Just whack in an AutoNumber and you are assured you have a unique identifier.  I have seen it claimed that this makes the development process faster because it is convenient.  Well, that depends how quickly you think, I suppose.

 

The only real downside is that the database may be bigger, because stored foreign key values are likely to be bigger with natural keys.  Maybe in a very large database this may need to be taken into account.

 

But the balance starts to shift if the data is likely to be subject to frequent change, for example.  Access provides for this to be easily handled, via enforcing Cascade Updates when defining Relationships with Referential Integrity.  So the occasional renaming of a category or some such will be fine, but some developers might feel that data instability is an indication for a surrogate key.

 

Similarly, it seems more manageable and efficient if the primary key data is short, and does not contain special characters.  The full botanical names for plants, for example, might feel a bit complex to be using as the basis of joins between tables, and some developers might feel that this would be an indication for a surrogate key.  The operative word here is “seems” – the distinction is more psychological than structural.

 

The real cut-off point for many developers, however, comes when the natural data will only be unique on the basis of a combination of fields.  In practice, this scenario often occurs.  A natural key would involve a composite (multi-field) primary key.  Which means multiple joins between tables in queries, some data manipulation code becomes more complex (sometimes considerably so), more complex query criteria are sometimes required, etc.  In other words, the balance of advantage shifts towards the use of a surrogate key such as an AutoNumber field.

 

Logic Lapses

 

One can often see a recommendation to put an AutoNumber field in a table, that seems to be based on something like this type of reasoning:

 

An AutoNumber is essentially meaningless data that therefore should not be exposed to the user.

If we had an AutoNumber, it would be the primary key.

Therefore the primary key should not be exposed to the user.

There is already a field in the table that uniquely identifies each record.

But this data has meaning, and thus must be exposed to the user.

Therefore it shouldn’t be the primary key.

Therefore we need an AutoNumber field to be the primary key.

 

Suffice it to say that this is not an example of sound logic!

 

Here’s another one, somewhat related:

 

A primary key’s purpose is to be a unique idenifier.

Therefore its data can’t have any meaning.

An AutoNumber has no meaning.

Therefore a primary key must be an AutoNumber.

 

Again, obviously the logic doesn’t hold up.  There is absolutely no reason why a field with meaningful data can’t be the primary key.

 

Tricks and Traps

 

1. Of course, in order to use natural data as primary key, you have to be certain that the data will be unique and universal.  That means stop and think, as occasionally this is not as easy as it first appears.  The classic example is developers in the US using a SSN (Social Security Number) as a primary key.  In theory fine, but my American friends assure me that in practice, for various reasons both legal and illegal, there are people who do not have a SSN, and there are also plenty of cases of duplicated numbers floating around.  So, SSN is not a good candidate.

 

2. There are a number of ways of providing a computer-generated number to uniquely identify records.  Using the AutoNumber data type is one method, but there are other approaches that are commonly employed.  Sometimes these numbers end up having a meaning within the business rules of the application.  Invoice numbers and order numbers are common examples.  If the user cares that there may sometimes be gaps in the numbering, AutoNumbers are not suitable in such instances.

 

3. If an AutoNumber or other artificial primary key field is introduced, because the data requires a combination of natural fields to ensure uniqueness, there will still be a requirement for the data in that combination of existing fields to be unique.  In many cases, therefore, it may be advisable or necessary to set a unique index on that composite of fields, and/or write code to validate this data.

 

Conclusion

 

There are a number of factors to consider when assigning a primary key to a database table.

 

There is no absolute right or wrong, as evidenced by the fact that no matter which approach you choose, you can find experienced, professional Access developers who do the same.

 

Taking into account all the considerations, the following is what I presently consider to be the most rational and manageable approach:

 

 

Steve Schapel