Principal Component Analysis

Part of my thesis involves generating a statistics set known as Principal Component Analysis (PCA). PCA takes a group of statistics and, well, makes sense of them.

Actually, that’s not quite correct. PCA gives you columns of numbers called eigenvectors. The numbers are just numbers; it’s up to you to make sense of them.

To illustrate the value of PCA, consider the following list. Each item in the list is a measure, or indicator of vulnerability that a person has in the event of an earthquake. Do they make sense?

  • Population – 65 years and older
  • Family – Single parent
  • Income – low
  • Education – High school or less
  • Unemployment
  • Renter
  • English – not spoken at home
  • Occupation – arts or service
  • Social dependence
  • Population growth
  • Family – Average size
  • Population – 14 years and under
  • Aboriginal identity

Okay, they make sense by themselves, but do they make sense as a group? Which is the most important? Are some related to each in the sense that when one indicator is high, another indicator is also high?

PCA takes these indicators and gives you columns of numbers. Here is one such column:

Renter -0.517039016
Family – Average size 0.421954814
Population – 14 years and under 0.409067414
Income – low -0.385866351
Occupation – arts or service -0.242320211
Unemployment -0.218579446
Aboriginal identity -0.192174588
Social dependence -0.147162482
Education – High school or less -0.057844989
Family – Single parent -0.055617301
Population growth -0.047615417
Population – 65 years and older 0.150799633
English – not spoken at home 0.204608869

Notice that I sorted the rows so that the top values, either positive or negative are at the top and the least values are at the bottom. The top numbers show us the most important indicators – the higher the number, the more important the indicator.

So what do these numbers tell us? Look at the first four indicators. Basically they tell us that areas of the city that have high percentages of renters do not have large families and population 14 and under, and that they tend to be low in income; this is indicated by the negative sign on each indicator. Conversely, the numbers tell us that large families and population 14 and under go together, but not with renters.

Are these patterns true? Let’s look at the map in several views.

RentersFirst, look at the Renters layer. This layer shows us that many people in the West End (the red part in top left), a densely packed area of 100,000 people in 100 square blocks, rent their dwelling units.
Population 14 and under
Next, look at the Population – 14 years and under layer. This layer tells us that in the south, Population 14 years and under is high, while in the West End it is low.
Large familyFinally, look at the Family – average size layer. This layer is similar to the Population – 14 years and under layer; the south part of Vancouver has large families, while in the West End, families are not large.

These layers confirm the patterns that the column of eigenvectors indicated. PCA “makes sense” of data by showing us patterns in the data. That is its value.

Now if you have been paying attention, you might say, “Whoa! Why do we need PCA when we can look at a map and get an even more accurate picture?” Well, that’s another question, one that I don’t have time to answer here. Gotta get back to my thesis.

Now where are those eigenvectors again…?