This may be beating a dead horse, but I thought it would be fun to examine the question of data centering, or mean subtraction, for principal component analysis (PCA). So, I created a program that does a side by side comparison of PCA on simple noise with proper averaging and with Michael Mann styled improper averaging.
This was motivated by Steve McIntyre’s observation…
“We [Steve McIntyre and Ross McKitrick] also observed that they[Michael E. Mann, Raymond S. Bradley and Malcolm K. Hughes] had modified the principal components calculation so that it intentionally or unintentionally mined for hockey stick shaped series. It was so powerful in this respect that I could even produce a HS from random red noise.”
The basic idea of principal component analysis (PCA)
PCA is used to determine the minimum number of factors that explain the maximum variance in multiple data sets. In the case of the hockey stick each data set represents a chronological set of measurements, usually a tree ring chronology. These chronologies may vary over time in similar ways, and in theory these variations are governed by common factors. The single factor that explains the greatest amount of variance is the 1st principal component. The factor that explains the next greatest amount of variance is the second principal component. etc. In the case of the hockey stick, the first principal component is assumed to be the temperature. With this assumption, understanding how the first principal component changes with time is the same as understanding how the temperature changes with time.
PCA is a method to extract common modes of variation from a set of proxies.
The following bullets give a brief explanation of the mathematical procedure for determining the principal components. See the tutorial by Jonathon Shlens at New York University’s Center for Neural Science for a nicer, more detailed explanation.
- Start with m data sets of n points each. For example, m tree ring chronologies each covering n years.
- Calculate the mean and standard deviation for each of the m data sets.
- Subtract each mean from its corresponding data set. This is called centering the data.
- Normalize each data set by dividing it by its standard deviation.
- Create an m x n data matrix where each of the m rows has n data points, say, one point per year.
- Calculate the covariance for each possible pair of data sets by multiplying the data matrix by its own transpose, yielding a square, m x m, symmetric covariance matrix.
- Find the eigenvalues and eigenvectors for the covariance matrix
- Multiply the eigenvector corresponding to the largest eigenvalue by the original m x n data matrix to get the 1st principal component. Similarly for the eigenvector corresponding to the second largest eigenvalue to get the 2nd principal component. etc.
- The magnitude of the eigenvalue tells the amount of variance that is explained by its corresponding principal component.
The data centering, or mean subtraction, (step 3 in the above list) is where one on the hockey stick controversies arises. McIntyre and McKitrik showed that Mann did not subtract the mean of all of the points from about 1000 year data sets, but rather, he subtracted the mean of only the last 80 or 90 points (years). They claim that this flawed process would yield a 1st principal component that looks like a hockey stick, even when the proxy data was made up of simple red noise.
Here is an explanation of Mann’s error in mathematical and graphical formats:
Let every proxy be given a number. Then the 1st proxy is P1, the second proxy is P2, and the jth proxy is Pj.
Each proxy is made up of a series of points representing measurement in chronological order. For a particular proxy, Pj, the ith point is denoted by Pji.
Here are two synthesized examples of proxy data. We can call them Pj and Pk.
We center and normalize each data set by subtracting its average from itself, and then dividing by its standard deviation. We can call the new re-centered and normalized data Rj and Rk. Rj and Rk have the same shape as Pj and Pk, but they are both centered around zero and vary between about plus and minus 2.5, as shown below…
Some words about covariance
The covariance, σ, of two data sets, or proxies, is a measure of how similar their variations are. If the shapes of two properly re-centered and normalized data sets, say Rj and Rk are similar, their covariance will be relatively large. If their shapes are very different, then their covariance will be smaller. It is easy to calculate the covariance of two data sets: simply multiply the corresponding terms of each data set and then add them together…
σjk = Rj1Rk1 + Rj2Rk2 + … + RjnRkn = Σ RjiRki
If Rj and Rk are exactly the same, then σjk will be n, exactly the number of points in each data set (for example, the number of years in the chronology). This is a consequence of the data sets being centered and divided by their standard deviations. If Rj and Rk are not exactly the same, then σjk will be less than n. In the extreme case where Rj and Rk have absolutely no underlying similarity, then σjk could be zero.
Some words about noise
Two data sets of totally random, white noise will approach this extreme case of no underlying similarity. So their covariance, σ, will be very small. This is easy to understand when you consider that, on the average, each pair of corresponding points in the covariance calculation whose product is positive will be offset by another pair whose product is negative, giving a sum that tends to zero. But there are other types of noise, such as red noise, which exhibit more structure and are said to be “autocorrelated.” In fact, the two data sets shown above, Pj and Pk (or Rj and Rk), are red noise. The covariance of two red noise data sets will be less than n, and if there are enough data points in each set (Rj and Rk each have 100 points) then σjk will likely be small.
One of the important differences for this discussion between white noise and red noise is that the average of white noise over short sub-intervals of the entire data set will be close to zero. But that will not necessarily be the case red red noise, which you can visually confirm by looking at the plots of Rj and Rk, shown above.
What about incorrect centering
McIntyre and McKitrick found that Mann improperly performed step 3, the subtraction of the average from each data set. Instead, Mann subtracted the average of only the last 80 or 90 points (years) from data sets that were about 1000 years long. For most sets of pure white noise this approach has little effect, because the average of the entire data set and the average of a subset of the data set are usually nearly identical. But for red noise the effect of improper centering tends to have a much greater effect. Because of the structure that is inherent in red noise, the average of a subset may be very different from the average of the entire data set.
Here is what the two data sets, shown above, look like when they are improperly centered using the mean of only the last 20 points out of 100 points…
What effect does improperly re-centering (for example, subtracting the average of only the last 80 years of 1000 year data sets) have on the covariance of two data sets? Let’s call the improperly re-centered and normalized data R’j and R’k where R’j = Rj + Mj , R’k = Rj + Mk , Rj and Rk are correctly centered, and Mj are Mk are the additional improper offsets. Then the improper covariance, σ’jk, between R’j and R’k is given by…
σ’jk = Σ R’jiR’ki
= Σ (Rji + Mj)(Rki + Mk)
= Σ RjiRki + Σ RjiMk + Σ RkiMj + Σ MjMk
= Σ RjiRki + Mk Σ Rji + MjΣ Rki + n MjMk
But remember, Rj and Rj are properly centered, so Σ Rji and Σ Rki each equal zero. So…
σ’jk = Σ RjiRki + n MjMk
And Σ RjiRki = σjk. This leaves…
σ’jk = σjk + nMjMk
In cases where the product of Mj and Mk is the same sign as σjk, then the absolute value of σ’jk will be larger than the absolute value of σjk. This falsely indicates that R’j and R’k have variations that are more similiar than thay really are. The PCA algorithm will then give a higher weight to these proxies in the eigenvector that is used to construct the first principal component.
Mann Averaging Error Demo
I have written a piece of code that demonstrates the effect of Mann’s centering error. This code is written in LabVIEW version 7.1. You can get my source code, but you will need the LabVIEW 7.1 Full Development System or a later version that contains the “Eigenvalues and Vectors.vi” in order to run it. You can also modify the code if you desire.
This demo generates synthetic proxies of red noise with autocorrelation between 0.0 and 1.0. If the autocorrelation is set to 0.0, then white noise is generated. If the autocorrelation is set to 1.0, then brown noise is generated. Principal component analysis is then performed on these proxies two different ways: with proper centering and with Mann style improper centering.
Each of the synthetic proxies is shown on the top plot in sequence. After the proxies are synthesized, PCA is performed and the eigenvalues and principal components are shown in sequence. Eigenvalues and principal components for correct centering are shown on the right. Eigenvalues and principal components for improper centering are shown on the left.
After all the principal components have been shown, the synthetic proxy graph on the top of window defaults to the first proxy, and the principal component windows at the bottom default to the 1st principal component. The operator then has the opportunity examine the individual proxies by changing the number in the yellow “View Proxy #” box and to examine the principal components by changing the number in the yellow “View Principal Component #” box.
The operator can also select the “Overlay of all Proxies” tab at the top right corner of the window to see all proxies overlaid before centering.
This demo lets the operator select the following parameters…
- Number of data points (years) per proxy. The default is set to 1000, but you can make this anything you want.
- Number of proxies. The default is 70, but you can select anything you want.
- Autocorrelation. Set to 0.0 for white noise, between 0.0 and 1.0 for red noise, and 1.0 for brown noise. The higher the autocorrelation is set, the more random structure there will be in synthetic proxies. The default is 0.98, giving highly structured noise, but you can experiment with other settings.
- Number of years to average over. This determines how many years are used for the improper Mann style centering. The default is set to 80, because this is approximately what Mann used.
- Include/Don’t Include Hockey Stick Proxy. When “Don’t Include Hockey Stick Proxy” is selected, all proxies are noise. When “Include Hockey Stick Proxy” is selected, the first proxy will have a hocky stick shape superimposed on noise, all other proxies will be noise. The default setting is “Don’t Include Hockey Stick Proxy.”
Play around with the settings. Try these…
- Set the autocorrelation to 0.0 (pure white noise) and select “Don’t Include Hockey Stick Proxy.” This is the combination that is least likely to result in a hockey stick for the flawed first principal component. Run it several times. Amazingly, you are likely to see a small, noisy hockey stick for the first flawed principal component.
- Set the autocorrelation to 0.0 (pure white noise), and select “Include Hockey Stick Proxy.” This will give one noisy hockey stick proxy and pure white noise for all other proxies. The first flawed principal component will be a crystal clear hockey stick with a dominating eigenvalue. Notice that the first proper principal component is just noise.
- Set the “Years” and “Average of last” years to the same value. Since proper centering means averaging over all years, this will result in the “flawed” results actually being correct and identical to the “proper” results.
- Set “Average of last” years to 80 (default) and try various autocorrelation settings. You will find that any autocorrelation setting will usually result in the a hockey stick first principal component.
Here are some screen shots the Mann Averaging Error Demo…
Voila! A Hockey Stick from noise…
Please let me know of any bugs or suggestions for enhancements. If anyone is interested in LabVIEW 8.6 version, let me know and I will make it available.