Data Science

Want to use Principal Component Analysis? Answer these five questions first.

They say PCA helps you reduce the dimensionality of the data. Fair enough! Should you, however, always use it? Find out here when it’s a bad decision to use PCA.

Aayush Malik
3 min readAug 21, 2024
Photo by Glen Carrie on Unsplash

If you work in the field of business intelligence and data analytics, you all may have heard of Principal Component Analysis. There are some excellent sources out there which explain clearly what principal component mean, what principal component analysis is, why to use it, when to use it, how it works, how many minimum principal components I need for my data, and the maths behind it. However, not many resources online talk about validating whether the data allows us to use PCA. This article tries answering questions such as —

  1. What are the assumptions I need to have my data follow before I go the PCA way?
  2. How should I find out if PCA is the right choice for me?
  3. Should we always use PCA whenever we want to prevent overfitting and reduce the curse of dimensionality?
  4. How can we test whether PCA will work for our data?
  5. What alternatives there are of PCA?

Assumptions of PCA

Before deciding to PCA, you need to ensure that you’re assuming a few facets about your dataset.

Linear Relationship

PCA tries to fit an ellipsoid to the data. At its core, the method is a linear transformation method. Therefore, the first assumption is that your covariates/features have a linear relationship. This can be verified by domain expertise as well as statistical tests such as Barlett’s test of Sphericity.

Barlett’s test of Sphericity

This test is used to make sure that the correlation matrix of the variables in your dataset diverges significantly from the identity matrix. An identity matrix is a matrix in which all of the values along the diagonal are 1 and all of the other values are 0.

Kaiser-Meyer-Olkin (KMO) Measure

The KMO measure assesses the sampling adequacy of your data, specifically focusing on whether your dataset is suitable for PCA.
It provides a value between 0 and 1, with higher values indicating better suitability. A KMO value closer to 1 suggests that the variables in your dataset have a high degree of common variance, making them suitable for PCA.

Is the data isn’t linearly related, you can still do PCA but using kernel PCA which accounts for the non-linear relationships. This is however not a foolproof method, and other dimensionality reduction techniques may need to be used in that case.

Normal Distribution Assumption

The second assumption states that the underlying distribution for your covariates is a normal distribution. If it is not, then one needs to apply non-linear transformation methods to make it so. To test your data analytically for normal distribution, there are several test procedures. The best known being the Kolmogorov-Smirnov test, the Shapiro-Wilk test, and the Anderson Darling test.

Scale Variation and Outliers Problems

PCA is affected seriously by outliers and the variation in scale of features/covariates. Simple methods which can bring all the covariates to standard values between 0 and 1, and removal of outliers can drastically improve the outcomes of PCA. Therefore, check for these two aspects too.

Multicollinearity Assumption

The covariates in your data shouldn’t be perfectly correlated. To find this out, you can use a simple correlation matrix, or you can also use tests such as Variance Inflation Factor which help with the same. If you mind multicollinearity, the first step is to remove the features/covariates that are highly correlated, because they may interfere with the outcomes of PCA.

Alternatives to PCA

If you come to a conclusion, that you cannot use PCA for your data, then there are alternative methods of dimensionality reduction such as Linear Discriminant Analysis and t-SNE (T-distributed stochastic neighbour embedding) which is a form of non-linear dimensionality reduction.

While PCA produces new component variables meant to maximise data variance, LDA produces component variables that also maximise class difference in the data.

Conclusion

As mentioned in this article, PCA, though very helpful, should always be used after testing all the assumptions because you are going to generalise the real-world phenomenon with a few variables, so you need to ensure you are doing it correctly statistically as well as rationally.

Reach out to me for any questions or discussions.

--

--

Aayush Malik
Aayush Malik

Written by Aayush Malik

Open Data | Causal Inference | Machine Learning | Data Visualization and Communication | https://www.linkedin.com/in/aayushmalik/

No responses yet