Should I use PCA with categorical data

Can principal component analysis be applied to data sets that contain a mixture of continuous and categorical variables?


Although a PCA applied to binary data would produce results that are comparable to the results of a multiple correspondence analysis (factor values ​​and eigenvalues ​​are linearly linked), there are more suitable techniques for dealing with mixed data types, namely the multiple factor analysis for mixed data in the FactoMineR R -Package (). If your variables can be thought of as structured subsets of descriptive attributes, Multiple Factor Analysis () is also an option.

The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the faculty space. To solve this problem, you can look for a nonlinear transform of any variable - be it nominal, ordinal, polynomial, or numeric - with optimal scaling. This is explained in detail in Gifi Methods for Optimal Scaling in R: The Packet Homals. An implementation is available in the corresponding R package homals.

A Google search "pca for discrete variables" gives this nice overview by S. Kolenikov (@StasK) and G. Angeles. To complement the answer, pc analysis is really an analysis of the eigenvectors of the covariance matrix. So the problem is how to compute the "correct" covariance matrix. One of the approaches is to use polychronic correlation.

I recommend a look at Linting & Kooij, 2012 "Nonlinear Principal Component Analysis with CATPCA: A Tutorial", Journal of Personality Assessment ; 94 (1).


Serving as a tutorial in Principal Nonlinear Components Analysis (NLPCA), this article systematically walks the reader through the process of analyzing actual data for personality assessment using the Rorschach Inkblot Test. NLPCA is a more flexible alternative to linear PCA that allows the analysis of potentially non-linearly related variables with different types of measurement levels. The method is particularly suitable for analyzing nominal (qualitative) and ordinal (e.g. Likert-type) data, possibly combined with numerical data. The CATPCA program from the Categories module in SPSS is used for the analyzes, but the method description can easily be transferred to other software packages.

I don't have permission to comment on a post yet, so I'm adding my comment as a separate reply. Please contact me.

After continuing @Martin F's comment, I recently came across the nonlinear PCAs. I have explored nonlinear PCAs as a possible alternative when a continuous variable approaches the distribution of an ordinal variable, when the data becomes sparse (it happens a lot in genetics when the co-parallel frequencies of the variables keep decreasing and you get stuck) with a very low number of counts where you can't really justify a distribution of a continuous variable, and you need to relax the distributional assumptions by making it either an ordering variable or a categorical variable that the nonlinear PCAs are not used often and the behavior of these PCAs has not yet been extensively tested (they may only have been related to the genetic field, so please take it with a grain of salt) Indeed, it's a fascinating option. Hope I added 2 cents (luckily relevant) to the discussion.

There is a recently developed approach to such problems: generalized, low-ranked models.

Work using this technique is even referred to as PCA on a data frame.

PCA can be set like this:

For x matrixm MnmM

find x matrix and x matrix (this implicitly encodes rank e constraint) such that k X k m Y knkX ^ kmY ^ k

X ^, Y ^ = .argminX, Y∥M − XY∥2F

The 'generalized' of GLRM stands for change to something else and adding a regularization term. 2F

# Rstats package:

Implements principal component analysis, orthogonal rotation, and multiple factor analysis for a mixture of quantitative and qualitative variables.

Example from Vignette shows results for continuous and categorical output

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.