Unveiling Hidden Patterns: A Beginner’s Guide to Factor Analysis in Data Science

Introduction to Factor Analysis
Statistics can be used to find unseen connections or latent variables (called factors) between a lot of variables that can be seen. The data is easier to understand because linked factors are put into smaller parts that can’t be seen but are representative of the data as a whole.
Mathematically, factor analysis assumes that:
Observed variables = Linear combinations of latent factors + error terms.
There are tools like SPSS and AMOS that can help you find secret links between a set of data that you can see. It’s easier to understand data when similar factors are grouped together into common parts. This helps experts find trends that aren’t clear at first in large collections. Finding trends in data and making it smaller is what factor analysis is all about. Because of this, it works really well for polls, social studies, and psychology tests. To look at models that have already been made, most people use SPSS for Exploratory Factor Analysis (EFA) and AMOS for Confirmatory Factor Analysis (CFA).

Purpose of Factor Analysis
- Data reduction means combining a lot of factors into a smaller group of parameters that are easier to understand without losing any important data.
- Finding Structure: To look for the patterns or structures that lie beneath different sets of data.
- Making a scale: This is how to make psychometric scales or poll tools by making sure that the groups of items are correct.
Predictor factors that are too closely linked in regression can be fixed by taking into account multicollinearity.
This method uses numbers to help explain ideas that are hard to understand, like fear, intelligence, or customer happiness.
Key Features
- Latent Variables: These are not-seen parts that can be found out from variables that can be seen. It’s easier to understand information when the number of dimensions is reduced.
- Correlative inputs: This means that the variables in the inputs are linked to each other.
- Factor Loadings: This shows how strong the links are between factors and variables and which way they point.
- Orthogonal and oblique rotations: these help you understand things better.

Core Applications of Factor Analysis
1. Scale Development
It is used to check if poll questions or items are linked to certain secret structures. As an example, FA might show how certain questions fit into groups like internal or external motivation while making a motivation scale.
2. Construct Validation
Checks to see if poll questions or things are linked to certain secret structures. In the case of a motivation measure, FA might show how certain questions fit into groups, such as those about internal or secondary motivation.
3. Latent Structure Identification
It helps find theories that haven’t been tried yet but could explain trends between variables that can be seen. You can use this tool to do informal study or to come up with new theory theories.
4. Data Reduction
Researchers can fit a lot of variables that are linked together into a small set of elements. This makes it simple to work with files and keep all important data safe. When there are more than one variable, this is a great way to group them and describe them.
Types of Factor Analysis: EFA vs. CFA
There are two primary kinds of factor analysis, which depend on the study’s purpose and how much is already known about the component structure:
Exploratory Factor Analysis (EFA)
When there is no existing theory, Exploratory Factor Analysis (EFA) is a data-driven method that tries to find the underlying structure of a group of observed variables. The objective is to find out how many hidden constructs (or factors) may explain the correlations between the observable data.
Below is a step-by-step explanation of procedures used in EFA:
1. Data Suitability Checks
It’s important to make sure the dataset is suitable for factor analysis before using EFA.
a. Sample Size
i. At least 5–10 observations for each variable (more than 100 is best).
ii. Bigger samples provide you more solid and dependable factor solutions.
b. Linearity and Normality
i. Assumes that there are linear correlations between variables.
ii. EFA is robust enough to handle small breaches of normality.
c. Correlation Matrix Inspection
i. EFA presupposes that the variables are at least somewhat connected.
ii. A correlation matrix is made to look at the connections.
2. Tests for Factorability
These tests make sure that the dataset is good for finding useful factors.
a. Kaiser-Meyer-Olkin (KMO) Test
i. Checks to see whether the sample is big enough.
ii. A KMO value of more than 0.6 is OK; a value of more than 0.8 is best.
iii. Shows whether the patterns of correlations are close enough together for factor extraction to be accurate.
b. Bartlett’s Test of Sphericity
i. Checks to see whether the correlation matrix is substantially different from an identity matrix, which means that all of the variables are not correlated.
ii. A p-value less than 0.05 means that EFA can be done.
3. Factor Extraction Methods
This phase tells you how many factors to keep.
a. Principal Component Analysis (PCA)
i. People sometimes mix it up with EFA, although it’s mostly used to cut down on data.
ii. It has total variance (common + unique).
iii. PCA is used to compare things, but not to do real factor analysis.
b. Principal Axis Factoring (PAF)
i. Only takes out the common variation across variables.
ii. Good when the objective is to find hidden constructions.
c. Maximum Likelihood (ML)
i. Assumes that the data is normally distributed in more than one way.
ii. It Makes it possible to do statistical tests and find confidence intervals.
4. Determining Number of Factors to Retain
There are a several ways to figure out how many elements to extract:
a. Eigenvalues > 1 (Kaiser Criterion)
i. Keep factors that have eigenvalues larger than 1.
ii. Shows components that explain greater variation than just one observable variable.
b. Scree Plot
i. A graph that shows the relationship between eigenvalues and the number of factors.
ii. Find the “elbow” point, when the slope levels out. Keep the parameters that come before this point.
c. Parallel Analysis
i. Compares the eigenvalues of real data with those of data that was made up at random.
ii. More accurate than the rule that says eigenvalue > 1.
5. Factor Rotation
Rotation makes the result easier to understand by making the factor loadings simpler.
a. Orthogonal Rotation (Varimax)
i. Assumes that the factors are not connected to each other (independent).
ii. Makes the columns of the factor loading matrix easier to read.
b. Oblique Rotation (Promax, Oblimin)
i. Makes it possible for factors to be related.
ii. More realistic in study on psychology and society.
iii. Gives you a structure matrix (factor correlations) and a pattern matrix (loading strength).
6. Interpretation of Factor Loadings
i. Factor Loadings: These are the correlation coefficients between observable variables and latent factors.
o Loading that is more than 0.4 or less than -0.4 is typically seen as important.
ii. Items are put into groups depending on how much they load.
iii. The names of factors are based on the conceptual meaning of the variables that are grouped together.
7. Reliability Testing
After identifying factors:
i. The internal consistency of items loading on each factor is measured by Cronbach’s Alpha.
Alpha > 0.7 means that the results are reliable.
Purpose & Uses of EFA
Exploratory Factor Analysis (EFA) is a method used in the beginning phases of research to find the underlying structure among a huge number of variables. It is particularly helpful when there is no existing theory or model and the objective is to find hidden structures or elements that explain the correlations that have been seen. Researchers use EFA to combine similar variables, cut down on the number of dimensions, and create new scales or measurement models that may be used in further studies to confirm and analyze their findings.
Key Features
i. An inductive technique that is based on data
ii. The number of factors is not set ahead of time
iii. Factors are found using eigenvalues and visual aids like the Scree plot
iv. After using rotation methods (Varimax for orthogonal and Promax for oblique), loadings are interpreted.
Confirmatory Factor Analysis (CFA)
Confirmatory Factor Analysis (CFA) is a way to utilize statistics to see whether a set of factors matches the data that has been collected. The method includes defining the model by determining which observable variables load onto which latent components and if the factors are connected. After making sure the model is correct and estimating the parameters using methods like Maximum Likelihood (ML), the model is tested using goodness-of-fit indices like CFI, RMSEA, and SRMR.
Purpose
Its main task is to check measurement models, check the validity of conceptions, and confirm theoretical notions like work satisfaction, motivation, or anxiety. CFA is usually used in latter phases of research, once a trustworthy scale has been made using procedures like EFA.
Process of Performing CFA
1. Model Specification
Define:
i. the observable variables go with the hidden factors.
ii. If the elements are connected to each other.
iii. Any error covariances that make sense in theory.
2. Model Identification
Make sure that the model can be mathematically approximated by having:
i. There are enough data points (observations > parameters to estimate).
ii. Enough indicators for each element (at least three observed variables for each factor is a good rule of thumb).
3. Model Estimation
Use estimation techniques like:
i. Maximum Likelihood (ML)
ii. Generalized Least Squares (GLS)
Software options: AMOS, LISREL, Mplus, R (lavaan), Python (semopy)
4. Model Evaluation (Goodness-of-Fit Indices)
Evaluate how well the model fits the actual data using several fit indices:
| Fit Index | Acceptable Value | Interpretation |
| Chi-Square (χ²) | p > 0.05 | Low values = good fit |
| CFI (Comparative Fit Index) | > 0.90 (good), > 0.95 (excellent) | Compares model fit to null model |
| RMSEA (Root Mean Square Error of Approximation) | < 0.08 (good), < 0.05 (excellent) | Measures approximation error |
| SRMR (Standardized Root Mean Square Residual) | < 0.08 | Measures difference between observed and predicted correlations |
5. Model Modification
i. If the fit isn’t good, utilize modification indices to find relationships that should be added or taken away (like error covariances).
ii. Only make changes based on theory; don’t let data-driven overfitting happen.
6. Model Interpretation
i. Factor loadings should be at least 0.5 (0.7 is better).
ii. To see whether anything is important, look at the standard errors and crucial ratios.
iii. Check the validity by looking at the construct reliability (CR) and the average variance extracted (AVE).
Key Features
i. An strategy based on theory (deductive)
ii. The researcher sets the factor structure ahead of time
iii. Statistical indicators are used to check how well the model fits:
• CFI/TLI (a score of more than 0.90 denotes a good match)
• RMSEA (less than 0.08 means the mistake is acceptable)
• SRMR (< 0.08 means low residuals)
• Chi-square/df (less than 3 is usually okay)
The difference between EFA & CFA:
| Aspect | Exploratory Factor Analysis (EFA) | Confirmatory Factor Analysis (CFA) |
| Purpose | Discover underlying structure | Test a predefined structure |
| Theoretical Framework | Not required | Required |
| Approach | Data-driven (inductive) | Hypothesis-driven (deductive) |
| Factor Specification | Factors and loadings are derived from data | Factors and loadings are specified in advance |
| Rotation Used? | Yes (to clarify loadings) | No (model structure is fixed) |
| Model Fit Indices | Not applicable | Required (CFI, RMSEA, SRMR, etc.) |
| Best Suited For | New scale development, exploring unknown structures | Theory testing, scale validation, measurement invariance studies |
| Software Tools | SPSS, R (psych), Python | AMOS, LISREL, Mplus, R (lavaan) |
Principal Component Analysis
Principle Component Analysis (PCA) is a method used in statistics and machine learning to make enormous datasets easier to work with by changing the original variables into a new collection of uncorrelated variables known as principle components. The first few of these components retain much of the variation (information) from the original dataset.
PCA is a mathematical process that finds the eigenvalues and eigen vectors of the covariance or correlation matrix to find the directions (principal components) in which the data changes the most. The first principal component captures the most variation, the second catches the next most variance that is not in the same direction as the first, and so on. PCA is a common way to compress data, make it easier to see, and prepare it for machine learning by improving the performance of models.
Factor Analysis vs Principal Component Analysis (PCA)
People commonly use Factor Analysis (FA) and Principal Component Analysis (PCA) to mean the same thing, although they are not the same thing and use distinct statistical assumptions:
| Feature | Factor Analysis (FA) | Principal Component Analysis (PCA) |
| Purpose | Identify latent constructs | Summarize total variance |
| Based on | Shared (common) variance | Total variance (common + unique + error) |
| Error Assumption | Accounts for measurement error | Does not account for measurement error |
| Use Case | Theory-driven: construct validation | Data-driven: dimensionality reduction |
| Output | Latent variables (factors) | Principal components (composite scores) |
Real-Life Example of EFA and CFA
Exploratory Factor Analysis (EFA) – Real-Life Example:
Context: A university sends out a survey with 30 questions to find out how happy students are with things like the quality of teaching, the campus amenities, the support services, and the extracurricular activities.
Application: The institution doesn’t know how these parts are put together, therefore they employ EFA to look at the data. The study shows that the questions fall into four main groups: Academic Experience, Infrastructure Satisfaction, Administrative Support, and Campus Life. The university can better learn what makes people happy and make future polls easier with these groups.
Confirmatory Factor Analysis (CFA) – Real-Life Example:
Based on the EFA above, the university creates a more detailed 20-item student satisfaction survey that is divided into the four variables that have previously been found.
Application: CFA is used on a group of new students to see whether the four-factor model fits the data effectively. The university checks the structure and certifies the measuring tool’s dependability using fit indices like CFI and RMSEA. This gives them the confidence to utilize the scale for frequent student input and benchmarking across institutions.
Factor Analysis Quiz Questions – Test Your knowledge
For an in-depth understanding, please refer to our book, “Academic Research Fundamentals: Research Writing and Data Analysis”. It is available as an eBook here, or you may purchase the hardcopy here .