TOPIC INFO (UGC NET)
TOPIC INFO – UGC NET (Geography)
SUB-TOPIC INFO – Geographical Techniques (UNIT 9)
CONTENT TYPE – Detailed Notes
What’s Inside the Chapter? (After Subscription)
1. Explanation
2. Sten by Sten. Explanation of PCA
2.1. STEP 1: Standardisation
2.2. Sten 2: Covariance Matrix Computation
2.3. Sten 3. Compute the Eigenvectors and Eigenvalues of the Covariance Matrix to identify the Principal Components
2.4. Step4: Feature Vector
2.5. Last Sten. Recast the Data Along the Principal Components Axes
3. Data Analysis
4. Cluster Analysis
4.1. K Means Clustering
42. Hierarchical Clustering
4.3. Applications of Cluster Analysis
Note: The First Topic of Unit 1 is Free.
Access This Topic With Any Subscription Below:
- UGC NET Geography
- UGC NET Geography + Book Notes
Principal Component Analysis
UGC NET GEOGRAPHY
Geographical Techniques (UNIT 9)
- Principal component analysis (PCA) is a technique used for identification of a smaller number of uncorrelated variables known as principal components from a larger set of data. The technique is widely used to emphasize variation and capture strong patterns in a data set.
- Invented by Karl Pearson in 1901, principal component analysis is a tool used in predictive models and exploratory data analysis. Principal component analysis is considered a useful statistical method and used in fields such as image compression, face recognition, neuroscience and computer graphics.
Explanation
- Principal component analysis helps make data easier to explore and visualize. It is a simple non-parametric technique for extracting information from complex and confusing data sets. Principal component analysis is focused on the maximum variance amount with the fewest number of principal components.
- One of the distinct advantages associated with the principal component analysis is that once patterns are found in the concerned data, compression of data is also supported. One makes use of principal component analysis to eliminate the number of variables or when there are too many predictors compared to number of observations or to avoid multicollinearity.
- It is closely related to canonical correlational analysis and makes use of orthogonal transformation in order to convert the set of observations containing correlated variables into a set of values known as principal components.
- The number of principal components used in principal component analysis is less than or equal to the lesser number of observations. Principal component analysis is sensitive to the relative scaling of the originally used variables.
- Principal component analysis is widely used in many areas such as market research, social sciences and in industries where large data sets are used. The technique can also help in providing a lower-dimensional picture of the original data. Only minimal effort is needed in the case of principal component analysis for reducing a complex and confusing data set into a simplified useful information set.
- Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
- Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
- So to sum up, the idea of PCA is simple – reduce the number of variables of a data set, while preserving as much information as possible.
Step by Step Explanation of PCA
STEP 1: Standardisation
- The aim of this step is to standardize the range of the continuous initial variables so that each one of them contributes equally to the analysis.
- More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the data to comparable scales can prevent this problem.
- Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable.
$$z\;=\;\frac{value\;-\;mean}{s\tan dard\;deviation}$$
Once the standardization is done, all the variables will be transformed to the same scale.
Step 2: Covariance Matrix Computation
- The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix.
- The covariance matrix is a 𝑝 × 𝑝 symmetric matrix (where 𝑝 is the number of dimensions) that has as entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables 𝑥, 𝑦, and 𝑧, the covariance matrix is a 3 × 3 matrix of this from:
$$\begin{bmatrix}COV\;(X,\;X)\;&COV\;(X,\;Y)&COV\;(X,\;Z)\\COV\;(Y,\;X)&COV\;(Y,\;Y)&COV\;(Y,\;Z)\\COV\;(Z,\;X)&COV\;(Z,Y)&COV\;(Z,Z)\end{bmatrix}\\$$
Covariance Matrix for 3-Dimensional Data
- Since the covariance of a variable with itself is its variance (Cov (𝑎, 𝑎) = Var (𝑎)), in the main diagonal (Top left to bottom right) we actually have the variances of each initial variable. And since the covariance is commutative (Cov (𝑎, 𝑏) = Cov (𝑏, 𝑎)), the entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower triangular portions are equal.
What do the covariances that we have as entries of the matrix tell us about the correlations between the variables?
It’s actually the sign of the covariance that matters :
- if positive then : the two variables increase or decrease together (correlated)
- if negative then : One increases when the other decreases (Inversely correlated)
Now, that we know that the covariance matrix is not more than a table that summaries the correlations between all the possible pairs of variables, let’s move to the next step.
