Dimensionality Reduction - Popular Techniques and How to Use Them

In the vast fields of data science and machine learning, dimensionality reduction plays a pivotal role. This process involves condensing high-dimensional data into a lower dimensional space, utilizing linear techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) or more complex non-linear methods. The goal is to transform original data into fewer dimensions with principal components that maintain key information. Dimensionality reduction techniques enable machine learning algorithms to process high-dimensional data more efficiently by identifying significant variables and reducing storage space. They simplify the complexity of data, facilitating predictive modeling and feature extraction from input variables by reducing the number of features or variables in a data set while preserving as much relevant information as possible.

This article explores several popular methods for dimensionality reduction and analyzes their pros, cons, and potential use cases. By understanding these techniques, practitioners can effectively tackle high-dimensional data and enhance the performance of their machine learning models.

TL;DR

• Dimensionality reduction aims to provide better understanding of the data for both you and your models.
• Without dimensional reduction, some problems may be virtually unsolvable due to the curse of dimensionality of the original data.
• There is no need to use overly complicated methods, because sometimes a simple summary using basic statistics is enough.
• For simple data sets, linear models such as PCA (principal component analysis) or LDA (linear discriminant analysis) are usually sufficient. It is important not to over-complicate the solution.
• Different tasks may require different approaches - the best visualizations will be produced by t-SNE/UMAP, dimensionality reduction for input of some machine learning model should be performed by more strict method, such as PCA or Autoencoders, while with significant redundancy in a very large data set it will be best to deal with the Random Projection.
• Want to take your data analysis to the next level? Contact nexocode data science experts to learn how you can implement artificial intelligence in your business and gain a competitive edge.

Why Should We Reduce Anything?

With the tremendous growth of data in today’s world, many data sets possess an overwhelming number of features, making them challenging to handle and analyze. There are two main ways in which we can explain the size of our data set:

the number of distinct observations that constitute our resource;
the number of distinct features describing each observation.

It is worth mentioning that here we are skipping a few steps of initial preprocessing of the data. One can notice that there are types of data which will not let us understand them in terms of just 2-dimensional arrays, with first dimension listing cases and second listing features. It is a valid remark, yet for most use cases, before applying conventional statistical models or machine learning methods, we need to establish at least a vast idea of interpreting the data in such a way that describes multidimensional random variables spanned over some probability distributions. With these assumptions, just before we try our models - and probably fail due to the curse of dimensionality - we can apply particular methods for decreasing the size of the data.

Harness the full potential of AI for your business

Dimensionality reduction is probably more difficult than that. A proper setup for most statistical models assumes that we have defined a task, or more generally - a cost function that has to be minimized. In the case of supervised learning, it is a function based on differences between desired labels and predicted outcomes. For unsupervised learning, we want our labeling to maintain some kind of regularities, based on a predefined metric over the feature space. This formulation highlights some natural approach, that if we cannot achieve an appropriate clustering using default feature space, maybe we should shift it such that our algorithms can produce better results.

Not-a-Method: Vectorizing and Mixture Models.

If our data comes from a certain probability distribution, we can reduce its size by estimating the parameters of this distribution.

Let’s start with an example. Suppose that we want to build a machine learning model to predict sales for a retail store. Our data set consists of shopping lists of individual customers. As we can imagine, there are thousands of different products, most of them are sold very rarely and have at least a few of closely related alternatives. Of course, the sales history of a specific product may contain information about the sales in general, but for better representation (and, in fact, dimensionality reduction) we can forget about the precise product names and simply replace them with their categories. Although we accept the loss of at least some information, we hope that in this particular task, we will preserve the most important features of our data set.

Another gate opens when we deal with numerical data (or we can represent our data as numerical). Most of the statistical methods assume that the data set follows some particular probability distribution. Therefore, to reduce the number of observations, we can analyze the (non)parametric statistics of the data set, e.g. we can use the mean value or quantiles of some features. Again, this is very sensitive to the underlying information we want to retain and which is needed to achieve our goal. Incorrect choice of the statistic may result in missing necessary knowledge. For instance, if we are interested in an outlier detection task, a simple mean is a bad idea and we should use higher-order moments such as variance or kurtosis.

Another approach is to utilize a broad theory of sampling techniques or methods known as data summarization. These can be very technical in nature and we will not discuss them here.

What to Keep in the Lower Dimensional Space?

Dimensionality reduction methods are usually divided by two characteristics: feature extraction vs feature selection and linear vs non-linear. To asses whether some features are important, it is necessary to have a labeled data set and a prepared model. There are techniques for extracting information about the importance of given features. Additionally, some machine learning models natively include feature extraction methods, e.g. deep neural networks. This approach is not very versatile for these two reasons: a labeled data set is a luxury we sometimes cannot afford, and model preparation can be computationally expensive (usually it is the reason why we decide to perform a dimensionality reduction). Moreover, such feature selection or feature extraction is closely related to the model type and different models may yield completely different results.

The distinction between linear and non-linear methods of dimensionality reduction follows from the type of relationships the method can detect and model. Linear methods typically rely on the assumption of linear separability existing in the data set. Although simple, they surprisingly often manage to capture most of the information contained in the data. Non-linear methods are based on the use of far more complicated and demanding models. They are usually used in cases when linear methods do not provide satisfactory results. This distinction is much more general and is what we will focus on in this article.

It is worth mentioning here that each method described in this article has its own extensions. Dimensionality reduction is an area of data science that is at the forefront of research and SotA ideas emerge every week, so for some methods, extensions could significantly change their characteristics. We would like to consider this overview as a short introduction. Discovering and understanding different approaches to improving a given method of dimensionality reduction is a beautiful journey and we encourage everyone to continue learning more about it.

Dimension Reduction with Linear Methods

Principal Component Analysis

Top: original image; bottom: image reconstructed by PCA using only 1.7% of the original data.

Principal Component Analysis (PCA) is a widely used - and probably the most popular - technique in data analysis for the dimensionality reduction task. It provides valuable insights into the underlying structure of complex data sets, allowing us to extract key features, reduce complexity, and visualize data. PCA is a mathematical procedure that transforms a high-dimensional data set into a new set of variables called principal components. These components are linear combinations of the original variables and capture the maximum amount of variance in the data. The first principal component accounts for the most variance, followed by the second, third, and so on.

Principal component analysis assumes that the relationship between variables is linear. If there are non-linear relationships, PCA might not be the most suitable method. Furthermore, PCA assumes that the variables are mutually independent. If there are strong dependencies or correlations present, the efficacy of PCA may be compromised.

Principal component analysis reduces the dimensionality of high-dimensional data while retaining the most informative features. This is particularly useful when dealing with large data sets. Additionally, PCA can filter out noise by selecting the principal components that capture the most significant variance, thereby enhancing the signal-to-noise ratio. While PCA is generally effective, it transforms the original features into new components, therefore the interpretability of these components may be reduced. This can be a limitation in scenarios where interpretability is crucial.

Example:

Suppose we have a data set related to customer shopping preferences in a retail store. We can compare various features such as the amount spent on different product categories (e.g., groceries, clothing, electronics), the frequency of visits to the store, the average duration of each visit, and the customer’s demographic information (e.g., age, income, location, etc.) and PCA can create a projection of this data set onto the desired number of dimensions. These representations can be used for data visualization or clustering. The insights from PCA may enable the store to optimize inventory management, conduct personalized marketing campaigns, or improve overall customer experience.

Complexity:

PCA is linear on the number of observations and cubic on their dimensionality. Additionally, there are numerous algorithms for computing the PCA approximation, including iterative methods. This makes it an excellent method for large data sets.

Advantages:

PCA allows us to condense large data sets into a smaller set of variables while preserving the most informative aspects of the data.
PCA helps us visualize high-dimensional data in a lower-dimensional space, making it easier to identify patterns, clusters, and outliers.
By focusing on the principal components capturing the most variance, PCA effectively filters out noise and enhances the signal-to-noise ratio.

Disadvantages:

PCA assumes linearity between variables, which can limit its effectiveness when dealing with non-linear relationships.
As PCA transforms the original features into a subspace, the interpretability of these components may be reduced, making it challenging to understand the underlying meaning of the transformed variables.
PCA can be sensitive to outliers since they can strongly influence the calculations, potentially leading to a skewed representation of the data.

Advised Use Conditions:

Generally the first approach for most data sets.
The data set should contain numeric data.
The data should not be too obfuscated (e.g. limited number of outliers).

Linear Discriminant Analysis

LDA uses labels to better represent data in a lower-dimensional space.

In the realm of dimensionality reduction techniques, Linear Discriminant Analysis (LDA) stands out as a tool between that can be also utilized as a classification model. By combining the concepts of feature extraction and classification, LDA provides a comprehensive approach to data analysis. At its core, LDA seeks to find a linear combination of features that maximally separates different classes in the data set while simultaneously minimizing the within-class scatter. By transforming the original features, LDA not only reduces the dimensionality but also enhances the separability between classes. To achieve optimal results, LDA relies on a few key assumptions:

The data must follow a multivariate normal distribution within each class. In particular, this assumption excludes applications to non-numeric data.
The classes should have identical covariance matrices.
The features are assumed to be independent of each other.

LDA works best in scenarios where the goal is to maximize the separation between classes. It is especially effective when the classes are well-defined and linearly separable, and there is a need to reduce dimensionality while preserving class discrimination. The simplicity of this model allows for numerous extensions, including: generalized discriminant analysis. On the other hand, the assumptions, such as class distribution or independence of features, are strict and in practice violated usually. Furthermore, in cases where the classes overlap significantly (lack of linear separability), LDA might struggle to find an optimal separation boundary. Another known issue are data sets with imbalanced class distributions that may lead to biased results.

Example:

Consider a medical scenario where a data set contains various health parameters of patients (such as blood pressure, cholesterol levels, and BMI) and their corresponding diagnosis (e.g., healthy or having a specific disease). By applying LDA, we can uncover the most discriminative features that contribute to accurate disease classification. We do not care about any reduction in dimensionality, but one that takes into account the main goal of our research, i.e. helping in the diagnosis of patients. LDA allows for a better understanding of underlying patterns and potentially aids in early disease detection or personalized treatment strategies.

Complexity:

LDA is linear on the number of observations and cubic on their dimensionality. This makes it an excellent method for large data sets as long as their initial dimension is reasonably small.

Advantages:

LDA transforms high-dimensional data into a lower-dimensional space while preserving the class discriminatory information.
LDA aims to maximize the separation between classes, making it particularly effective in scenarios where distinct class boundaries exist. By identifying the most significant features, LDA can provide clear insights into the underlying data structure.
The transformed features obtained from LDA can be easily interpreted and visualized, making it easier to understand the relationships between variables and classes. This interpretability facilitates better decision-making and domain-specific insights because it is more accessible to domain experts than black box models.

Disadvantages:

LDA assumes that the classes can be separated by linear boundaries. This assumption may limit its effectiveness in scenarios where classes have complex non-linear relationships.
LDA is sensitive to outliers, as they can significantly impact the estimation of class means and covariances. Outliers can distort the separation between classes and lead to suboptimal results.
LDA is primarily designed for classification purposes and may not provide comprehensive insights into the overall data structure or relationships between variables that are not directly related to classification.

Advised Use Conditions:

Moderately many initial dimensions in the data set.
The data set should contain numeric data.
A class or label for each observation.
Need for explainable results.

Random Projections

The graph shows the relationship between model accuracy and the level of reduction of the original dataset using random projections. The data set consists of a sparse vectorization of texts and in its original form has over 130,000 dimensions. The red line indicates the result of the classifier using the original dataset.

Random Projections (RP) is a method used to reduce the dimensionality of very high-dimensional data while preserving its inherent structure. The technique is based on the intuition that a small number of well-chosen random projections can approximately preserve distances and relationships between data points. It works by randomly selecting a set of projection vectors and mapping the data points onto these vectors. The projection vectors are typically randomly generated, such as from a Gaussian distribution. To perform the random projection, the algorithm computes the dot product between each data point and the projection vectors. This dot product represents the mapping of the data point onto the projection vectors. The resulting lower-dimensional representation of the data is obtained by concatenating these dot products. The key idea behind random projections is that as long as the random projection vectors span the entire space well, the distances and relationships between data points in the high-dimensional space are approximately preserved in the lower-dimensional space. This allows for effective dimensionality reduction while maintaining the integrity of the data structure. Random projections can be particularly useful for large-scale data sets or situations where preserving exact distances between data points is not crucial.

Check this series

The main assumption behind RP is that the random projection vectors span the entire space well. In other words, the projection vectors must be sufficiently diverse to capture the variability present in the data. This assumption ensures that the relationships between data points in the high-dimensional space are approximately maintained after the projection. Random Projections excel in scenarios where computational efficiency and simplicity are crucial. RP is particularly useful for large-scale data sets where traditional dimensionality reduction techniques may become computationally expensive or infeasible. It is also effective when the focus is on preserving the overall structure of the data.

The mathematical foundations of Random Projections are provided thanks to the Johnson-Lindenstrauss lemma. It states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. This embedding is exactly the Random Projections.

Example:

One practical application of Random Projections is in text classification. Consider a large collection of text documents represented as high-dimensional vectors where each dimension corresponds to a unique word. By applying RP, we can significantly reduce the dimensionality of the feature space while retaining the underlying semantics. This enables faster and more efficient text classification algorithms or a suitable introduction for other dimensionality reduction methods.

Complexity:

For N as the number of observations, d as the number of dimensions and k as the number of desired vector projections, the complexity is of the order O(Ndk).

Advantages:

RP is highly efficient, making it suitable for big data (both many observations and dimensions) and real-time applications.
The implementation of RP is straightforward and does not require complex optimization procedures.

Disadvantages:

While RP can preserve the overall structure of the data, it may introduce some level of approximation in distance calculations, leading to a loss of accuracy.
Random Projections are not suitable for all types of data sets. In cases where preserving precise distances or fine-grained details is critical, other dimensionality reduction techniques may be more appropriate.
The effectiveness of RP heavily depends on the selection of random projection vectors. Choosing an optimal set of projections is a challenging task and may require experimentation to achieve desired results.

Advised Use Conditions:

Data set that is too big to reasonably run a more sophisticated method on.
Data set that is very sparse, but it is difficult to determine the most important dimensions.
It is good idea to reduce the size of the data set using RP and then use another dimensionality reduction method.

Dimensionality Reduction Process with Non-Linear Methods

Locally Linear Embeddings

LLE focuses on maintaining distances within local neighborhoods of observations.

Locally Linear Embeddings (LLE) uncovers the underlying structure of high-dimensional data by mapping it to a lower-dimensional space while preserving local geometrical relationships. LLE operates by constructing a neighborhood graph based on the proximity of data points. It assumes that each data point can be expressed as a linear combination of its neighbors. The algorithm then learns a low-dimensional representation of the data by minimizing the difference between these linear combinations in the original space and the embedding space.

LLE assumes that the data lies on a smooth, manifold-like structure embedded in a higher-dimensional space. It applies locally linear approximations to capture the intrinsic relationships between neighboring data points. Moreover, LLE assumes that the local structure of the data is preserved in the lower-dimensional space.

LLE excels in scenarios where the data exhibits nonlinear and intricate structures. It is particularly useful when dealing with high-dimensional data where traditional linear techniques might fall short. It also shines in tasks such as visualizing complex data sets with inherent patterns. While LLE is a flexible and powerful tool, there are situations where caution is necessary. It may struggle when the data contains outliers or noise, as these can disrupt the local linear relationships.

Example:

Let’s imagine a one-dimensional string embedded in n-dimensional space. The string can be twisted in a complicated way, which makes it difficult to linearly determine its shape, but we know that it is one-dimensional, so it is enough to use one dimension to fully describe it. In this scenario, linear dimension reduction methods will have difficulty in obtaining a meaningful result, i.e., summarizing it in one dimension, while LLE creates a projection based on the neighborhood and therefore will successfully depict the coherence of the input manifold.

Complexity:

LLE might face challenges when dealing with very large data sets due to its computational complexity O(N^2) for N as the number of observations.

Advantages:

LLE can effectively capture intricate nonlinear relationships, making it suitable for complex data sets.
LLE ensures that the local geometrical relationships between data points are preserved in the lower-dimensional space.

Disadvantages:

LLE requires careful parameter tuning, such as the number of neighbors and the dimensionality of the embedding space. Improper parameter selection can lead to suboptimal results.
LLE can be computationally expensive, particularly for large data sets. The algorithm involves solving a system of linear equations for each data point, which can be time-consuming.
While LLE excels at preserving local relationships, it may struggle to capture the global structure of the data. In cases where global structure is crucial, alternative methods may be more suitable.

Advised Use Conditions:

Small to medium numeric data sets.
Evidence for non-linear relationships within the data set (e.g. when PCA fails spectacularly).
Confidence in the quality of the data set (e.g. negligible proportion of outliers, no missing data, consistent data preparation).

t-distributed Stochastic Neighbor Embedding

t-SNE assumes that the differences between individual observations reflect a certain probability distribution.

t-distributed Stochastic Neighbor Embedding (t-SNE) is an algorithm commonly used for visualizing high-dimensional data in a lower-dimensional space. It works by first calculating pairwise similarities between data points in the high-dimensional space. Then, it constructs probability distributions that represent the relationships between the points. After that, t-SNE embeds the data points in a lower-dimensional space in a way that preserves these relationships. It uses gradient descent to adjust the positions of the points and minimize the divergence between the original high-dimensional distribution and the lower-dimensional one. Stochastic neighbor sampling is employed to speed up computation. The algorithm continues until convergence, generating a visualization that reveals complex structures and patterns in the data.

t-SNE assumes that the data points lie on a manifold, which is a lower-dimensional structure embedded within the higher-dimensional space. It aims to capture this manifold structure and reveal the underlying patterns. Additionally, it assumes that the pairwise similarities in the high-dimensional space are converted into probabilities, interpreting them as affinities. These affinities are compared to the affinities in the low-dimensional space, aiming to minimize the Kullback-Leibler divergence between them.

t-SNE is particularly effective in scenarios where the underlying data structure is complex and contains non-linear relationships. It excels at visualizing clusters, identifying patterns, and revealing hidden structures that may not be apparent in the original high-dimensional space. It is often used in exploratory data analysis, image recognition, natural language processing, and genomics research.

Example:

In genomics, t-SNE has been employed to analyze gene expression data and identify distinct cell populations in single-cell RNA sequencing. By reducing the dimensionality of the data, t-SNE enables researchers to visualize and interpret complex gene expression profiles, aiding in the discovery of novel cell types or disease subtypes.

Complexity:

Exact t-SNE might be inefficient for very large data sets due to its computational complexity O(N^2) for N as the number of observations, but there are plenty of tricks implemented to speed up the computation, e.g. the Barnes-Hut approximation is running in O(NlogN).

Advantages:

t-SNE can effectively reveal complex structures in the data, making it a valuable tool for exploratory analysis and gaining insights into high-dimensional data sets.
It is capable of preserving local structures, enabling the identification of clusters, outliers, and dense regions.
The algorithm is flexible, making it scalable and applicable to various domains.

Disadvantages:

t-SNE is computationally expensive, especially for very large data sets. The algorithm’s time complexity (for exact case) is quadratic, requiring careful consideration when working with big data.
It can be sensitive to the choice of hyperparameters, such as the perplexity value. Different perplexity values may result in different visualizations, requiring experimentation and fine-tuning. Furthermore, only local-convergence is guaranteed, and since t-SNE has a cost function that is not convex, different initializations can yield different results.
t-SNE cannot accurately preserve global structures, which means that the distances between data points in the output space may not reflect their true distances in the original high-dimensional space.
Since t-SNE is an iterative method, it cannot be applied to another data set based on the results of previous computations.
Interpretation of t-SNE visualizations requires caution. The algorithm emphasizes local structures, and global context may be lost. It is crucial to combine t-SNE results with domain knowledge for meaningful analysis.

Advised Use Conditions:

Data set that is not too big.
Evidence for non-linear relationships within the data set.
A common practice is to first apply PCA to the data set and then apply t-SNE to the results.
Need for data visualization, and preliminary exploratory analysis.

Uniform Manifold Approximation and Projection

UMAP assumes that the distances between individual observations reflect some multidimensional manifold and tries to approximate it.

Uniform Manifold Approximation and Projection (UMAP) is based on the mathematical theories of Riemannian geometry and algebraic topology. It aims to create a low-dimensional representation of the data set while preserving the intrinsic structure and relationships between data points. In a simplified overview:

Constructing a graph: UMAP starts by constructing a k-nearest neighbor (KNN) graph to capture local relationships between data points. The graph connects each data point to its nearest neighbors.
Fuzzy representations: UMAP uses a fuzzy-simplicial set theory to measure the strength of connections between data points by approximating a local Riemannian metric. It assigns probabilities to each connection based on this metric.
Low-dimensional representations: UMAP optimizes the low-dimensional representation by minimizing the difference between the fuzzy-simplicial set in the original high-dimensional space and the low-dimensional space. The low-dimensional representation is based on the Euclidean metric and has the desired number of dimensions.
Optimization: UMAP employs stochastic gradient descent to iteratively update the low-dimensional representation by minimizing the cross-entropy between the probabilities from the high-dimensional representation and the probabilities from the low-dimensional representation until it converges to a stable solution.

UMAP assumes that the high-dimensional data lies on a low-dimensional manifold, meaning that the data has an underlying structure that can be effectively captured in lower dimensions. This means that nearby data points in the high-dimensional space should also be close to each other in the low-dimensional representation. This assumption allows UMAP to preserve local structures and capture fine-grained relationships.

UMAP performs well when the data exhibits complex, nonlinear relationships. It can capture intricate patterns and reveal hidden structures that may not be apparent in the original high-dimensional space. UMAP strikes a balance between preserving global structures, such as clusters or groups, and maintaining local relationships, such as neighborhoods or similarities between nearby points. This flexibility makes UMAP ideal for exploratory data analysis and visualizing complex data sets. Additionally, UMAP is computationally efficient, making it suitable for real-time data analysis or interactive visualization tasks. It can handle large data sets with relative ease, allowing for efficient exploration and decision-making. On the other hand, If the data does not possess a well-defined underlying structure or manifold, UMAP may effectively capture and represent random (noise) relationships between data points.

Example:

Imagine a data set of customer reviews for a popular online store. Each review has multiple features, such as the rating, sentiment, and various textual attributes. Thanks to UMAP, we can visualize the high-dimensional nature of this data to gain insights about customer preferences and identify any underlying patterns or clusters. What is more, UMAP produces a manifold representation of the data, that can be used in further tasks involving this data set (e.g. classification, clustering).

Complexity:

It is difficult to assess the complexity bound for the UMAP, since many underlying methods are very input-sensitive. Theoretical results say that it is O(N^2) for N as the number of observations, but it is the worst-case scenario. There are classes of problems for which it is O(NlogN) and empirical complexity, as the authors claim, is approximately O(dN^1.14) (for d as the number of initial dimensions).

Advantages:

UMAP excels at preserving both global and local structures of complex data sets, enabling a more faithful representation of the data.
UMAP is highly scalable and can handle large data sets with millions of data points efficiently.
UMAP offers flexibility in terms of parameterization - including different metrics for different types of data - allowing users to customize the trade-off between preserving global and local structures according to their specific needs.
UMAP is known for its computational efficiency, making it suitable for real-time data analysis or interactive visualization tasks.

Disadvantages:

UMAP has a few parameters that need to be carefully tuned, such as the number of nearest neighbors and the learning rate. Incorrect parameter settings may lead to suboptimal results.
While UMAP provides valuable visual representations, the low-dimensional embeddings generated by UMAP might not have direct interpretability in terms of the original features.
UMAP requires a sufficient number of observations to obtain a meaningful result. Otherwise, it is inclined to give importance to random noise in the data set.

Advised Use Conditions:

At least 500 observations in the data set.
Evidence for non-linear relationships within the data set.
A common practice is to first apply PCA to the data set and then apply UMAP to the results.
Need for data visualization, preliminary exploratory analysis.

Autoencoders (AE)

In autoencoders, the narrowest layer should compress the most important data information.

Autoencoders (AE) are a type of artificial neural network that learns to encode data into a lower-dimensional representation and subsequently decode it back into its original form. The network comprises two main components: an encoder and a decoder. The encoder learns to compress the input data into a latent representation, while the decoder reconstructs the original data from that representation. The trick is to design a neural network such that:

Input and output have the same dimension.
The latent layer of neurons has a significantly smaller dimension than the input.

The latent representation is a general concept - different architectures and types of AEs may use a simple numerical vector, a sparse representation, or even parameters of probabilistic distributions. To help visualize how AEs work, we can recall the fact that using linear activation functions will cause the latent representation to resemble the PCA. Moreover, using pre-trained layers in the encoder and decoder allows us to train AEs on data types that can cause significant problems with other methods.

AEs assume that the input data possess meaningful patterns and structures. They aim to capture these patterns in the lower-dimensional representation. However, if the data lacks inherent structure or exhibits noise, it may hinder their performance. AEs can identify the most informative and representative aspects, making them ideal for data sets with high-dimensional inputs. Furthermore, AEs can be used to detect anomalies or outliers in a data set. By training the network on normal instances, it can reconstruct them accurately. Any input that deviates significantly from the reconstructed version can be flagged as an anomaly. AEs are less effective when dealing with linearly separable data. Linear techniques like PCA might be more suitable in such cases. Another difficulty is that AEs (like most neural networks) require a substantial amount of training data to learn meaningful representations. When dealing with small data sets, they may not perform as well and might be prone to overfitting.

Example:

Consider a data set of images containing different types of animals. Each image is represented by a high-dimensional vector of pixel intensities. By training an autoencoder on these images, we can learn a lower-dimensional representation that performs feature extraction for features like shapes, textures, and patterns. This compression can enable efficient storage, faster processing, and even aid in classification tasks. The autoencoder can be trained on a subset of labeled images, enabling it to encode and decode new, unlabeled images accurately. This reduced representation can then be used as input for downstream classification algorithms.

Complexity:

AEs are a type of neural network, which is why their training in general is difficult and requires a lot of computing power. There are tricks to reduce the number of necessary training steps, such as with transfer learning, but some problems may require more complex architectures.

Advantages:

AEs can capture non-linear relationships between features, making them more effective in capturing complex patterns and structures in the data.
By reconstructing the original data from the compressed representation, AEs indirectly provide a measure of data reconstruction quality, which can be useful in detecting anomalies or assessing data quality.
AEs are a vast and diverse family of models that are highly customizable and may be adjusted to each problem in its own way.

Disadvantages:

AEs can be computationally expensive to train, especially for large data sets or complex machine learning models. They may require more computational resources compared to simpler dimensionality reduction techniques.
AEs are prone to overfitting, particularly when dealing with limited training data. Regularization techniques, such as dropout or adding explicit noise, may be necessary to mitigate this issue.
The compressed representation learned by AEs may not always be easily interpretable by humans. While it captures important information, understanding the underlying meaning of each dimension can be challenging.
AEs require customized architecture and meta-parameters for each problem. Designing the right neural network for a specific task can be a difficult task.

Advised Use Conditions:

Large data set (but this is a highly task-dependent requirement).
Access to high computing power.
Evidence for non-linear relationships within the data set.
Need for highly personalized solution, when other methods fail.

Summary

Visualization of the performance of 4 models on a dataset of handwritten digits.

Dimensionality reduction is highly important because it allows us to tackle the curse of dimensionality. As data sets grow larger and become more complex, the number of features or variables can increase exponentially. By reducing the number of dimensions in the data, we can eliminate irrelevant or redundant features, thereby simplifying the problem and improving the efficiency of our machine learning models.

Dimensionality reduction truly unlocks the potential to uncover hidden patterns and insights within high-dimensional data. With proper knowledge about the different methods, their bright and dark sides, as well as the recommended conditions for their use, you can significantly improve your data and the entire machine learning framework.

If you want to take your data analysis to the next level right away, the data professionals at nexocode can assist you with implementing AI and machine learning solutions in your business in order to gain an edge over your competitors without further ado. Get in touch with us about your data analytics today to learn more.

About the author

Mateusz Przyborowski

Senior Data Scientist

Mateusz is a mathematician working in data science and artificial intelligence. A graduate of mathematics at the University of Warsaw. His theoretical knowledge, combined with project experience in many AI application domains, allows him to carefully approach and manage various business and research tasks. Author and co-author of articles from IEEE Big Data, FedCSIS, IJCRS. Mathematics enthusiast and perfectionist, in his free time he reads SotA articles and works on his PhD in machine learning.

Dimensionality Reduction - Popular Techniques and How to Use Them

Why Should We Reduce Anything?

Harness the full potential of AI for your business

Not-a-Method: Vectorizing and Mixture Models.

What to Keep in the Lower Dimensional Space?

Dimension Reduction with Linear Methods

Principal Component Analysis

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

Linear Discriminant Analysis

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

Random Projections

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

Dimensionality Reduction Process with Non-Linear Methods

Locally Linear Embeddings

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

t-distributed Stochastic Neighbor Embedding

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

Uniform Manifold Approximation and Projection

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

Autoencoders (AE)

Example:

Complexity:

Advantages:

Disadvantages:

Advised Use Conditions:

Summary

Ready to embark on your data journey?

About the author

This article is a part of

Insights on practical AI applications just one click away

Done!

Thanks for joining the newsletter

1. Definitions

2. Cookies

3. How System Logs work on the Website

4. Cookie mechanism on the Website

5. Additional information

Want to unlock the full potential of Artificial Intelligence technology?