Your browser has just loaded information about roughly 800 artworks from the collection at the Metropolitan Museum of Art. The museum has publicly released a large dataset about their collection [5], we are displaying just a small fraction. They are positioned randomly.
Hover over an artwork to see its details.
Each artwork includes basic metadata, such as its title, artist, date made, medium, and dimensions. Data scientists like to call metadata for each data point (artwork) features. Below are 10 artworks from the dataset.
These features can be thought of as vectors existing in a high-dimensional space. We want to visualize the vectors, however we can’t show all the dimensions at once.
The answer is to project the data into a lower dimension, one that can be visualized. This kind of projection is called an embedding.
Computing 1-dimensional embedding requires taking each artwork and computing a single number to describe it. A benefit of reducing to 1D is that the numbers, and the artworks, can be sorted on a line.
On the right you see the artwork positioned according to their average pixel brightness. Notice that the images are sorted, with the darkest images appearing at the top and the brightest images on the bottom!
Dimensionality reduction can be formulated mathematically in the context of a given dataset. Consider a dataset represented as a matrix , where is of size , where represents the number of rows of , and representes the number of columns.
Typically, the rows of the matrix are data points and the columns are features. Dimensionality reduction will reduce the number of features of each data point, turning into a new matrix, , of size , where . For visualizations we typically set to be 1, 2 or 3.
Say , that is is a square matrix. Performing dimensionality reduction on and setting will change it from a square matrix to a tall, rectangular matrix.
Each data point only has two features now, i.e., each data point has been reduced from a 3 dimensional vector to a 2 dimensional vector.
The same brightness feature can be used to position the artworks in 2D space instead of 1D. The pieces have more room to spread out.
On the right you see a simple 2-dimensional embedding based on image brightness, but this isn’t the only way to position the artworks. In fact, there are many, and some projections are more useful than others.
Use the slider to vary the influence that the brightness and artwork age have in determining the embedding positions. As you move the slider from brightness to artwork age, the embedding changes from highlighting bright and dark images, and starts to cluster recent modern-day images in the bottom left corner whereas older artworks are moved farther away (hover over images to see their date).
Artwork Age Brightness
We just showed an example of a user-driven embedding, where the exact influence of each feature is known in the embedding. However, you may have noticed that it’s hard to find meaningful combinations of feature weights.
State-of-the-art algorithms can find an optimal combination of features so that distances in the high dimensional space are preserved in the embedding. Use the tool below to project the artworks using three commonly used algorithms.
In this example we are performing reduction on the pixels of each image: each image is flattened into a single vector, where each pixel represents one feature. We then reduce these vectors to two dimensions.
Pros:
Cons:
There are many algorithms that compute a dimensionality reduction of a dataset. Simpler algorithms such as principal component analysis (PCA) maximize the variance in the data to produce the best possible embedding. More complicated algorithms, such as t-distributed stochastic neighbor embedding (t-SNE) [2], iteratively produce highly clustered embeddings. Unfortunately, whereas before the influence of each feature was explicitly known, one must relinquish control to the algorithm to determine the best embedding— that means that it is not clear what features of the data are used to compute the embedding. This can be problematic for misinterpreting what an embedding is showing [10].
Dimensionality reduction, and more broadly the field of unsupervised learning, is an active area of research where researchers are developing new techniques to create better embeddings. A new technique, uniform manifold approximation and projection (UMAP) [4], is a non-linear reduction that aims to create visually striking embeddings fast, scaling to larger datasets.
Dimensionality reduction is a powerful tool to better understand high-dimensional data. If you have your own dataset and wish to visualize it using dimensionality reduction, there are a number of different algorithms [3] and implementations available. In Python, the scikit-learn package [7, 8] provides APIs for many unsupervised dimensionality reduction algorithms, as well as manifold learning: an approach to non-linear dimensionality reduction.
Regarding the three algorithms discussed above, you can find Python implementations of the algorithms we used for the artworks here: PCA, t-SNE [2], and UMAP [4].