About My Research (now outdated but still a theme and research direction that I am interested in pursuing!)

31 May, 2023

My thesis research is about multimodal embeddings that have semantic properties. Specifically, I am looking at two modalities: language (text) and vision (images).

What is an Embedding

You may be familiar with a vector as "something with direction and magnitude" from high school physics, but I think this is overcomplicating it--a vector is just a way to define a point in space. In two dimensions, a "vector" is basically just the same as what non-mathematicians would call a "point": an (x,y) coordinate pair. You can tack on a third number to get a vector in three dimensions, which is just a point (x,y,z). The x, y, and z notation (i.e. the fact that we run out of subsequent letters after using z for the third dimension) are indicative of the fact that we mostly encounter vectors of three dimensions or less in the real world since physical space is three-dimensional.

However, there is no reason we have to stop at 3D vectors. What about adding a fourth element to our point to get a (x,y,z,j) coordinate pair. We now have a 4D vector! At four dimensions, it becomes hard to visualize the vector like we could for two or three dimensions, but you can see how it is a natural extension of the 3D vector. Just like 2D and 3D vectors, you can add or subtract 4D vectors and calculate how close together they are (for 2D vectors, we do this with the pythagorean theorem). Importantly, the 4D vector can store more information than the 3D one--it has an extra j component that can capture something a 3D vector could not.

An embedding is a vector with very many dimensions. More accurately, an embedding is a high dimensional vector that represents some other type of information, such as a photo or word (embeddings are sometimes also called "representations"). Embeddings are important for many reasons, but one is that they allow us to do math with things with which we normally can't do math. Subtracting the word "cat" from the word "dog" is not a defined action, but subtracting the embedding for the word "cat" from the embedding for the word "dog" is possible, since embeddings are just vectors, and you can subtract vectors.

Semantic Embeddings

Just because you are allowed to perform mathematical operations on embeddings doesn't mean that the results will make sense. You may measure the closeness (again, with a variation of the pythagorean theorem) of the embedding for "dog" and the embedding for "cat" and notice that the "cat" embedding is closer to "pickle" than it is to "dog." Therefore, good embeddings are carefully designed so that the operations have intuitive results.

The first famous embeddings with this trait are the word2vec embeddings for language. Not only was the closeness operation sensible ( "dog" is closer to "cat" than to "pickle"), but addition and subtraction worked: you could calculate "king" - "man" + "woman" and the result would be "queen"! This is not a small feat and was widely praised in the research community.

Multimodal Embeddings

We skipped over one part in explaining embeddings. How do you go from a word like "dog" to its embedding, which is a vector of numbers?

It turns out you do this using a function. In many cases the function is very complicated and cannot be written down in terms of symbols, but for the most part it is similar to something like f(x) = x^2, where x would be the word "dog" and f(x) would be the function that takes "dog" and gives the embedding vector.

Designing these functions for various modalities, or types of data, is one of the main pursuits of modern deep learning. You will often see new machine learning papers talk about learning "good representations"--what they mean when they say that is that they found a function f that yields good results f(x) for any input x.

My research area is a slight variation where you want to learn two functions, say f(x) and g(y), at the same time, so that they work well together. Specifically, f(x) could generate an embedding for an image x and g(y) could generate an embedding for a text y. By working well together, we mean that, for example, the output vector of f for an image of a dog would be close to the output vector of g for the word "dog."

Ultimately, my goal for my thesis is to have the ability to replicate the original results form Word2vec, but using both text and images interchangeably, so you could have something like:

"man" + 👑 = "king" = 🤴

Why do I (and why should you) care?

There are various potential applications and extensions of multimodal embeddings, but my interest mainly stems from the fact that I think multimodal embeddings, specifically those that allow semantic arithmetic in the representation space, are the future of sports analytics. I believe (and other reputable sources also believe) that separating individuals from context is the underlying purpose and goal of sports analytics (ESPN receiver analytics, for example). This is something that humans struggle with and is difficult even with computers. The reason we now have ChatGPT and other AI advances is largely because of the ability of Transformer-based models to understand language in context using really good language embeddings. My hope is that this can be replicated in other domains, and specifically in sports analytics.

I believe I am one of a handful of people in the world with a very specific combination of technical training (in both hard traditional CS, AI/ML, and math/statistics/data science) and knowledge of/passion for various sports. These two things are what I spend the majority of my time thinking about on a daily basis.