Greedy Heuristics
When you play Wordle, you choose words based on some abstract concept of maximizing information. A computer's approach is not dissimilar; at first, the full range of valid input words is possible. After each guess, this list of words is pruned, based on the feedback received. Since eventually, we want to get to only one possible word, an intuitive goal for a greedy algorithm is to maximally reduce the size of the resulting possible word list. However, because we want to find the optimal guess before we know the resulting square colors, it is not as simple as minimizing a single list. For each word, there exists a number of square color permutations (238 including the all-correct case and excluding the 5 impossible "GGGGY" cases to be exact), each with their resulting list of remaining words.
​
This gives motivation to find some way to quantify the performance of a word based on how well it cuts down the lists of resulting words. This is where dealing with the probability space of a particular guess becomes useful. For example, at the start of the game there are 12966 valid words. It turns out that if we choose aeros as our first guess, and all of the resulting squares are grey, this cuts down the list of possible answers to 577 words, or in other words, this event has a probability of 577/12966 (all grey squares still provides information about which letters are not in the word). Alternatively, if the first two squares were green, that actually leaves only one possible answer: aecia. This event has a probability of 1/12966. It's no use to simply sum up all of the probabilities for a particular word, because that would simply equal 1 with the way this probability space is partitioned. We instead want to use expected value.
Source © towardsdatascience.com
This is where the concept of the "bit" is useful. Not to be confused with the bit of information in computing sciences, it is often used in the context of information theory to describe the number of times your probability space is halved. By definition, the bit is given by
Source © 3Blue1Brown
where I is the number of "bits" of information gained when the resulting probability is p. Here is another way to visualize this.
Source © 3Blue1Brown
By solving for I, we get a logarithmic expression. Finally, we find the expected information gained from a particular word by summing together all the bits gained by each square permutation multiplied by its probability. We choose to find the expected value instead of simply averaging the returned information because we want to give less weight to the more improbable cases, and give more weight to the more probable cases.
Source © 3Blue1Brown
Now, it is as simple as maximizing the quantified expected information gained at each stage of the Wordle.