Submission | Wordle Helpers

Project Topic finalized Submission

Project Idea: Wordle Helper

Members: Jackson Muller, Andy Zhang, Dawson Hartman, Justin Yu

Addressing Deliverables:

User Interface:

For the UI, we will utilize the Visual Studio IDE so the user can input guesses and results while also seeing feedback on the next best guess from the program. We decided to not use a Matlab interface because we think that using C++ in Visual Studio would make some of our coding flow better. However, we will be using Matlab to perform frequency analysis on the word lists and potential guesses. These scripts would be run in conjunction with the C++ scripts and prior to it to make sure that we have the correct analysis. The interface will start by just being the command line but we hope to export it to the website to have it be user friendly. The interface will have error checking to make sure that the user is inputting valid words (i.e. 5 letters, actual words, etc.). The interface will first recommend a few different starting words to the player. Then once the player indicates what word they decided, they will input the results from Wordle using an array of zeros, ones and twos to indicate the three states that a letter can have: correct letter but wrong location, correct letter and right location, and wrong letter. The process repeats for the subsequent guesses.

Word Bank:

Our data set involves word banks. We have a few word lists available to us, but there are two main ones we are considering. We have the original Wordle “possible solutions” list and the Wordle collection of guessable words list. We plan to incorporate both of these lists into our solution to best emulate the original Wordle game. Our program would perform frequency analysis on these data sets and determine the best course of action from there.

Frequency Analysis:

We plan to incorporate frequency analysis in numerous ways. The first occurs in the initial “reading in” of the full word bank. In this first read-through, our program would take into account the frequency of each letter in the alphabet. This would essentially create a vector of length 26 with magnitudes determined by the frequencies of each letter. Additionally, our program would also keep track of letter positioning. For example, the letter “A” would use a vector of length 5 where each element represents the amount of times that the letter “A” has been in that position out of all the words in the word bank. Thus, each of the 26 letters would also have a length 5 vector associated with them. Finally, each word would also have a 26 length vector associated with it indicating which letters are in that given word. Using the information provided by these vectors, we would be able to determine the top few first guesses.

The process of finding the best first guesses would involve a second read-through of the full list of words. Using cosine similarity, we would try to find which word’s 26 length vector best matches the direction of the 26 length vector of the entire word list. The top five words with the best scores would be considered for the first guess. Next, if two or more top words are anagrams of each other, we would use the positions of letters as tie-breakers. We would compare the letter positions of all the words in question with the respective length 5 vectors for each of their respective letters and determine based on which word has positioning that best aligns with the word bank.

Machine Learning Methods Comparison:

Unsupervised Learning

Pros:

Allows for a list of best guesses that contain the most common letters and the most common locations for those letters
Uses same word banks as our approach
Letter arrays easily comparable to user input array
Provided ranked list with weighted scores for comparison

Cons:

Only applied in article for the first guess of the game
Doesn’t account for the possibility of suggested word being in the solution bank or not.

COPOD

Pros:

Gets the best guess that is closest to the mean of the data.
Finds the outliers of the data set but can also be used to find the values that are most like the mean of the distribution function.

Cons:

Would need to be run every time for the next guesses on a smaller list with different letter distributions.

Greedy Algorithm

Pros:

Can beat the game in 5 guesses or less for regular mode (when optimized)
Simple algorithm: chooses a word that reduces the possible word list the most

Cons:

Requires at most 8 guesses in hard mode
Not sure what weight this algorithm puts on yellow or green guesses when guessed and how that approaches the word bank it minimizes.