Wordle Solver and Evaluator

Final project for 15418/618. Songyu Han and Joel Ye

Summary: We will implement a Wordle solver based on information theory heuristics, evaluating performances on GHC and PSC machines.

Proposal (Posted on main for convenience)

Background

Wordle is a game with the goal of guessing a 5 letter English word in as few guesses as possible. After each guess, you are told which letters in your guess match letters in the solution, and if they are also in the right position in the word. Below, we show a three-round game.

A basic strategy is to minimize the average-case number of guesses for words drawn from a common word bank (1-10K words). This strategy is elaborated in this 3B1B video, and has three components:

Compute the entropy of candidate answer words (probabilities drawn from a lookup table), given state of board.
Compute the expected information gain \( I(g) \) for a guess, which is equal to the entropy of the feedback given under the answer distribution.
Identify the guess with the best expected information gain. Loosely, guessing what we are least sure the feedback will be provides the most useful clue for future moves. This 1-step heuristic supports an approximate policy that chooses words based on their probability of correctness at the current iteration and a value function which maps the empirical future guesses needed based on \( f(I(g)) \).

Further, proposed strategies like this one must be evaluated on a test set of answers. This evaluation is typically treated as data-parallel across answers. There may be a chance to improve the heuristic policy with value iteration if the evaluator is fast enough.

The challenge

There are three nested layers of loops, each which work with about 1-10K words.

The evaluation loop (for answer in test bank)
Guess loop (for answer in legal guesses, which is best next move?)
- We can experiment with the value of per-worker minibatch aggregation before making a comparison across workers.
Candidate answer loop (get probability of candidate by weighing lookup probability with compatibility with guess)
- The working set of candidates is dynamically sized based on the state of the board, hence there’s a workload balancing challenge at the innermost loop.
- Computing \( p(x) \) for different candidates have no interdependency, but probabilities must be synchronized and gathered to compute the result.

This work is characterized by a high communication to computation ratio; the computation is effectively a few multiplications and max operations.

At minimum, we expect to try parallelization on the inner two loops over different guesses and different candidates. There are a number of implementation details that are interesting for a parallel programming class. For example, parallelizing the guess loop (2) may have memory locality benefits when a single worker compares against multiple candidates, but the same can be said for parallelizing over candidates for all possible guesses. Further, the workload profile shifts over different rounds of guessing, because the candidate pool should exponentially decrease. This may motivate different distribution of work in the inner loop, or staggered work in the outer loop.

Resources

We will use the accompanying code for the 3B1B video as a reference for algorithmic correctness, and his video overall for guidance. However, his implementation is a naive for loop in Python. We will use OpenMP on GHC and move to PSC for late stage testing; the main implementation will likely depend on a shared memory abstraction. Time permitting, we may compare with a MPI based implementation where workers are dedicated to different subsets of candidates.

Goals and Deliverables

Plan to Achieve: Minimally, we expect to:

Provide an serial implementation that reproduces the average score of 3B1B’s wordle solver.
Compare the performance of a serial solvers with CPU-based thread parallelism. Explore guess-loop vs candidate-loop parallelism.
Characterize changing workload requirements in different guess iterations and propose a hybrid guess-loop / candidate-loop strategy.

We also feel it’s likely we will be able to provide a CUDA implementation of this solver, but it’s unclear whether this problem would benefit from high-thread count parallelism.

Hope to Achieve:

Demonstrate that accelerated solving enables an outer loop optimization of the empirical value function by evaluating against the test set.
Provide functionality to perform 2-step or deeper search instead of a 1-step heuristic, described here.
If we implement a CUDA functionality, it’d be interesting to compare our low level algorithm against the high level language that e.g. Pytorch provides.

Deliverables at Poster Day:

Most likely results will simply be performance charts. We may build a visualizer showing our solver’s progress, but feel that’d be an orthogonal goal to the main study (there’s already a video of the algorithm).

Platform Choice

We will begin with a low-worker count parallelism to get a sense of the problem scale. It may not be sufficiently large to justify many more cores or a GPU implementation (since the naive approach was suitable for the resource video.) If we do not get a satisfactory speedup, a GPU implementation would be interesting as the innnermost string comparison is a heterogeneous operation, and would perhaps motivate using a big lookup table instead of computing it on the fly.

Proposal Schedule (Progress Updated as of May 6th)

Week 4/12:
- 🔴 Provide a serial CPU C++ and pytorch (python with C++ bindings) implementation of the V1 algorithm. (Completed, SH & JY)
- 🔴 Analyze sequential algorithm and determine multiple parallel appraoches to the Wordle solver. (Completed, SH & JY)
- 🔴 Implement first OpenMP parallel program on the candidate guess loop. (Completed, SH)
- 🔴 Provide test-bank evaluator and profile the two implementations on 5-letter wordle. (Completed, SH & JY)
Week 4/15 (First Half):
- 🔴 Implement and profile guess level and candidate level parallelism while optimizing for memory locality and minimizing scatter reduce contention. (Finished, SH & JY)
- 🔴 Evaluate performance characteristics of various implementations of scatter reduce and map reduce. (Finished, JY)
Week 4/15 (Second Half):
- ⚫ Evaluate correctness degradation with reduced synchronization on reduction. (Task Removed: Not Necessary for this study.)
- 🔴 Implement message-passing solver with MPI (Finished, SH)
- 🔴 Start to implement GPU solver in CUDA (Finished, SH)
Week 4/22 (First Half):
- 🔴 Experiment with coloring matrix partitioning or on the fly coloring computation. (Completed, SH)
- 🔴 Profile and optimize workload balancing across turns. (Completed, JY)
Week 4/22 (Second Half):
- 🔴 Continue Optimizing the CUDA Solver (Somewhat Completed, SH)
- 🔴 Scale problem size in number of letters (up to 7) (Completed, JY)
Week 4/29 (First Half):
- 🔴 Profile and analyze performance characteristics of GPU solver (Completed, SH)
- 🔴 Perform problem size sensitivity analysis (Completed, JY & SH)
- ⚫ Hope to Achieve: extend problem size in number of boards to solve. (Task Removed: Not likely to finish.)
Week 4/29 (Second Half):
- 🔴 Writing up report and preparing poster. (Completed, SH & JY)

parallel_final