Algorithm for classifying words for hangman difficulty levels as "Easy","Medium", or "Hard"

Algorithm Problem Overview

What is a good algorithm to determine the "difficulty" of a word for a hangman game, so that the game can select words to match a specified difficulty level?

Difficulty would seem related to the number of guesses required, the relative frequency of usage of letters (e.g. words with many uncommon letters may be harder to guess), and potentially the length of the word.

There are also some subjective factors to (attempt to) compensate for, such as the likelihood a word is in the vocabulary of the player, and can be recognised, allowing moving from a guessing strategy based on letter frequencies alone to guessing based on list of known matching words.

My attempt for now is below in ruby. Any suggestions on how to improve the categorisation?

def classify_word(w)
  n = w.chars.to_a.uniq.length # Num. unique chars in w
  if n < 5 and w.length > 4
    return WordDifficulty::Easy
  end
  if n > w.length / 2
    return WordDifficulty::Hard
  else
    return WordDifficulty::Medium
  end
end

I am writing a hangman game I would like my children to play; I am rather too old to be attempting "homework", which may be why the question is receiving so many down votes... Words are drawn randomly from large word databases, which include many obscure words, and are being filtered by the difficulty level determined for the word.

Algorithm Solutions

Solution 1 - Algorithm

1. Introduction

Here's a way to approach this problem systematically: if you have an algorithm that plays hangman well, then you can take the difficulty of each word to be the number of wrong guesses that your program would take if guessing that word.

2. Aside on hangman strategy

There's an idea that's implicit in some the other answers and comments, that the optimal strategy for the solver would be to base their decisions on the frequency of letters in English, or on the frequency of words in some corpus. This is a seductive idea, but it's not quite right. The solver does best if it accurately models the distribution of words chosen by the setter, and a human setter may well be choosing words based on their rarity or avoidance of frequently used letters. For example, although E is the most frequently used letter in English, if the setter always chooses from the words JUGFUL, RHYTHM, SYZYGY, and ZYTHUM, then a perfect solver does not start by guessing E!

The best approach to modelling the setter depends on the context, but I guess that some kind of Bayesian inductive inference would work well in a context where the solver plays many games against the same setter, or against a group of similar setters.

3. A hangman algorithm

Here I'll outline a solver that is pretty good (but far from perfect). It models the setter as choosing words uniformly from a fixed dictionary. It's a greedy algorithm: at each stage it guesses the letter that minimizes the number of misses, that is, words that do not contain the guess. For example, if no guesses have been made so far, and the possible words are DEED, DEAD and DARE, then:

if you guess D or E, there are no misses;
if you guess A, there's one miss (DEED);
if you guess R, there are two misses (DEED and DEAD);
if you guess any other letter, there are three misses.

So either D or E is a good guess in this situation.

(Thanks to Colonel Panic in comments for pointing out that correct guesses are free in hangman—I totally forgot this in my first attempt!)

4. Implementation

Here's an implementation of this algorithm in Python:

from collections import defaultdict
from string import ascii_lowercase

def partition(guess, words):
    """Apply the single letter 'guess' to the sequence 'words' and return
    a dictionary mapping the pattern of occurrences of 'guess' in a
    word to the list of words with that pattern.

    >>> words = 'deed even eyes mews peep star'.split()
    >>> sorted(list(partition('e', words).items()))
    [(0, ['star']), (2, ['mews']), (5, ['even', 'eyes']), (6, ['deed', 'peep'])]

    """
    result = defaultdict(list)
    for word in words:
        key = sum(1 << i for i, letter in enumerate(word) if letter == guess)
        result[key].append(word)
    return result

def guess_cost(guess, words):
    """Return the cost of a guess, namely the number of words that don't
    contain the guess.

    >>> words = 'deed even eyes mews peep star'.split()
    >>> guess_cost('e', words)
    1
    >>> guess_cost('s', words)
    3

    """
    return sum(guess not in word for word in words)

def word_guesses(words, wrong = 0, letters = ''):
    """Given the collection 'words' that match all letters guessed so far,
    generate tuples (wrong, nguesses, word, guesses) where
    'word' is the word that was guessed;
    'guesses' is the sequence of letters guessed;
    'wrong' is the number of these guesses that were wrong;
    'nguesses' is len(guesses).

    >>> words = 'deed even eyes heel mere peep star'.split()
    >>> from pprint import pprint
    >>> pprint(sorted(word_guesses(words)))
    [(0, 1, 'mere', 'e'),
     (0, 2, 'deed', 'ed'),
     (0, 2, 'even', 'en'),
     (1, 1, 'star', 'e'),
     (1, 2, 'eyes', 'en'),
     (1, 3, 'heel', 'edh'),
     (2, 3, 'peep', 'edh')]

    """
    if len(words) == 1:
        yield wrong, len(letters), words[0], letters
        return
    best_guess = min((g for g in ascii_lowercase if g not in letters),
                     key = lambda g:guess_cost(g, words))
    best_partition = partition(best_guess, words)
    letters += best_guess
    for pattern, words in best_partition.items():
        for guess in word_guesses(words, wrong + (pattern == 0), letters):
            yield guess

5. Example results

Using this strategy it's possible to evaluate the difficulty of guessing each word in a collection. Here I consider the six-letter words in my system dictionary:

>>> words = [w.strip() for w in open('/usr/share/dict/words') if w.lower() == w]
>>> six_letter_words = set(w for w in words if len(w) == 6)
>>> len(six_letter_words)
15066
>>> results = sorted(word_guesses(six_letter_words))

The easiest words to guess in this dictionary (together with the sequence of guesses needed for the solver to guess them) are as follows:

>>> from pprint import pprint
>>> pprint(results[:10])
[(0, 1, 'eelery', 'e'),
 (0, 2, 'coneen', 'en'),
 (0, 2, 'earlet', 'er'),
 (0, 2, 'earner', 'er'),
 (0, 2, 'edgrew', 'er'),
 (0, 2, 'eerily', 'el'),
 (0, 2, 'egence', 'eg'),
 (0, 2, 'eleven', 'el'),
 (0, 2, 'enaena', 'en'),
 (0, 2, 'ennead', 'en')]

and the hardest words are these:

>>> pprint(results[-10:])
[(12, 16, 'buzzer', 'eraoiutlnsmdbcfg'), (12, 16, 'cuffer', 'eraoiutlnsmdbpgc'), (12, 16, 'jugger', 'eraoiutlnsmdbpgh'), (12, 16, 'pugger', 'eraoiutlnsmdbpcf'), (12, 16, 'suddle', 'eaioulbrdcfghmnp'), (12, 16, 'yucker', 'eraoiutlnsmdbpgc'), (12, 16, 'zipper', 'eraoinltsdgcbpjk'), (12, 17, 'tuzzle', 'eaioulbrdcgszmnpt'), (13, 16, 'wuzzer', 'eraoiutlnsmdbpgc'), (13, 17, 'wuzzle', 'eaioulbrdcgszmnpt')]

The reason that these are hard is because after you've guessed -UZZLE, you still have seven possibilities left:

>>> ' '.join(sorted(w for w in six_letter_words if w.endswith('uzzle')))
'buzzle guzzle muzzle nuzzle puzzle tuzzle wuzzle'

6. Choice of wordlist

Of course when preparing wordlists for your children you wouldn't start with your computer's system dictionary, you'd start with a list of words that you think they are likely to know. For example, you might have a look at Wiktionary's lists of the most frequently used words in various English corpora.

For example, among the 1,700 six-letter words in the 10,000 most common words in Project Gutenberg as of 2006, the most difficult ten are these:

[(6, 10, 'losing', 'eaoignvwch'), (6, 10, 'monkey', 'erdstaoync'), (6, 10, 'pulled', 'erdaioupfh'), (6, 10, 'slaves', 'erdsacthkl'), (6, 10, 'supper', 'eriaoubsfm'), (6, 11, 'hunter', 'eriaoubshng'), (6, 11, 'nought', 'eaoiustghbf'), (6, 11, 'wounds', 'eaoiusdnhpr'), (6, 11, 'wright', 'eaoithglrbf'), (7, 10, 'soames', 'erdsacthkl')]

(Soames Forsyte is a character in the Forsyte Saga by John Galsworthy; the wordlist has been converted to lower-case so it wasn't possible for me to quickly remove proper names.)

Solution 2 - Algorithm

A really simple way would be to compute a score based on the lack of vowels in the word, the number of unique letters, and the commonness of each letter:

letters = 'etaoinshrdlcumwfgypbvkjxqz'
vowels = set('aeiou')

def difficulty(word):
    unique = set(word)
    positions = sum(letters.index(c) for c in word)
    
    return len(word) * len(unique) * (7 - len(unique & vowels)) * positions

words = ['the', 'potato', 'school', 'egypt', 'floccinaucinihilipilification']

for word in words:
    print difficulty(word), word

And the output:

432 the
3360 potato
7200 school
7800 egypt
194271 floccinaucinihilipilification

You could then score the words with:

        score < 2000   # Easy
 2000 < score < 10000  # Medium
10000 < score          # Hard

Solution 3 - Algorithm

You can use the Monte Carlo Method to estimate the difficulty of a word:

Simulate a game by guessing a random letter each time, weighted by letter's frequency in your target language, and count how many guesses it took your randomized player to arrive at a solution. Note that since each guess eliminates a letter, this process is finite, and it returns a number from 1 to 26, inclusive.
Repeat this process 2*N times, where N is the number of unique letters in your word,
Calculate the score by averaging the results of 2*N runs,
Determine the complexity level: scores less than ten indicate an easy word, and scores above sixteen indicate a hard word; everything else is medium.

Solution 4 - Algorithm

Previous similar discussion around the same topic: https://stackoverflow.com/questions/5141092/determine-the-difficulty-of-an-english-word

I like the answer at the end of the link ^. For a kids hangman game, just apply an approach like scrabble does.

Assign a point value to each letter, then just add up the letters.

Solution 5 - Algorithm

Just do it! Play hangman against the word. Count how many forfeits (ie. incorrect guesses) it takes to beat.

You'll need a strategy to play. Here's a human(ish) strategy. From the dictionary, strike out all the words that don't fit the reveals so far. Guess the letter most frequent among the words remaining.

If your strategy is randomised, you can define your measure as the expected number of forfeits, and estimate that empirically.

Another deterministic strategy, from a hangman bot I wrote a few years ago. Guess the letter that minimises the number of words remaining in the case the guess is incorrect (ie. optimise the worst case). Today I dislike this strategy for being too mechanical, I prefer the one above.

Solution 6 - Algorithm

A while back I wrote a hangman solver using the obvious algorithm: given an initial dictionary of all possible words, at each turn we choose the letter that occurs in the most words remaining in the dictionary, then remove non-matching words (depending on the response) from the dictionary.

The algorithm isn't quite as straightforward as this, since there are often several letters which each occur in the same number of words in the dictionary. In this case, the choice of letter can make a significant difference to how many guesses are required for a word. We pick the maxima where the resulting information about placement of that letter (if is is indeed in the word) gives the maximum information about the system (the letter with the maximum information entropy). e.g. if the two remaining possible words are 'encyclopedia' and 'encyclopedic', the letter 'c' has the same probability of appearing as as e,n,y,l,o,p,e,d,i (i.e. it is guaranteed to be in the word), but we should ask about 'c' first since it has a nonzero information entropy.

Source (C++, GPL) is here

The result of all this is a list of words, with the number of guesses required for each one: difficulty.txt (630KB). The hardest word for this algorithm to find is "will" (with 14 failed guesses); the i and double l are guessed pretty quickly, but then the options include bill, dill, fill, gill, hill, kill, mill, pill, rill, till, will, and from then on the only option is to guess each letter in turn. Somewhat counterintuitively, longer words are much guessed much more quickly (there just aren't that may of them to choose from).

Of course, in a human game of hangman, psychology (and breadth of vocabulary) play a much greater role than this algorithm accounts for...

Solution 7 - Algorithm

First, of course, you'd generate a list of unique letters. Then sort by frequency (in English or whatever language -- there are lists for this), with less frequent letters having a higher difficulty.

Then you need to decide whether you combine the scores by adding, multiplying, or using some other scheme.

Solution 8 - Algorithm

You're getting downvoted because you're asking us to build a very complex algorithm for you.

Why don't you just create three arrays (easy,medium, and hard) and populate each with a hundred or so words? It would take about 20 minutes.

I promise your kids will get bored of hang man long before they burn through a few hundred games... :D

Solution 9 - Algorithm

Well, potentially there could be a lot of things involved:

As everyone said, the frequency of the individual letters;
The length of a word definitely should count, but not in a linear way - a long word can make random guesses hit the letters, while a short one can be hard to get;
Also, the words themselves should be considered - "bipartite" might be a word for people on SO, but maybe not for non technical population.

Actually, you could try to co-evolve several strategies, half of them for deciding the worth of a word, and half of them for trying to win the game. The latter group will try to maximize the score while the first one try to minimize the score. After a while there could be a pattern and then the half for deciding the worth of a word may give you some benchmarks.

Solution 10 - Algorithm

Start with a List of words and Launch a google search for each One. Let the The number of Hits serve as a (coarse) Proxy of the term's difficulty.

In a refined version you'd group words by a synonym Relation Based on a Thesaurus and determine the most difficult word of a category by counting the Results of google searches.

Taking the Notion of n-Grams One step further, the difficulty of a Word could be rated by the frequency of its syllables in prose. Depends on the quality of the syllable statistics, of course. You'd probably have to Differentiate between Lexemes and Function words ( determiners, conjunctions etc. ) and Normalize by number of syllables in the Word (Feels like Overkill as i Write ...).

Solution 11 - Algorithm

I like the idea of building an algorithm that learns and changes depending on the users. At the beginning, you can implement any of the algorithms suggested to come up with the list, then as more people play the game, you assign a weight to each of the words depending on the number of guesses (which is also continually tracked and calculated). This prevents the issue of complex but popular words being given difficult rating but are well-known to people.

Solution 12 - Algorithm

Compute the value of each letter of a word in Scrabble points: E=1, D=2, V=4, X=8 and so on. Add them up and divide by the number of letters to get an average letter value, and use that to score the word. Compute the average for each word in a large dictionary, and determine the break points between quartiles. Call words in the lowest quartile "easy", words in the two middle quartiles "medium", and words in the highest quartile "hard".

Content Type	Original Author	Original Content on Stackoverflow
Question	grrussel	View Question on Stackoverflow
Solution 1 - Algorithm	Gareth Rees	View Answer on Stackoverflow
Solution 2 - Algorithm	Blender	View Answer on Stackoverflow
Solution 3 - Algorithm	Sergey Kalinichenko	View Answer on Stackoverflow
Solution 4 - Algorithm	Alan Waage	View Answer on Stackoverflow
Solution 5 - Algorithm	Colonel Panic	View Answer on Stackoverflow
Solution 6 - Algorithm	Chris Johnson	View Answer on Stackoverflow
Solution 7 - Algorithm	Hot Licks	View Answer on Stackoverflow
Solution 8 - Algorithm	BBagi	View Answer on Stackoverflow
Solution 9 - Algorithm	zw324	View Answer on Stackoverflow
Solution 10 - Algorithm	collapsar	View Answer on Stackoverflow
Solution 11 - Algorithm	Michael Lai	View Answer on Stackoverflow
Solution 12 - Algorithm	user448810	View Answer on Stackoverflow