BigWord

2016-11-18

This post is an update on a small side project I just started. It is called BigWord and it aims to find the biggest word in a dictionary based on a multiset of letters.

Sample Usage

For instance, it finds out that the biggest English words made with the letters from “linuxwars” are “urinals” and “insular” and that the biggest English words made with “applepies” are “pappies” and “applies”.

$ ./bigword linuxwars
> urinals
> insular
$ ./bigword applepies
> pappies
> applies

Implementation Notes

The existing code is written using C++14 and should run under any system which has reasonable C++14 compiler. The program relies on the public domain word list shipped with Fedora, but you may use any plain text dictionary you want.

It loads all words which may be constructed with the provided letter multiset, ignoring words which are too big in order to improve performance, and then traverses the word vector from longest to shortest trying to match as many as possible of the biggest subsets of the provided multiset.

Even though it is performant enough for a fairly big word list, I think there must be a more efficient way to find the biggest subsets. Currently, the first abovementioned example finishes after 200 ms on a list of 479,000 words.

Ideally, I would like even the worst-case queries to finish in less than 100 ms for word lists with less than one million words, as suggested in Usability Engineering by Jakob Nielsen.

Some possible improvements are dumping the processed word vector after the first execution and reusing it in the next executions and sorting the input text file by word length so IO can be stopped earlier, but these possibilities add undesired complexity to the setup the user would need.

The following code snippet shows how letter counts are currently compared.

  static bool is_contained(const LetterCount &a, const LetterCount &b) {
    if (a.letter_count > b.letter_count) {
      return false;
    }
    size_t remaining = b.letter_count - a.letter_count;
    for (size_t i = 0; i < alphabet_size; i++) {
      if (a.counters[i] > b.counters[i]) {
        return false;
      }
      // By catching excessive differences we may fail early.
      const size_t difference = b.counters[i] - a.counters[i];
      if (difference > remaining) {
        return false;
      }
      remaining -= difference;
    }
    return true;
  }

There may be a faster way to do it, but I am failing to spot it.

You can find the project source code (licensed under the ISC license) at the GitHub repository.