What are we trying to do?

Basically we want to find some nice starting words for wordle, but thats pretty easy to do so we would also like to know a few ways to eliminate as many characters as possible from the board. Also if possible rank these different ways according to whats most probably beneficial (I basically know nothing about statistics so this ones a big maybe).

Character Frequencies

We could probably find some data file somewhere that has a good metric for char frequencies, but that would be cheating, so were gonna generate them ourselves using War and Peace from project Gutenberg.

This is what the getCharFreqs() function does for us. Graphing it we get

Character Frequencies

This does not match exactly with other sources, but its close enough for our purposes. We’re not trying to break some encryption here, just get a decent rating system.

5 Letter english words

We need a word list of english. We’ll grab words.txt from https://github.com/dwyl/english-words/.

In the interest of eliminating as many characters as possible we want only 5 letter words that have unique characters. The word wells wouldn’t be a good guess for example because we only eliminate a max of 4 chars instead of 5.

We also don’t really care about words that have the same letter as some word already in our collection. In otherwords we don’t want any anagrams, they’re not helpful.

Also our wordlist seems to contain a lot of crap that does not spellcheck, so we’re gonna use the enchant python library to get rid of anything that doesn’t spell check.

After we’ve filtered out everything we don’t care about we’re left with some 1765 words. These can be found in 5_letter_words_no_2_have_same_letters.txt

Ranking them according to the frequencies we found before we can see the top 20.

WORD WEIGHT
antes 0.3319598561162775
ethos 0.3241091520476795
inset 0.32198422231073054
earth 0.3219567110827106
other 0.319435994815384
anted 0.31838150195048354
inert 0.3173110650784351
heist 0.3166542345094592
ashen 0.3145674453386288
death 0.3130515141492122
hones 0.31204672907130226
arose 0.3112192038375662
heron 0.30737357183900677
dents 0.3071281591799654
doter 0.3048187791642464
helot 0.3041522571399452
alone 0.30230431544874253
deist 0.30203701885832157
alert 0.3009609546896327

Mutually Exclusive Sequences

We want a sequence of words that have no letters in common between them. If we could find 5 of these we could eliminate 25 words from the board and our 6th guess would be pretty much guaranteed. Except for the fact that you may get a bunch of letters out of order then your kinda screwed and your last guess is basically a hail mary. In reality I’m not sure that there actually is a sequence of 5 such words, but we can find a bunch of sequences of 4 words, and some smaller sequences of 3 and 2.

This is what the wordGroups() function does for us, it basically just takes all the words we found in the previous step and finds all mutually exclusive sequences.

[swamp befit chord gunky] is of the sequences the rest can be seen in mutually_exclusive_words.txt

Rating Our Seqeunces

The simplest way to order them would weighting each word by char frequency and summing the weights, but thats not super useful, since if you choose a large seqeunce your gonna eliminate most of the chars anyway.

It would be nice if we could rate them in a way such that the words in the sequence contained frequently occuring characters in a sorted order. So we could elimnate the 5 most common with the 1st word, the next 5 with the 2nd, and so on. However I am bad at statistics, sooooo, we’re just gonna sort the sequence by char frequency ascending, weight the words based on index, and sum them. Hopefully we promote sequences of the kind we’re looking for by weighting things in this way, but it could also be pointless. At any rate its not worse, than the basic metric.

This is all accomplished by the orderLists() function.

Ordering Sequences by Length

As you would probably expect, there is no sequence of 4 words that has a lower weight than a sequence of 3, or 2 words. The same is true for sequences of 3 and 2 words, so lets split them up and put them in their own file. Most likely your going to want seqeunces of different length anyways. If you really just yolo with a sequece of 4 everyday your kinda nuts.

This what writeWordLists() does and it produces the following files.

These are the first 20 entries of the sequence tables

Sequences of Length 4

WORD SEQUENCE WEIGHT
fluky chimp drown abets 2.16944352914033
jumpy child frown abets 2.1604457944398554
jumpy child frown abets 2.1604457944398554
jumpy child grown abets 2.1577137419209214
wimpy flung chord abets 2.1574777080896137
fluky chomp grind abets 2.1560293044598824
jumpy flown acids berth 2.153416675680762
jumpy wolds acing berth 2.1373085390474986
gimpy flunk chord abets 2.136348459715121
jumpy folds acing berth 2.135157035965303
bulky fjord acing thews 2.132318377437792
jumpy flown bahts cider 2.130440736125431
frump cling howdy abets 2.1265310154706887
dumpy growl chink abets 2.1223608760325305
funky chimp world abets 2.1167920407516316
frump wonky child abets 2.116688248391375
frump wonky child abets 2.116688248391375
grump wonky child abets 2.115777564218397
dumpy cling abhor wefts 2.1126412842241242

Sequences of Length 3

WORD SEQUENCE WEIGHT
bulgy acorn heist 1.5172636082101008
clump abhor inset 1.5145287420428462
dimly abuts heron 1.4884040173895972
bulgy acids other 1.4801509616112074
blush acing doter 1.4788119776383737
clump rhino abets 1.4779916428392337
bulgy adios tench 1.4731953103360311
clump abhor dents 1.4699605526505506
psych afoul inert 1.4684411825576311
dimly abhor cents 1.467447339445411
dimly abhor cents 1.467447339445411
album point herds 1.4666360708464143
bulgy point ashed 1.4647471749407726
chump abide snort 1.4620529503603659
bulgy adorn ethic 1.4587578555497962
crimp abuts honed 1.4573297727134886
bulky acing ethos 1.4560254904032708
clump abhor deist 1.4546871316856191
chump abode instr 1.4545980328221453

Sequences of Length 2

WORD SEQUENCE WEIGHT
about heirs 0.8043817257918388
houri abets 0.7962790438847854
bijou earth 0.7836932823208471
adopt inure 0.7740434064652636
bijou death 0.7658828884538502
abuts opine 0.7577505069256391
shout abide 0.7566419294874721
acing house 0.7538026457047782
cause point 0.737957741503251
amuse point 0.7376745009056822
botch adieu 0.7355580121136939
abhor cutie 0.7338704483767438
argue point 0.730399656859956
abuse point 0.7291629021094235
ukase point 0.725501407762043
mouth abide 0.7250096445611866
ought abide 0.7224079577477558
youth abide 0.7208285631573386
about hiker 0.7167709696519892

If you want to look at all the code the main file that does all of this is freqeuncy.py its really gross code though so… have fun?

Good luck with wordle!!