What are we trying to do?
Basically we want to find some nice starting words for wordle, but thats pretty easy to do so we would also like to know a few ways to eliminate as many characters as possible from the board. Also if possible rank these different ways according to whats most probably beneficial (I basically know nothing about statistics so this ones a big maybe).
Character Frequencies
We could probably find some data file somewhere that has a good metric for char frequencies, but that would be cheating, so were gonna generate them ourselves using War and Peace from project Gutenberg.
This is what the getCharFreqs()
function does for us. Graphing it we get
This does not match exactly with other sources, but its close enough for our purposes. We’re not trying to break some encryption here, just get a decent rating system.
5 Letter english words
We need a word list of english. We’ll grab words.txt from https://github.com/dwyl/english-words/.
In the interest of eliminating as many characters as possible we want only 5 letter words that have unique characters. The word wells wouldn’t be a good guess for example because we only eliminate a max of 4 chars instead of 5.
We also don’t really care about words that have the same letter as some word already in our collection. In otherwords we don’t want any anagrams, they’re not helpful.
Also our wordlist seems to contain a lot of crap that does not spellcheck, so we’re gonna use the enchant python library to get rid of anything that doesn’t spell check.
After we’ve filtered out everything we don’t care about we’re left with some 1765 words. These can be found in 5_letter_words_no_2_have_same_letters.txt
Ranking them according to the frequencies we found before we can see the top 20.
WORD | WEIGHT |
---|---|
antes | 0.3319598561162775 |
ethos | 0.3241091520476795 |
inset | 0.32198422231073054 |
earth | 0.3219567110827106 |
other | 0.319435994815384 |
anted | 0.31838150195048354 |
inert | 0.3173110650784351 |
heist | 0.3166542345094592 |
ashen | 0.3145674453386288 |
death | 0.3130515141492122 |
hones | 0.31204672907130226 |
arose | 0.3112192038375662 |
heron | 0.30737357183900677 |
dents | 0.3071281591799654 |
doter | 0.3048187791642464 |
helot | 0.3041522571399452 |
alone | 0.30230431544874253 |
deist | 0.30203701885832157 |
alert | 0.3009609546896327 |
Mutually Exclusive Sequences
We want a sequence of words that have no letters in common between them. If we could find 5 of these we could eliminate 25 words from the board and our 6th guess would be pretty much guaranteed. Except for the fact that you may get a bunch of letters out of order then your kinda screwed and your last guess is basically a hail mary. In reality I’m not sure that there actually is a sequence of 5 such words, but we can find a bunch of sequences of 4 words, and some smaller sequences of 3 and 2.
This is what the wordGroups()
function does for us, it basically just takes all the words we found in the previous step and finds all mutually exclusive
sequences.
[swamp befit chord gunky] is of the sequences the rest can be seen in mutually_exclusive_words.txt
Rating Our Seqeunces
The simplest way to order them would weighting each word by char frequency and summing the weights, but thats not super useful, since if you choose a large seqeunce your gonna eliminate most of the chars anyway.
It would be nice if we could rate them in a way such that the words in the sequence contained frequently occuring characters in a sorted order. So we could elimnate the 5 most common with the 1st word, the next 5 with the 2nd, and so on. However I am bad at statistics, sooooo, we’re just gonna sort the sequence by char frequency ascending, weight the words based on index, and sum them. Hopefully we promote sequences of the kind we’re looking for by weighting things in this way, but it could also be pointless. At any rate its not worse, than the basic metric.
This is all accomplished by the orderLists()
function.
Ordering Sequences by Length
As you would probably expect, there is no sequence of 4 words that has a lower weight than a sequence of 3, or 2 words. The same is true for sequences of 3 and 2 words, so lets split them up and put them in their own file. Most likely your going to want seqeunces of different length anyways. If you really just yolo with a sequece of 4 everyday your kinda nuts.
This what writeWordLists()
does and it produces the following files.
- group_0.txt sequences of length 4
- group_1.txt sequences of length 3
- group_2.txt sequences of length 2
These are the first 20 entries of the sequence tables
Sequences of Length 4
WORD SEQUENCE | WEIGHT |
---|---|
fluky chimp drown abets | 2.16944352914033 |
jumpy child frown abets | 2.1604457944398554 |
jumpy child frown abets | 2.1604457944398554 |
jumpy child grown abets | 2.1577137419209214 |
wimpy flung chord abets | 2.1574777080896137 |
fluky chomp grind abets | 2.1560293044598824 |
jumpy flown acids berth | 2.153416675680762 |
jumpy wolds acing berth | 2.1373085390474986 |
gimpy flunk chord abets | 2.136348459715121 |
jumpy folds acing berth | 2.135157035965303 |
bulky fjord acing thews | 2.132318377437792 |
jumpy flown bahts cider | 2.130440736125431 |
frump cling howdy abets | 2.1265310154706887 |
dumpy growl chink abets | 2.1223608760325305 |
funky chimp world abets | 2.1167920407516316 |
frump wonky child abets | 2.116688248391375 |
frump wonky child abets | 2.116688248391375 |
grump wonky child abets | 2.115777564218397 |
dumpy cling abhor wefts | 2.1126412842241242 |
Sequences of Length 3
WORD SEQUENCE | WEIGHT |
---|---|
bulgy acorn heist | 1.5172636082101008 |
clump abhor inset | 1.5145287420428462 |
dimly abuts heron | 1.4884040173895972 |
bulgy acids other | 1.4801509616112074 |
blush acing doter | 1.4788119776383737 |
clump rhino abets | 1.4779916428392337 |
bulgy adios tench | 1.4731953103360311 |
clump abhor dents | 1.4699605526505506 |
psych afoul inert | 1.4684411825576311 |
dimly abhor cents | 1.467447339445411 |
dimly abhor cents | 1.467447339445411 |
album point herds | 1.4666360708464143 |
bulgy point ashed | 1.4647471749407726 |
chump abide snort | 1.4620529503603659 |
bulgy adorn ethic | 1.4587578555497962 |
crimp abuts honed | 1.4573297727134886 |
bulky acing ethos | 1.4560254904032708 |
clump abhor deist | 1.4546871316856191 |
chump abode instr | 1.4545980328221453 |
Sequences of Length 2
WORD SEQUENCE | WEIGHT |
---|---|
about heirs | 0.8043817257918388 |
houri abets | 0.7962790438847854 |
bijou earth | 0.7836932823208471 |
adopt inure | 0.7740434064652636 |
bijou death | 0.7658828884538502 |
abuts opine | 0.7577505069256391 |
shout abide | 0.7566419294874721 |
acing house | 0.7538026457047782 |
cause point | 0.737957741503251 |
amuse point | 0.7376745009056822 |
botch adieu | 0.7355580121136939 |
abhor cutie | 0.7338704483767438 |
argue point | 0.730399656859956 |
abuse point | 0.7291629021094235 |
ukase point | 0.725501407762043 |
mouth abide | 0.7250096445611866 |
ought abide | 0.7224079577477558 |
youth abide | 0.7208285631573386 |
about hiker | 0.7167709696519892 |
If you want to look at all the code the main file that does all of this is freqeuncy.py its really gross code though so… have fun?