Regular Expressions A Search Tool for Text

By Eric E. Cohen, CPA
As part of my classes for universities and businesses, where I teach students, professors and businesspeople how to leverage emerging technologies for accounting, audit and tax, I teach a wide variety of techniques for working with information. Many financial professionals are familiar with SQL (structured query language), a venerable standard for getting the information you want from almost any structured data source. Another tool in the information-gathering toolbox is called Regular Expressions, or RegEx. RegEx can be used on its own, and is also part of most SQL implementations to facilitate getting the information you need from unstructured data sources, such as blocks of text.

To illustrate the power of RegEx, let’s use the popular Web-based game recently purchased by The New York Times, called Wordle. If you are like me, your friends started posting a series of images to their social media with rows of black, yellow and green boxes, reflecting the success of their guessing the letters in a five-letter word using your guesses of what the word might be. The black boxes in their post means that a particular letter in their guess word is not in the puzzle answer; yellow means a letter is in the word, but in the wrong place; green means a correct letter in the correct space. Those of us who used to play the game Mastermind, where you guessed a series of colored pegs, will find it easy to adapt to.

So, for example, if the answer is “banjo,” and your first guess is “debit,” the letters d, b, i and t would be black, and the letter b would be yellow; it is in the answer, but in the wrong place. For each guess, you get more feedback of letters in, or not in, the answer, and whether any of the words put letters in the correct place.

The game gives six chances to guess the word of the day. Each guess must be a five-letter word from the Wordle word list, which is quite expansive; you probably won’t find a lot of them on the IELTS or SATs. The answer will be from a smaller list of much more common words, although enthusiasts will gripe about some of them (RUPEE being a recent example, common amongst crossword enthusiasts, but perhaps not as well known to the general public, which has been drawn into Wordle).

There are many articles and videos on the best strategies, primary focused on using words initially based on common letter usage. I like to get the vowels out of the way, and then work on common letters. Sometimes, I am left scratching my head for words to make from unused letters. What five letter word has an “m,” a “g” and a “k” in it?

Of course, a game is worth whatever effort you want to put into it and, if you want to do it without helps, you can; I know people who solve it in one browser, then solve it again with more knowledge in a different browser, so they can post better results in their social media. I do prefer to do it on my own, but had wondered whether online helps might make it easier. There are many sources, such as the Merriam-Webster Word Finder, that can help with words that begin, end or contain a string; looking for “arb” will give you words starting or ending with “arb” or that have “arb,” together and in that order, within, but doesn’t help with a list of five-letter words that have an “a,” and “r” and a “b” in any order.

Can RegEx Help?
I will be preparing a complete exercise for my training courses, but wanted to share some quick ideas of what you can do with RegEx.

You can access a list of five letter words you may be able to use for your guesses for Wordle at this link. If you already have an analytical tool you like to use, such as Alteryx or Tableau, which supports RegEx in whole or part, you may be able to use that tool to reference the file directly. If not, the website RegEx101.com is a great learning tool for working with the file contents, but you will need to copy and paste the word list into the “Test String” area.

The file has around 13,000 words in it. I know this because typing \W somewhat unhelpfully highlights each word but returns the statistics how many matches there are.

Contains: There are six words that have “eric” in them, either as the first or last four letters. I have just learned the meaning of the word “xeric,” meaning “characterized by, relating to, or requiring only a small amount of moisture.” I found this out by typing the string eric.

Starts/ends with: There are 77 matches that contain the string ja. (By typing “ja.”) If I want to know how many words start with “ja,” I find 54 of them, using the code for “beginning of a line”: ^. So, typing ^ja gives me the lines starting with ja. Likewise, the code for “end of a line” is “$.” Typing ja$ tells me that there are 11 matches, far more than I might have guessed. What beside Ouija and ninja? A number of words I think are proper names or otherwise not often used in my circle.

OK, so far, we could have done these things with Merriam-Webster. What if I have “b,” space, “n,” space, “o”? No help from M-W, but RegEx has our back. The period (.) or .+ can be used as a wild card, so typing b.+n.+o returns seven matches, including bingo, bongo, banjo and bunco.

Having already moved beyond M-W, everything else is gravy, and RegEx can do just about anything. A simple example: you know “b,” “n” and “o” are in the first, third or last place, but you aren’t sure which. You can put alternative choices within brackets, such as [bno]. So typing [bno].+[bno].+[bno] (allowing duplicates in this simple case), brings 14 matches from our set, adding words like “blown” (allowing the switch in place of the n and the o). Typing a ^ within the choice brackets negates the set; [^bno] would mean match anything except those letters.

Let’s Do Today’s Wordle
I’ll type one of my standard openers, ADIEU. Only the letter e is in the answer word, but in the wrong place. Of the 12,972 words, almost half of them have an “e” in the fourth position, leaving 6,486 that do not have an “e” in the fourth position. I find this by typing .+.+.+[^e].+. But I also have letters not in any position, so I could make that more restrictive: just as the [^e] means there’s no “e” in the fourth position, I know there is no “a”, “d”, “i” or “u” in any position, so I could exclude them as well, as I will later in this example.

To round out the vowels and introduce more letters, my next guess is SYNCH. I see that the first letter is correct and in the right position, and the other letters are not in the word at all. But noting the s in the first position, s.+.+[^e].+, narrows me to 1,283 choices.

I like to keep getting common letters identified, and my next guess (not leveraging the knowledge of the s or e) is PROWL. I see the “l” is part of the word, but not in the last position. So the captured knowledge – which letters do not appear, which letters do not appear in certain positions and the position of the s – is noted. I then types[^adiuynchprow][^adiuynchprow][^adieuynchprow][^adieuynchprowl], which leaves me with 124 possibilities left.

With today’s puzzle, I could try to further use common letters leveraging RegEx and the word list, but I went on instincts, trying SLEET, SPELT, and finally getting it in six with SMELT.

This is just a small example of using RegEx to solve a puzzle like Wordle, and far more sophisticated solutions can leverage more advanced functionality, such as lookahead, lookbehind and atomic groups. Interestingly, after I wrote this, I went to see if others had done similar experiments, and found a wide variety of methods and use of RegEx formulas. But the real goal here is to showcase that RegEx exists, is an important tool in text querying, and may be useful in your analytical work.

Regular Expressions are an important tool in the “search and analyze text/unstructured data” toolkit. As part of SQL queries, it is used as part of the filter (WHERE) mechanism. On its own, it can be used to better search text and unstructured data for many purposes.

 

Leave a comment

Comments

  • No comments found