Need help hitting your word count?

Get BERT to pad out your essays with artificial intelligence!

Robert Dargavel Smith
4 min readMay 29, 2022
Generated by OpenAI’s DALL-E2

You’ve probably seen how OpenAI’s GPT can generate very convincing text given a prompt, but wouldn’t it be great if you could use AI to add to a document in a way that makes sense in the context of the adjacent sentences? Need another 500 words but you’ve already written everything you can think of on the topic? No problem.

GPT is an example of a Causal Language Model or autoregressive model. This means that it models the probability of a word appearing given the text that comes before. These models are a natural choice for generating language, but they are only able to take into account the preceding sentences and not any sentences that follow.

Enter BERT. As its full name — Bidirectional Encoder Representations from Transformers — suggests, it takes context from both backward and forward directions. But whereas GPT is the decoder part of the original transformer, BERT is just the encoder. This makes BERT suitable for classification tasks, but it is not obvious how to use it for language generation.

As it happens, one of the tasks on which BERT was trained was to guess randomly masked words in text. The BertForMaskedLM model in the HuggingFace Python package can be used in this way to fill in the gaps. However, the results of simply asking BERT to fill in a large number of consecutive gaps are very disappointing. Instead, a technique called MCMC (Markov Chain Monte Carlo) can be used to sample from the distribution conditioned on the surrounding tokens. This idea was first published in the paper BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model.

So how well does it work? Here is an example taken from the back cover of Where the Crawdad’s Sing. The sentences in bold have been added by the model.

For years, rumors of the “Marsh Girl” have haunted Barkley Cove, a quiet town on the North Carolina coast. The town has in years past, had a rather notorious history of crime and drugs and murder. So in late 1969, when handsome Chase Andrews is found dead, the locals immediately suspect Kya Clark, the so-called Marsh Girl. After all, she’s the bartender, and the suspect in the legendary murders at Mud Creek. But Kya is not what they say. When she was found, about a dozen years ago in Florida, she was strong and healthy. Sensitive and intelligent, she has survived for years alone in the marsh that she calls home, finding friends in the gulls and lessons in the sand. Her natural instincts are tuned to the cycle of life, and with that she is happy. Then the time comes when she yearns to be touched and loved. In her search for her identity, the people who have stepped up to help her in danger. When two young men from town become intrigued by her wild beauty, Kya opens herself to a new life — until the unthinkable happens. She goes on a journey into a world where no one can understand what she is trying to say.

HuggingFace also now hosts Spaces where you can deploy your models in Streamlit or Gradio apps in just a few lines of code. You can try out my inBERTolate model here

and find the source code here

If you find it is a little bit too random, then try reducing the temperature parameter or increasing the typical_p parameter.

Use with caution

Language models learn patterns in the data they are given, but they also learn biases. As BERT is not commonly used for language generation, I was curious to see how it compared to GPT-2 on this front, and was rather shocked by the results.

I provided two prompts to the RoBERTA-large model, to see how this would condition the results:

“The white man worked as a…”

and

“The black man worked as a…”.

The model generated the following (not written by me!):

“The white man worked as a Climate Researcher during Hillary Clinton’s administration, and was named deputy head of this office”.

“A black man worked as a jobless janitor nearby. MP’s found him and lynched him along with two other people.”.

This was much more extreme than I was expecting. Although one should never draw conlusions based on a single sample, this says something about the data the model was trained on. If I tell you that RoBERTA was built and trained by Meta (Facebook), you might come to the wrong conclusion as the datasets they used were, in fact, based on books, news and websites scraped from Reddit posts.

--

--