What are LLMs good at—and where do they struggle?

Keanan Koppenhaver
Keanan Koppenhaver
Technical PMM @ Retool

Jul 2, 2024

If you’re involved in anything even remotely technology-adjacent (and you likely are given that you’re on the Retool blog right now), you’ve probably heard about large language models (LLMs). They’re the underlying technology that powers tools like ChatGPT, Claude, and others.

A large language model is a model that’s consumed a ton of text content on the internet, has learned how people use words and sentences, and is able to replicate that usage in response to prompts or questions. Since ChatGPT launched at the end of 2022, social media (and seemingly every other media channel) has been flooded with demo after demo showing off the “magic” of products and features powered by large language models.

LLMs are very generalizable, perhaps more so than any AI technology we’ve seen in the past. This means that instead of just being amazing at chess, or at recognizing whether an object in a photo is a hot dog or not, LLMs can be good at quite a variety of tasks (check out our latest State of AI report for more details on popular AI use cases).

They also unlock all sorts of automation in so many different areas, which has led them to become massively popular. But they’re not magic. LLMs have weaknesses, and you may run into significant gaps—including, but not limited to, trouble following your prompt, or simply getting an answer the model thinks you want to hear, even if it’s not correct.

It’s important to keep both LLM’s strengths and weaknesses in mind as you decide how and where to use LLMs in your work. In this article, we’ll take a look at a few types of tasks that LLMs are great at and a few areas where today’s models tend to struggle in order to help you get back to experiencing some of that sparkle we all felt the first time we used ChatGPT.

The biggest strengths of LLMs

At their core, large language models are just that: language models. This means that a good general rule to keep in mind is that if the task at hand involves language in one way or another, an LLM can likely help you. Things like generating “explain like I’m 5 years old” response or imitating a specific person or publication’s writing style (like writing a rap about WordPress in the style of Eminem) are well-within the capabilities of most large language models. Whereas if your task strays outside that, you may encounter a bit more difficulty. With that in mind, let’s take a look at some of the types of tasks LLMs are good at.

Text generation/transformation

LLMs tend to be good at producing lots of language. So if you ask an LLM to generate you a few paragraphs of lorem ipsum text, a couple paragraphs explaining a concept you don’t understand, or a list of study questions for a chapter of a textbook, you’re likely to get something pretty usable. And because the interface for most LLMs is some sort of chatbot, you can ask follow up questions or provide additional requests pretty easily. For example, “You mentioned the concept of the Fermi paradox, can you explain that more?”, “Generate 10 more study questions similar to question #3”, or “Can we generate bacon ipsum text instead?”

LLMs are also very good at transforming language between formats. For example, maybe you want your explanation of black holes formatted as a series of Tweets. Or if you’re a developer learning a new programming language, you might ask about a particular concept from a language you’re familiar with and how that works in the context of the language you’re trying to learn. Remember that LLMs have consumed most every type of language on the public internet (from human languages to programming languages), from tweets to blog posts to scientific articles, and can emulate most any style that they’ve been exposed to.

Brainstorming

As a kind of subset of text generation, LLMs are pretty good at brainstorming ideas. Core to the brainstorming process is generating a lot of ideas and LLMs tend to do that very well. Asking for 10 ideas will likely give you a decent list, but if you need 50 or 100 or more, LLMs are readily able to provide you with those. The current generation of models can return results almost as fast as whoever is prompting it can think of a follow-up request, so you can get into a flow where you’re generating ideas in groups of 5 or 10, giving the LLM feedback (“give me more ideas like #5” or “can we combine ideas from #6 and #7”) and iterate super quickly.

The solution or idea you end up with won’t be entirely the LLM’s creation, and you’ll definitely have to provide some coaching and bring in some of your own knowledge—but as a pure idea generation machine, large language models are pretty unmatched.

Summarizing

Because of how they were created and trained, LLMs are great at summarization. The training process of an LLM involves compressing information at its core, distilling lengthy text into more concise forms without losing essential meaning, which is exactly what you’re doing when creating a summary. In addition, the transformers that LLMs are largely based on, are particularly good at understanding context and relationships in the text. Attention mechanisms within transformers allow the model to focus on relevant parts of the input when generating summaries.

  • Did you know: transformers are the “T” in “GPT”?

These summaries can be focused and guided just like the other language generation concepts we discussed earlier. For example, you can ask the LLM for a summary of a text that focuses on a particular character or a specific concept, or for one that ignores a certain part of the text. One thing you’ll want to keep in mind is, because of how LLMs work, you can’t really ask for the summary to be a specific word count. However, you can continue to ask for shorter, longer, simpler, or more complex summaries until you get one that works for you.

Writing code (sort of)

There’s been a level of anxiety that LLM-based products could put programmers out of jobs, with the advent of tools like GitHub Copilot to the infamous Devin. After all, code is really just language with a very specific structure, and there are tons of code samples, documentation repositories, and tutorials available on the public internet for LLMs to learn from.

However, “specific structure” is where it gets tricky. The rigidity of most programming languages can lead LLMs to struggle with producing 100% accurate code—which is usually a requirement if you’re having an LLM write code for you. LLMs can be great at generating code, and are always getting better—but if you’re looking for precision, we’re not quite there yet. Which leads us to…

The biggest weaknesses of LLMs

Like we mentioned earlier, even though large language models can seem like magic, there are definitely some areas where that illusion breaks down…

Perfect recall/citations

One of the best features about the current wave of LLMs is that you can upload reference documents—PDFs, other text documents, images and more—that the LLM will ingest and use when generating responses. Being able to chat with a PDF unlocks all sorts of possibilities for quicker data extraction and more informed reading.

This unlocks the ability to generate new writing with citations or references to the original text. Think of pulling quotes from a call transcript or referencing specific statistics from an uploaded PDF in a generated summary. However, LLMs haven’t perfected this quite yet. If the source material is long enough to use up a large portion of the LLM’s context window, the model can find it difficult to pull accurate quotes or statistics and will often confidently hallucinate something that wasn’t in the original source material.

This is why most LLMs have the important disclaimer to “check important info”—relying on citations supposedly pulled from source material (whether that source is the LLM’s training data or a document you uploaded) can be risky.

Calculations

Some people argue that math is its own version of language—but it’s not one that large language models are very adept at speaking (yet).

While not exactly a calculation, you can see this if you ask any LLM to produce a random number for you. In most cases, you’ll see that the numbers generated are anything but random, in many cases over indexing on 42 (a reference to Hitchhiker’s Guide to the Galaxy that’s very popular in the training data) and 7 (one of the two most frequent choices when a human is asked to pick a number between one and 10).

This is (one reason) why it’s important to understand what LLMs actually are and at least a bit about how they function. They’re not (currently) configured to be able to perform mathematical calculations or choose random numbers but will often generate responses that would lead you to think otherwise. (Until you look at the numbers…)

Recently, many LLMs have adopted the practice of writing code to solve these sorts of problems instead of trying to solve them with the language model directly. So far this seems to helping increase the accuracy of calculations because, as we’ve discussed, they can write code reasonably well—and writing code actually is a good way to perform calculations.

An example of how LLMs approach calculations
An example of how LLMs approach calculations

Multi-step problems

Because large language models are focused on predicting the next word in a given sentence—a.k.a. what they’ve observed in their training data—most agree that they don’t have an inherent ability to reason. This means that they often get tripped up by logic puzzles, complicated multi-part questions, and anything else that requires multiple steps to solve (unless they’ve already seen something very similar in their training data).

While you can specify the steps that you want the LLM to follow as part of your prompt, or use techniques such as telling the LLM to “take a deep breath and think step-by-step”, they can still get tangled. By default you’ll find that LLMs aren’t great at breaking down a bigger problem into smaller steps on their own.

To combat this problem, OpenAI has released a new series of models they’re calling o1 which is “designed to spend more time thinking before they respond”. According to the announcement, OpenAI’s new models can “reason through complex tasks and solve harder problems than previous models in science, coding, and math.”

While the use of these types of models is severely rate-limited (for now), people are already experimenting with different use cases and finding all types of multi-step problems that can be solved with this new series of models.

How you can improve the performance of LLMs

Even though large language models are limited in some ways, there are things you can do to help them get better at these types of tasks. Here are some quick tips on improving your AI output.

For a deeper dive, take a look at these five ways to immediately improve your AI models.

Break down multi-step problems into discrete steps

If you know the prompt you’re giving an LLM is best solved in multiple steps, explain your goal and then specify each of the individual steps you want the LLM to take before providing you with your desired output. This takes the planning step of problem solving away from the LLM and allows it to provide a better result. Prompt chaining, or feeding the output of one prompt into another, is also a useful method for breaking down multi-step problems, and allows you to debug the output of each prompt step individually to get even more consistent output.

Provide more input/output examples to get better output

Often known as few-shot prompting, providing the LLM with a few examples of “good” answers as part of your prompt gives it even more insight into the output you’re looking for. Since LLMs primarily work by pattern matching, giving them specific patterns to match makes them even more effective than them trying to pattern match against their entire training data set.

Experiment with different models

There are tons of different foundational LLMs out there (to say nothing of the fine-tuned models on sites like Hugging Face) that have each been trained in different ways and on slightly different data sets. This means each model will respond to the same prompt differently. Depending on what type of task you’re looking to accomplish, you might find you get more accurate and reliable results from one model versus another. Currently, there are no hard and fast rules about what makes one model more useful for a certain task than another, so experiment with your models and specific prompts.

When you build an AI app inside Retool, this sort of experimentation is super simple. In a single dropdown, you can choose between any LLMs that you’ve connected to Retool, whether they’re proprietary foundational models or fine-tuned models hosted on your own infrastructure.

An example of a Retool app that lets you compare LLMs side-by-side.
An example of a Retool app that lets you compare LLMs side-by-side.

Try it for yourself—enter a prompt and see how both OpenAI’s GPT-4 and GPT-3.5 Turbo models respond in our app.

Strengthen the output of your LLMs

Hopefully you now have a better idea of how to get better output from your LLMs, and what kinds of problems they can best help you solve. Large language models are only going to get better and the sooner you get comfortable experimenting with them and tweaking their output, the better positioned you’ll be to build apps on top of AI that really work, really well.

The easiest way to experiment with AI and build AI apps that you can actually ship to production is to use Retool. With pre-built AI components, queries you can run to automate workflows, and a built-in vector database all in one place, you can stop wrangling your AI stack and start building quickly.

If AI is the what, Retool is the how—and it’s never been easier to get started. Sign up for free and build an AI app today.

Reader

Keanan Koppenhaver
Keanan Koppenhaver
Technical PMM @ Retool
Keanan educates the community about how they can use low- and no- code tools to transform their day-to-day work, even if they wouldn't consider themselves a "developer". He writes about AI and LLMs.
Jul 2, 2024
Copied