LLMs are still wholly unreliable: a case study with CSS

Published by marco on

This is a 50-minute video of a guy who’s really good at using and teaching CSS asking three LLMs pointed and tricky questions about it.

It’s a bit long for what it is but I think there were some interesting things to learn. First of all, it’s very clear that Kevin hasn’t actually read very much about how LLMs work or how to prompt them. This is OK—because that means he’s just like most people trying to use these tools.

I gave three AI models a CSS quiz by Kevin Powell (YouTube)

Overall, Kevin was frustrated with the answers he got from Gemini, ChatGPT, and Claude. Despite his frustration, he still imparts too much ability to these text-generators. His questions, though formulated as a regular person might well do, are wrong for these machines because he’s often pre-loading the context with information that the machine will use in its answer, although nearly always incorrectly.

On top of that, CSS has a lot of fiddly bits with numeric specificities, which the LLMs all consistently get wrong, or are right no more often than a coin-toss. There is no way for these general LLMs to know these things. You’d have to add a filter on top of it to weed out incorrect answers—which is moving away from the utility of a general-purpose question-answering machine.

Already after the first question or two, he could have summed up with “the machines don’t know anything about CSS, so the massive amounts of text that they generate will almost always include something that will waste you time.”

Instead, he says,

“The only thing I would say here is, at least it’s so bad—this answer—that if somebody were reading this, they would know that it’s wrong.”

Oh, wow. That statement is absolutely not true for anyone who was actually seeking help, rather than Kevin, who’s an expert testing the machine. Unfortunately, people generally aren’t asking these machines questions to which they already know the answer.

I know from personal experience that students will just copy/paste the responses directly back into their own projects. They will not have any idea why it doesn’t work. They won’t be able to see that the massive amount of generated text—which hardly anyone reads, by the way[1]—disagrees with the code, in which case they would be warned that perhaps the code isn’t correct. Or perhaps the description isn’t correct. Or perhaps they should just read and learn the material instead of wasting time with a digital idiot savant with CTE.

He keep saying things like “Gemini is just bad at specificities” or “it doesn’t understand the system it’s built for itself here,” which are just completely nonsensical statements. The LLMs don’t see correlations between pieces of text. They simply can’t. It’s like expecting a car to fly.

The questions he asks are going to very likely get incorrect answers, or correct answers—by luck: he uses multiple-choice questions—with incorrect explanations. If it gets it right, it’s going to be luck. Why? Because the text-generator is based on probabilities with a bit of “temperature” adjustment to introduce variability that makes it feel like it’s being written by a person. That doesn’t help at all for very specific questions with very specific answers. LLMs are better for stuff where there is no right answer, were subjective style outweighs correctness.

Still, there is quite a bit of daylight between the LLMs. Gemini and CoPilot are much more often confidently wrong for this subset of questions than Claude was. Kevin’s final scores for 13 questions were: CoPilot: -4, Gemini: -4, Claude: 9. He concluded with,

“Claude is definitely the winner. It still got enough things wrong that I’m always a little bit nervous trusting these tools. They’re going to continue to get better but, just be really careful if you’re using them. […] it always says things with the utmost confidence, so just don’t copy paste code, they’re giving you. Try and understand the code they’re giving you and see if it actually makes sense. Especially, like, they’ll just say stuff isn’t true that is true and vice versa. They’ll make stuff up that isn’t true and say that it’s true and then their source will be some completely random GitHub repo. So be a little bit careful with these tools if you’re using them.”

Kevin’s not forceful enough in his conclusion. He says that he’s “a little bit nervous trusting them”, which I’m pretty sure is not what he means to say. What I think he means to say is, “don’t trust them,” i.e., “[t]ry and understand the code they’re giving you and see if it actually makes sense,” which, if you’re not already an expert, may prove difficult.

He also says that, “They’re going to continue to get better” but this statement is utterly without proof. He doesn’t understand their mechanism but just assumes that “progress” will fix everything. It’s OK, he’s a designer and CSS expert not a market analyst, but I thought it was important to point out that people tend to say completely unsubstantiated things like this, until they’re all just repeating religious cant until anyone who asks whether it’s actually true is called out as a heretic for even asking the question.

His final sentence is “be a little bit careful with these tools if you’re using them,” which is too soft. He means to say that people should be very careful with the answers. (And also, you don’t have to worry about the tools’ output if you’re not using the tools.)

[1]

People don’t read articles written by humans. They like and forward having barely read the headline. What are the odds that they’re doing anything more than scrolling past all of the text to grab the highlighted code sample? The boilerplate responses from these machines train people to skip over text, because there’s often so much of it.

For example, there’s a point where Claude returns a very good answer explaining why, of the list ci, rlh, vb, and Q, the one that doesn’t exist is ci. Kevin says “I don’t know why Q is even capitalized or what it even means.” He’s literally showing and ostensibly reading the line that says “It’s equal to 1/40th of 1cm.” This apparently doesn’t compute for him because it’s only when he reads the list, where it says it’s a “unit from traditional typography, representing a quarter of a millimeter,” that the penny drops and he groks it.

This is the wild part of this all: the answer is so convincing and it happens to be correct, in this case, as the unit is a Quart (Wikipedia), but how are you supposed to believe it? It might just as well have made it up, unless you already knew the answer in advance. All of the machines made up the specificity rules, often getting them reversed and completely wrong. You cannot use these machines to learn this kind of stuff. You can use it to learn APIs, but not how things work.

You should only ever use this information as a jumping- off point, verifying the answer you think you got with other sources. Sometimes the answers include sources, like MDN, W3Schools, or W3C, which are sources you could just have checked in the first place instead of posing such questions to an LLM.

In another place, Kevin reads translate as transform, which goes to show that not just LLMs can get things wrong. 🙄

↩