Wrapping text the hard way

Published by marco on

The work journal 2024-03-27T16:03:51 conversation: 01ht0afgwryks5fepkvvm0kn28 by Simon Willison (GitHub) describes the author’s process of using AI prompting to write a console text-wrapping algorithm.

He prompted with “JavaScript that takes a big string of text and word wraps it at the specified width, adding newlines where necessary.” The answers meandered around a solution space that seemed over-engineered and not particularly fruitful—the answers all used regular expressions, which seems kind of like overkill, when you’re really only trying to detect a single character. A simple parser would be more effective, easier to understand,,debug, and maintain, and is more likely to be performant.

The original wrapping code was:

function wordWrap(text, width) {
  const regex = new RegExp(`(?![^\\n]{1,${width}}$)([^\\n]{1,${width}})\\s`, 'g');
  return text.replace(regex, '$1\n');
}

Things got hairy when Willison asked for a solution that includes an indent before each wrapped line. It’s honestly hard to tell whether the example will actually do that, largely because of the non-obvious regular expressions (which make use of negative lookahead assertions).

function wordWrap(text, width, indent = '') {
  const regex = new RegExp(`(?![^\\n]{1,${width}}$)([^\\n]{1,${width}})\\s`, 'g');
  return text.replace(regex, `$1\n${indent}`).replace(/^/gm, indent);
}

When Willison noted that the indent appears everywhere but at the front of the text, the proposed code was:

function wordWrap(text, width, indent = '') {
  const regex = new RegExp(`(?![^\\n]{1,${width}}$)([^\\n]{1,${width}})\\s`, 'g');
  return indent + text.replace(regex, `$1\n${indent}`);
}

This looks promising, as the eye is drawn to the indent + at the beginning of the answer. If you look carefully, though, you’ll see that the second call to replace() has gone missing.

This is dangerous, right? Imagine if you didn’t have sufficient tests for this code? You can’t really understand it, it looks vaguely right, and pieces that used to work are randomly dropped in responses to “refinement prompts”. He tried a few more times, but kept losing rather than gaining fidelity in the response, with one or more corner-cases no longer being supported.

When he told it to stop using regular expressions, it responded with:

function wordWrap(text, width, indent = '') {
  const lines = [];
  let currentLine = '';

  text.split('\n').forEach(line => {
    if (line === '') {
      lines.push('');
    } else {
      line.split(' ').forEach(word => {
        if (currentLine.length + word.length <= width) {
          currentLine += (currentLine ? ' ' : '') + word;
        } else {
          lines.push(indent + currentLine);
          currentLine = word;
        }
      });
      lines.push(indent + currentLine);
      currentLine = '';
    }
  });

  return lines.join('\n');
}

This actually seems better—and Willison claims it works—but it still seems only OK. There is a lot of allocation in this response, as well as a failure to separate concerns.

There are a few pieces to this request:

Split the string into paragraphs
Wrap each paragraph to lines of a given width, including an optional prefix on each line

You can play with his version at Wrap text at specified width by Simon Willison (Observable)

When I threw the code into WebStorm and added a test, I discovered that it doesn’t wrap at the desired width.

It fails to take the length of the indent into account when wrapping the text.
It doesn’t sanity-check that the indent isn’t bigger than the desired wrap length.

At least the algorithm doesn’t fall into the pathological trap when a word is too long on it’s own to fit within the desired width, which would result in an infinite loop.

I noodled around with a solution that uses generators to avoid allocation until you actually snip text. The algorithm looks like this:

Pass in a text, desired width, and indent.
Get a generator for all line breaks.
Get a generator for all word breaks in a line.
Yield a generator for all line spans.
Yield a generator from that for all lines.

This strategy ensures that there are no substring allocations until you actually need them. You can get the first line or two lines without allocating more than the substrings for those lines. There are no substrings allocated to find word breaks, as the algorithm outlined above does.

Is it a more complex algorithm? Of course. Is it relatively easy to understand, especially with the requisite tests? Yes. Does it do its job much more efficiently? Absolutely.