The Two Sides of an Error (and a Short Survey in the Go language)

Estimated reading time: 20 to 30 minutes

Special thanks to Alexandre Salle, Pietro Menna and other colleagues for the valuable discussions on this theme.

What is the purpose of writing errors when programming? And how can we help people using our code by writing good errors? This is a summary of a few discussions I had on the subject and a short survey on how this matter is handled in the Go language.

Motivation: the Perspective of the Caller

Let's imagine you're writing code and you call a function that might return an error. What can you do with that error? There are two cases (ignoring the error is not an option!):

You can take action. For example, retry if it's a temporary problem, or maybe free some memory and call the function again, or even set the system to a degraded state while still keeping it functional. Or,
You cannot take action. In that case, you can return the error, throw an exception, log the error, exit, etc., depending on where your code lies on the system stack.

Having that, we can think what a good error offers the caller:

Meaningful information for the code to take action in runtime. For example, an error of type TemporaryError informs the caller that they can retry the operation. EndOfFileError can inform the caller that they've reached the end of input, and so on. And,
Good error messages containing debug information, such as the underlying causes and stack traces. This allows for taking action in debug time.

Notice these two traits mirror the two cases listed before. It's important to distinguish what the code and what the IT guy can do about the error, since these are very different enterprizes. The next two sections go through this two traits in more detail.

Runtime Decisions and Abstraction Levels

Suppose you're calling a function called Compute(), which is a remote procedure call (RPC). Many things can go wrong. The stack trace for this call may look like this:

Level	Function Call
RPC	`Compute()`
HTTP	`Get()`
TCP	`Recv()`
IP	OS level
…	…

If Compute() receives an error which originated in a corrput IP package or a failed DNS lookup, the caller of Compute() shouldn't be able to recover. Just imagine writing scientific computing code and having if statements to deal with network issues. Yikes. It is the job of Compute() to distinguish which errors are recoverable and which ones are not. The caller can't deal with every single error on lower layers, because the branching factor makes this undesirable or even unfeasable.

Each function should summarize the errors from lower layers and give the caller only meaningful information to answer questions like: What can I do about this error here in the code? Is this error actionable in runtime?

This relates to point 1 mentioned in the beginning of this text.

Debug Time Decisions and Log Messages

Let's change perspectives. You're not writing code anymore, but debugging code or looking at system logs. In this case, you want the error messages to be complete. You want to see the errors from the first layer that failed until the main routine of the program.

Imagine you're debugging the program that uses Compute() and you see this message:

Compute() failed: Connectivity error

You'll certainly be frustrated. Compare this to:

Compute() failed: Connectivity error: HTTP request failed: GET http://loclhost:8080: Could not resolve host: loclhost

Now you've found the culprit.

Doesn't this feel very different from what the code wants to see? It is now the responsability of the error to carry all the underlying causes of it. Every useful detail. Notice, however, that the information "could not resolve host" does not help the program to recover, but is extremely helpful for the programmer debugging it. Therefore, it is hidden in the message, not in an error code or error type.

Looking to Real Code

Sampling Code

To see if I could map those hypothesis to real applications, I took samplings of Go code. I went to the Github monthly trending Go repositories and downloaded (go get) about 20 projects and their dependencies. Thanks to Go ubiquitous "if err ..." statements, I could randomly sample projects for points where errors are encountered. I used this shell command:

grep -A5 'if err ' $(find . -type f -name '*.go' | grep -vE '(test.go|vendor)' | shuf | head -100)

This recursively finds all .go files starting in current directory, shuffles them, takes the first 100 and finds in them the lines containing "if err " (mind the space). Then it prints the match and the 5 subsequent lines. Phew. If you run this in your $GOPATH/src folder you'll see output that looks like this:

...
./golang.org/x/tools/cmd/toolstash/cmp.go:	if err != nil {
./golang.org/x/tools/cmd/toolstash/cmp.go-		log.Fatal(err)
./golang.org/x/tools/cmd/toolstash/cmp.go-	}
./golang.org/x/tools/cmd/toolstash/cmp.go-	defer f1.Close()
./golang.org/x/tools/cmd/toolstash/cmp.go-
./golang.org/x/tools/cmd/toolstash/cmp.go-	f2, err := os.Open(outfile + ".stash.log")
--
./github.com/kubernetes/kubernetes/pkg/proxy/ipvs/proxier.go:			if err != nil {
./github.com/kubernetes/kubernetes/pkg/proxy/ipvs/proxier.go-				glog.Errorf("Failed to add destination: %v, error: %v", newDest, err)
./github.com/kubernetes/kubernetes/pkg/proxy/ipvs/proxier.go-				continue
./github.com/kubernetes/kubernetes/pkg/proxy/ipvs/proxier.go-			}
./github.com/kubernetes/kubernetes/pkg/proxy/ipvs/proxier.go-		}
./github.com/kubernetes/kubernetes/pkg/proxy/ipvs/proxier.go-		// Delete old endpoints
--
./github.com/tools/godep/rewrite.go:	if err != nil {
./github.com/tools/godep/rewrite.go-		return err
./github.com/tools/godep/rewrite.go-	}
./github.com/tools/godep/rewrite.go-	ast.SortImports(fset, f)
./github.com/tools/godep/rewrite.go-	tpath := name + ".temp"
...

I ran this command many times scanning the output. The goal here is to see what happened when an error was found in code. I didn't do proper statistics, but after reading many of those snippets, it seemed to me the handling could be grouped in some categories. You can run the above command and see if you find the same categories. I just recommend doing before reading any further to avoid confirmation bias. You may also disagree about how the sampling was done, or think that the command was just plain wrong. In any case, I've found the handlig to lie in these four calsses, ordered from most common to least common:

return err. Can't do anything nor add information, so just return.
return fmt.Errorf(..., err). Can't do anything, but debug information is preppended to the underlying error.
log.Fatalf(..., err), log.Errorf(..., err), etc. This seems to be most common in source files for executables. Places like main.go files, files with the same name as its parent folder, or files in a cmd folder.
return newTypedError(message, err). The underlying error is wrapped in a new type of error, raising the level of abstraction. This seemed surprisingly uncommon.

I've also selected some examples which I thought were representative of the two uses for errors proposed here.

Some Examples

The following are code snippets to illustrate the concepts of cases 1 and 2, which I've distiguished in the beginning of this text. They're all collected by the code sampling technique I mentioned, but I've made the formatting a bit nicer. The first comment tells you where to find the code.

Runtime Decision Making

This is the caller's perspective of case number 1. I considered here cases in which the code takes action due to the error.

Here, branching occurs on a special error type:

// In the standard library's net/http/httputil/persist.go
if err != nil {
	if err == io.ErrUnexpectedEOF {
		// A close from the opposing client is treated as a
		// graceful close, even if there was some unparse-able
		// data before the close.
		sc.re = ErrPersistEOF
		return nil, sc.re

// In the standard library's runtime/pprof/internal/profile/legacy_profile.go
if err != nil {
	if err == errUnrecognized {
		// Recognize assignments of the form: attr=value, and replace
		// $attr with value on subsequent mappings.
		if attr := strings.SplitN(l, delimiter, 2); len(attr) == 2 {
			attrs = append(attrs, "$"+strings.TrimSpace(attr[0]), strings.TrimSpace(attr[1]))
			r = strings.NewReplacer(attrs...)

In this example, the code turns on a flag and continues processing:

// In the standard library's go/ast/resolve.go
if err != nil {
	p.errorf(spec.Path.Pos(), "could not import %s (%s)", path, err)
	importErrors = true
	continue
}

Abstracting Before Returning to the Caller

This is the error's perspective of case number 1. I considered abstracting when an error variable is collapsed into a single kind of error, when more than one bit of information is summarized in one error type or when errors are retinterpreted before they're returned to the caller.

Below, a special condition receives a name:

// In the standard library's net/http/h2_bundle.go
if err == io.EOF && cs.bytesRemain > 0 {
    err = io.ErrUnexpectedEOF
    cs.readErr = err
    return n, err
}

// In github.com/ethereum/go-ethereum/core/vm/interpreter.go
if err != nil || !contract.UseGas(cost) {
	return nil, ErrOutOfGas
}

This time an error is reinterpreted:

// In the standard library's net/http/h2_bundle.go
if err == http2ErrNoCachedConn {
	return nil, ErrSkipAltProtocol
}

// In github.com/golang/go/src/cmd/go/build.go
if err != nil {
	return false
}

And, here, potentially many types of errors are collapsed into one type:

// In the standard library's os/env.go
if err != nil {
	return NewSyscallError("setenv", err)
}
return nil

// In the standard library's runtime/pprof/internal/profile/legacy_profile.go
if err != nil {
	return nil, errUnrecognized
}

Logging the Error

This is the callers perspective of case number 2. Plenty of examples of simply logging the error were found. In this cases, the error message is exposed to whoever is looking at the terminal output. Notice how, even without context, you can see the errors seem to be "non-actionable". Things related to hardware failure, invalid input, hard network problems, failed system calls, etc. The code can't recover from this, but the programmer sitting in the chair can plug a network cable, optimize loops, fix the syntax error, and so on.

// In github.com/kubernetes/kubernetes/test/e2e/common/autoscaling_utils.go
if err != nil {
	framework.Logf("ConsumeCPU failure: %v", err)
	return false, nil
}

// In golang.org/x/tools/cmd/godoc/handlers.go
t, err := template.New(name).Funcs(pres.FuncMap()).Parse(string(data))
if err != nil {
	log.Fatal("readTemplate: ", err)
}
return t

// In github.com/ethereum/go-ethereum/metrics/influxdb/influxdb.go
_, _, err := r.client.Ping()
if err != nil {
	log.Printf("got error while sending a ping to InfluxDB, trying to recreate client. err=%v", err)
	if err = r.makeClient(); err != nil {
		log.Printf("unable to make InfluxDB client. err=%v", err)
	}
}

// In golang.org/x/tools/cmd/toolstash/cmp.go
f2, err := os.Open(outfile + ".stash.log")
if err != nil {
	log.Fatal(err)
}

Adding Debug Information

This is the error's perspective of case number 2. This were cases when the underlying error was not available to the caller, but its information was appended to the error message. Again, you can notice patterns similar to the aforementioned ones. These errors are irrecoverable, so a generic error type is returned. The caller can't do anything and the underlying cause belongs to log messages, not the callers code.

// In github.com/golang/go/src/cmd/go/build.go
if err != nil {
	os.Remove(dst)
	return fmt.Errorf("copying %s to %s: %v", src, dst, err)
}

// In github.com/ethereum/go-ethereum/whisper/whisperv5/whisper.go
if err != nil {
	return "", fmt.Errorf("failed to generate ID: %s", err)
}

// In github.com/alecthomas/chroma/style.go
if err != nil {
	return nil, fmt.Errorf("invalid entry for %s: %s", ttype, err)
}

How to Allow for Good Runtime and Debug Time Decisions

The Amount of Information an Error Exposes

This is related to runtime decision making. When either a generic error is returned (return fmt.Errorf(..., err)cases) or no error is returned (return nil), the information of the underlying error is collapsed into one bit. The caller of this function has two cases to distinguish: either an error occured or it didn't.

When we define error types in Go (or exception types in Java, or special return values in C, etc.), we are giving the caller more information. A function that can return two kinds of error gives the caller three possible outcomes: errors of the first type, errors of the second type and no errors at all.

Go has the special trait in that errors are values. This allows for flexible error handling techniques. The standard library has many ways of creating and exposing errors to its users: using variables, types, methods, anonymous functions, etc. The principle is still the same: to convey the relevant information about the error to the calling code in the appropriate level of abstraction.

Hiding (But Not Losing) the Underlying Cause

As we can see from the code samples, sometimes information is hidden from the caller. When we wrap the underlying error by doing

return fmt.Errorf(..., err)

we are hiding from the caller the real cause of the error. Notice, however, that we're not hiding it from the person debugging the output of the code. They still see the underlying error in the final message, because that might be useful for debugging.

Conclusions

When returning errors to the caller, it's important to distinguish the two types of decision making developers go through: runtime and debug time. Taking into account well known themes in computer science, such as information hiding and abstraction, we must careful not to conflate those matters and hide useless information for the caller (though possibly useful to someone debugging) inside the error message and expose only the necessary information for runtime decision making.