Hunting an Idris2 Codegen Bug

Background

I had been working on an annoyingly difficult Ph.D.-related project recently, and when it finally did work, Idris itself crashed. And spectacularly so! I got an error message which I’d never seen before:

Exception in string-append: erased is not a string

This wasn’t an INTERNAL ERROR, which typically show up when there is an idris_crash call somewhere, so it had to be something to do with the compiled Idris code. And it had to be something serious, because erased values are (as the name suggests) not meant to be there; they’re never meant to be used at runtime! So something was up with the Idris compiler. That’s scary, because there is a lot that makes Idris2 tick and, like all compiled code, it’s really mostly meant to be machine-generated and executed, not read by humans…

After asking for help on the Idris discord, both stefan-hoeck and dunham suggested ways to get more readable compiled code out (by compiling to NodeJS) and to potentially get a stack trace (by using Racket Scheme instead of Chez). With that, I was able to narrow down the problem to the built-in mod function. The mod function is part of the Idris2 prelude’s Integral interface, and is generally implemented along the following lines:

Integral Int where
  div x y =
    case y == 0 of
         False => prim__div_int x y

  mod x y =
    case y == 0 of
         False => prim__mod_int x y

This, as you may have noticed, is neither total nor covering. So what happens with that? My first thought was to “fix” this by filling in the missing True case and adding a descriptive crash message in those cases. However, dunham figured out that the issue was actually more serious than that. He defined mod exactly the same, just outside the Integral interface, and got the expected crash message!

partial
mod : Int -> Int -> Int
mod a b = case b == 0 of False => prim__mod_Int a b

partial
main : IO ()
main = printLn (mod 10 0)
ERROR: Unhandled input for Main.case block in mod at src.Error:9:1--10:31

The error I was seeing was only reproducible when using the built-in Prelude.Num.mod, suggesting that there was something wrong with the code generation/compilation itself!

The problem

As part of his investigation, dunham had discovered that the generated scheme code did contain the correct error message, but somehow the generated function call was wrong:

(define PreludeC-45Num-u--mod_Integral_Int 
  (lambda (arg-0 arg-1) 
          (let ((sc0 (PreludeC-45EqOrd-u--C-61C-61_Eq_Int arg-1 (blodwen-toSignedInt 0 63)))) 
            (cond ((equal? sc0 0) (blodwen-euclidMod arg-0 arg-1)) 
                  (else ((Builtin-idris_crash 'erased) "Unhandled input for Prelude.Num.case block in mod at Prelude.Num:131:3--133:40"))))))

That final line evaluating Builtin-idris_crash 'erased, followed by an error string, should just be the evaluation of Builtin-idris_crash applied to that very error string. An extra 'erased was getting in there somehow.

To get a sense of whether this problem was local to the Scheme backends or a more general problem, I checked the behaviour on the other code-gen targets. Unfortunately, I could reproduce the erroneous code generation on both the NodeJS and the Reference-counting C (RefC) backends, meaning the problem was universal.

Show NodeJS code

/* Prelude.Num.mod */
function Prelude_Num_mod_Integral_Int($0, $1) {
 switch(Prelude_EqOrd_x3dx3d_Eq_Int($1, Number(_truncBigInt32(0n)))) {
  case 0: return _mod($0, $1);
  default: return Builtin_idris_crash(undefined)('Unhandled input for Prelude.Num.case block in mod at Prelude.Num:131:3--133:40');
 }
}

Show RefC code

switch(extractInt(var_6)){
      case 0 :
      {
        tmp_15 = mod_Int64(var_0, var_1);
        break;
      }
      default :
      {
        // start Builtin_idris_crash(NULL)                   // Prelude.Num:131:3--133:40
        Value_Arglist *arglist_16 = newArglist(0,1);
        arglist_16->args[0] =  newReference(NULL);
        Value *(*fPtr_17)(Value_Arglist*) = Builtin_idris_crash_arglist;
                                                             // Prelude.Num:131:3--133:40
        Value *closure_17 = (Value*)makeClosureFromArglist(fPtr_17, arglist_16);
                                                             // Prelude.Num:131:3--133:40
        // end   Builtin_idris_crash(NULL)                   // Prelude.Num:131:3--133:40
        Value * var_4 = trampoline(closure_17);              // Prelude.Num:131:3--133:40
        Value * var_5 = (Value*)makeString("Unhandled input for Prelude.Num.case block in mod at Prelude.Num:131:3--133:40");
                                                             // Prelude.Num:131:3--133:40
        tmp_15 = tailcall_apply_closure(var_4, var_5);
        removeReference(var_5);
        removeReference(var_4);
      }
    }

Hunting down the bug

Idris comes with logging functionality for exactly this kind of scenario. So to help me narrow down what was wrong, I tried adding some %logging around the affected code. Unfortunately, neither the compile.casetree or just the compile log topics yielded anything. Fortunately, I had the expected error string: it started with “Unhandled input”. So armed with ripgrep and the Idris2 source tree, I tried searching for that.

This led me to TTImp.ProcessDef. Ah. Going down a TTImp hole is rarely fun. TTImp is an abbreviation for “Type Theory with Implicits” and is the underlying representation of Idris2 code. As far as I understand, this is roughly the “assembly” of Idris – all Idris code gets turned into TTImp. TTImp is relatively simple, while being powerful enough to express everything you need to be able to reason about a complete Idris program. So yeah, digging through that stuff is seldom fun, but sometimes you just have to.

The error message itself was part of a larger structure of local definitions (a massive where-block) and function calls, which traced back to a function called mkRunTime, specifically lines 804-806:

           let clauses = case cov of
                              MissingCases _ => addErrorCase clauses_init
                              _ => clauses_init

This code was checking if there were any non-covered inputs to the given case-block and if so, used the addErrorCase function (one of the local definitions) to traverse the list of defined case pattern clauses until it reached the final one, at which point it appended a MkCrash clause with the “Unhandled input” error. However, as you can see from the code above, there were no log messages being emitted when this was happening, which explained why I wasn’t getting any log output.

Adding some logging involved dealing with Core (the type the Idris compiler uses for state), which is always a bit delicate: there is a lot going on, lots of datatypes in scope, lots of compiler and/or functional programming specific language in use, and lots of functions which manipulate these things in various ways. But I managed to add a new log topic: compile.casetree.missing. Now I could add that in before the addErrorCase call and log when these MkCrash cases were being added.

This initially seemed to do nothing, until I realised that I needed to put the %logging pragmas around the interface implementation itself and then recompile Idris2 to see the effects. Otherwise, the compiler happily skipped over the already-compiled prelude and just compiled the file that was using it. With that fixed: hurray, log output at last!

Getting output only led to more confusion though. The log outputs of the file with a custom mod function and the prelude were effectively identical:

LOG compile.casetree.missing:5: Adding uncovered error for [[a, b]: ($resolved2578 [__]@b[1] [__]@a[0] $resolved2456) = ($resolved55 a[0] b[1])]
LOG compile.casetree:5: Clauses are [[a, b]: ($resolved2578 [__]@b[1] [__]@a[0] $resolved2456) = ($resolved55 a[0] b[1]), [a, b]: ($resolved2578 [__] [__] [__]) = (Builtin.idris_crash [__] "Unhandled input for Issue2950.case block in mod at Issue2950:9:1--9:52")]
LOG compile.casetree:5: simpleCase: Clauses:
  (Issue2950.case block in mod [__]@{pat0::1} [__]@{pat0::0} Prelude.Basics.False) = (prim__mod_Int {pat0::0} {pat0::1})
  (Issue2950.case block in mod [__] [__] [__]) = (Builtin.idris_crash [__] "Unhandled input for Issue2950.case block in mod at Issue2950:9:1--9:52")
LOG compile.casetree.missing:5: Adding uncovered error for [[x, y]: ($resolved1587 [__]@y[1] [__]@x[0] $resolved386) = ($resolved55 x[0] y[1])]
LOG compile.casetree:5: Clauses are [[x, y]: ($resolved1587 [__]@y[1] [__]@x[0] $resolved386) = ($resolved55 x[0] y[1]), [x, y]: ($resolved1587 [__] [__] [__]) = (Builtin.idris_crash [__] "Unhandled input for Prelude.Num.case block in mod at Prelude.Num:132:3--134:40")]
LOG compile.casetree:5: simpleCase: Clauses:
  (Prelude.Num.case block in mod [__]@{pat0::1} [__]@{pat0::0} Prelude.Basics.False) = (prim__mod_Int {pat0::0} {pat0::1})
  (Prelude.Num.case block in mod [__] [__] [__]) = (Builtin.idris_crash [__] "Unhandled input for Prelude.Num.case block in mod at Prelude.Num:132:3--134:40")

Sure, one used a,b and the other x,y, and there was some different numbering of the machine-generated names, but other than that the clauses were exactly the same! I tried following what happened at the end of the mkRunTime stuff, but this only confirmed that the runtime trees were the same:

Runtime tree for Issue2950.case block in mod:
  case {arg:2} : Prelude.Basics.Bool of
    Prelude.Basics.False => (prim__mod_Int {arg:1}[1] {arg:0}[0])
    _ =>
      (Builtin.idris_crash [__] "Unhandled input for Issue2950.case block in mod at Issue2950:9:1--9:52")
Runtime tree for Prelude.Num.case block in mod:
  case {arg:2} : Prelude.Basics.Bool of
    Prelude.Basics.False => (prim__mod_Int {arg:1}[1] {arg:0}[0])
    _ =>
      (Builtin.idris_crash [__] "Unhandled input for Prelude.Num.case block in mod at Prelude.Num:132:3--134:40")

On one hand, this was good, because it meant the TTImp was fine (I knew that the custom mod version gave the expected error message, and the runtime tree for the built-in mod matched that). On the other, this was annoying because it meant the next step was to try to dig through the compiler to figure out how the completely fine TTImp got turned into not-completely-fine Scheme (or JavaScript or C).

Digging through the compiler

Initially, I tried looking in src/Compiler/Scheme/Common.idr. However, when that didn’t bring up anything, my next attempt was to find the source and handling of Clauses datatypes. Unfortunately rg MkClause src/Compiler came up blank. What about searching for Erased instead? Surely there must be a type for erased? Well yes, but there is more than one, depending on which code-gen backend, type of term, and intermediate representation one is referring to. So that was slightly too wide a net. Following that, I tried to find places handling a Crash, which was more manageable, except all the functions seemed sensible and not broken.

Okay, well, what about the clauses variable that the TTImp processing was creating? Where did it get passed to? A function called getPMDef. After more rg-ing, it led me to src/Core/Case/CaseBuilder.idr. That seemed like a promising name! So, open that, find the getPMDef definition, and try to underst-

While I was doing this, I was also occasionally chatting with dunham on the Idris discord, and he had just now accidentally discovered the problem while trying to implement a fix. His initial idea was the same as the one I had: manually define each and every True crash case in Prelude.Num. In order to do this, dunham had to import Builtin. However, after removing the manually defined crashes to try to find the reason for the bug, the problem remained fixed. We spent a brief moment puzzling over whether we’d simply been using an old Idris build, or if there were some stale .ttc files or something, before dunham realised that he had forgotten to remove the import Builtin statement when undoing his changes. And this import statement fixed the issue!

The source of the problem

Normally, when you use a function, the Idris compiler checks if it is in scope and throws an error if it isn’t. That’s kind of a core functionality part of a compiler. However, when handling the TTImp case stuff, we were constructing function calls directly, bypassing all the checks and just going “in this case, generate a function call of name idris_crash with the following argument(s)”. Since Builtin was never imported in Prelude.Num, we were generating a compiled thing which didn’t actually have idris_crash in scope! This then led to the incorrect code generation we were seeing (presumably due to some further compiler passes or inlining or erasure steps).

“How did this not get caught earlier?” you might ask. It turns out that all the other modules in Prelude which need to crash do import Builtin. So this was a case of an easily missable, subtle error, which only appeared if you happened to call the built-in mod or div with 0, which most people and programs who use them take care to avoid.

Also, it turned out the issue had been caught earlier! In January this year (2023), issue #2865 was opened by andorp. It described the issue with mod, as well as the weird error message involving “erased”. So a bit of searching on GitHub after narrowing down the problematic function would’ve saved us a lot of trouble. Well at least I got a fun dive into the source code out of it…

Fixing things

With the problem identified, the first idea was to add a check. We do have the context in scope during mkRunTime, so the idea was to just check whether idris_crash from the namespace Builtin was in scope whenever we needed to compile an incomplete case block, and complain (with a descriptive error message) if it wasn’t.

However, prompted by z_snail, a much better solution ended up being to use prim__crash directly. This, as the name suggests, is a lower-level primitive function, which is helpful because A) it seemed more appropriate to generate low-level crashes in this situation where we were handling semi-compiled Idris; and B) it doesn’t need Builtin – it just gets compiled to blodwen-error-quit, which is defined in the support files required to get Idris going in the first place. In short, it was a better, simpler, and safer option.

This was implemented by dunham, along with a test case to make sure the bug wouldn’t accidentally get reintroduced, in Idris2 pull request #2952. And with that, the problem was fixed!!

Conclusion

These bugs are always a good excuse to learn more about the internal workings of Idris2. Although I did not find the solution myself and instead went down a rabbit hole of compiler calls, I gained some useful knowledge along the way, e.g. where the case-tree stuff lives, and how we handle it. Not only is this great for me personally, but it also allows me to “draw in” more of the Map of the Source Code, which is an immensely useful resource for future bug hunts! And thanks to dunham and z_snail finding the root problem and a solution, I now have a better intuition for what might cause weird bugs, if one should come up again. All around a day well spent.


I hope this was insightful and that you learned something. Thanks for reading! : )

Acknowledgements

  • Stefan Höck (stefan-hoeck) for helping me get more readable code-gen output and teaching me how to easily inspect different compiled outputs.
  • Steve Dunham (dunham) for helping me find the cause of the bug and for implementing the fix.
  • Zoe Stafford (Z-snails/z_snail) for the reminder that prim__crash was a thing.
  • Andor Penzes (andorp) for originally discovering the bug. I should’ve searched the issue tracker once I narrowed down the problem, sorry ^^;;
Thomas Ekström Hansen
Thomas Ekström Hansen
PhD student in Computer Science

My interests include information visualisation, formal methods, and low-level programming.

Next
Previous

Related