For those unfamiliar, the mem2reg (memory to register) algorithm is the algorithm Noir uses to replace references with the values held by those references in the code. It is an optimization pass that occurs at several points in our SSA pipeline as various other optimizations simplify the program and references become easier to track.
Now onto the main topic: our current mem2reg algorithm is insufficient. In particular it is missing:
Without delving into the "why" too much further, the short is that we need to expand the mem2reg pass to support these features, and we need to do so cleanly.
Consider each block and reference in isolation.
For any given block bN
and reference vN
, the value of vN
at the beginning of bN
will always be equal to the unification of each value vN
has at the end of each predecessor block of bN
.
A
and B
from two blocks unify iff A = B
. If the value from either predecessor block differs, the result of the unification is "Unknown".bN
has no predecessors, the starting value of vN
is also "Unknown." If it helps, you can think of this as unifying an empty set of predecessor blocks always resulting in an "Unknown" value.So the value of a reference at the start of a block is either Unknown, or Known(value). We can represent this cleanly in an enum:
enum ReferenceValue {
Unknown,
Known(ValueId),
}
Now that we have the value of a reference at the start of a block, to find its value at the end of a block there are several cases:
vN
within bN
-> The value of vN
at the end of bN
equals its value at the beginningvN
within bN
-> The value of vN
after the store is Known(stored_value)
bN
containing vN
as an argument -> The value of vN
after the call is Unknown
At any point while going through the instructions of a block, if we find a vA = load vN
in which the ReferenceValue
for vN
is Known(vB)
, we can replace the value vA
for the value vB
.
With these rules we have a relatively simple framework to handle allocations in each block. To expand this to a whole function, we could traverse each block in execution order, remembering the ReferenceValue
of each reference at the end of each block and feeding those into the unified inputs of each subsequent block.
Compared to our current mem2reg algorithm, this version is fairly similar but provides a simpler base to build upon. Notably, it does not require any logic for finding loops within the program and extends to passing references to function calls cleanly.
Consider a simple looping program:
fn main() {
let mut foo = 0;
for i in 0..3 {
foo += 1;
println(foo);
}
println(foo);
}
And its SSA representation before any optimizations:
b0():
v0 = allocate // let mut foo;
store Field 0 in v0 // foo = 0;
jmp b1(Field 0)
b1(v1: Field) // for i in 0..3 {
v2 = lt v1, Field 3
jmpif v2, then: b2, else: b3
b2():
v3 = load v0
v4 = add v3, Field 1
store v4 in v0 // foo += 1;
v5 = load v0
call println(v5) // println(foo);
v6 = add v1, Field 1
jmp b1(v6)
b3():
v7 = load v0
call println(v7) // println(foo);
return
Let's apply the mem2reg algorithm we have so far to this code:
Block b0
:
Unknown
store Field 0 in v0
within, so the value of v0
by the end of the block is Field 0
Block b1
:
b0
and b2
. Since we haven't explored b2
yet it must occur later in this analysis. So our starting value for v0
is Unknown
(after unifying Known(Field 0)
and Unknown
).Unknown
.Block b2
:
b1
, so our starting value for v0
is Unknown
.v3 = load v0
, but we do not know the value of v0
, so it is kept in the code.store v4 in v0
, so the value of v0
is now Known(v4)
.v5 = load v0
and replace v5
with v4
.v0
is v4
since there are no later stores.Block b3
:
b1
so the value of v0
is inherited as Unknown
.v7 = load v0
with anything yet.v0
is Unknown
.There weren't many loads to replace in this example but we can consider how changing the code may affect the result. E.g. What if we added a store Field 7 in v0
in b1
? That would change the initial value of v0
in both blocks b2
and b3
to Known(Field 7)
, so we'd replace both v3
and v7
with Field 7
. This may seem unintuitive at first to replace the load at the end of the loop, thus ignoring all the mutations within the loop, but note that the loop block b2
always jumps back to b1
which only afterwards jumps to b3
to end the loop. Thus, there would be no way to get to the end of the loop without going through the new store Field 7 in v0
first.
Hopefully I've managed to convince you, the reader, so far that this new algorithm is simple and sound. Now that that is done, let's break this trust and disprove that assumption:
If handling function calls were all we needed, there would be no reason to change our mem2reg algorithm or write this whole document explaining it.
Let's consider the elephant in the room, aliasing:
fn main(v0: bool) {
let v1 = &mut 0;
let mut v2 = [v1, v1];
if v0 {
*v2[0] = 1;
} else {
*v2[1] = 2;
}
// What is the value of v1?
}
And in SSA:
b0(v0: u1):
v1 = allocate
store Field 0 in v1
v2 = allocate
store [v1, v1] in v2
jmpif v0, then: b1, else: b2
b1():
v3 = load v2
v4 = array_get v3, index 0
store Field 1 in v4
jmp b3()
b2():
v5 = load v2
v6 = array_get v5, index 1
store Field 2 in v6
jmp b3()
b3():
// What is the value of v1?
return
Saving you the trouble of applying the algorithm yourself, the result would be Field 0
as if it were not modified in either block(*). This is because we've only been tracking references via their value ids (v1
here) so far. To properly handle this case we'd need to employ one of two strategies:
[v1, v1]
which was then stored in v2
.ReferenceValue::Unknown
for v1
despite there being no conflicting direct stores to it.(*) Note that Noir's current mem2reg algorithm does not actually fail on this case yet, but more complex aliasing cases will still fail, e.g. https://github.com/noir-lang/noir/issues/2316.
Unfortunately, there's no single, standard algorithm for optimizing out loads to arbitrary pointers. Instead, there are many different trade-offs to consider along the way. We must also note that identifying all pointer aliases correctly has been proven to be undecidable (Landi, 1992), so we can abandon the notion of always identifying every alias. It can be helpful to get some background information on what options are available though. I've found this resource which seems like a good starter:
I'll be going through each option to slowly build up the pass we need (you can find this list starting on slide 8):
[&mut T]
. So this simplifies things somewhat. We also do not need to worry about structures at all since they are not present in our SSA form.With these we can start to build out the algorithm a bit more.
Most of the points above are already covered by the existing algorithm structure that was outlined. The most notable addition is the first point on tracking aliases directly. We can add some additional rules to track alias sets:
vN = allocate
, it is initially aliased only to itself: {vN}
.store v1 in v2
where v1
is also a reference, we add *v2
to v1
's alias set: {v1, *v2}
.vA = [..., vN, ...]
=> {v1, vA[i]}
vA = array_set vB, index i, value vN
=> {v1, vA[i]}
{v1, vA[?]}
slice_push_back
, slice_insert
, slice_remove
, etc.Luckily, Noir's core semantics have a fairly small surface area so references and arrays are our only worries. We don't have to worry about pointer arithmetic, pointer casting, long jumps, etc.
Taking the example with aliasing from earlier:
b0(v0: u1):
v1 = allocate
store Field 0 in v1
v2 = allocate
store [v1, v1] in v2
jmpif v0, then: b1, else: b2
b1():
v3 = load v2
v4 = array_get v3, index 0
store Field 1 in v4
jmp b3()
b2():
v5 = load v2
v6 = array_get v5, index 1
store Field 2 in v6
jmp b3()
b3():
// What is the value of v1?
return
Since we're now storing alias sets in addition to the ReferenceValue
enum from before, I'll write this as the tuple of (alias set, reference value)
. Now, let's apply our algorithm:
b0
:
({v1}, Known(0))
initially({v2}, Known([v1, v1]))
v1
, we modify v1
's reference set to ({v1, *v2[0], *v2[1]}, Known(0))
b1
:
v3
now gets added to v1
's alias set: {v1, *v2[0], *v2[1], v3[0], v3[1]}
v4
: {v1, *v2[0], *v2[1], v3[0], v3[1], v4}
store Field 1 in v4
, since we know v4
is only in a single alias set (it can be in multiple if we are unsure which reference it is aliased to), we can actually change our ReferenceValue here as well: ({v1, ...}, Known(1))
. If v4
had aliased several references, we would have had to change all of those references to ReferenceValue::Unknown
.b2
:
b1
but using v5
, v6
, and storing Field 2
instead.b3
:
b3
has two predecessors so we merge the references values:
({v1, *v2[0], *v2[1], v3[0], v3[1], v4}, Known(1))
from b1
({v1, *v2[0], *v2[1], v5[0], v5[1], v6}, Known(2))
from b2
({v1, *v2[0], *v2[1], v3[0], v3[1], v4, v5[0], v5[1], v6}, Unknown)
is the resultload
s to v1
in b3
!This should be a workable algorithm, but could use some fine tuning. There are some items we could improve upon for example:
… And that's it! As long as there are not major objections to this approach, we will likely be seeing a greatly changed mem2reg in the near future. Also, I've done absolutely no proofreading, so let me know what you think
At the time of writing, the algorithm above has been in use for quite a while and has generally worked well. Over time however, we began focusing on code size and optimizations for unconstrained code more. Compared to constrained code, different control flow patterns are much more important to handle well in unconstrained code since they won't be flattened into a single block later on. The mem2reg pass as it exists above doesn't optimize looping in particular as well as it could. Consider the program:
unconstrained fn main() {
let mut foo = 7;
for i in 0 .. 10 {}
assert_eq(foo, 7);
}
We can see from the above that although it is mutable, foo
is never modified within the loop (or anywhere), so we should be able to optimize the assert to assert_eq(7, 7)
, then remove it entirely. When converting this program to SSA we get:
brillig(inline) fn main f0 {
b0():
v1 = allocate -> &mut Field
store Field 7 at v1
jmp b1(u32 0)
b1(v0: u32):
v5 = lt v0, u32 10
jmpif v5 then: b3, else: b2
b3():
v9 = add v0, u32 1
jmp b1(v9)
b2():
v6 = load v1 -> Field
constrain v6 == Field 7
return
}
Which we can then run mem2reg on to get… the same SSA back. Why is this happening? It's happening due to how references are tracked across blocks. Going step by step, mem2reg will visit:
b0
: We find that *v1 = 7
, okb1
:
b0
, we have *v1 = 7
b3
, so v1
is Unknown
there, resulting in it being unknown in b1
as well.b3
: Only predecessor is b1, so it inherits v1 as unknownb2
: Only predecessor is b1, so it inherits v1 as unknown
v1
is unknown, we can't optimize out the load in this blockTo better handle this case, we need a way to distinguish between Unknown (due to program logic that can't be known at compile-time) and Unknown (because a predecessor block hasn't been visited yet). Notably, we can't just introduce a NotYetKnown
value for this case and allow unify(Known(v), NotYetKnown) = Known(v)
since that would break mutations in loops. Consider the code:
brillig(inline) fn main f0 {
b0():
v1 = allocate -> &mut Field
store Field 7 at v1
jmp b1(u32 0)
b1(v0: u32):
v5 = lt v0, u32 10
jmpif v5 then: b3, else: b2
b3():
v9 = add v0, u32 1
v10 = load v1 -> Field <-- new instruction
constrain v10 == Field 7 <-- new instruction, `assert_eq(foo, 7)` in the loop
store v9 at v1 <-- new instruction, `foo = i;` in the loop
jmp b1(v9)
b2():
v6 = load v1 -> Field
constrain v6 == Field 7
return
}
If we had NotYetKnown
with the unification change above, we'd perform:
b0
: We know *v1 = 7
b1
: Unify v1
from predecessors:
*v1 = 7
in b0
v1
is NotYetKnown
in b3
unify(Known(7), NotYetKnown)
, we know *v1 = 7
in b1b3
: We inherit *v1 = 7
from b1
v10 = Field 7
(!)constrain 7 == 7
(!)store v9 at v1
so we now know *v1 = v9
b2
: We inherit *v1 = 7
from b1
(!)
This is incorrect however. We're mutating foo = i;
within the loop so we'd expect to fail the assert_eq(foo, 7);
on the second iteration of the loop since foo
would be 0
then, but the constrain was optimized out, so this is not what happens. Instead the program executes successfully when it should have failed. Another sign we can see of this failure is b2
after the loop inherits the value of v1
from before the loop, completely ignoring any mutations that happened within the loop. Finally, a third sign of failure is if we go back to b1
and check its predecessors again we find that they'd now be Known(7)
and Known(v9)
, which should unify to Unknown
, not 7
. All of these issues resulted from the unification change with NotYetKnown
, so this is not a solution.
We can modify the above approach a bit by changing the unification rule to unify(Known(v), NotYetKnown) = NotYetKnown
instead. This is identical to the rule for Unknown
and we'll generally treat these values as unknown as well. We'll also have unify(NotYetKnown, Unknown) = Unknown
.
To use these new NotYetKnown
values, we'll explicitly handle loops within mem2reg. This can be done by finding any strongly-connected components (SCCs) in the control flow graph for the program. Any SCCs larger than a single block would be a loop consisting of at least a header block (b1
above) and a body block (b3
above), although the body will often be larger than a single block.
After identifying where loops are, change the iteration order of blocks in mem2reg to be over SCCs in control-flow order instead of over blocks in control-flow order. For blocks not in loops this will be identical to the previous behavior, but for looping blocks this means we'll have a set of blocks to iterate over at the same time. For looping blocks we will:
NotYetKnown
instead of Unknown
b1
above, this would be *v1 = unify(Known(7), NotYetKnown) = NotYetKnown
unify(Known(v1), Known(v2)) = Unknown
. Alternatively, the value could also become Unknown
within the loop, at which point it'd still unify to Unknown
.NotYetKnown
. If this is the case, instead of unifying, the known value of the reference becomes the value from the pre-header block.
unify(Known(7), Known(v9)) = Unknown
, which correctly identifies the variable is mutated within the loop and can't be optimized out.unify(Known(7), NotYetKnown)
and at this point we could manually override the result to Known(7)
.NotYetKnown
to Known
, we can iterate over the loop blocks once more at this point to potentially optimize them further.Going back to the original example:
unconstrained fn main() {
let mut foo = 7;
for i in 0 .. 10 {}
assert_eq(foo, 7);
}
With the SSA:
brillig(inline) fn main f0 {
b0():
v1 = allocate -> &mut Field
store Field 7 at v1
jmp b1(u32 0)
b1(v0: u32):
v5 = lt v0, u32 10
jmpif v5 then: b3, else: b2
b3():
v9 = add v0, u32 1
jmp b1(v9)
b2():
v6 = load v1 -> Field
constrain v6 == Field 7
return
}
This loop-aware algorithm would visit:
b0
: find that *v1 = 7
b1
: *v1 = unify(Known(7), NotYetKnown) = NotYetKnown
(unify values of v1 from predecessors b0 & b3)b3
: inherit *v1 = NotYetKnown
b1
again*v1
is still NotYetKnown
, so we ignore b3 and set it to Known(7)
from b1 instead. Because a value changed from NotYetKnown
to Known
, we iterate the loop blocks again:
b1
again, looking for further optimizations now that we know v1 (there are none)b3
again, looking for further optimizations now that we know v1 (there are none)*v1
in b3
is overwritten to Known(7)
since we ran this block againb2
: *v1 = unify(Known(7), Known(7)) = Known(7)
b2
This algorithm can now optimize this looping case successfully. If we look at the mutation example in the loop, the main change is in step 6 since when we recheck the unify of b1
we'd have *v1 = unify(Known(7), Known(v9)) = Unknown
. Since a value didn't change from NotYetKnown to Known, we don't iterate the loop blocks again. After the loop in b2
, the value of *v1
would still be Unknown
so we correctly do not optimize out the constrain in the mutation case.
Let's see how this handles nested loops:
unconstrained fn main() {
let mut foo = 7;
for i in 0 .. 2 {
for j in 0 .. 3 {}
}
assert_eq(foo, 7);
}
Variables defined inside a loop:
unconstrained fn main() {
for _ in 0 .. 2 {
let mut foo = 7;
for j in 0 .. 3 {}
assert_eq(foo, 7);
}
}
let mut leading_zeroes = 0;
let mut stop = false;
let bits: [u1; 64] = C.to_be_bits();
for i in 0..64 {
if (bits[i] == 0) & !stop {
leading_zeroes += 1;
} else {
stop = true;
}
}
assert_eq(leading_zeroes, 30);
acir(inline) fn main f0 {
b0():
v1 = allocate -> &mut Field
store Field 0 at v1
v3 = allocate -> &mut u1
store u1 0 at v3
v6 = make_array [u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0,
u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 1, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u
1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 0, u1 1, u1 0] : [u1; 64]
inc_rc v6
jmp b1(u32 0)
b1(v0: u32):
v9 = lt v0, u32 64
jmpif v9 then: b3, else: b2
b3():
v13 = array_get v6, index v0 -> u1
v14 = not v13
v15 = load v3 -> u1
v16 = not v15
v17 = mul v14, v16
jmpif v17 then: b5, else: b4
b5():
v18 = load v1 -> Field
v20 = add v18, Field 1
store v20 at v1
jmp b6()
b6():
v22 = add v0, u32 1
jmp b1(v22)
b4():
store u1 1 at v3
jmp b6()
b2():
v10 = load v1 -> Field
v12 = eq v10, Field 30
constrain v10 == Field 30
return
}
*v1 = 0
*v3 = false
{b1, b3, b4, b5, b6}
b1
(preds b0, b6
):
*v1 = 0 U NYK = NYK
*v3 = false U NYK = NYK
b3
: inherit from b1
b4
: inherit from b3
*v3 = true
b5
: inherit from b3
*v1 = v20
b6
(preds b4, b5
):
*v1 = NYK U v20 = NYK
*v3 = true U NYK = NYK
b1
b1
:
*v1 = 0 U NYK = NYK
*v3 = false U NYK = NYK
v U NYK = Unknown
b1
unifications:
*v1 = 0 U NYK = NYK
(override from Unknown)*v3 = false U NYK = NYK
(override from Unknown)Now this is breaking:
fn main() {
let mut acc: Field = 0;
let array = [[1, 2], [3, 4]];
for i in 0..2/*57*/ {
for j in 0..2/*57*/ {
acc += array[i][j];
}
}
assert(acc != 0);
}
acir(inline) fn main f0 {
b0():
v2 = allocate -> &mut Field
store Field 0 at v2
v6 = make_array [Field 1, Field 2] : [Field; 2]
v9 = make_array [Field 3, Field 4] : [Field; 2]
inc_rc v6
inc_rc v9
v10 = make_array [v6, v9] : [[Field; 2]; 2]
inc_rc v10
jmp b1(u32 0)
b1(v0: u32):
v13 = lt v0, u32 2
jmpif v13 then: b3, else: b2
b3():
jmp b4(u32 0)
b4(v1: u32):
v18 = lt v1, u32 2
jmpif v18 then: b6, else: b5
b6():
v21 = load v2 -> Field
v22 = array_get v10, index v0 -> [Field; 2]
inc_rc v22
v23 = array_get v22, index v1 -> Field
v24 = add v21, v23
store v24 at v2
v25 = add v1, u32 1
jmp b4(v25)
b5():
v20 = add v0, u32 1
jmp b1(v20)
b2():
v14 = load v2 -> Field
v15 = eq v14, Field 0
v16 = not v15
constrain v15 == u1 0
return
}
*v2 = 0
{b1, b3, b4, b5, b6}
*v2 = 0 U NYK = NYK (overridden)
*v2 = NYK U NYK = NYK
*v2 = v24
*v2 = 0 U NYK
(!) analyzed in wrong order
If the loop check were recursive:
*v2 = 0
{b1, b3, b4, b5, b6}
*v2 = 0 U NYK = NYK (overridden)
{b4, b6}
*v2 = NYK U NYK = NYK
*v2 = v24
*v2 = NYK U v24 = Unknown
*v2 = 0 U Unknown