Program obfuscation via local mixing

Let's assume that a reversible circuit with

n^{*}

wires and

m^{*}

gates permutes the input such that the output permutation is indistinguishable from a strong pseudo-random permutation. The authors use this property of reversible circuits to build an algorithm that satisfies ROI (ROI is also program obfuscation and is similar to iO) for all reversible circuits.

Reversible circuits

Reversible circuits were first explored to express computation such that it can be reveresed. For example, a + b = c is not a reversible operation. Because it is not possible to derive a, b from c. However, if the circuit outputs b as an auxilary value, the computation becomes reversible. Reversible circuits apparently are also natural fit for quantum circuits (a quick search for "reversible circuits and quantum computation" will produce many useful resources). The other property that reversible circuits have, and the one we're interested in, is that if you chain enough reversible gates the reversible circuit can be called a strong pesudorandom permutation. In others words, the reversible circuits permutes the input bits such that the output is indistinguishable from the truly random permutation to an adversary with polynomial resources.

A reversible circuit is expressed as collection of

n

parallel wires and

m

gates and, usually, the computation flow is from left to right. Each gate mutates at least 1 wire based on signals of other wires. There're several varieties of gates but we're only interested in Toffoli gate.

Toffoli gate is defined by 2 control wires, control function, and a target wire. The gate applies the control function on the control wires and sets the target wire as XOR of output of control function by itself. For example, if a, b are the control wires,

ϕ

is the control function, and c is the target wire. Then the gate computes:

c = ϕ (a, b) \oplus c

Observe that with 2-bit controls there can only be

2^{2^{4}} = 16

control functions. We refer to these gates as the "base gates". A base gate is defined over

n

wires but it only touches 3 wires (2 controls and 1 target) called active wires. Another observation to make is any base gate is an inverse of itself and there exist a identity base gate with control function

ϕ (a, b) = 0

Psuedorandomness of reversible circuits and t-wise independence

Gow96 conjectured that reversible circuits with sufficiently many gates where each gate is a 3-bit permutation is strong-psuedo random permutation. Gowers proved that reversible circuit with

m = O (n^{3} t^{3} \log ϵ^{- 1})

ϵ

close to t-wise independence. Note that Gowers's result only applies to any 3-bit permutation and not just the base gates.

Function

F (x)

is called

ϵ

close to t-wise independent if the distance between following two tuples is

\leq ϵ

$F (x_{1}), F (x_{2}), . . . F (x_{t})$
$y_{1}, y_{2}, . . ., y_{t}$

where

y_{i}

is sampled from truly random permutation distribution. The distance function is defined as the function

d (p, q)

in BH06 on page 3.

HMMR05, HB05 later improved the bound to

m = O (n^{3} t^{2} + n^{2} t \log ϵ^{- 1})

and worked only with the base gates. They also conjectured that 4-wise independece suffices for cyrptographic grade pseudomness (I suppose there're attacks with 3-wise independence, which makes one think 4-wise is plausibly secure. But there's no elaboration on the conjecture in the paper).

Note: There're other recent papers that improve the bound and provide proofs. For example, this and this.

Random output indistinguishibility (ROI) obfuscators

Any function

O

is ROI with inner stretch function

ξ

for input ensemble of circuits

C_{n_{κ}, m_{κ}}

if it satisfies the following property for any reversible circuit

C \in C_{n_{κ}, m_{κ}}

O (C) \approx_{c} π (\hat{C}) : \hat{C} \leftarrow E_{C, ξ (κ, n_{κ}, m_{κ})}

where

E_{C, m}

denotes set of all circuits that compute permutation

C

with

m

gates.

Definition of ROI is different from iO. But they provide similar functionality. ROI, like iO, requires the obfuscator to output circuits that are indistinguishable from a random circuit that computes the same function (/permutation) as the input circiut

C

Separable ROI obfuscators

A ROI obfuscator is

m_{κ}

left (right resp.) separable if

the computational complexity of
$m_{κ}$ prefix (suffix resp.) of the obfuscated circuit is low (ie
$C C (C_{[0, m^{κ}]}) < m_{κ} / 2$ )
Post processing algorithm is of the form
$π (C) = π_{1} (C_{[1, m_{κ}]}) | π_{2} (C_{[m_{κ}, *]})$ (resp.
$π (C) = π_{1} (C_{[1, - m_{κ}]}) | π_{2} (C_{[- m_{κ}, *]})$ )

Observer that an obfuscator

O

m_{κ}

-right separable if

O^{†}

(

O^{†} = (O (C^{†}))^{†}

) is

m_{κ}

left separable.

An obfuscator

O

is called

m_{κ}

separable obfuscator if it is both

m_{κ}

right and left separable.

Random input & output (RIO) obfuscators

RIO is built from function(s) that satisfy its two properties. It isn't necessary for the same function to satisfy both properties because the two properties are used at different places. However, it is important to highlight here that ROI is built from RIO based on the new assumptions authors introduce. In particular, assumption 8, which covers assumptions 4&7.

Property 1:

Any obfuscator function

O

statisfies property 1 on input ensemble

C_{n_{κ}, m_{κ}}

with post processing function

π

and inner stretch

ξ

if the following holds:

(C_{1}, C_{2}) \approx_{c} O (C_{1}^{'}, C_{2}^{'})

where

(C_{1}, C_{2}) \leftarrow O (C); C \leftarrow_{R} C_{n_{κ}, m_{κ}}

and

(C_{1}^{'}, C_{2}^{'}) = (π (C_{1}^{'}), π (C_{2}^{'})) \leftarrow_{R} E_{P_{C^{'}}, ξ (κ, n_{κ}, m_{κ})}; C^{'} \leftarrow_{R} C_{n_{κ}, m_{κ}}

Property 1 says that running obfuscator function

O

twice on same circuit

C

must output obfuscated circuits that are not co-relatable.

Property 2:

Any obfuscator function

O

satisifes property 2 on input ensemble

C_{n_{κ}, 2 m_{κ}}

for sufficiently large leeway

λ

and fixed circuit

Z_{1}, Z_{2} \in C_{n_{κ}, m_{κ}}

if the following holds:

(O (C), {\hat{C}}_{1}, {\hat{C}}_{2}) \approx_{c} (π (C^{'}), {\hat{C^{'}}}_{1}, {\hat{C^{'}}}_{2})

where

C \leftarrow_{R} C_{n_{κ}, 2 m_{κ}}; {\hat{C}}_{1} \leftarrow_{R} E_{P_{C_{[1, m_{κ}]} | Z_{1}}, m_{κ} λ}; {\hat{C}}_{2} \leftarrow_{R} E_{P_{Z_{2} | C_{[m_{κ}, *]}}, m_{κ} λ}

and

C^{'} \leftarrow_{R} C_{n_{κ}, 2 m_{κ}}; {\hat{C^{'}}}_{1} \leftarrow_{R} E_{P_{C_{[1, m_{κ}]}^{'} | Z_{1}}, m_{κ} λ}; {\hat{C}}_{2} \leftarrow_{R} E_{P_{Z_{2} | C_{[m_{κ}, *]}^{'}}, m_{κ} λ}

Property 2 says that obfuscation of the circuit must not be co-relatable with original circuit even when given oracle access to

m_{κ}

-gate prefix of the original circuit.

Assumptions

Assumption 4 states that there exist

n^{*}, m^{*}

such that a random circuit in ensemble

C_{n^{*}, m^{*}}

is strong pseudorandom permutation (SPRP).

Assumption 5 extends assumption 4. For any random circuit

Q \leftarrow_{R} C_{n^{*}, m}

, circuit

C \leftarrow_{R} E_{P_{Q}, m_{κ}}

with fixed permutation

P_{Q}

where

m_{κ} \geq m^{*} m

, and

m^{*} \leq l_{κ} \leq (m - 1) m^{*}

following two distributions are indistinguishable.

Oracle access to
$l_{k}$ -gate prefix of
$C$ and the remainder of
$C$
Oracle access to
$Q_{1} | C^{'}$ and
$C^{' †} | Q_{2}$ , where
$Q = Q_{1} | Q_{2}$ , and
$C$ ' is a random circuit sampled from
$C_{n^{*}, m^{*}}$ .

Assumption 5 says that oracle access to sufficiently long prefix, with at-least

m^{*}

gates (recall

m^{*}

satisfies assumption 4), of circuit

C

and the remainder where

C

is functionally equivalent to

Q

is indistinguishable from two SPRPs that jointly compute

Q

Assumption 7 is relevant for theorem 16 and is the main assumption, along with assumption 8 (assumption 8 is an extension of assumption 7). It says even if adversary is given the random circuits, either tuple

(C_{0, l_{κ}}, C_{l_{κ,}})

(Q_{1} | C^{'}, C^{^{'} †} | Q_{2})

, something like assumption 5 still holds. First observe that now any tuple given is clearly distinguishible from a random tuple. Nonethless, authors assume that the sufficient inner stretch should erase any information, other than functionality, of

Q

Define

Q = H_{1} | H_{2}

where

H_{i} \in C_{n^{*}, m_{i}}

and

l_{i, κ} = m_{i} m^{#}

. Assumption 7 states that there exists

m^{#}

for which the following two distributions are indisginuishable

${\hat{C}}_{1} | {\hat{C}}_{2}$ where
${\hat{C}}_{1}$ is a random circuit with
$l_{1, κ}$ gates that computes permutation
$P_{H_{1} | C}$ and
${\hat{C}}_{2}$ is a random circuit with
$l_{2, κ}$ gates that computes permutation
$P_{C^{†} | H_{2}}$ .
A random circuit
$C$ that computes permutation
$Q$ with
$l_{1, κ} + l_{2, κ}$ gates.

Assumption 8 combines assumption 4&7 and states that there exist

n^{*}, m^{*}

for which assumption 4 holds. Additionally there exist

m^{#}

that satisfies assumption 7 with respect to SPRP of assumption 4.

ROI for all reversible circuits from RIO

Theorem 16

Given functions

O_{1}

π

O_{2}

such that:

$O_{1}$ is an obfuscator that satisfies property 1 of RIO with input ensemble
$C_{n^{*}, m^{*}}$ , inner stretch function
$ξ (κ, n^{*}, m^{*}) \leq m^{#}$ where
$n^{*}, m^{*}, m^{#}$ statisfy assumption 8, and post processing algorithm
$π$ .
$O_{2}$ is an obfuscator that statisfies property 2 of RIO with input ensemble
$π (C) | π^{†} (C)$ where
$C \leftarrow C_{n^{*}, m^{*}}$ .

we can build obfuscator function that satisfies ROI for all reversible circuits.

Let input circuit be a reversible circuit with each gate in the set of base gates. Theorem 16 is proven by first constructing Random identity generator (RIG). RIG is then used to construct a Gate obfuscator (GO) for any base gate. GO is used to obfuscate each gate in the input circuit. After which obfuscated circuits of the gates are concatenated together using

O_{2}

. The resulting obfuscated circuit statisfies ROI.

Random identity generator (RIG)

Random identity generator (RIG) is an

m^{#}

-separeable ROI obfuscator for identity permutation with inner stretch

2 m^{#}

. It is used to (efficiently) sample obfuscated circuit that computes identity permutation on

n^{*}

input wires.

Sampling a circuit that computes identity permutation with

2 m^{#}

gates is not efficient in practice. RIG uses

O_{1}

and assumption 8 to sample such a circuit efficeintly in the following way:

Sample
$C \leftarrow C_{n^{*}, m^{*}}$
$C^{'} = O (C), C^{″} = O (C)$
Output
$C^{'} | (C^{″})^{†}$

It is clear that circuit

C^{'} | (C^{″})^{†}

computes an identity permutation and has inner stretch

2 m^{#}

. Step 2 justifies why

O_{1}

should satisfy property 1 of RIO. That is

C^{'}

and

C^{″}

must not be co-relteable even though they are obfuscation of the same input circuit

C

Assumption 8 is used to prove that (informally)

C^{'} | (C^{″})^{†}

is indistinguishable from a random circuit

I

with

2 m^{#}

gates that computes the identity permutation.

Gate obfuscator (GO)

Gate obfuscator (GO) is a

m^{#}

-separeable ROI obfuscator with inner stretch

2 m^{#}

for a single base gate

β

. It is built using RIG.

To obfuscate a base gate

β

Repeatedly sample RIG until it outputs a circuit
$C$ of which first gate is the base gate
$β$
Remove the first gate from
$C$ and insert in its place the identity base gate

The resulting cricuit

C

is an obfuscated circuit for base gate

β

Step 1 of GO needs to be efficient and, with the following modifications, indeed it is. Observe that circular rotation of a reversible circuit that computes identity permutation, still computes the identity permutation. Hence, modify the algorithm to find the first gate that has same control function as the base gate

β

and circularly rotate the circuit s.t. the found gate is the first gate. If base gate

β

has control wires

ω_{0}, ω_{1}

and target wire

ω_{2}

and the first gate has control wires

ω_{0}^{'}, ω_{1}^{'}

, and target wire

ω_{2}^{'}

swap wire

ω_{0}^{'}

with

ω_{0}

ω_{1}^{'}

with

ω_{1}

, and

ω_{2}^{'}

with

ω_{2}

C

and output the resulting circuit.

Soldering obfuscated circuits

After obfuscating each gate in input circuit

C

using GO we solder them together into a final obfuscated circuit using "append and solder" operation.

To solder two different obfuscated circuit outputs of GO,

Γ_{1}

and

Γ_{2}

. Take the

m^{#}

-gate suffix of

Γ_{1}

and

m^{#}

-gate prefix of

Γ_{2}

and feed their concatenation to

O_{2}

. Then output the following,

Γ_{1, [1, m^{#}]} | O_{2} (Γ_{1, (m^{#}, *]} | Γ_{2, [1, m^{#}]}) | Γ_{2, (m^{#}, *]}

We feed

Γ_{1, (m^{#}, *]} | Γ_{2, [1, m^{#}]}

O_{2}

which explains why

O_{2}

must have input ensemble as

π (C_{1}) | π (C_{2})^{†}

where

C_{1}, C_{2} \leftarrow_{R} C_{n^{*}, m^{*}}

. It's because GO is

m^{#}

-separable and, hence,

m^{#}

-gate suffix of

Γ_{1}

and

m^{#}

-gate prefix of

Γ_{2}

where the first is output of

(O_{1})^{†}

and the second is output of

O_{1}

are post-processed versions of random circuits sampled from

C_{n^{*}, m^{*}}

Assuming

O_{1}

is secure, sub-circuits

Γ_{1, [1, - m^{#}]}

(resp.

Γ_{2, (- m^{#}, *]}

) are oracle accesses to

m^{#}

-gate prefix (resp. inverse of

m^{#}

-gate suffix) of random identity circuit sampled by GO for

Γ_{1}

(respectively

Γ_{2}

). Hence,

O_{2}

must satisfy property 2 for RIO.

Summary of obfuscation

Let

C

be input reversible circuits with gates

γ_{1} | γ_{2} | . . . | γ_{m}

For each gate
$γ_{i}$ sample
$Γ_{i} \leftarrow G O (γ_{i})$
Set output obfuscated circuit
$C^{O} = Γ_{1}$ . For
$j$ in range
$[2, m]$ , set

$C^{O} = C_{[1, - m^{#}]}^{O} | O_{2} (C_{(- m^{#}, *]}^{O} | Γ_{j, [1, m^{#}]}) | Γ_{j, (m^{#}, *]}$
Output
$C^{O}$

Implementing RIO

The words local mixing in the title are due to the approach authors introduce to build an RIO that satisfies property 1 and 2 simultaneously.

Authors first observe that constructing an RIO that provably satisfies property 1 and runs in polynomial time is hard because it implies CoNP = NP (Question: How so? In particular why having polysize witness of transformation from functionally equivalent and same size circuit

$C_{1}$ to
$C_{2}$ imply CoNP=NP?). Which is why they settle for an algorithm that assures

C_{1}

and

C_{2}

are not efficiently discernible.

Satisfying property 1 of RIO requires transforming the input circuit with

m^{*}

gates to a functionally equivalent circuit with

m^{#}

gates. To do so the algorithm first introduces redundancy to the circuit by (1) randomly isolating small sub-circuits and (2) replacing them with bigger funtionally equivalent circuits. After which the resulting circuit will have

m^{#}

gates. However, the resulting circuit can easily be reversed if care is not taken to hide the traces of introduced redundancy. This is acheived by again randomly selecting sub-circuits and replacing them with funtionally equivalent but different circuits of same size. The hope is that the latter stage, since sub-circuits are randomly chosen, diffuses introduced redundancy over larger and larger parts of the circuit. The first stage is referred to as "inflationary" and the second stage is referred to as as "kneading".

Local mixing

Inflationary stage and kneading stage only differ in size of the circuits being replaced. Otherwise, both stages take local & independent sub-section of the circuit and replace it with a different functionally equivalent circuit. A single step in either stage is referred to as local mixing step.

The step is divided into 3 main parts:

Selecting an independent sub-section
$C^{o u t}$ with
$ℓ^{o u t}$ gates at random
Finding functionally equivalent to
$C^{o u t}$ but different circuit
$C^{i n}$ with
$ℓ^{i n}$ gates.
Replacing
$C^{o u t}$ with
$C^{i n}$ without affecting functionality of the original circuit.

Step 1: Selecting
$C^{o u t}$

Observe that naive selection of

ℓ^{o u t}

consecutive gates in

C

and replacing them with

C^{i n}

can easily be reveresed if most gates in

C^{o u t}

are commutitative (i.e. swappable). It's because with high probability

C^{i n}

will be some permutation with gates present in

C^{o u t}

with pair of random base gate and their inverses to pad up to

ℓ^{i n}

gates. For example,

γ_{1}, . . ., γ_{ℓ^{o u t}}, β, β^{- 1}

where

γ_{1}, . . ., γ_{o u t}

are

C^{o u t}

gates and

γ_{1}, . . ., γ_{o u t}, β, β^{- 1}

are

C^{i n}

gates.

Skeleton graph:

In order to avoid the obvious attack, the reversible circuit is first converted to skeleton graph which only stores dependency among the gates. In the skeleton graph each gate is a node and there exist an edge between two gates

i < j

iff gate

i

collides with gate

j

and there exists no such gate

k

i < k < j

, that collides with both gate

i

and gate

j

Selecting

C^{o u t}

at random:

Observe that a convex subgraph with

ℓ^{o u t}

nodes in the skeleton graph is locally replaceable. That is, the convex subgraph can be swapped out with functionally equivalent circuit

C^{i n}

with

ℓ^{i n}

gates.

Selection of random

C^{o u t}

simply requires selecting a convex subgraph with

ℓ^{o u t}

nodes at random.

Step 2: Finding
$C^{i n}$

Simple brute force approach to find a functionally equivalent to C^out circuit

C^{i n}

with

ℓ^{i n}

gates is feasible as long as

ℓ^{i n}

is small enough. In our implementation,

ℓ^{i n}

is fixed to 4.

Step 3: Replacing
$C^{o u t}$ with
$C^{i n}$

Since

C^{o u t}

is a convex subgraph the predecessors of

C^{o u t}

and successors of

C^{o u t}

are disjoint. At a high level, replacement operation writes the circuit such that gates of

C^{o u t}

are placed consecutively (i.e. predecessors of

C^{o u t}

are placed before

C^{o u t}

and successors of

C^{o u t}

are placed after

C^{o u t}

). It then replaces

C^{o u t}

with

C^{i n}

and updates the skeleton graph with necessary edges.

To correctly update the skeleton graph efficiently two additional auxilary data structures are required: direct incoming connections (DIC) and direct outgoing connections (DOC).

DIC and DOC are maps that map gates to a set. For any gate

j

, DIC[j] maps to a set of gates that are

< j

and collide with

j

. For any gate

i

, DOC[i] maps to a set of gates that are

> i

and collide with

i

. Note that any two gates

i

and

j

, for

i < j

have an edge in skeleton graph iff they collide and DOC[i] and DIC[j] are disjoint.

Following is pseudocode of the replacement algorithm:


def inplace_topological_sort(nodes: &mut [Nodes]):
    '''
    Topologically sort the nodes in-place as per SG(C)
    '''

def find_all_predecessors(convex_set: [Nodes]):
    '''
    Finds union of predecessors of the nodes in the input convex_set
    '''
    
def find_all_successors(convex_set: [Nodes]):
    '''
    Finds union of successors of the nodes in the input convex_set
    '''

def replace_cout_with_cin(C^out, C^in, DOC, DIC):
    '''
    Replaces C^out with C^in
    '''

    # Since C^out is convex `all_predecessors` and `all_successors` must be disjoint.
    all_predecessors = find_all_predecessors(C^out)
    all_successors = find_all_successors(C^out)

    # Find the nodes that are neither predecessor nor successors of C^out
    all_outsiders = all_nodes(SG(C)) 

    inplace_topological_sort(all_predecessors)
    inplace_topological_sort(all_successors)

    # Update DOC and DIC with C^in. And update SG(C) with new edges. 
    for gate i in C^in:
        # Direct outgoing connections within C^in
        for j in C^in[i+1, ]:
            if i collides with j:
                insert j in DOC[i]
                insert i in DOC[j]

        # Direct outgoing connections from C^in to C^out's successors
        for gate j in all_successors:
            if i collides with j:
                insert j in DOC[i]
                insert i in DOC[j]

        # Treat all outsiders as successors
        for gate j in all_outsiders:
            if i collides with j:
                insert j in DOC[i]
                insert i in DOC[j]

    for gate i in C^in:
        reverse(for gate j in all_predecessors):
            if i collides with j:
                insert i in DOC[j]
                insert j in DIC[i]

                # Remove unecessary edges in SG(C). 
                # That is, remove any edge from j to any node in DOC[i]
                # This is because *now* there exist a node between j and i's
                # direct connection which collides with both, the gate i
                # itself.
                for node in intersection(DOC[j], DOC[i]):
                    SG(C).remove_edge_if_exists(j, node)


                # Add an edge in SG(C) following the rules of skeleton graph. 
                # That is, iff DOC[i] and DIC[j] are disjoint
                if is_disjoint(DOC[j], DIC[i]):
                    SG(C).add_new_edge(j, i)

    # Find union of DOCs/DICs of C^out (excluding any vertex in C^out)
    let union_cout_docs = {}
    let union_cout_dics = {}

    # Delete vertices in C^out from DOCs, DICs, and SG(C)
    for node in C^out:
        delete from SG(C)
        delete from DOC including from the internal sets 
        delete from DIC including from the internal sets


    # Re-connect missing links
    for direct_inc_predecessor in union_cin_dics:
        for direct_out_successor in union_cout_dics:
            if is_disjoint(DOC[direct_inc_predecessor], DIC[direct_out_successor]):
                SG(C).add_new_edge(direct_inc_predecessor, direct_out_successor)

RIO strategy 1 and strategy 2

Strategy 1 is simpler than strategy 2. However authors claim both strategies are equivalent. Strategy 2 is same as the one we've seen before; inflationary stage followed by kneading stage. In the inflationary stage

ℓ^{o u t} < e l l^{i n}

. In the kneading stage

ℓ^{o u t} = ℓ^{i n} = ℓ^{k n d}

and

ℓ^{k n d} \geq ℓ^{i n}

of inflationary stage.

Strategy 1 does not differentiates between inflationary and kneading stage. Instead at each local mixing step it randomly samples

ℓ^{o u t}

from the range

[2, ℓ^{i n}]

Concrete parameters for ROI

For strategy 2 authors set

m^{*} = n l o g^{4} n

and

m^{#} = n^{3}

. They run inflationary stage for

n^{4}

iterations, and kneading stage for

n^{5}

iterations.

However, for strategy 1 authors have drastically cut down iterations to

O (m κ)

where

κ \approx n

. For example, for n = 128 only

\approx 460 K

iterations are required for security parameter for

κ = 128

Additional resources

The t-wise Independence of Substitution-Permutation Networks (https://eprint.iacr.org/2021/507.pdf)

t=2, that is almost pair-wise independence, implies security against any attack that takes advantage of statistical distance between pair of outputs (for ex, linear/differenctial/truncated-differential cryptanalysis).
There exist a class of diffrential attacks called higher order differentials that consider the k-th derivative (Lai94). Note that k-independence permutation is secure against k^{th} differential attack. More so, any pairwise (t=2) independce is secure against truncated differentials (Knu94)
There are 3 main results:
- Theorem 3.7 & 3.8: a 3 round SPN is
  $ϵ$ close to 2-wise independent where
  $ϵ$ is small
- AES is almost 2-wise independent. To be specific they prove that 6 round AES on 8-bit blocks is
  $ϵ = 0.472$ close to pairwise independence and then use MPR amplification lemma to claim
  $6 r$ -round AES is
  $2^{r - 1} {0.472}^{r}$ close to pairwise independence.
- An existential result that says that there exists t+1 round key alternating cipher (independent keys per round; heuristic assumption of key-scheduling is ok!) so that it is almost t-wise independent. Existential means they don't concretely define the permutations.
Video: https://ntt-research.com/cis-stefano-tessaro-2021summit-summary/
Another related video on t-wise independence, permutations, and block ciphers: https://www.youtube.com/watch?v=FPWPQ6L8Thk

Pseudorandom Permutations from Random Reversible Circuits (https://arxiv.org/pdf/2404.14648)

Depth of reversible circuit: Maximum no. of gates through which any wire passes. A gate is defined as permutation on c << n wires, where n is total no. of wires.
Authors claim that all difficulty in theorem 9 (their main theorem) comes from "analysing the spectral gap of the natural random walk on k-tuples of strings"
- What is spectral gap of natural random walk?
What is LP21 and how does OWF connect with Kolmogorov's complexity?
Check papers on analysis of security properties of SPNs
HMMR05 conjecture that any exp(-n)-approximate 4-wise independence permutation composed of reversible gates forms a pseudorandom permutation.
HOw does statistical indistinguishbility compares to computational indistinguishibility ?
-> Statistical: distinguish two statistical distributions, requires as many samples than the security parameter to be able to distinguish, for example distinguish
$x \in_{u} Z_{2^{128}}$ from
$x^{'} \in_{u} Z_{2^{128} - 1}$ . Provides security against an adversary with unbounded computation.
-> Computational: only one sample is needed but as many operations as the security parameter are needed to be able to distinguish from the ideal distribution, for example breaking the LWE problem. Does not provide security against an adversary with unbounded computation.
Connection between OWFs, reversible circuits, and hardness of inverting block ciphers
- Authors view problem of inverting block ciphers as finding the reversilbe circuit that computes the permutation. They call it minimum reversible circuit size problem (MRCSP), which is analogous to mininum circuit size problem (MCSP) (in KC00).
- KC00 prove that whenever the input output pairs fed to MCSP solver are computed using a pseudo-random function then MCSP solver does not run in polynomial time. Hence, exitence of OWFs precude the existence of polynomial time solver for MCSP.
- There's a theorem that says that if OWFs exist, then for a constant d, and for large enough n there exist a negl(n)pseudo-random function with boolean with gates at-most n^d.
- Authors show that we can re-write feistal network (a pseudo-random function) with n^d gate as a reversible circuit. Their contribution is to apply the permutation again on the ancilla bits so that they don't leak information about the input.
- Analogous to MCSP, a reversible circuit that computes a pseudo-random permutation with n^d gate is MRCSP hard. Hence, if OWFs exists then there exists negl(n)-pseudo random function with boolean circuit of size n^d. Thus, there exists a reversible circuit with n^d gates that computes negl(n)-pseudo random function.
In CMR20 authors build a pseudo-random function with
$2 \log n + \log_{3} n$ gates. Based on conjecture in HMMR05 setting k = 4 with
$ϵ = 2^{- n}$ gives a pseudo-random function, of which the authors provide more evidence. Then as per theorem 9 a random brickwork reversible circuit of size
$O (n^{2})$ suffices to build pseudo-random function. I'm curious whether there's any to be drawnbetween theorem 9 and 3-stage construction of CMR20. In particular, CMR20 provides quantum mechanical evidence for pseudo-randmoness of 3-stage construction but no proof. Do you think there can exist a proof for pseudo-randomness of reversible circuits with strctured gates with depth
$\log n$ ?

Quantum statistical mechanics of encryption: reaching the speed limit of classical block ciphers (https://arxiv.org/abs/2011.06546)

The paper formulates attacks on fiestal block ciphers using out-of-order correlators (OTOCs) and entropy. Decay of OTOCs signal chaos in quantum systems and saturation of entropy limits information that can be extracted by an adversary. They show that for insecure instantiation of fiestal block cipher with 1, 2, or 3 rounds, OTOC does not vanishes to very small values and has significant residual entropy (i.e. distance from maximum entropy). However with 4 rounds OTOCs vanishes exponentially while entropy saturates.

Based on the above evidence they construct a 3-stage reversible circuit with

n \log n

3-bit permutation gates for which OTOCs vanish exponentially and entropy saturates. Hence, the 3-stage cipher can be claimed to a strong pseudo-random permutation secure against an adversary with polynomial no. of queries (they further claim that reversible circuits are also secure against quantum adeversaries).

The 3-stage construction consists of 1 inflationary stage with

\log n

layers, followed by 1 proliferation stage with

\log n

layers, and another inflationary stage with

\log n

layers. Inflationary stage only consists of gates from the set of 144 inflationary gates. Proliferation stage only consits of gates from the set of 10K proliferatory gates (however one of the authors mentioned that any random 3-bit permutation gate would do as a proliferatory gate in this video).

Let

n^{'}

be power of 3

\geq n

. For each layer, divide

n^{'}

into sets of triplets such that collection of layers within a stage forms hierarchical tree structure. For example, l1 = (0, 1, 2) (3, 4, 5) (5, 6, 7)… and l2 = (0, 3, 6) (1, 4, 7) (2, 5, 6)… etc. Assign a random inflationary/prolifatory gate to each triplet. Let

π

be a random surjective map from

n^{'}

n

. Apply

π

to elements of each triplets in every layer (if n' = n, then map is 1 to 1. If n' > n then repetitions are inevitable) and output the reversible circuit for the stage. Repeat the procedure for all 3 stages and output the concatenation of the 3 reversible circuits.

13oct

Sampling a random subgraph with ell gates and finding a functionally equivalent reversible circuits with same no. of gates tends to get really hard as ell grows. It becomes in-feasible beyond ell=5. I suspect the reason is the following: A random subgraph has complexity gap 0. Which means the set of circuits that compute the same function as the random subgraph with same no. of gates are exponentially small with m.
- Note that this does not includes the case where we know random subgraph has CG > 0. In which case, assuming ell_in is sufficiently larger than ell_out, finding funtionally equivalent circuit is quite fast.
The observation above makes me wonder whether the kneading stage work. At the moment, I think it will. This is because inflationary stage intentionally introduces complexity redudancy in the circuit. Thus a random subgraph will have complexity gap way higher than 0 (Question what will be the complexity gap?). This means a subgraph with ell_knd (ell_knd is way higher than ell_in,ell_out) will have CG > 0. Hence, it should be easier to find a functionally equivalent circuit ell_knd gates. However, as kneading stage progresses CG is expected to reduce below CG for any random circuit with ell_knd gates. Therefore find replacement circuits may become harder.
Question what will be the complexity gap of random subgraph with ell_knd gates after inflationary stage?
- My initial response was that it should be proportional to ell_out - ell_in.
Is the size of set of circuits that compute permutation P with CG§ = x exponential in x? In other words, does it become easier to find a circuit if CG is high.
- If it is then it should be easier to find replacements for kneading stage at the start. However with each iteration the difference between a CC of a random subgraph with ell_knd gates of the circuit and CC of random circuit with ell_knd gates reduces. Thus overtime finding a replacement for a random subgraph will become harder. I'm curious beyond what point do we consider it enough and terminate the process?

15 oct

The time it takes to find a circuit with \ell^in gates seems to grow with \ell^out. It seems to mostly depend on no. of acive wires in C^out, which infact grows with \ell^out (with high probability it is \approx 2\ell^out + 1). At the moment, as far as I can see, it seem to be unfeasible to find repalcement circuit C^in when \omega^out >= 7. I've asked other folks whether the way I'm sampling the circuits is correct. Whether there exist a better strategy to sample circuits.
It will be nice idea to get a average sampling runtime estimates for various values of \ell^out and \omega^out.
- I'm also curious to abou how does the average sampling runtime changes with probability of sampling 3-way gates and 2-way gates.
- I took a random circuit with \omega^out = 6, \ell^out = 5, and tried finding a functionally equivalent circuit with \ell^in = 10. I set two_prob to 1 (ie no 3-way gates and only 2 way gates). Even after 100M rounds of randomly sampling different circuits (there were repetition) I couldn't find a funtionally equivalent circuit. This makes me think that probably the brute force approach is not the way to go. We need a better strategy.
  - Just to assure that setting two_prob = 1 wasn't the cuplrit I ran again with two_prob = 0.5 (i.e. probability of sampling 2-way and 3-way gate is same) for the same parameters. The only difference in this case is several circuits are sampled more than once and all of them have 1 thing in common - all of their gates are 3-way CNOTs. I guess it is because, although the probabity of sampling 3-way is equal to 2-way, the space of reversible circuits with just 3-way gates is way smaller because a single 3-way gate decomposes into 4 2-way gates. So for \ell^out = 10 one only needs composition of at-most 3 3-way gates.
  - This makes me wonder whether setting equal probability for 3-way and 2-way gates is fair.
    - I changed two_prob = 0.8 but nothing changes. Still many collisions because of 0.2 chances of 3-ways gates

17 oct

I treid splitting a circuit with many gates into smaller circuits with less gates and find replacement circuits for each whenever n is big (for ex, n >= 7). Well we're able to find the circuit but it is exactly as the input circuit. Splitting the circuit into less gates reduces the search space and we're able to find exactly the same circuit. For exaple with \ell^out = 7 and n = 15. I split the circuit into first 3 gates (first circuit) and last 4 gates (second circuit). Then I tried finding their funcitonally equivalent counterparts. Splliting into smaller gates reduces the search space from n(n-1)(n-2)*7 to n(n-1)(n-2)*3 and n(n-1)(n-2)*4 respc. But this still does not help becauss we increase the probability of finding the same circuits
Let's think through what's the point of the two stages - Inflationary and kneading stage.
- Inflationary stage: This stage introduces redudancy to the circuit. It sets c * \ell^out = \ell^in. More so, finding replacement circuit for this stage, as far as I understand, is pretty easy.
- Kneading stage: This is the more problematic stage. The point of this stage is hide the introduced redudancy in the inflationary stage. It takes \ell^knd gate sub-circuits and finds their \ell^knd gate replacement. Authors claim that doing so spreads the complexity gap over larger and larger region. Their reasoning goes as follows:
  - Let C^out be the \ell^knd circuit. Note that C^out may have, assuming \ell^knd > \ell^in, several C^in from previous stage within itself. Hence CC(C^out) >= \sum CC(C^in_j) where C^in_j is j^th C^in circuit. With each iteration number of C^in within any C^out will decrease beacuse of the previous iteration, hence CG(C^out) will grdually decrease and after sufficiently many iterations will equivalent to CG of any random circuit with \ell^knd gates. Hence, there's no way for an attacker distinguish any part of circuit from a random sub-circuit with \ell^knd gates. There's no thread left to start pulling from.
  - I have the following quyestions:
    - If at each iteration we find a funtionally equivalent circuit which just has a different shape (ie contains no information about orginal set of C^in circuits) the CG still remains equivalent to CC(C^out) >= \sum CC(C^in_j) (C^out being the circuit being replaced). This means touching a particular sub-circuit, C^out, which is a collection of several C^in of inflationary stage just once isn't enough. Hence, we need to run iterations of kneading stage quite a few times.

18 oct

I noticed for the case \ell^out = 2 (or even 3/4) and \ell^in = 5 (or honestly any other > \ell^out value) the C^in contains all gates of C^out. I do not know whether this is obvious. But I expect it to make it easier to attack if redundancy is not spread out properly in the kneading stage.

Let's assume the redundacny is not spread out in the kneading stage. Then with high probability there exist a convex subcircuit with \ell^in gates and \omega^out active wires (we assume that active wires of C^in and C^out are unchaged). To reverse the mixing, the attacker first identifies a sub-circuit with \ell^in gates and \omega^out active wires and tries to find a replacement circuit with \ell^out gates. If the attacker has no other information the attack will take in the worst case
$O (2^{2^{ω^{o u t}}})$ iterations. But if the attacker knows that C^in contains gates of C^out with high probability then attacker needs to only try \ell^in p \ell^out (ie no. of way to select \ell^out gates from \ell^in gates).

Below are a few samples of C^out and C^in.
```
C^out 
---------------
 0 1 2 3 4 5 6
---------------
 x I x x O I x
 x x I I x O x
 I x x O x x I
---------------

C^in 
---------------
 0 1 2 3 4 5 6
---------------
 x x I I x x O
 x I x x O I x
 x x I I x O x
 x x I I x x O
 I x x O x x I
---------------
```
```
C^out
---------------
 0 1 2 3 4
---------------
 x O I x I
 I x x I O
---------------

C^in
---------------
 0 1 2 3 4
---------------
 x O I x I
 x I x I O
 I O x I x
 x I x I O
 I O x I x
---------------
```
Probably this attack can be avoided if \omega^in > \omega^out (ie C^in has more active wires than C^out). But I don't know whether this is necessary to do so. Or shall we safely assume that kneading stage will spread the complexity gap throughout really well.

Oct 24

Aprt from finding functionally equivalent circuit C^in, the other thing that starts to consume majority of time for bigger graphs is "finding convex subgraph".
- The existing approach to find convex subgraph was naive and did a lot of extra work. More so, it iterated over all edges that all nodes in subgraph has, which can be many if the graph is really big.
- I trimmed down the runtime with the following simple rule: Recursively add elements to the convex subset via random selection of 1 edge belonging to 1 node in the convex set. If the subset is no more convex start afresh. If the resutling subset is convex and is not of desried size, recurse to find the next element. Don't get into the trap of checking each edge of each node.
- This rule seems to work but still takes long for anything over 40K gates. I'm wondering about why it increases with no. of gates. The primary candidate for this seems to be the DFS algorithm that checks whether the potential candidate node has any path with existing nodes in the convex set which includes an external node not in the set. The runtime of DFS does depends on the graph size but can easily be pruned with some auxiliary information given that we're generally interested in small selecting convex-subgraphs (ie subgraph with at-most 4 to 5 nodes).
  - What can this auxilary information be?
    - Can we solve this for simple case of 2?
I refer to the funciton that implements the new way of finding the convex set (the one that does not falls into trap of edges; the recrusive way) as blah. I started to test blah for different graph sizes and there's clear difference in runtime (note that blah is called repeteadly with a new random source node until it finds a convex subset). For m=40K on average (although per call runtime depends on node; averaged over 1000 runs) it takes 90ms. For m=20K on average it takes 23.5ms. For m=10K it takes 6ms. For m=5K it takes 1.5ms
While chatting with Han I figured that instead of trying to find a convex subgraph of some fixed size we can instead be more flexible with the size. For example, we're mostly interested in sizes 2, 3, 4, 5 and we should select the subgraph that is either of size 2, 3, 4, or 5. It remains to check whether the function returns the subgraphs of sizes 2, 3, 4, 5 with equal probability. If it does then this is equivalent to randomly sampling a size
$\in [0, 5]$ and finding a subgraph.
Another question popping in my mind is how efficient would this approach be for proof verification + FHE decryption. (Making this point bold to come to later: TODO)

Oct 25

I wanted to assign levels to each node in the graph such that a predecessor is guaranteed to have level < its successors. I took the following approach:
- Topologically sort the graph and then for each node assign the level = max(all parents' levels) + 1.
- The approach works but to topoligically sort a graph with 50K gates it takes 10s. This makes me wonder whether there're parallel algorithms for topolical sorting.
Why assign levels to each node: There are two reasons
- (a) Prune DFS in "find convex graph". That is, safely terminate DFS recursion if current node's level > than candidate's level.
- (b) Fast C^out replacement with C^in: Assume there exist a hash map G_k that stores gates touching wire
  $k$ . Then to draw edges from/to gates in C^in we just iterate through gates in G_{\alpha} where gate i in C^in touches wire \alpha and check for collision. If the colliding gate has level > C^out then drawn an outgoing edge otherwise draw an incoming edge. Note that C^out's level is a range and the colliding gate may have equal level in which case randomly make them predecessor or successor.
Ok, looks like topological sort will hard and we cannot use it. Assuming the graph is dense, sorting takes roughly n^2.
I wonder would it be more performant, or at-least equivalent to what we've at the moment, to pick a random subgraph and check whether it's convex. I see no difference other than we're ruling out non-convex sets much faster than before. Another side benefit is we know the topological ordering of the graph in advance. Hence, need not to worry about finding order of convex set later.

Oct 26

Topological sort of skeleton graph with 50K nodes on average takes 30ms but SC version of the same graph on average takes 10s. The only difference between skeleton graph and SC graph of a circuit is skeleton does not have any transitive collision edges. I suspect the reason for a large slowdown in SC is because SC is very very dense.
I'm thinking to divide the replacement operation into the following steps
- All immPred, and any of their predecessors, must come before C^in. Similarily all immSucc, and any of their successors must come after C^in. Hencce,
  - for each immPred, find the first gate in each dependency graph of C^in it collides with. Then recrusively balance out the collisions for immPred's predecessors. For example if immPred collides with i^th gate in k^th dependency chain of C^in, then recursively find collisions for immPred's predecessors for gates [0, i) of the dependency chain.
  - for each immSucc, find the last gate in each dependency graph it collides with. Then recursively balance out the collisions for immSucc's successors. For example if immSUcc collides with i^th gate in k^th dependency chain of C^in. Then recursively find collisions for immSucc's successors for [i+1, ) of the dependency chain.
  - Add all the new edges to the graph
- It is possible that a immPred was connected with a immSucc via C^out but may not be connected via C^in. This incorrectly implies immSucc does not depend on immPred. But infact any immPred must come before any of the successors of C^out. Likewise any immSucc must come after any of the predecessors of C^out. To settle this case we take the following approach:
  - For each such immPred, immSucc pair: for each path outgrowing from immSucc find the first gate immPred collides with. Draw an edge from immPred to that gate. Let that gate be gate i on j^th path starting from immSucc. The chain of gates on j^th path uptill, and not including, j^th gate must now be checked for collisions against predecessors of immPred. This is similar to settling dependecy chains with predecessors.
- C^in contains new gates which bring new collisions with gates which is neither a predecessor of C^out nor a successor. These are the outsiders. Find the outsiders (to find outsiders, find all predecessors and successors of C^out and remove union of predecessors, successors, and C^out from set of all nodes) and check for collision. We make the choice of making any colliding outsider a predecessor (although doing so is not necessary).

Oct 27

I implemented an algorithm to settle cases where some immPred, immSucc pair are no more connected via C^in. However it turns out as the graph grows this algorithm takes really long because there are many such pairs in the kneading stage which results in many (successor) paths to investigate, which in turns results in many such dependecy chains to settle. I also added an algorithm that prunes recursive calls if same dependency chain is checked against any predecessor node more than once. But it does not help much. I parallelised settlement, but there are still cases where the algorithm takes > 5s. Check this branch for code.

Oct 28

I went back to the naive way. That is,
- Don't use skeleton graph. Instead use SC.
- To replace C^out with C^in find all predecessors/successors of C^out. For each predecessor draw in outgoing edge to any gate it collides with in C^in. For each successor draw in incoming edge to any gate it collides with.
- For each outsider, draw an edge to any colliding gate within C^in.
Obfuscation is in process. Still at inflationary stage (which runs for 50K times). A few observations:
- Not a single call to local_mixing returned false with max convex interation se to 10K and max replacement iterations set to 1M. Which not a single call to find_convex_subcircuit and find_replacement returned false. This is not surprising at inflationary stage. It will be nice to check whether this holds true for kneading stage.
- Majority of time per iteration is spent on 2 things:
  (1) find convex subcircuit: roughly 85%.
  (2) find all predecessors: roughly 10-12%
  
  Question: How does find convex subcircuit time decrease with no. of cores.
  Question: Why "find all predecessors" is more expensive than "fina all successors"? The difference is quite a bit. The later is in double digit micro-seconds whereas the latter is in double digit milli-seconds. Or am I mistaken?
- Find replacement circuit: takes 500us; Not surprising given we're at still at inflationary stage.

Oct 29

Question: Why "find all predecessors" is more expensive than "fina all successors"?
- Answer: It's because the distribution of convex sub-circuits is skewed towards the end of the graph. This means it is very often the case that a convex sub-circuit has many more predecessors than successors.
Question: How does find convex subcircuit time decrease with no. of cores.
- Answer: It does not improves linearly with no. of cores. However, it does improve.
Han implemented a way to assign levels to each node in the graph. Using which we were able to prune the DFS used in find convex sub-circuit which gave a huge improvement in find convex circuit's runtime. However, the function to assign levels itself takes a lot of time. After parallelising it across cores, it takes roughly 290ms on 32 cores.
- Question: I wonder whether we can adjust levels in-place after replcing C^out with C^in. So that we don't we need to touch the entire graph per iteration to update levels.
I suspected that the reason convex subcircuit is often skewed towards the end of the graph is because it is easier to find them towards the end of the graph. However, this does not seem to be the case. I restricted the start nodes (start nodes are randomly picked to start finding convex subgraph) to first quarter of the graph and observed the runtime to find a convex subcircuit. The runtime did not change but "find convex circuit" function did often find sub-circuits that has approximately equal no. of predecessors and successors.
I compared predecessors/successors distribution of convex sub-circuit and runtime to find convex sub-circuit for a random skeleton graph against a random SC graph, both with 50K gates. On average over 10 runs, find_convex often returns convex subcircuit with even distribution for skeleton graph and un-even distribution for SC graph (I wasn't expecting this). More so, as expected, find_convex returns in 12ms for skeleton graph and 950ms for SC graph.
- This hints we must use skeleton graph for all graph related operations and use SC, or some other representation, for replacement operations.

Oct 30

Since the core of program obfuscation via local mixing is constructing an RIG, we will simply narrow down to testing RIG. RIG requires to obfuscate a random reversible circuit using RIO obfuscator twice and concatenate the first output with the inverse of the second. Thus we obtain a obfuscated reversible circuit that computes the identity permutation. They are two elements here (1) RIO obfuscator (2) reversible circuit that can be guaranteed to produce permutation that are indistinguishable from a true random permutation by a polynomial time adversary.
Reversible circuit for strong-psudo random permutation: I referred to the following paper. It shows some evidence (quantum-mehcanical which I don't happen to grasp) that a reversible circuit with 3 stages - inflationary -> proliferation -> (again) inflationary where each stage is log_2{n} layers and is tree-structured is a secure block-cipher. Hence, produces pseudo random permutation on its inputs. Following is my explaination of how to construct it:
- Construction:
  - Let
    $L_{I 0, x}, L_{P, x}, L_{I 1, x}$ for x = 0, 1, 2 equal
    $\log n$ where
    $n$ is number of wires and
    $I 0$ is the first inflationary stage, P is the middle proliferation stage, and I1 is the second inflationary stage. L_{y,x} represent the no. of layers in stage y.
  - For each stage, iterate through the layers for
    $i \in [0, L_{S, x})$ where S is the stage. At layer i randomly divide n wires into sets of size 3. Each set must be unique (I realise it may not always be possible, in which case repetition is allowed to fill the set. That is, make sure that union of all set touches all wires at-least once). Also pick a gate for each such set. If S is inflationary then the gate must be picked randomly from set of 144 inflationary gates. If S is proliferation then the gate can be a random 3 bit permutation.
  - Output
    $L_{I 0, x} | L_{P, x} | L_{I 1, x}$
- Notes:
  - In local mixing paper, although authors have referred to the construction above to claim that m can be as low as O(n) for concreteness they set m = O(n4log{n}). I believe it's because the authors are working with just 16 base permutations, not all 3-bit permutations.

Oct 31

So the new algorithm finally seems to work. It hasn't crashed for long. The problem was in the routine function. routine is supposed to handle edges from predecessors of C^out to their direct connections which they don't happen to have in the incoming graph because both the predecessor and C^out collide with the direct connection. The rule is pretty simple: Any node will only have an edge to other colliding node, if there's no other node in between which collides with both. Routine function takes this rule and applies it recursively.
- It makes two observations (1) that any predecessor of C^out will not have a direct edge with its direct connection after replacement only if C^out had a direct connection as well (2) If A^P is predecessor A where A is predecessor C^out and set of direct connections of C^out is DC. If A manages to regain path to its direct connetions in DC. Then A^P only has to care about the left over elements in DC.
- So the routine function collects union of direct connections of C^out. And passes it to its predecessors. Each predecessors, for each node in the intersection of its direct connections and C^out's directions checks whether it has path from itself to the node. If not, it then draws a direct edge. Then the predecessors chucks out the nodes at intersection and passes it over to its predecessors.
- One simple optimisation is: If a node (ie direct/indirect predecessor of C^out) has been visited once in the process. It does not need to visited twice. This is because if it receives different set of left over direct collisions of C^out from its two different successors and one of them consist a direct connection of itself but the other does not. Then the direct connection is rendered a don't care because one of the node's (direct/indirect) successor has already created a path (the successor(s) that passed over the set that does not have the direct connection) to the direct connection.
The benefit of using skeleton graph for graph algorithms in the new algorithm is apparent. For example, at inflationary mixing step 5000 "find convex sub-circuit" using old algorithm takes (ie using SC graph) 50ms whereas with the new algorithm it takes 3ms on average. Finding predecessors and successor has sped up a lot as well. On single core "find predecessor" with new algorithm takes 4ms on average whereas with old algorithm it takes roughly 4ms on >= 32 core machine.
The slowest part of new algorithm still remains the "routine" function. Recall that routine function is called per predecessor and the runtime per predecessor is very non-uninform. As far as I'm able to parse the traces, whenever a mixing step takes >1s it is just 2-3 predecessor routine function calls that consume the majority of the time. It seems likely that it is the "find_path" function that takes the longest. And there are few predecessors that have big intersections which leads blowing up their runtime.
Text with Han on parallelisation:
- The new algorithm has following main functions:
  1. It first finds topologically sorted predecessors of C^out
  2. It then finds topologically sorted successors of C^out.
  3. It then finds topologically sorted outsiders of C^out.
  4. It then updates direct connections and transitive connections from C^in nodes to successors of C^out. Any two nodes will have a direct connection if they collide. Any two nodes will have a transitive connection if they collide and do not have any node in between that collides with both. Additionally all outsiders that collide are treated as successors.
  5. It then updates direct connections and transitive connections from predecessors to C^in nodes. Additionally it removes any transitive edge that a predecessor, which collides with node A in C^in, may have with node A's (newly discovered) direct connections (because now node A of C^in manages that transitive connection).
  6. It then calls the routine function for each immediate predecessor of C^out. I will tell you reasoning behind routine function later.
  I wanted to narrow down to scope of parallelisation.
  
  Finding topologically sorted predecessors and successors can easily be parallelised with what we've at the moment. We simply find all predecessors (using parallelised DFS) and then sort them as per graph levels. The same logic applies for topologically sorted successors.
  
  Point (4) and (5) are pretty trivial to parallelise as well. But they happen to be very cheap. We should prioritise parallelising other things before we tackle 4,5.
  
  The new algorithm seems to work well btw. "Find convex sub-circuit" is roughly 3-4ms after 5000 inflationary local mixing steps whereas with the old version it is 50ms.
  
  The main bottleneck still is calling the routine function per immediate predecessor. For few predecessors it takes 100s of milliseconds. So I don't think we should parallelise it per predecessor. Moreover, if we parallelise per predecessor we will be visiting all predecessors again and again which we don't need to. I also suspect the majority of the time is spent in the has_path_connecting function. Which makes me think we should instead parallelise has_path_connecting across a intersection set.
  
  It could be the case that as the graph grows a single invocation to has_path_connecting may take a quite a bit of time. I don't know how has_path_connecting behaves with big graphs. I will test it now.
  
  At the moment, I'm parallelising has_path_connecting across intersection set and checking whether it gives any significant speed ups.
  
  If parallelising has_path_connecting across intersection set gives significant speeds ups and runtime of has_path_connecting does not degrade significantly on larger graphs on average. Then we're good to go. Otherwise, we will have to implement parallel variant of has_path_connecting (which should be easy given we already have parallel DFS).
  
  Then we will have to figure out some heuristic for when to use parallel has_path_connecting per node in any intersection set vs to parallelise has_path_connecting across nodes in the intersection set. A naive heuristic could be: for small graphs always parallelise across the set. For big graphs if the intersection set is small use parallel variant of has_path_connecting otherwise parallelise across intersection set

Nov 4

We modified the the main program to run two different type of obfsucation strategies. The first is a rather simple variant of the second strategy which authors claim to use just for exposition purposes. Intuitively if you believe in strategy 2 you may believe in strategy 1 (but trust strategy 2 is the criticial piece here). Anyways following are the two strategies (I assume n <= 128)
1. Strategy 1: Fix ell^in to 4. And then randomly select any value in range [2, 4]. Set that value as ell^out. Then run the local mixing procesure for n^3 times.
2. Strategy 2: Divide the entire procedure into 2 stages - (a) infationary stage (b) kneading stage. Run inflationary stage with ell^out 2 and ell^in 4 for n^3 times. Then run the kneading stage with ell^out = ell^in = 4.
For obvious reason by default I prefer strategy 1. And if authors claim that both are equivalent, then, in practice, assuming local mixing is secure, strategy 1 makes sense.
We sampled a random block cipher reversible circuit with n=64 (ie no. of input bits) and obfuscated the circuit by running local mixing for 4000 times (for testing purposes). We repeated the experiment for 20 times and we observed the following -
- Contrary to what I expected, local mixing step failed only for a few hundred times (roughly 300 on average). And all times it failed because it couldn't find a replacement circuit; It never failed because it couldn't find a convex sub-circuit.
- Based on this observation, I think it makes sense to simply bump up repalcement iterations (probably 1M -> 10M) and add more cores so that runtime remains unchanged.
- I also expect as the graph gets bigger (we woudln't run local mixing step for just 4K times in practice. Instead we will run it for 270K times for n = 64) finding convex subgraph will also become harder. Which is why, for the next experiment, I should also bump up convex iterations (from 1K to 10K).

Nov 5

yesterday I observed that as obfuscation procedure reaches roughly 10K iterations C^out deletion starts to take occasionally take a lot of time. For example, I notice iterations where it took 40s. So I figured I should (for obvious reasons; Remember we need to run 250K iterations) fix it. I've created an issue on petgraph just to check what they maintainers of petgraph think.
My first attempt is to never delete C^out. Instead mantain a map of removed nodes. And truncate any graph related operation (ie DFS etc.) the moment it reaches any node in the set of removed nodes. Intuitivelly this should work. But at the moment this naive approach seems to introduce cycles in the graph. I don't know what's the cause. I suspect it could be the way we 're updating the levels of the graph'
Let's assume the naive approach of not deleting C^out introduces graph cycles. This means we cannot use any in-built graph algrothms and hence will have to switch to our implementation of topological sort (we only use in-built topological sort in our implementation) which completely ignores removed nodes. We already maintain a data structure graph_neighbours that stores incoming and outgoing edges of each node. For removed nodes, graph_neighbours stores empty lists for incoming and outgoing edges. If any remove node is a neighbour of any active node, then graph neighbour does not count the edge in real graph as an edge. Moreover, any algorithms which uses graph_neighbours simply filters out removed nodes. I implemented topological sort by first calculating level of each active node and then sorting the active nodes as per calculated level. Turns out multi-threaded variant of calcualting graph-level is somewhat buggy. So I ended up using single threaded version. After which obfuscation seems to run, but tests are still under-way.
We had one obfuscation running with "direct skeleton graph node-deletion" codebase (the slow one) for sometime. I was suppose to run for 50K iterations but I had to kill it because node deletion in skeleton starts to tak >30s in some iterations. I just checked its trace. Following are a few observations:
- It ran for total 15725 iterations.
- Local mixing step returned false 412 times. Everytime because find repalcement circuit returned false. We had bumped up replacment iterations from 1M to 10M for this run. But seems like some circuits are really hard to find replacement for. At all times ell^out was either 3 or 4 (I suspect majority of the time it was 4), which intuitivelly makes sense.
- Local mixing step 14K onwards. Following things (other than C^out deletion) take majority of the time (Most to least):
  - Remove edges AND Process predecessors
  - Find all predecessors
"Find all predecessors" can be sped up with more cores. "Process predecessors" is already parallelised across C^in gates (which are set to 4). We can further parallelise it but doing so will significantly increase complexity and I'm not sure how worthy would it be.

"Remove edges" comprises of two parts. The first is finding what edges to remove (which is parallelised). The second is removing the edges. I figured "finding" which edges to remove takes roughly 97% of the time.

Nov 7

After reducing runtime for "Process predecessors" with parallelism and "Remove edges" the new bottleneck is "adding new edges". This is due to some peculiarity of the library. We use update_edge because it only adds an edge if none exists between the source and the target (having duplicated edges will increase runtime of graph traversal algorithms). This takes O(k) time where k is no. of edges connected with the source. add_edge adds an edge irrespective of whether one exists and takes O(1) time. I suspect as the gaph grows in size, no. of edges connected per node grows as well. Hence, bigger graphs, let's say the one after 50K iteations, seems to take quite a while to add an edge (in some cases more than a few seconds!). I think getting around this problem should be easy: Maintain a cache of edges that exist in the graph. Keep the cache in sync with the graph. And only add an edge (with add_edge) if it does not exists in the graph. This should roughly take O(l) time where l is no. of new edges to add. Whereas using update_edge takes O(kl) time to add l edges.
I've been wondering why do we need a graph after all. Can't we maintain maps of outgoing and incoming "edges" per node and just work with it.

because we use update_edge function that avoids multiple edges between two same nodes. If we were to use add_edge

Han

2024/09/27 07:08:33

Because for n wire circuit, the probability of circuit C to have same starting gate as \gamma_i will be

On page 21 it suggests to sample until the first gate has desired control function, then rotate the wires directly, so we need to sample at most 16 times.

Janmajayamall

2024/09/27 10:55:48

does the next sentence cover this ?

2024/09/27 16:05:02

Ah yes

Program obfuscation via local mixing

Reversible circuits

Random output indistinguishibility (ROI) obfuscators

Random input & output (RIO) obfuscators

Assumptions

ROI for all reversible circuits from RIO

Random identity generator (RIG)

Gate obfuscator (GO)

Soldering obfuscated circuits

Summary of obfuscation

Implementing RIO

Local mixing

Step 1: Selecting Cout

Step 2: Finding Cin

Step 3: Replacing Cout with Cin

RIO strategy 1 and strategy 2

Concrete parameters for ROI

Additional resources

The t-wise Independence of Substitution-Permutation Networks (https://eprint.iacr.org/2021/507.pdf)

Pseudorandom Permutations from Random Reversible Circuits (https://arxiv.org/pdf/2404.14648)

Quantum statistical mechanics of encryption: reaching the speed limit of classical block ciphers (https://arxiv.org/abs/2011.06546)

Read more

Obfuscation bounty

Phantom zone v0 blog post (old version)

ADP

frog game

Step 1: Selecting
$C^{o u t}$

Step 2: Finding
$C^{i n}$

Step 3: Replacing
$C^{o u t}$ with
$C^{i n}$