Let's assume that a reversible circuit with \(n^*\) wires and \(m^*\) gates permutes the input such that the output permutation is indistinguishable from a strong pseudo-random permutation. The authors use this property of reversible circuits to build an algorithm that satisfies ROI (ROI is also program obfuscation and is similar to iO) for all reversible circuits.
Reversible circuits were first explored to express computation such that it can be reveresed. For example, a + b = c is not a reversible operation. Because it is not possible to derive a, b from c. However, if the circuit outputs b as an auxilary value, the computation becomes reversible. Reversible circuits apparently are also natural fit for quantum circuits (a quick search for "reversible circuits and quantum computation" will produce many useful resources). The other property that reversible circuits have, and the one we're interested in, is that if you chain enough reversible gates the reversible circuit can be called a strong pesudorandom permutation. In others words, the reversible circuits permutes the input bits such that the output is indistinguishable from the truly random permutation to an adversary with polynomial resources.
A reversible circuit is expressed as collection of \(n\) parallel wires and \(m\) gates and, usually, the computation flow is from left to right. Each gate mutates at least 1 wire based on signals of other wires. There're several varieties of gates but we're only interested in Toffoli gate.
Toffoli gate is defined by 2 control wires, control function, and a target wire. The gate applies the control function on the control wires and sets the target wire as XOR of output of control function by itself. For example, if a, b are the control wires, \(\phi\) is the control function, and c is the target wire. Then the gate computes: \(c = \phi(a, b)\oplus c\).
Observe that with 2-bit controls there can only be \(2^{2^4} = 16\) control functions. We refer to these gates as the "base gates". A base gate is defined over \(n\) wires but it only touches 3 wires (2 controls and 1 target) called active wires. Another observation to make is any base gate is an inverse of itself and there exist a identity base gate with control function \(\phi(a, b) = 0\)
Psuedorandomness of reversible circuits and t-wise independence
Gow96 conjectured that reversible circuits with sufficiently many gates where each gate is a 3-bit permutation is strong-psuedo random permutation. Gowers proved that reversible circuit with \(m = O(n^3t^3 \log{\epsilon^{-1}})\) is \(\epsilon\) close to t-wise independence. Note that Gowers's result only applies to any 3-bit permutation and not just the base gates.
Function \(F(x)\) is called \(\epsilon\) close to t-wise independent if the distance between following two tuples is \(\leq \epsilon\):
where \(y_i\) is sampled from truly random permutation distribution. The distance function is defined as the function \(d(p,q)\) in BH06 on page 3.
HMMR05, HB05 later improved the bound to \(m = O(n^3t^2 + n^2t \log{\epsilon^{-1}})\) and worked only with the base gates. They also conjectured that 4-wise independece suffices for cyrptographic grade pseudomness (I suppose there're attacks with 3-wise independence, which makes one think 4-wise is plausibly secure. But there's no elaboration on the conjecture in the paper).
Note: There're other recent papers that improve the bound and provide proofs. For example, this and this.
Any function \(O\) is ROI with inner stretch function \(\xi\) for input ensemble of circuits \(\mathbb{C}_{n_{\kappa}, m_{\kappa}}\) if it satisfies the following property for any reversible circuit \(C \in \mathbb{C}_{n_{\kappa}, m_{\kappa}}\):
\[O(C) \approx_c \pi(\hat{C}) : \hat{C} \leftarrow \mathcal{E}_{C, \xi(\kappa, n_{\kappa}, m_{\kappa})}\]
where \(\mathcal{E}_{C, m}\) denotes set of all circuits that compute permutation \(C\) with \(m\) gates.
Definition of ROI is different from iO. But they provide similar functionality. ROI, like iO, requires the obfuscator to output circuits that are indistinguishable from a random circuit that computes the same function (/permutation) as the input circiut \(C\).
Separable ROI obfuscators
A ROI obfuscator is \(m_{\kappa}\) left (right resp.) separable if
Observer that an obfuscator \(O\) is \(m_{\kappa}\)-right separable if \(O^{\dagger}\) (\(O^{\dagger} = (O(C^{\dagger}))^{\dagger}\)) is \(m_{\kappa}\) left separable.
An obfuscator \(O\) is called \(m_{\kappa}\) separable obfuscator if it is both \(m_{\kappa}\) right and left separable.
RIO is built from function(s) that satisfy its two properties. It isn't necessary for the same function to satisfy both properties because the two properties are used at different places. However, it is important to highlight here that ROI is built from RIO based on the new assumptions authors introduce. In particular, assumption 8, which covers assumptions 4&7.
Property 1:
Any obfuscator function \(O\) statisfies property 1 on input ensemble \(\mathbb{C}_{n_{\kappa}, m_{\kappa}}\) with post processing function \(\pi\) and inner stretch \(\xi\) if the following holds:
\[(C_1, C_2) \approx_{c} O(C_1', C_2')\]
where
\[(C_1, C_2) \leftarrow O(C); C \leftarrow_{R} \mathbb{C}_{n_{\kappa}, m_{\kappa}}\]
and
\[(C_1', C_2') = (\pi(C_1'), \pi(C_2')) \leftarrow_{R} \mathcal{E}_{P_{C'}, \xi(\kappa, n_{\kappa}, m_{\kappa})}; C' \leftarrow_{R} \mathbb{C}_{n_{\kappa}, m_{\kappa}} \]
Property 1 says that running obfuscator function \(O\) twice on same circuit \(C\) must output obfuscated circuits that are not co-relatable.
Property 2:
Any obfuscator function \(O\) satisifes property 2 on input ensemble \(\mathbb{C}_{n_{\kappa}, 2m_{\kappa}}\) for sufficiently large leeway \(\lambda\) and fixed circuit \(Z_1, Z_2 \in \mathbb{C}_{n_{\kappa}, m_{\kappa}}\) if the following holds:
\[(O(C), \hat{C}_1, \hat{C}_2) \approx_{c} (\pi(C'), \hat{C'}_1, \hat{C'}_2)\]
where
\[C \leftarrow_{R} \mathbb{C}_{n_{\kappa}, 2m_{\kappa}}; \hat{C}_1 \leftarrow_{R} \mathcal{E}_{P_{C_{[1,m_{\kappa}]}|Z_1}, m_{\kappa}\lambda};\hat{C}_2 \leftarrow_{R} \mathcal{E}_{P_{Z_2|C_{[m_{\kappa},*]}}, m_{\kappa}\lambda}\]
and
\[C' \leftarrow_{R} \mathbb{C}_{n_{\kappa}, 2m_{\kappa}}; \hat{C'}_1 \leftarrow_{R} \mathcal{E}_{P_{C'_{[1,m_{\kappa}]}|Z_1}, m_{\kappa}\lambda};\hat{C}_2 \leftarrow_{R} \mathcal{E}_{P_{Z_2|C'_{[m_{\kappa},*]}}, m_{\kappa}\lambda}\]
Property 2 says that obfuscation of the circuit must not be co-relatable with original circuit even when given oracle access to \(m_{\kappa}\)-gate prefix of the original circuit.
Assumption 4 states that there exist \(n^*, m^*\) such that a random circuit in ensemble \(\mathbb{C}_{n^*, m^*}\) is strong pseudorandom permutation (SPRP).
Assumption 5 extends assumption 4. For any random circuit \(Q \leftarrow_{R} \mathbb{C}_{n^*, m}\), circuit \(C \leftarrow_{R} \mathcal{E}_{P_Q, m_{\kappa}}\) with fixed permutation \(P_{Q}\) where \(m_{\kappa} \geq m^*m\), and \(m^* \leq l_{\kappa} \leq (m-1)m^*\) following two distributions are indistinguishable.
Assumption 5 says that oracle access to sufficiently long prefix, with at-least \(m^*\) gates (recall \(m^*\) satisfies assumption 4), of circuit \(C\) and the remainder where \(C\) is functionally equivalent to \(Q\) is indistinguishable from two SPRPs that jointly compute \(Q\).
Assumption 7 is relevant for theorem 16 and is the main assumption, along with assumption 8 (assumption 8 is an extension of assumption 7). It says even if adversary is given the random circuits, either tuple \((C_{0, l_{\kappa}}, C_{l_{\kappa,}})\) or \((Q_1 | C', C^{'\dagger} | Q_2)\), something like assumption 5 still holds. First observe that now any tuple given is clearly distinguishible from a random tuple. Nonethless, authors assume that the sufficient inner stretch should erase any information, other than functionality, of \(Q\).
Define \(Q = H_1 | H_2\) where \(H_i \in \mathbb{C}_{n^*, m_{i}}\) and \(l_{i, \kappa} = m_im^{\#}\). Assumption 7 states that there exists \(m^{\#}\) for which the following two distributions are indisginuishable
Assumption 8 combines assumption 4&7 and states that there exist \(n^*, m^*\) for which assumption 4 holds. Additionally there exist \(m^{\#}\) that satisfies assumption 7 with respect to SPRP of assumption 4.
Theorem 16
Given functions \(O_1\), \(\pi\), \(O_2\) such that:
we can build obfuscator function that satisfies ROI for all reversible circuits.
Let input circuit be a reversible circuit with each gate in the set of base gates. Theorem 16 is proven by first constructing Random identity generator (RIG). RIG is then used to construct a Gate obfuscator (GO) for any base gate. GO is used to obfuscate each gate in the input circuit. After which obfuscated circuits of the gates are concatenated together using \(O_2\). The resulting obfuscated circuit statisfies ROI.
Random identity generator (RIG) is an \(m^{\#}\)-separeable ROI obfuscator for identity permutation with inner stretch \(2m^{\#}\). It is used to (efficiently) sample obfuscated circuit that computes identity permutation on \(n^*\) input wires.
Sampling a circuit that computes identity permutation with \(2m^{\#}\) gates is not efficient in practice. RIG uses \(O_1\) and assumption 8 to sample such a circuit efficeintly in the following way:
It is clear that circuit \(C' | (C'')^{\dagger}\) computes an identity permutation and has inner stretch \(2m^{\#}\). Step 2 justifies why \(O_1\) should satisfy property 1 of RIO. That is \(C'\) and \(C''\) must not be co-relteable even though they are obfuscation of the same input circuit \(C\).
Assumption 8 is used to prove that (informally) \(C' | (C'')^{\dagger}\) is indistinguishable from a random circuit \(I\) with \(2m^{\#}\) gates that computes the identity permutation.
Gate obfuscator (GO) is a \(m^{\#}\)-separeable ROI obfuscator with inner stretch \(2m^{\#}\) for a single base gate \(\beta\). It is built using RIG.
To obfuscate a base gate \(\beta\):
The resulting cricuit \(C\) is an obfuscated circuit for base gate \(\beta\).
Step 1 of GO needs to be efficient and, with the following modifications, indeed it is. Observe that circular rotation of a reversible circuit that computes identity permutation, still computes the identity permutation. Hence, modify the algorithm to find the first gate that has same control function as the base gate \(\beta\) and circularly rotate the circuit s.t. the found gate is the first gate. If base gate \(\beta\) has control wires \(\omega_0, \omega_1\) and target wire \(\omega_2\) and the first gate has control wires \(\omega_0', \omega_1'\), and target wire \(\omega_2'\) swap wire \(\omega_0'\) with \(\omega_0\), \(\omega_1'\) with \(\omega_1\), and \(\omega_2'\) with \(\omega_2\) in \(C\) and output the resulting circuit.
After obfuscating each gate in input circuit \(C\) using GO we solder them together into a final obfuscated circuit using "append and solder" operation.
To solder two different obfuscated circuit outputs of GO, \(\Gamma_1\) and \(\Gamma_2\). Take the \(m^{\#}\)-gate suffix of \(\Gamma_1\) and \(m^{\#}\)-gate prefix of \(\Gamma_2\) and feed their concatenation to \(O_2\). Then output the following,
\[\Gamma_{1, [1, m^{\#}]} | O_2(\Gamma_{1, (m^{\#}, *]} | \Gamma_{2, [1, m^{\#}]}) | \Gamma_{2, (m^{\#}, *]} \]
We feed \(\Gamma_{1, (m^{\#}, *]}| \Gamma_{2, [1, m^{\#}]}\) to \(O_2\) which explains why \(O_2\) must have input ensemble as \(\pi(C_1) | \pi(C_2)^{\dagger}\) where \(C_1, C_2 \leftarrow_{R} \mathbb{C}_{n^*, m^*}\). It's because GO is \(m^{\#}\)-separable and, hence, \(m^{\#}\)-gate suffix of \(\Gamma_1\) and \(m^{\#}\)-gate prefix of \(\Gamma_2\) where the first is output of \((O_1)^{\dagger}\) and the second is output of \(O_1\) are post-processed versions of random circuits sampled from \(\mathbb{C}_{n^*, m^*}\).
Assuming \(O_1\) is secure, sub-circuits \(\Gamma_{1, [1, -m^{\#}]}\) (resp. \(\Gamma_{2, (-m^{\#}, *]}\)) are oracle accesses to \(m^{\#}\)-gate prefix (resp. inverse of \(m^{\#}\)-gate suffix) of random identity circuit sampled by GO for \(\Gamma_1\) (respectively \(\Gamma_2\)). Hence, \(O_2\) must satisfy property 2 for RIO.
Let \(C\) be input reversible circuits with gates \(\gamma_1 | \gamma_2 |... |\gamma_m\).
The words local mixing in the title are due to the approach authors introduce to build an RIO that satisfies property 1 and 2 simultaneously.
Authors first observe that constructing an RIO that provably satisfies property 1 and runs in polynomial time is hard because it implies CoNP = NP (Question: How so? In particular why having polysize witness of transformation from functionally equivalent and same size circuit \(C_1\) to \(C_2\) imply CoNP=NP?). Which is why they settle for an algorithm that assures \(C_1\) and \(C_2\) are not efficiently discernible.
Satisfying property 1 of RIO requires transforming the input circuit with \(m^{*}\) gates to a functionally equivalent circuit with \(m^{\#}\) gates. To do so the algorithm first introduces redundancy to the circuit by (1) randomly isolating small sub-circuits and (2) replacing them with bigger funtionally equivalent circuits. After which the resulting circuit will have \(m^{\#}\) gates. However, the resulting circuit can easily be reversed if care is not taken to hide the traces of introduced redundancy. This is acheived by again randomly selecting sub-circuits and replacing them with funtionally equivalent but different circuits of same size. The hope is that the latter stage, since sub-circuits are randomly chosen, diffuses introduced redundancy over larger and larger parts of the circuit. The first stage is referred to as "inflationary" and the second stage is referred to as as "kneading".
Inflationary stage and kneading stage only differ in size of the circuits being replaced. Otherwise, both stages take local & independent sub-section of the circuit and replace it with a different functionally equivalent circuit. A single step in either stage is referred to as local mixing step.
The step is divided into 3 main parts:
Observe that naive selection of \(\ell^{out}\) consecutive gates in \(C\) and replacing them with \(C^{in}\) can easily be reveresed if most gates in \(C^{out}\) are commutitative (i.e. swappable). It's because with high probability \(C^{in}\) will be some permutation with gates present in \(C^{out}\) with pair of random base gate and their inverses to pad up to \(\ell^{in}\) gates. For example, \(\gamma_1, ..., \gamma_{\ell^{out}}, \beta, \beta^{-1}\) where \(\gamma_1, ..., \gamma_{out}\) are \(C^{out}\) gates and \(\gamma_1, ..., \gamma_{out}, \beta, \beta^{-1}\) are \(C^{in}\) gates.
Skeleton graph:
In order to avoid the obvious attack, the reversible circuit is first converted to skeleton graph which only stores dependency among the gates. In the skeleton graph each gate is a node and there exist an edge between two gates \(i \lt j\) iff gate \(i\) collides with gate \(j\) and there exists no such gate \(k\), \(i \lt k \lt j\), that collides with both gate \(i\) and gate \(j\).
Selecting \(C^{out}\) at random:
Observe that a convex subgraph with \(\ell^{out}\) nodes in the skeleton graph is locally replaceable. That is, the convex subgraph can be swapped out with functionally equivalent circuit \(C^{in}\) with \(\ell^{in}\) gates.
Selection of random \(C^{out}\) simply requires selecting a convex subgraph with \(\ell^{out}\) nodes at random.
Simple brute force approach to find a functionally equivalent to C^out circuit \(C^{in}\) with \(\ell^{in}\) gates is feasible as long as \(\ell^{in}\) is small enough. In our implementation, \(\ell^{in}\) is fixed to 4.
Since \(C^{out}\) is a convex subgraph the predecessors of \(C^{out}\) and successors of \(C^{out}\) are disjoint. At a high level, replacement operation writes the circuit such that gates of \(C^{out}\) are placed consecutively (i.e. predecessors of \(C^{out}\) are placed before \(C^{out}\) and successors of \(C^{out}\) are placed after \(C^{out}\)). It then replaces \(C^{out}\) with \(C^{in}\) and updates the skeleton graph with necessary edges.
To correctly update the skeleton graph efficiently two additional auxilary data structures are required: direct incoming connections (DIC) and direct outgoing connections (DOC).
DIC and DOC are maps that map gates to a set. For any gate \(j\), DIC[j] maps to a set of gates that are \(\lt j\) and collide with \(j\). For any gate \(i\), DOC[i] maps to a set of gates that are \(\gt i\) and collide with \(i\). Note that any two gates \(i\) and \(j\), for \(i < j\) have an edge in skeleton graph iff they collide and DOC[i] and DIC[j] are disjoint.
Following is pseudocode of the replacement algorithm:
def inplace_topological_sort(nodes: &mut [Nodes]):
'''
Topologically sort the nodes in-place as per SG(C)
'''
def find_all_predecessors(convex_set: [Nodes]):
'''
Finds union of predecessors of the nodes in the input convex_set
'''
def find_all_successors(convex_set: [Nodes]):
'''
Finds union of successors of the nodes in the input convex_set
'''
def replace_cout_with_cin(C^out, C^in, DOC, DIC):
'''
Replaces C^out with C^in
'''
# Since C^out is convex `all_predecessors` and `all_successors` must be disjoint.
all_predecessors = find_all_predecessors(C^out)
all_successors = find_all_successors(C^out)
# Find the nodes that are neither predecessor nor successors of C^out
all_outsiders = all_nodes(SG(C))
inplace_topological_sort(all_predecessors)
inplace_topological_sort(all_successors)
# Update DOC and DIC with C^in. And update SG(C) with new edges.
for gate i in C^in:
# Direct outgoing connections within C^in
for j in C^in[i+1, ]:
if i collides with j:
insert j in DOC[i]
insert i in DOC[j]
# Direct outgoing connections from C^in to C^out's successors
for gate j in all_successors:
if i collides with j:
insert j in DOC[i]
insert i in DOC[j]
# Treat all outsiders as successors
for gate j in all_outsiders:
if i collides with j:
insert j in DOC[i]
insert i in DOC[j]
for gate i in C^in:
reverse(for gate j in all_predecessors):
if i collides with j:
insert i in DOC[j]
insert j in DIC[i]
# Remove unecessary edges in SG(C).
# That is, remove any edge from j to any node in DOC[i]
# This is because *now* there exist a node between j and i's
# direct connection which collides with both, the gate i
# itself.
for node in intersection(DOC[j], DOC[i]):
SG(C).remove_edge_if_exists(j, node)
# Add an edge in SG(C) following the rules of skeleton graph.
# That is, iff DOC[i] and DIC[j] are disjoint
if is_disjoint(DOC[j], DIC[i]):
SG(C).add_new_edge(j, i)
# Find union of DOCs/DICs of C^out (excluding any vertex in C^out)
let union_cout_docs = {}
let union_cout_dics = {}
# Delete vertices in C^out from DOCs, DICs, and SG(C)
for node in C^out:
delete from SG(C)
delete from DOC including from the internal sets
delete from DIC including from the internal sets
# Re-connect missing links
for direct_inc_predecessor in union_cin_dics:
for direct_out_successor in union_cout_dics:
if is_disjoint(DOC[direct_inc_predecessor], DIC[direct_out_successor]):
SG(C).add_new_edge(direct_inc_predecessor, direct_out_successor)
Strategy 1 is simpler than strategy 2. However authors claim both strategies are equivalent. Strategy 2 is same as the one we've seen before; inflationary stage followed by kneading stage. In the inflationary stage \(\ell^{out} < ell^{in}\). In the kneading stage \(\ell^{out} = \ell^{in} = \ell^{knd}\) and \(\ell^{knd} \geq \ell^{in}\) of inflationary stage.
Strategy 1 does not differentiates between inflationary and kneading stage. Instead at each local mixing step it randomly samples \(\ell^{out}\) from the range \([2, \ell^{in}]\).
For strategy 2 authors set \(m^* = n log^4{n}\) and \(m^{\#} = n^3\). They run inflationary stage for \(n^4\) iterations, and kneading stage for \(n^5\) iterations.
However, for strategy 1 authors have drastically cut down iterations to \(O(m \kappa)\) where \(\kappa \approx n\). For example, for n = 128 only \(\approx 460K\) iterations are required for security parameter for \(\kappa = 128\).
The paper formulates attacks on fiestal block ciphers using out-of-order correlators (OTOCs) and entropy. Decay of OTOCs signal chaos in quantum systems and saturation of entropy limits information that can be extracted by an adversary. They show that for insecure instantiation of fiestal block cipher with 1, 2, or 3 rounds, OTOC does not vanishes to very small values and has significant residual entropy (i.e. distance from maximum entropy). However with 4 rounds OTOCs vanishes exponentially while entropy saturates.
Based on the above evidence they construct a 3-stage reversible circuit with \(n\log{n}\) 3-bit permutation gates for which OTOCs vanish exponentially and entropy saturates. Hence, the 3-stage cipher can be claimed to a strong pseudo-random permutation secure against an adversary with polynomial no. of queries (they further claim that reversible circuits are also secure against quantum adeversaries).
The 3-stage construction consists of 1 inflationary stage with \(\log{n}\) layers, followed by 1 proliferation stage with \(\log{n}\) layers, and another inflationary stage with \(\log{n}\) layers. Inflationary stage only consists of gates from the set of 144 inflationary gates. Proliferation stage only consits of gates from the set of 10K proliferatory gates (however one of the authors mentioned that any random 3-bit permutation gate would do as a proliferatory gate in this video).
Let \(n'\) be power of 3 \(\geq n\). For each layer, divide \(n'\) into sets of triplets such that collection of layers within a stage forms hierarchical tree structure. For example, l1 = (0, 1, 2) (3, 4, 5) (5, 6, 7)… and l2 = (0, 3, 6) (1, 4, 7) (2, 5, 6)… etc. Assign a random inflationary/prolifatory gate to each triplet. Let \(\pi\) be a random surjective map from \(n'\) to \(n\). Apply \(\pi\) to elements of each triplets in every layer (if n' = n, then map is 1 to 1. If n' > n then repetitions are inevitable) and output the reversible circuit for the stage. Repeat the procedure for all 3 stages and output the concatenation of the 3 reversible circuits.
13oct
15 oct
17 oct
18 oct
I noticed for the case \ell^out = 2 (or even 3/4) and \ell^in = 5 (or honestly any other > \ell^out value) the C^in contains all gates of C^out. I do not know whether this is obvious. But I expect it to make it easier to attack if redundancy is not spread out properly in the kneading stage.
Let's assume the redundacny is not spread out in the kneading stage. Then with high probability there exist a convex subcircuit with \ell^in gates and \omega^out active wires (we assume that active wires of C^in and C^out are unchaged). To reverse the mixing, the attacker first identifies a sub-circuit with \ell^in gates and \omega^out active wires and tries to find a replacement circuit with \ell^out gates. If the attacker has no other information the attack will take in the worst case \(O(2^{2^{\omega^{out}}})\) iterations. But if the attacker knows that C^in contains gates of C^out with high probability then attacker needs to only try \ell^in p \ell^out (ie no. of way to select \ell^out gates from \ell^in gates).
Below are a few samples of C^out and C^in.
C^out
---------------
0 1 2 3 4 5 6
---------------
x I x x O I x
x x I I x O x
I x x O x x I
---------------
C^in
---------------
0 1 2 3 4 5 6
---------------
x x I I x x O
x I x x O I x
x x I I x O x
x x I I x x O
I x x O x x I
---------------
C^out
---------------
0 1 2 3 4
---------------
x O I x I
I x x I O
---------------
C^in
---------------
0 1 2 3 4
---------------
x O I x I
x I x I O
I O x I x
x I x I O
I O x I x
---------------
Probably this attack can be avoided if \omega^in > \omega^out (ie C^in has more active wires than C^out). But I don't know whether this is necessary to do so. Or shall we safely assume that kneading stage will spread the complexity gap throughout really well.
Oct 24
blah
. I started to test blah for different graph sizes and there's clear difference in runtime (note that blah
is called repeteadly with a new random source node until it finds a convex subset). For m=40K on average (although per call runtime depends on node; averaged over 1000 runs) it takes 90ms. For m=20K on average it takes 23.5ms. For m=10K it takes 6ms. For m=5K it takes 1.5msOct 25
Oct 26
Oct 27
Oct 28
Not a single call to local_mixing returned false with max convex interation se to 10K and max replacement iterations set to 1M. Which not a single call to find_convex_subcircuit and find_replacement returned false. This is not surprising at inflationary stage. It will be nice to check whether this holds true for kneading stage.
Majority of time per iteration is spent on 2 things:
(1) find convex subcircuit: roughly 85%.
(2) find all predecessors: roughly 10-12%
Question: How does find convex subcircuit time decrease with no. of cores.
Question: Why "find all predecessors" is more expensive than "fina all successors"? The difference is quite a bit. The later is in double digit micro-seconds whereas the latter is in double digit milli-seconds. Or am I mistaken?
Find replacement circuit: takes 500us; Not surprising given we're at still at inflationary stage.
Oct 29
Oct 30
Oct 31
The new algorithm has following main functions:
I wanted to narrow down to scope of parallelisation.
Finding topologically sorted predecessors and successors can easily be parallelised with what we've at the moment. We simply find all predecessors (using parallelised DFS) and then sort them as per graph levels. The same logic applies for topologically sorted successors.
Point (4) and (5) are pretty trivial to parallelise as well. But they happen to be very cheap. We should prioritise parallelising other things before we tackle 4,5.
The new algorithm seems to work well btw. "Find convex sub-circuit" is roughly 3-4ms after 5000 inflationary local mixing steps whereas with the old version it is 50ms.
The main bottleneck still is calling the routine
function per immediate predecessor. For few predecessors it takes 100s of milliseconds. So I don't think we should parallelise it per predecessor. Moreover, if we parallelise per predecessor we will be visiting all predecessors again and again which we don't need to. I also suspect the majority of the time is spent in the has_path_connecting
function. Which makes me think we should instead parallelise has_path_connecting
across a intersection set.
It could be the case that as the graph grows a single invocation to has_path_connecting
may take a quite a bit of time. I don't know how has_path_connecting
behaves with big graphs. I will test it now.
At the moment, I'm parallelising has_path_connecting
across intersection set and checking whether it gives any significant speed ups.
If parallelising has_path_connecting
across intersection set gives significant speeds ups and runtime of has_path_connecting
does not degrade significantly on larger graphs on average. Then we're good to go. Otherwise, we will have to implement parallel variant of has_path_connecting
(which should be easy given we already have parallel DFS).
Then we will have to figure out some heuristic for when to use parallel has_path_connecting
per node in any intersection set vs to parallelise has_path_connecting
across nodes in the intersection set. A naive heuristic could be: for small graphs always parallelise across the set. For big graphs if the intersection set is small use parallel variant of has_path_connecting
otherwise parallelise across intersection set
Nov 4
We modified the the main program to run two different type of obfsucation strategies. The first is a rather simple variant of the second strategy which authors claim to use just for exposition purposes. Intuitively if you believe in strategy 2 you may believe in strategy 1 (but trust strategy 2 is the criticial piece here). Anyways following are the two strategies (I assume n <= 128)
For obvious reason by default I prefer strategy 1. And if authors claim that both are equivalent, then, in practice, assuming local mixing is secure, strategy 1 makes sense.
We sampled a random block cipher reversible circuit with n=64 (ie no. of input bits) and obfuscated the circuit by running local mixing for 4000 times (for testing purposes). We repeated the experiment for 20 times and we observed the following -
Nov 5
yesterday I observed that as obfuscation procedure reaches roughly 10K iterations C^out deletion starts to take occasionally take a lot of time. For example, I notice iterations where it took 40s. So I figured I should (for obvious reasons; Remember we need to run 250K iterations) fix it. I've created an issue on petgraph just to check what they maintainers of petgraph think.
My first attempt is to never delete C^out. Instead mantain a map of removed nodes. And truncate any graph related operation (ie DFS etc.) the moment it reaches any node in the set of removed nodes. Intuitivelly this should work. But at the moment this naive approach seems to introduce cycles in the graph. I don't know what's the cause. I suspect it could be the way we 're updating the levels of the graph'
Let's assume the naive approach of not deleting C^out introduces graph cycles. This means we cannot use any in-built graph algrothms and hence will have to switch to our implementation of topological sort (we only use in-built topological sort in our implementation) which completely ignores removed nodes. We already maintain a data structure graph_neighbours
that stores incoming and outgoing edges of each node. For removed nodes, graph_neighbours stores empty lists for incoming and outgoing edges. If any remove node is a neighbour of any active node, then graph neighbour does not count the edge in real graph as an edge. Moreover, any algorithms which uses graph_neighbours simply filters out removed nodes. I implemented topological sort by first calculating level of each active node and then sorting the active nodes as per calculated level. Turns out multi-threaded variant of calcualting graph-level is somewhat buggy. So I ended up using single threaded version. After which obfuscation seems to run, but tests are still under-way.
We had one obfuscation running with "direct skeleton graph node-deletion" codebase (the slow one) for sometime. I was suppose to run for 50K iterations but I had to kill it because node deletion in skeleton starts to tak >30s in some iterations. I just checked its trace. Following are a few observations:
"Find all predecessors" can be sped up with more cores. "Process predecessors" is already parallelised across C^in gates (which are set to 4). We can further parallelise it but doing so will significantly increase complexity and I'm not sure how worthy would it be.
"Remove edges" comprises of two parts. The first is finding what edges to remove (which is parallelised). The second is removing the edges. I figured "finding" which edges to remove takes roughly 97% of the time.
Nov 7
update_edge
because it only adds an edge if none exists between the source and the target (having duplicated edges will increase runtime of graph traversal algorithms). This takes O(k) time where k is no. of edges connected with the source. add_edge
adds an edge irrespective of whether one exists and takes O(1) time. I suspect as the gaph grows in size, no. of edges connected per node grows as well. Hence, bigger graphs, let's say the one after 50K iteations, seems to take quite a while to add an edge (in some cases more than a few seconds!). I think getting around this problem should be easy: Maintain a cache of edges that exist in the graph. Keep the cache in sync with the graph. And only add an edge (with add_edge
) if it does not exists in the graph. This should roughly take O(l) time where l is no. of new edges to add. Whereas using update_edge takes O(kl) time to add l edges.because we use update_edge
function that avoids multiple edges between two same nodes. If we were to use add_edge