Silicon Salon III Notes 2023-01-18

--- title: Silicon Salon III Notes 2023-01-18 tags: Notes --- # Silicon Salon Notes 2023-01-18 #### 2023-01-18 9am PDT Slides (follow along): https://hackmd.io/mG1lQ2mkTKSqoz4ELYuXHg?view These Notes: https://hackmd.io/1u3aWfCjQ_aqXc36rdADWQ?edit Overflow notes (if we reach the limit): https://hackmd.io/5J57Ib_9QYe_cpj2kw-2IA?edit ## Attendees NOTE: This document is public. Don't add your name here, or any private contact details, if you don't want them be listed as a participant in the final output from this Silicon Salon. [Chatham House rules apply](https://hackmd.io/@bc-silicon-salon/rkxbd6rFw9#/14) for all quotes. * Christopher Allen (@ChristopherA) * Bryan Bishop (@Kanzure) * ## Notetaking # Introduction Intro: Christopher Allen We are taking live notes. In particular, Bryan Bishop who is an amazing transcriber of these types of events is also going to be taking notes. There will be good solid notes here. We are hoping these salons will be a quarterly event on the intersection of hardware, wallets, silicon design, security and blockchain. Again, this is a collaborative session. I don't see anyone here who has logged in via phone, but there is a way to follow from a phone. I encourage you to follow along with a laptop or smartphone. Many thanks to Bryan Bishop for his live transcription as we go along. What is Blockchain Commons? We are a community interested in self-sovereign control of digital assets, whether those digital assets are cryptocurrencies or some form of token or digital identity or other kinds of authentication and authorization online. We really feel that our mission is to bring together people to collaboratively develop interoperable infrastructure for these problems. We don't want one-off solutions. We want decentralized solutions where everyone wins. We work hard to be a neutral not-for-profit organization working to help people control their digital destiny. I am Christophre Allen. I am co-author of the TLS international standard and co-author of the new decentralized identifier standard at W3C. I am architecting a variety of privacy technologies, usually higher up in the stack than hardware. Part of my motivation for hosting these events through Blockchain commons is that there's only so much we can do at our level of the stack. We need more security lower down. What is Silicon Salon? What is it that the wallet companies need? What do the semiconductor designers need? What would be useful for academics to look into? We have three very different presentations today that are yet related from very different parts of our community. Once we have these requirements, what we want to do is collaboratively engineer interoperable specifications. We hope to get to that later on in the year, which means things like: what does it mean to be "open"? What can we do to support collaboration in engineering? We then want to evangelize this it to the ecosystem, create reference code, APIs, libraries, test suites, to support what we are trying to understand here today at this salon. We have semiconductor desginers, Bunnie Studios, Tropic Square, Crossbar, various hardware wallet manufacturers like Foundation and Proxy and validating lightning signer. We also have Unchained Capital, Bitmark, Chia, some bitcoin supporters ,and also some advocacy organizations thta care about security like Human Rights Foundation, Rebooting Web of Trust, and academics, protocol developers, and cryptographres. At our previous Silicon Salon, we focused on secure boot, firmware upgrade issues, and supply chain security. It's on siliconsalon.info if you missed those previously there are videos and transcripts. What we're trying to explore today is silicon logic based cryptographic functionality. What are the opportunities for semiconductor acceleration of cryptography, as MPC and ZKP technologies are beginning to be deployed? They have special needs that we're interested in. Our process today is that we will open with three presentations where we limit the Q&A specific to the presentation. We will take a brief break, and then go into a facilitated Q&A. We have some amazing participants today. Let's try to dive into what are the challenges and requirements in the future. Then, let's focus on what the next steps are for collaboration, including what the other topics should be, who else should we ask to participate in this in April when we do our next Silicon Salon? For those who are new to Blockchain Common events, is that Chatham House Rules apply. You are free to use all information shared here. But when you quote, do not reveal the identity or affiliations of the speakers. We will record the presentations to youtube. The actual speakers, we will put the presenter's presentations online, but all the other discussions we're only recording to produce an anonymized summary. They will include quotes but not names. Our goal here is partly to have the freedom of being open and honest about these topics, which sometimes means the right to not be taken out of context. We have companies here involved in these discussions that are public companies and we have to be careful. We will start with Silicon and MPC requirements by Jo, with Crossbar/Cramium. Bunnie will then talk about a more open secure element chip. Kavya will be talking about XGCD on hardware. We're really looking forward to these presentations. # Cramium Labs Sung Hyun Jo is a co-founder of Crossbar, and has done work in non-volatile memory technologies and ReRAM. He holds many patents, with many citations, and got his doctorate from the University of Michigan. Hi everyone, I am Sung Hyun Jo from Cramium and Crossbar. Today our presentation is about a secure processing unit. Before we start, let me first start who we are. We are a minor subsidiary of Crossbar. Crossbar is a resistive memory technology, superior to flash memory in terms of scalability, reliability and security. We will discuss that in a moment. We have this in production at 28nm, and we're currently working with a customer on implementing resistive memory at 12nm. We have more than 300 patents, and we were named one of the top 200 Semiconductor companies as rated by IEEE. You have probably heard about decentralized computation, decentralized key management, and multi-party computation. Traditionally, the approach is single key based security schemes used in many applications like managing digital assets, making transactions, and digital certificates. However, this single key pair approach is useful when there's not enough computing algorithm or when there's no good algorithm for decentralized computing. These days, you need a good amount of computing power, and there's been lots of progress in multi-party computation (MPC). The single key scheme has become more risky. You could use multiple devices or multiple parties to manage a digital asset or security or succession of ownership. What about protection from a single point of failure? Unfortunately there has not been much support from semiconductor manufacturers for MPC and other schemes. We have identified this problem, and Cramium was formed to help solve this challenge. There are some popular MPC protocols like GG18, Lindell17, and CGGMP21. Many of these papers have been cited. Silicon or hardware support should be able to help these rapidly evolving protocols. I'd like to show a limitation of software-only approaches. Here, I used CGGMP21 library from Taurus Group, and in the example here- during key generation phase, each device or member creates their own secret and collectively computes a group public key. It can be used to verify a group signature. Typically the group generation phase requires multiple rounds, like 4 or 5. In each round, everyone computes and verifies the data they receive from others. It takes as little as 3 rounds but as high as maybe 10 rounds. Although these libraries like CGGMP21, or GG18, these MPC protocols can handle unlimited number of parties but in practice it is limited to only a few parties. In this example, I show here 3 or 4 parties involved. Key generation here already takes several seconds or even more than 10 seconds. This is one of the reasons why software-only solutions are not good enough. This is one of the challenges of distributed key management and multi-party computation. What is causing these delays? On the next slide, you can see the round 1 breakdown here, and 2, and 3. Paillier key generation is taking the longest time. ECDSA covers more than 90% of blockchains, and there's a non-linear encryption scheme, so we need a homomorphic encryption scheme to make it work. Often, proof is required, so zero-knowledge proof. So we need hardware for homomorphic encryption or be able to address verification in hardware. Signing phase is similar. Once the keys are generated, multiple parties come together to compute the group signature without revealing individual secrets. So that's what we do in the signing phase. Here, again, for CGGMP21-- when two people, small parties yes, maybe only 2 to 3 seconds but for some applications that might be too long. So say there are 5 devices, that's already 10 seconds. That's unacceptable. This is one of the issues with MPC. There is a bigger challenge or issues. Cold storage architecture for bitcoin wallets. The typical structure where you have a CPU and the memory and not necessarily a secure CPU or not necessarily secure memory. There is no physical countermeasure. We use a secure element. But these are discrete components, and the communication bus can be attacked by attackers. Due to the lag of semiconductor support, wallet vendors often have to use off-the-shelf components. It's expensive to do this, and these components are probably not secure. Individual components may be secure, but putting them together is not necessarily secure. That's another problem. Finally, secure elements are not really for modern blockchain applications. They are for cryptographic applications. As you know, NIST curve, potentially all the NIST curves have backdoors. It's a concern. ... Normally, secure elements have 10 kilobytes max to store a single key. Usually, to use a secure element, you have to sign a strict NDA. It's a very centralized approach, and requires trust with what the vendor offers, and there's no visibility and you cannot verify. There are many issues. I haven't even gone through all of them. Cramium wants to address these issues. Our approach is, instead of having all these discrete components, like memory, CPU, secure storage, and accelerator-- all of these components are put into a single chip, monolithically integrated, instead of having multiple components connected together on an open bus. All of these will be protected by physical countermeasures of the secure chip. We also do not use flash, but we do use a secure memory, and it gives it more performance and reliability. We have some slides showing why ReRAM is better compared to flash memory. We do the manufacturing in the leading foundry, TMSC, and we're planning 24nm. The traditional secure element is at 130 nm down to maybe 40nm, and so 22nm advanced node will have lower power consumption, higher performance, and more secure. The logic will also be much more dense, and become physically harder to reverse engineer compared to other secure element chips. We are collaborating with Bunnie Studios for RISC-V. We will be focusing on cryptographic primitives that are used by blockchain. We will be taking an open hardware, open-source approach. You will be able to review and audit. Here are some of the cryptographic primitives that we're planning. For the public key scheme, there will be ECC like ECDSA, Schnorr, EdDSA on the secp256k1 key, and Ed25519/Ristretto, P-256/384, and then RSA. We will offer homomorphic encryption acceleration using the Paillier cryptosystem. Various hash functions, and AES encryption, and HMAC/PBKDF2, key agreement like X509 and others, and of course MPC acceleration. We aim for a secure yet flexible chip. We will have two operation modes. Here, the MCU and secure element--- the state machine in the secure element .... the chip will be guarded and secure. The secure element for the additional layer of defense ;it is never exposed to the MCU, and is only staying inside the secure element. The result will be sent back to the MCU. The other mode, though, is more useful when the secure element does not support full protocol acceleration but only enforces atomic levels. By having this mode, the MCU is coordinating multiple atomic-level operations so that you can create arbitrary protocols using atomic level commands. Data again will not be exposed to the MCU. In new applications, you might have a secure element that doesn't support new functions. But what if you want to be able to use the SE and still be secure? This is our approach. Let's also discuss and compare ReRAM vs flash memory and show why resistive memory is important for security. Flash memory and RRAM-- the physical mechanism is different. Flash memory is charge-based, where it stores electrons to represent information in binary. If you have a lot of electrons stored, that represents a bit flipped or not. However, in RRAM we use metal ions- we only need a few atoms. RRAM consists of a top and bottom electrode with an insulator in the middle. In the middle, the ions connect the two, and when there is no bridge then it is insulating the state because the insulator is in the middle. This is how we store information. The difference is that, as you can guess, storing electrons is extremely challenging. Say we store the electrons in a gate here; there are various ways for electrons to escape like due to a defect where they hop to the defect and then escape. Or a sodium ion.... even for state verification, it's impossible to have an ideal insulator. There's always some kind of ion contamination or physical defects. Because of this reason, actually you lose a few electrons very easily even within a few days. When scaling down to 28nm or 22nm, you need about 100 electrons but the problem is that one electron escapes every day or two. After six months, there's not much information left, and retention is gone. That's why there is no flash memory below 28nm. All the secure elements and conventional chips are 40nm and up, because there's no practical memory solution. Resistive memory is only based on about 3 atoms, so that you can scale down below 20nm, in fact they are already working on 12nm. Resistive memory is inherently secure against invasive attacks like optical attacks. Here is an actual device fabricated in a foundry at 28nm, this is an individual memory cell, and we program this device. Here are some images shown -- we cannot even see which device is programmed in 0 or 1. The information is only stored in a few atoms. Even with the most advanced imaging techniques, it's basically impossible to distinguish between the cells. RRAM is protective against invasive attacks. Another popular attack is an optical injection attack. RRAM provides good resistant to this mode of attack. ..... You might inject light into the wafer, and the metal layers can deflect the light. Metal can be so dense that light can't go through, but this is not the case in silicon wafers. You can direct light through the silicon and see what's going on in the transistor level. ROM, flash, and others, those are built on these layers, so you can see through behind the chip. In SRAM, it only takes a few minutes to get the contents of SRAM these days. RRAM doesn't have this problem, it's built around the metal layer not the wafer layer. It's only in the middle of the metal lines. If you inject light from the top or bottom, it's heavily protected from both sides. .... Some examples of physical countermeasures against physical attacks, include active shields, security layout, security design (self-check, dynamic logic); we have fault injection protection like glue logic design, glue cells (trigger cells) throughout the chip, and isolated clock, voltage. Sidechannel resistance, clock jitter, power balance, lifecycle protection, multi-stage boot, multi-signature firmware verification, and various kinds of TRNG. Let's take a look at TRNG briefly. We use multiple different entropy sources. We offer multiple independent sources of entropy. We also provide the option if the user doesn't trust hardware generated entropy then they are free to use external entropy of their choice. Having multiple independent entropy is important; we used a NIST test to demonstrate the importance of this. This was a test suite NIST SP800-22 which can measure the quality of randomness. We showed that this mixes entropy very good. ... We aim to provide a flexible and programmable computing platform, and we want to take an open development approach, offer a few megabytes of memory, and implement countermeasures and security. This would be useful for cryptography and emerging cryptocurrency applications. Let us know, any functionality or any crypto primitives that you would like us to consider. The reference numbers from my simulations was a 4-core laptop. So an x86 pretty modern laptop? Got it. Thanks. Q: Would it be slower on a conventional MCU? A: Yes. Normally, some SEs-- they are based on M3, so it would be 5x slower or even more. Q: One of the things you didn't bring up was the fact that ReRAM can be integrated with CMOS more effectively than with flash. A: Yes. Resistive memory can scale down below 20nm. We are not occupying the transistor layer. So we are saving some area too. Flash memory has to be occupying the silicon space for logic and memory, however our resistive memory is built on top of logic so it can save space or be more dense. Also faster and more robust, because fewer electrons in read/write, but in resistive memory you have reliability and you can read and write it more quickly. Q: Unlike EdDSA and ECDSA, Schnorr does not have a formal specification. There are many flavors. There is bitcoin bip340 Schnorr that uses x-only pubkeys. Is that something that you are planning on supporting? A: Yes. We are supporting that directly. Schnorr standardization is already lagging, so we want to implement the bitcoin approach first. Q: Just an anecdote: Pallier always seemed like an odd choice to me, because it's RSA-based and used in MP-ECDSA with ECC keys. First, I think there may be future variants of MP-ECDSA that use ECC-based protocols like ECIES and Bulletproofs. Second, Pallier is a variant of RSA, so having microprocessor primitives that can do both RSA and Pallier in a general purpose way would be valuable for many uses. Third, I know from personal experience that the Mercury wallet (which uses 2p-ECDSA) and statechains is limited by the slow speed of Pallier in the protocol. For them they have to generate Pallier keys for every swap participant, and that participant could disconnect after the key generation phase, presenting a DOS vector. Fast key generation and Pallier in silicon would mitigate this. The specific use of Paillier in that implentation is in this paper: https://eprint.iacr.org/2017/552 A: Yes, I agree. The reason why support Paillier is that it is heavily used in major ECDSA MPC protocols. However, the Paillier related ZKP has been one of the main vulnerabilities if not implemented correctly. We offer both RSA and Paillier since they are very similar. There are some active research activities to enable ECDSA MPC without Paillier. We are closely tracking these and may have dedicated accelerators for them (once there are some standardization) # Bunnie Studios Our next presenter is Andrew "bunnie" Huang, an American researcher and hacker with a doctorate from MIT. He wrote "Hacking the Xbox"; he's a mentor and advisor, and recently released a very interesting cryptographic wallet RISC-V device Precursor. Reflections on F/OSS Design + Closed PDK I am bunnie. I have made a device called Precursor which is an FPGA-based device meant for security. The idea behind it is that we wanted to go towards how far can you go in terms of trusting your hardware? Particularly in the face of supply chain challenges and verifiability issues. FPGAs are great because you can compile your design from source, so you can know your CPU doesn't have backdoors and you can inspect the RTL. When you compile it yourself, we do a randomization trick to make sure the cells end up in different locations and you get your own unique bitstream. So you would need someone to backdoor a lot of your cells, not just a few of them, and you would see a change by x-rays and other easy methods to see that the FPGA chip is behaving unexpectedly. The whole thing is RISC-V based; it's not just a hardware exercise. We wrote our own operating system. We have a microkernel OS written in rust from the ground up. It's meant to run secure cryptograhpic applications, it can deal with a number of different protocols all the way up to the application layer. The device itself can be used as U2F FIDO token, you can keep your plaintext passwords, can do TOTP, wifi chat, matrix chat, the ecosystem is starting to evolve around it and flesh out. This current presentation- after we put in all this effort to build tihs ecosystem on RISC-V, and we were thinking about security and trust, then Cramium approached and they are building an secure element. I'm helping advise them on sort of the open-source aspects of this. In particular, the original draft of the chip, originally it was just an ARM CPU. From the standpoint of trust and being able to know the really important things are working right, having proprietary CPU core from ARM that you can't inspect and can't know on the inside seems problematic. So we are going to graft in a RISC-V core that we have and integrate it into their chip. Hopefully with that we can get software compatibility with all the RISC-V developments around secure wallets and secure hardware. This has led to a question: we're doing all this free open-source design, but then when we're targeting the Cramium process, it's on a closed 22nm TMSC PDK. It's under NDA, and we can't show you the process. If you can't trust the transistors, then why bother with anything else? This presentation doesn't fit into the exact theme of this commons. So you care about secure and you want to trust your hardware. You want to avoid security by obscurity as Kerckhoff's principle said. So let's have open circuit boards, firmware, bootloader, kernel, protocols, and applications. And also open chips, RTL, PDK, masks, and chip fabs. The problem is that you can keep going down and down and down until you get to the nth turtle. The reason people get worried about this is, what happens if your BIOS is rooted? If your BIOs is rooted, then it doesn't matter about your kernel. But if your motherboard is JTAGged, then your BIOS doesn't matter. You can always walk around any security countermeasure by going one level deeper and then trapping the upper security level in a state where you are running a simulation and the world is not what it appears. That's the root of the paranoia really. There has been a lot of talk about open PDKs, trusted fabs and that kind of thing as a potential solution to this, even discussed in US Congress and European Parliament. The problem with bringing fabs over, this diagram here shows what a supply chain might look like on the left hand side. You have chip designers, developers, git cloud, all this stuff. You have chip fabs, PDKs, PCB assembly, grey market in the middle, it goes through customs, distributors, and vendors, and finally to customers. Open PDK lets you put some checks on the left-hand side of this diagram, but then it ignores all these other vectors for potentially swapping out your hardware and doing bad things to your hardware along the way. This whole thing about opening things all the way down doesn't even really address the core problem at the end of the down. In hardware, checking designs doesn't mean checking devices. There is no "hash function" and "digital signature" for physical hardware. At least not yet. In software, there are ways of doing reproducible builds and checking hashes of software binaries. You can even use untrusted hardware to run code. There's no trust transfer in hardware, though. In a separate topic of discussion, I'm trying to find ways to add some amount of verifiability to silicon through optical inspection means. It won't be perfect, but it should be much better than what we have today. So, if you're worried about vectors in chips, then why not inspect transistors? Well, there's an x-ray tractography tomography paper where they show a 3d extract of a modern intel CPU here, using PXCT of detector ASIC chip. It's also non-destructive. They were using a particle accelerator x-ray source to do this. Not everyone is going to have one of these in their house. If you want to check your chip, you have to Fedex it to this facility, and now you're not sure if someone swapped out your chip anyway when you ship it... So this doesn't solve the problem either. I have bad news for people who want full security through all the turtles. There are no silver bullets in hardware security. There's no essential link between formal verification security, because specifications worry about correctness but not security. Open-source does not mean it's secure, unless you check it. Even if you do inspect, then there might be an evil maid today. So you need to keep it under lock and key and under active eye, so that someone doesn't swap it. Auditors? Certifications? Certifications are not a public service- they have their own incentives. There's no bottom turtle for security in this stack that we can get to at the end of the day. Hardware security is a cost-benefit tradeoff at the end of the day, like fatality rates in automobile accidents. How much does it cost to break the security? How much do you lose if the security is broken? What is acceptable at the end of the day? How do you accurately assess these costs? The reason why the cost assessment is hard, it's just that it's human nature to fear things that you don't understand well or have uncertainty about. Here are some typical problems you might have causing security issues: user error, phishing, protocol errors, software bugs, hardware bugs, sidechannel vulnerabilities, direct readout of memory, etc. I would re-draw the argument where the width of the surface is actually the size of the attack surface would look different. The most commonly exploitable stuff is user error. It's a big problem. It happens all the time. This is fundamentally the biggest issue. As you make the bottom layers more and more secure, users are increasingly going to be the source of problems. Social engineering and phishing will be a big problem for a very long time. Protocol errors are going to exist. Heartbleed, these kinds of things. They will be out there. Software bugs are pretty common, we get CVEs all the time. We have hardware bugs like Spectre and other issues, and finally direct readout is a relatively small one-- you literally have to have the hardware in your hand to execute the attack, compared to user error or protocol errors. The attack surface is literally measured in microns, and when you try to do the direct read there's a high chance you actually destroy the chip you're trying to read. It's a threat, yes, but it shouldn't be up there with user risks or these other concerns. More importantly, perhaps, when you have closed hardware, and this is the crux of it, is that if you have those secure elements where typically today there's an NDA and you can't know anything about the hardware or barely anything about the API and it runs a bunch of closed osurce software blobs to run ciphers you don't understand-- and now your visibility into the attack surface gets hazey somewhere between the protocol and the software layer. So if there is some initial value setting bug, but you can't get through the NDA to see the underlying library looks like, it can lead to protocol errors. Even though the hardware itself is under NDA, it bubbles up and causes problems on higher layers. This is why people want to get rid of NDAs on the SEs. In order to do analysis on what looks like bugs, we need to do an analysis of the lower layers. With Cramium, we want to open up as much as possible. Some components won't be opened up at this time. You can't get direct visibility all the way down to direct readout, but you can get some hardware bug visibility like in the crypto cores, and we can analyze them in gate-level simulations and figure out what's going on. This makes software bugs, and protocol errors, those are all more tractable now-- as long as we give ourselves enough rope to hang ourselves with, we will be able to patch and solve bugs a lot faster. Even though it's the case that we have a closed PDK, I think there's still meaning in working on RTL-level FOSS designs. The pro's of this approach is that we can reduce the software bugs assisted by analysis of hardware design, and analyze it all the way down from APIs down to the hardware. We can look for the timing errors, and we can run tests and use glassbox analysis of the hardware to make sure we don't have software bugs. More importantly, there's another problem that there are hardware bugs. You won't ship perfect silicon. But when you have the RTL available, you can do analytical patching of hardware bugs. The antithesis of this is what's happening with Spectre mitigations where, we have SGX in all these CPUs and there's whole research departments now churning out papers about finding timing errors in Intel microarchitecture because it's closed and we think we have mitigations but we keep on agonizingly working through this cat-and-mouse game of countermeasure and break etc etc. If you had all the microarchitecture, you can factor this into the compiler for mitigations and have a much better chance of patching out hardware bugs like that. Furthermore, we can know if we have bad actors, bug or backdoor- where was it from? Also, the gross morphology is more constrained, where if your RTL calls for 1 megabyte of RAM and you find 2 MB of RAM on the chip, then you have a problem. So we can constrain the gross size of a chip. It won't protect against people doing mask attacks, but you won't be able to emulate an entirely different chip on different hardware etc... Some of the cons? Maybe someone in the process or tapeout put something tricky on the inside. There's no improvement on analytical side for sidechannel and readout layers. There's no layout to help you analyze those. It doesn't improve transistor or logic gate inspection. At the end of the day, you're still standing on the backs of different turtles. It feels untethered, and it doesn't feel great, but it's better. The strawman argument is that, all things equal then an open PDK is better. I agree with that. But the strongman argument is that, you should make your secure chip on 130nm PDK process and just use the whole reticle. Why go to 28 or 22nm? The obvious problem is that there are physics and form factor economics. There's like a 20x cost difference going from 16 mm^2 chip to a 400 mm^2 chip. There's also speed and power difference, and form factor differences. Even if you were willing to accept all these things, the analysis isn't quite fair because if you look at today's 180nm open PDKs, I have a friend Sean Cross who has gone through the current open 180nms PDKs and did a trial layout. This is a 8 kb RAM chip, and the register files take up a third of the chip on the right-hand side. Normally they would be much smaller in a closed-source version of this. But the foundry invested no effort at all in investing memory, so the memory macros are just flipflops tiled out. You're not even close to the memory density that you need to get for the analysis to come out equal. They are working on it, it's not impossible, but it's not as simple as saying well you know pentium is taped out at 130nm and they ran at 2 GHz and they had all this memory on chip, so why bother? Well, we're not Intel, we don't have their resources, and we don't have that same PDK. You can't make the comparison. Another issue is opportunity cost. Outside the field of security research, security is considered more of a barrier to adoption. If you put too much security on a website or a bank, people don't sign up. It's hard to upsell security. Consumers often don't want to pay more for it. Saying well it's open source, it's just more expensive, consumers will still say no thanks. That's just the reality. Security also settles around standards, the old saying is don't roll your own. First movers have the ability to set standards. I assume this forum is trying to set standards by being a first mover. As an open source community, if you ignore the standards process and just allow ARM and proprietary technology to become entrenched, then later on once you have the open PDK you will have a much harder time getting full open-source into the stack. Microarchitectural lock in is a really serious problem. I felt it was important to engage with Cramium, even though they have a closed PDK, is that the original plan called for ARM CPU only, and if that became the standard for all secure elements for cryptocurrency at the end of the day, you end up with a huge base of code, running ARM, shipped to millions of devices, and then ARM will get its act together and you might have open PDKs, and then you have these millions of devices molehill that you have to climb up to replace it all. Hopefully we can do it right the first time at the ground level instead. So which is better, a bottom up approach or a top down approach? Why not both? Some people say it's a mistake to work on a top-down approach without the PDK, but I think that's missing the point. I don't think we all have to be playing on one side of this field, it's okay to play zone. It's important to think about the top-down perspective. At the RTL level, the architectural level, and software level, we should have all of our open gates open so that when open-source PDKs arrive then we should have a fully open stack ready to go instead of having to work our way up the chain. So that's it. I'll take questions. Q: What about Precursor and Cramium? A: If they are open RTL, and a RISC-V core, and target our software to their RISC-V core? Yeah. We wouldn't retire the FPGA SKU, we would still offer that for people who want inspectable RTL. But a cost-down SKU that would have a lower cost? FPGAs are really hard to buy right now and highly overpriced. Aside from that, we can reduce the cost of the Precursor and make it more accessible. It wouldn't be the same gold standard of accessibility but I think for many users it would be acceptable. Q: What about having an FPGA, an ARM and a RISC-V, and now you have an ability to cross-check and review. A: ... Q: Your presentation was horrifying. A: Another area of research I'm doing is trying to figure out how to make hardware more transparent and inspectable at the user level. There's some interesting work in infrared imaging. Reading transistor memory values off the back with optics, it works both ways: there are attacker tools and defender tools. We can look at the transistors from the back and make sure it's correct from the back end. The problem is that these inspection tools cost $1 million each. But a much cheaper version turns out to be a DSR camera from Sony, you remove the infrared filter, and you light up your chip with an infrared light source, you actually can see transistors. Nobody is doing this. I think we can solve this problem. I think we're not screwed. I think there's a way to solve this. I think we can have standards and ways to get that trust back. We haven't made it a priority though, for many reasons. I think that as we get the-- what happens is that as you start to tighten the screws on protocol layer bugs and those things, it becomes more worthwhile to do hardware supply chain exploits. In the crypto wallet space, you do see these pop up every now and then. It's exciting. I am excited to be working with Cramium becaues if I have my way, we will have packaging where the backside is transparent and we could have a tool that users can use to take a picture of the backside of their Cramium chip and show that it matches to a few micron resolution of what we expect to have. This rules out a large range of supply chain attacks. Not all of them, but a lot of them. This raises the bar again in terms of where we're at. I agree with you that hardware is terrifying, I just think there's still hope. It's a hard and lonely hill to climb. A lot of people don't care, and it's still largely theoretical problem. Q: Bunny - are you familiar with Physcally Unclonable Functions and have you thought about implementing this in silicon? Is there a useful way to do this to prove hardware? A: Bob - I think with a modicum of point-of-use silicon verification, maybe PUFs become viable. Or perhaps as Mark indicated, with something like ReRAM you kind of don't need a PUF because you can bury a secret number in the ReRAM and that is your "unclonable secret"; and with 5-10 micron level optical verification of the die at point of use confirming that things seem mostly in the right place (e.g. you don't have a malicious block replaying the PUF or something like that built into the silicon) one could at least significantly constrain the attack surface? # XGCD hardware implementations Kavya Sreedhar is an electrical engineering PhD student at Stanford advised by Mark Horowitz. Her current research explores how to efficiently accelerate the extended GCD computation for verifiable delay functions and modular inversion in cryptography. She previously worked with the Agile Hardware (AHA) Project in developing Lake, a parameterizable memory generator that can be configured at runtime to support different image processing and machine learning applications. A fast large-integer extended GCD algorithm and hardware design for verifiable delay functions and modular inversion The extended GCD is a fundamental operation in number theory and it computes Bezout coefficients that satisfies a certain Bezout Identity here. It is useful for modular multiplicative inverse, RSA, elliptic curve crypto, ElGamal encryption, and so on. A lot of work in the 1980s and 1990s has been spent on developing fast GCD algorithms. Until recently, there was not much development in this domain. Recently there has been a larger need for fast extended GCD implementations. The first is a realization that constant-time XGCD can nebe faster than Fermat's Little Theorem for curve25519 and others, like in Ber06. The second application is an increased interest in squaring binary quadratic forms over a class group as a verifiable delay function or VDF. I think Chia have been looking into using this particular VDF in their protocol. XGCD takes about 91% of the execution time when profiling VDFs in software. These applications have two distinct requirements: constant-time 255-bit XGCD. But also, for VDFs, we need 1024-bits and doesn't need to be constant time because the inputs are not secret. We looked at performance of previous implementations. ... There is an FPGA paper that prototypes ... no ASIC points to our knowledge. On the 1024-bit side, there are two ASIC points collected. The faster ASIC is 2x faster than software. The hardware points that do exist focus on using iterative division based XGCD algorithms and provide point solutions for constant time or non-constant time execution. Our goal was to explore the broader design space across different areas, like target platform, algorithm, and application requirements. In contrast to prior hardware works, it's better for performance to use iterative subtraction rather than iterative division. We offer a unified hardware unit that wcan be configured for non-constant-time or constant-time operations. This is good because ASICs tend to be expensive. .... GCD algorithms are inherently iterative. ... Stein 67 family of algorithms iteratively subtracts two numbers, Q: Obviously, curve25519 is evolving and has some issues with aggregatable multisig so they have been moving to ristretto. I think from what I understand there's no limits, you can support 25519 equivalent in ristretto? Or does that require some change in your code? A: Ours is focusing on GCD. As long as the XGCD part of it is still just modular inversion, it's a plug-and-play. The actual application doesn't depend on that. This is true for different bitwidths as well. We have been using curve25519 because it seemed like something the community was interested in. We can change a parameter in our generator. Q: What about secp? A: I'm not familiar with that. Q: We can go into that later.. secp256k1 is what bitcoin uses, and r1 is the NIST standard. I just wondered if you had looked at either of those. A: We have been focused on XGCD. It seemed to be a bottleneck. ---- # Discussion Intros (skipping). What should these teams be investigating next? Paillier has come up as an item. But what about simple things like, some people at Intel have said that a great monotonic counter that is constant time and guaranteed could be useful for a lot of useful things and they would like to have that as a primitive. They had something called "proof of elapsed time" that simulates proof of work so it's not a proof-of-stake style mechanism, but more of a trust root secured proof of elapsed time for blocks. But it requires a secure monotonic counter. When it was implemented in SGX, it proved problematic. What are the other things that we need? Does anyone have some initial thoughts? One of the things to work on next is fully accelerating the applications that we have identified. The verifiable delay functions, it would be interesting to see how similar to PoET because it's also trying to do proof-of-time instead of proof-of-work. It's kind of a proof-of-space time or bram's proofs-of-sequential work. There is the elliptic curve side which is all modular operations, like mod adds, mod mul, and reduction as well to get it in the mod. How can those operations be pipelined? It's all large integer arithmetic; so playing around with that. We're definitely open to thinking about different things and hardware tricks we can apply in this space. It's helpful to hear ideas that you are all thinking about or think might be useful. As you do this work, how important is it that it is done in an open source way? It's one thing to throw a bunch of source code over the wall which is what a lot of companies call "open-source", versus an open development community that is actively involved at multiple layers of the initiative. There's not a lot of great articulation in the community about what exactly higher levels of open development work i nterms of the Open Source Institute open source definition. It's even harder when we start putting in hardware into the equation. I had this nice little stack of okay, you can begin to do thi