CN-XHV design comments

# CN-XHV design comments ### Temporary module I downloaded an [implementation of Groestl](http://cryptography.gmu.edu/athena/index.php?id=source_codes) because Lucca's Groestl module wasn't finished yet. We made some agreements so it shouldn't be too hard to replace the module once Lucca has finished. ### Test constants The number of rounds of each module has been reduced for test purposes, but that can easily be modified. The algorithm that is used now is actually CN-HEAVY, but it takes only one NOT gate in the Shuffle module to implement CN-XHV instead. ### Keccak sharing inefficiencies The Keccak hash at the start and at the end of the algorithm are two different modules. This is because the input size and the amount of steps is different. It could however be optimized to share the round function and save some area on the FPGA. ### Implode sharing inefficiencies Implode exists of two iterations over the BRAM, and then 16 extra AES rounds. These extra AES rounds are instantiated separately from the other ones, but they could be shared to save some area on the FPGA. ### Overly large busses There are still some large busses in the design. The input (268 bytes) is read all at once, and Keccak also receives its input as one bus of 1600 bits. The Keccak output (200 bytes) is provided as one bus as well. The Keccak hash module also features some large busses since it was based on my design for the EAGLE project. @Michiel you said we'd want to avoid large busses, but can you explain why? ### Division rounding errors The division IP core rounds the output of the division (instead of flooring), or lacks the required amount of bits ot be as accurate as C++ implementations. To solve this I added an extra multiplier which checks the result and then modifies the quotient if needed. I have no idea how to get rid of this but it delays the "division time" by only 1 cycle. ### Explode idle time The explode scratchpad module takes two clock cycles to write its output and initialize a new AES round. This could however be done in one with a bit of modification. ### Implode vs explode round times Implode takes longer to perform one round and get new input than explode. This causes explode to always having to wait before performing the next round. I have not looked into why that is, but I expect it has something to do with read latency being bigger than write latency. BRAM reads and writes are already performed during the AES rounds, so I don't know if this can be sped up. Even if it can be sped up, the impact on overall latency and throughput would be small. ### AES sharing The design features AES rounds in explode, implode and shuffle steps. It is possible to share AES blocks between explode and implode, but this would decrease throughput and make pipeling impossible. The shuffle step could however make use of the AES rounds of explode or implode without impeding them.