# STL 2 Assignment 3
## 2
The following are run:
`md5collgen -p pretext.txt -o out1.bin out2.bin` # Generate 2 binary files
`diff out1.bin out2.bin`
`md5sum out1.bin`
`md5sum out2.bin`
There are come differences in the collision block, but the hash is the same

## 3
Text comparison by using the command `bless out1.bin out2.bin`:


The obvious differences are circled.
If the length of text does not reach a multiple of 64, it will be padded with zeroes.
--------------------------------------------
Content of the `prefix64.txt`:
`abcdefghijklmnopqrstuvwxyz0000000000000000000000000000000000000`


No bytes padded:


The commands that were run:
`md5collgen -p prefix64.txt -o out1_64.bin out2_64.bin`
`diff out1.bin out1_64.bin`
`diff out2_64.bin out1_64.bin`
`md5sum out1_64.bin`
`md5sum out2_64.bin`
To compare the sizes of the generated files:

The sizes of `out1.bin` and `out1_64.bin`, `out2.bin` and `out2_64.bin` are the same.
The differences:


## 4
The content of mssd.txt is the following:
`mssd`
Joining the binary files earlier to `mssd.txt`, the md5 hash of the combined files are the same, even though they are distinct from each other:
`cat out1.bin mssd.txt > combine1.txt`
`cat out2.bin mssd.txt > combine2.txt`
`md5sum combine1.txt combine2.txt`
`diff combine1.txt combine2.txt`

## 5
For this section, we would need to first compile the c program.
`gcc -o xyz.c xyz`
`bless xyz` # check the binary of the compiled file content.

The first A is on byte number 4161 ( offset by 4160 bytes):

The last A is on byte number 4560 ( offset by 4559 bytes):

We can use 4160 since it gives a whole number after dividing by 64 to get 4224.
We need to add 128 to it, so it gives us 4353.
`head -c 4224 xyz > prefix`
`md5collgen -p prefix -o out_A_xyz.bin out_B_xyz.bin`
`md5sum out_A_xyz.bin out_B_xyz.bin`
prefix files were generated.

To add into the suffix:
4224 + 128 + 1 = 4353 bytes onwards would be suffix.
`tail -c +4353 xyz > suffix`

output of `out_A_xyz.bin`:

```
41414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141ea8df2f947412e6ac5471d2ebcbef5e885824b586d5fe4ffc8979b5b1ba8ef550c1be14865ab22e7d6c7ac78d581bbd37714b316e660b2f2f18497b198584fad30a871e65fb13c505534416b1dc3bc7eb11615e636e6d11af6ccec462728624a6d64ce38e784c49578f4c2a863126f825577734d11a4743ceee3bdf41414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141
```
output of `out_B_xyz.bin`:

```
41414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141ea8df2f947412e6ac5471d2ebcbef5e8858243586d5fe4ffc8979b5b1ba8ef550c1be14865ab22e7d6c7ac7d591bbd37714b316e660b2f2f1497b198584fad30a871e65fb13c505534416b1dc3bc7eb19615e636e6d11af6ccec462728624a6d64ce38e784c49578f84c1a863126f825577734d11a47c3ceee3bdf41414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141414141
```
Only very minor differences in the collision bytes section:

The hash of the files are still the same:

Although the content in the collisoon block is different the resulting hash would still be the same.
## 6
Idea : if the array content is changed, the word `Malicious` will be printed, else `Benign` is printed
Create a new c program that contains static information within arrays:
```csharp=
#include <stdio.h>
#include <string.h>
unsigned char x[400] = {"\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
"};
unsigned char y[400] = {"\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
ZELDLINKZELDLINKZELDLINKZELDLINKZELD\
"};
int main()
{
int i;
for (i=0; i<400; i++){
printf("%x", x[i]);
}
printf("\n\n");
int j;
for (j=0; j<400; j++){
printf("%x", y[j]);
}
printf("\n\n");
if (strcmp(x,y) == 0)
printf("Benign\n");
else
printf("Malicious\n");
}
```
`gcc xyz_benign_malicious.c -o xyz_benign`

Running `xyz_benign` results in the `Benign` string to be printed out.
`./xyz_benign`

The print out of the first `Z` is at the same byte location as the previous one.
The second `Z` is at the following byte location:

-----------------------
To do the Attack

The idea is we create 2 prefixes: `x_prefix` and `y_prefix` in order to create 2 collision blocks that will create different content in the array.
Similar to the previous section, we would need to create prefix and suffix for the attack. Therefore the following is run:

`head -c 4224 xyz_benign > x_prefix` : to get the prefix for the reconstructed binary file.
`tail -c +4353 xyz_benign > x_suffix` : to get the part where we need to get different content.
we can then run the following to create `y_prefix` and `y_suffix`:
`head -c 288 x_suffix > y_prefix`
`tail -c +4769 x_suffix > y_suffix`

To generate the collision from x_prefix:
`md5collgen -p x_prefix -o x_prefix_P x_prefix_Q`

`tail -c 128 x_prefix_P > P`
`tail -c 128 x_prefix_Q > Q`

To join it all back:
`cat x_prefix_P y_prefix P y_suffix > xy_benign`
`cat x_prefix_Q y_prefix P y_suffix > xy_malicious`

To run the compiled program:
`sudo chmod +x xy_benign xy_malicious`
`./xy_benign`


`./xy_malicious`


Both have the same hash:
`md5sum xy_benign xy_malicious`

But have different binary files:
`diff xy_benign xy_malicious`

## 7
`https://shattered.io/static/shattered.pdf` is renamed to `og1.pdf`
`https://iacr.org/archive/crypto2005/36210017/36210017.pdf` is renamed to `og2.pdf`

Originally og1 and og2 have different md5sum
The idea is similar to doing a text collision where inserting static data, static array in c and inserting non rendered items in jpg. In PDF's case, inclusion of metadata, such as the title, author, and creation date, within the file. By modifying these metadata fields while keeping the actual content of the PDF file unchanged, it is possible to generate multiple PDF files with the same hash value.
To achieve md5 hash collision, we need to add things into a non-rendered component. To do that the code from https://github.com/corkami/collisions/blob/master/scripts/pdf.py is used to generate the colliding hashes. In order to use the code, one can either download the exe on windows, or apt install `mupdf-tools`.
Some modifications:
- instead of having a variable byte string `template`, we are making the whole thing static since it does not matter that much what is added so long as it is consistent on both pdfs.
The following is the editted variable, do note that the hardcoded values came from printing out the `KIDS1` and `KIDS2` variables.
```python=
template = b"""%%PDF-1.4
1 0 obj
<<
/Type /Catalog
%% for alignments (comments will be removed by merging or cleaning)
/MD5_is__ /REALLY_dead_now__
/Pages 2 0 R
%% to make sure we don't get rid of the other pages when garbage collecting
/Fakes 3 0 R
%% placeholder for UniColl collision blocks
/0123456789ABCDEF0123456789ABCDEF012
/0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0
>>
endobj
2 0 obj
<</Type/Pages/Count 6/Kids[80 0 R 91 0 R 98 0 R 107 0 R 111 0 R 118 0 R 121 0 R 124 0 R 127 0 R 130 0 R 133 0 R 136 0 R 139 0 R 142 0 R 145 0 R 152 0 R 155 0 R 170 0 R 173 0 R 176 0 R 188 0 R 191 0 R 243 0 R]>>
endobj
3 0 obj
<</Type/Pages/Count 6/Kids[80 0 R 91 0 R 98 0 R 107 0 R 111 0 R 118 0 R 121 0 R 124 0 R 127 0 R 130 0 R 133 0 R 136 0 R 139 0 R 142 0 R 145 0 R 152 0 R 155 0 R 170 0 R 173 0 R 176 0 R 188 0 R 191 0 R 243 0 R]>>
endobj
%% overwritten - was a fake page to fool merging
4 0 obj
<< >>
endobj
"""
```
Once the modification is done, the python3 code is run to generate `collision1.pdf` and `collision2.pdf`.
This is the command used:
`python3 pdf.py og1.pdf og2.pdf`
The following is the screenshot of the generated file and the md5 hash values:

FOR SHA1 Collision, we need to convert the pdfs to images and then edit from there. To use nneonneo's code for a sha1 collider, we have to make it such that the page count and page sizes are equal. Page count and page size seem seem to be problematic as seen in the following:

As such, a minor modification is done by inserting blank pages in the offending pdf (og2.pdf). Using https://pdfux.com/add-blank-pages-pdf/, I added blank pages such that, we do not interfere with the content. Also, the page sizes are equalised using the tool in this website: https://www.pdf2go.com/
Running `collide.py og1.pdf og2.pdf` results in the following:

Their sha1sum are seen to be the same in the following:

The code is gotten from:
- https://github.com/nneonneo/sha1collider
Reference:
- https://www.youtube.com/watch?v=13rzZkSVxsM
- https://security.stackexchange.com/questions/152341/how-did-the-shattered-io-group-manage-to-create-a-sha1-collision-for-a-pdf-that
- https://github.com/corkami/collisions