# Dataset description
## Facebook
The dataset for Facebook are divided into 16 files from `data_2007.csv` to `data_2022.csv`.
Each file contains 10 columns:
- Image name: The collected image names, having the format `[aa]_[bb]_[cc]_[size].extension`
- Image URL: The collected URL, having one of the three formats: `https://[any]fbid=[FBID]&[any]`, `https://[any]/p.[any]/[FBID]/[any]`, `https://[any]/basw.[any]/[FBID]/[any]`. These URLs are very roughly filtered from the raw URL and may have lost certain query parameters that make it a working URL. However, all of them contain the [FBID], which can redirect to the working URL using the API `https://www.facebook.com/photo/?fbid=[FBID]`.
- bb: The second component in the image name.
- FBID: The FBID in the image URL.
- difference: The difference between `[bb]` and `FBID`, equivalent to `difference = bb - FBID`
- base: The base from this formula `bb - FBID = base * multiplier`. The base may be `0`, meaning `bb = FBID`; one of the five bases we discovered; or equal the difference above if it cannot be divided by any of the five bases.
- multiplier: The multiplier from the formula above.
- first: The first component of the image name. Equivalent to `[aa]`. Used for studying the relationships between multiple filenames.
- second: The second component of the image name. Equivalent to `[bb]`. Used for studying the relationships between multiple filenames. This column is the same as the bb column.
- third: The third component of the image name. Equivalent to `[cc]`. Used for studying the relationships between multiple filenames.
Sample entry:
`303946492_215954794088805_4730241505196182795_n.jpg, https://www.facebook.com/photo.php?fbid=215954800755471&set=pb.100070228564876.-2207520000.&type=3, 215954794088805, 215954800755471, -6666666, 3333333, -2, 303946492, 215954794088805, 4730241505196182795`
Meaning:
- Image name: 303946492_215954794088805_4730241505196182795_n.jpg
- Image URL: https://www.facebook.com/photo.php?fbid=215954800755471&set=pb.100070228564876.-2207520000.&type=3
- bb: 215954794088805
- FBID: 215954800755471
- difference: -6666666
- base: 3333333
- multiplier: -2
- first: 303946492
- second: 215954794088805
- third: 4730241505196182795
## Twitter
1. `images_by_year.json`
This JSON file contains an object that stores all image file names, along with its corresponding Twitter ID and creation time of the tweet, **not the image file**. These entries are grouped by each month and each year. Each year is represented as an item in this object, with the value being a 2-dimensional array. The first dimension represents 12 months of a year, while the second dimension shows the images crawled for the corresponding month.
For example, to access the 10th image in December, 2022, simply use `obj['2022'][11][9]`
```json
{
"id": 1625151000184262657,
"created_at": 1676301166.0,
"file_name": "Fo2yIIAaMAIIUS6.jpg"
}
```
2. `place_data_center.json`
This JSON file contains an object will 1466 locations crawled from Twitter API. Each element stores important information such as its coordinates, country name and the amount of images having a specific data center id from that location.
For example:
```json
{
"id": "00467dd53ce7cbc5",
"data_center_ids": {
"10": 429,
"11": 29,
"13": 37,
"4": 1,
"1": 4
},
"coordinates": [103.26117487778697, 22.2514955],
"country": "Vietnam"
}
```
This shows that the location with id `00467dd53ce7cbc5` is in Vietnam, with most images being uploaded to data center 10 (429 images).
## Flickr
Dataset for Flickr includes 2 binary files `images` and `images_filtered`
Each binary file contains approximately 1,6 million entries including each images' ID, date posted and date taken that is crawled from Flickr API.
Each entry is a binary sequence of 24 bytes long, with each mentioned value take up 8 bytes. To extract information from this dataset, first convert each 8 byte sequence into an integer in little-endian. Flickr ID is kept as-is, while the other two numbers are the corresponding UNIX timestamp. Below is an example code to extract such information for each entry.
```python
def get_info(data):
assert len(data) == 24, "chunk length must be 24"
id = int.from_bytes(data[:8], byteorder='little')
date_posted = int.from_bytes(data[8:16], byteorder='little')
date_taken = int.from_bytes(data[16:], byteorder='little')
return id, date_posted, date_taken
```