Dataset description

# Dataset description ## Facebook The dataset for Facebook are divided into 16 files from `data_2007.csv` to `data_2022.csv`. Each file contains 10 columns: - Image name: The collected image names, having the format `[aa]_[bb]_[cc]_[size].extension` - Image URL: The collected URL, having one of the three formats: `https://[any]fbid=[FBID]&[any]`, `https://[any]/p.[any]/[FBID]/[any]`, `https://[any]/basw.[any]/[FBID]/[any]`. These URLs are very roughly filtered from the raw URL and may have lost certain query parameters that make it a working URL. However, all of them contain the [FBID], which can redirect to the working URL using the API `https://www.facebook.com/photo/?fbid=[FBID]`. - bb: The second component in the image name. - FBID: The FBID in the image URL. - difference: The difference between `[bb]` and `FBID`, equivalent to `difference = bb - FBID` - base: The base from this formula `bb - FBID = base * multiplier`. The base may be `0`, meaning `bb = FBID`; one of the five bases we discovered; or equal the difference above if it cannot be divided by any of the five bases. - multiplier: The multiplier from the formula above. - first: The first component of the image name. Equivalent to `[aa]`. Used for studying the relationships between multiple filenames. - second: The second component of the image name. Equivalent to `[bb]`. Used for studying the relationships between multiple filenames. This column is the same as the bb column. - third: The third component of the image name. Equivalent to `[cc]`. Used for studying the relationships between multiple filenames. Sample entry: `303946492_215954794088805_4730241505196182795_n.jpg, https://www.facebook.com/photo.php?fbid=215954800755471&set=pb.100070228564876.-2207520000.&type=3, 215954794088805, 215954800755471, -6666666, 3333333, -2, 303946492, 215954794088805, 4730241505196182795` Meaning: - Image name: 303946492_215954794088805_4730241505196182795_n.jpg - Image URL: https://www.facebook.com/photo.php?fbid=215954800755471&set=pb.100070228564876.-2207520000.&type=3 - bb: 215954794088805 - FBID: 215954800755471 - difference: -6666666 - base: 3333333 - multiplier: -2 - first: 303946492 - second: 215954794088805 - third: 4730241505196182795 ## Twitter 1. `images_by_year.json` This JSON file contains an object that stores all image file names, along with its corresponding Twitter ID and creation time of the tweet, **not the image file**. These entries are grouped by each month and each year. Each year is represented as an item in this object, with the value being a 2-dimensional array. The first dimension represents 12 months of a year, while the second dimension shows the images crawled for the corresponding month. For example, to access the 10th image in December, 2022, simply use `obj['2022'][11][9]` ```json { "id": 1625151000184262657, "created_at": 1676301166.0, "file_name": "Fo2yIIAaMAIIUS6.jpg" } ``` 2. `place_data_center.json` This JSON file contains an object will 1466 locations crawled from Twitter API. Each element stores important information such as its coordinates, country name and the amount of images having a specific data center id from that location. For example: ```json { "id": "00467dd53ce7cbc5", "data_center_ids": { "10": 429, "11": 29, "13": 37, "4": 1, "1": 4 }, "coordinates": [103.26117487778697, 22.2514955], "country": "Vietnam" } ``` This shows that the location with id `00467dd53ce7cbc5` is in Vietnam, with most images being uploaded to data center 10 (429 images). ## Flickr Dataset for Flickr includes 2 binary files `images` and `images_filtered` Each binary file contains approximately 1,6 million entries including each images' ID, date posted and date taken that is crawled from Flickr API. Each entry is a binary sequence of 24 bytes long, with each mentioned value take up 8 bytes. To extract information from this dataset, first convert each 8 byte sequence into an integer in little-endian. Flickr ID is kept as-is, while the other two numbers are the corresponding UNIX timestamp. Below is an example code to extract such information for each entry. ```python def get_info(data): assert len(data) == 24, "chunk length must be 24" id = int.from_bytes(data[:8], byteorder='little') date_posted = int.from_bytes(data[8:16], byteorder='little') date_taken = int.from_bytes(data[16:], byteorder='little') return id, date_posted, date_taken ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.