---
tags: BigData
title: Lab 01 - Data Retrieval with SQL
---
# Lab 01 : Data Retrieval with SQL
**Course:** Big Data - IU S23
**Author:** Firas Jolha
## Dataset
- flamingo_dataset
- [original](https://github.com/esther3587/Catch-Pink-Flamingo-Game/raw/master/big_data_capstone_datasets_and_scripts/flamingo-data.zip)
- [preprocessed](https://bit.ly/3YfcnK4)
<!-- - (http://bit.ly/3wLnbUH) -->
# Agenda
[toc]
# Prerequisites
- Installed Hortonworks Data Platform (HDP) Sandbox
- Installed PostgreSQL server
# Objectives
- Build a relational PostgreSQL database for a mobile game data
- Retrieve data from a single table
- Retrieve data from multiple tables
- Aggregation and grouping
- Subqueries
<!-- - Group by queries -->
# Introduction
The Structured Query Language (SQL) is the most extensively used database language. SQL is composed of a data definition language (DDL), which allows the specification of database schemas; a data manipulation language (DML), which supports operations to retrieve, store, modify and delete data; and a data control language (DCL), which enables database administrators to configure security access to databases. In this lab, we will practice how to retrieve structured data from relational databases using SQL. We will store and retrieve the data of a mobile game called “Catch the Pink Flamingo”.
# Game Description
One of the products of an imaginary company Eglence Inc. is a highly popular mobile game called “Catch the Pink Flamingo”. The objective of the game is to catch as many Pink Flamingos as possible by following the missions provided by real-time prompts in the game and cover the map provided for each level. The levels get more complicated in mission speed and map complexity as the users move from level to level.
<center>
<img src='https://www.linkpicture.com/q/Opera-Snapshot_2023-01-17_224924_www.coursera.org.png' width="600" />
<p>
A snapshot of the game "Catch the Pink Flamingo"
</p>
</center>
It’s a multi-user game where the players have to catch Pink Flamingos that randomly pop up on a gridded world map based on missions that change in real-time. For the player or team to move to the next complexity level, they need to have at least one point in every map grid cell, i.e., cover the whole world map. An example mission would be “Catch the Flamingos on land with stars on their belly” in which the player should only click on flamingos that match the mission criteria, in this case, stars and being on land. If the player tags any other flamingo on the map, he/she or his/her team gets a negative point (-1) on that map location.
After the initial sign up, a player (user) is asked to play the Level 1 individually without joining any team. This is where the user gets trained as a player and starts building a game history. Level 1 is an easy entry to the game composed of only 64 (8x8) grid cells and longer, more obvious, fun missions. Upon completion of Level 1, the player gets asked if she/he wants to join any team or form a team and will continue the rest of the time as a team player even if that means the user is a 1-person team of her/his own. Each user is a member of at most one team.
At the beginning of each level, the game creates a brand new map with more cells than the level before. The complexity of the missions also increases. The missions change more frequently as the levels increase.
The players keep in touch via chat boards assigned to the teams and also via social media, e.g., Twitter. There are some things to consider while designing an information system for this game:
- **Ranking of Users:** Each user will be ranked individually by the speed and accuracy of their click to completion. The rankings get tracked in real-time and can be viewed both via the mobile app and the website for the game. In addition to score, speed and accuracy based ranking, the other players can see what parts of the map the user has the most points for. The players are also categorized based on their history as “rising star”, “veteran”, “coach”, “social butterfly” and “hot flamingo”. These refer to the qualities of players in addition to the game statistics.
- **Ranking of Teams:** The teams are ranked publicly. There is a maximum of 30 members in a team and a minimum of 1 member. The players “ask” to join a team and get voted in when 80% of the team members allow. A team may choose to “recruit” if they think a player can contribute or “outvote” a player if a player is not contributing. The players are also allowed to change their teams and bring all their points along. The competition is built on “point-based economy” and it is encouraged by the game providers. When all players leave a team, the team automatically gets removed from public and archived by Eglence Inc.
- **In-game Purchases:** Users are allowed in game purchases including binoculars to spot the mission specific flamingos, special flamingos that count for more than one grid point, ice blocks to freeze a mission for 20 seconds when needed, and trading cards to transfer the extra points from some grid cells to the ones without any points.
- **Game Completion:** The game never ends, meaning that there will always be a more complicated next level. A challenge for Eglence Inc. is to keep the game interesting and engaging for players who have been around for a long time.
<!-- - They make use of big data analytics to make sure the veteran players are still around. -->
:::warning
Note: We do not use the data collected from chat boards in this lab. We use only the data on in-game purchases, ad clicks and game-specific information.
:::
# Dataset Description
The data generation scripts create several log files recording the activities of people playing *Catch the Pink Flamingo*. The collected data can be visualized as an ER diagram below.
<!-- <center>
<img src='https://i.imgur.com/mCB3XtO.png' width="900"/>
<p>ER-diagram
</p>
</center>
-->
<!-- ![]() -->
<center>
<img src='https://i.imgur.com/sGWiuF4.png' width="900"/>
<p>ER-diagram
</p>
</center>
<!-- <center>
<img src='https://i.imgur.com/MthjOnU.png' width="900"/>
<p>ER-diagram
</p>
</center> -->
<!-- <center>
<img src='https://i.imgur.com/qfJnYuf.jpg' width="900"/>
<p>ER-diagram
</p>
</center>
-->
:::spoiler Actual ER-diagram
<!-- <center>
<img
src = 'https://i.imgur.com/5ZWY7Fd.png'
width="900"/>
<p>Actual ER-diagram
</p>
</center> -->
<!-- ![]() -->
<center>
<img src='https://i.imgur.com/Lz1y7Je.png' width="900"/>
<p>Actual ER-diagram
</p>
</center>
<!-- <center>
<img src='https://i.imgur.com/Kg0xcK8.png' width="900"/>
<p>Actual ER-diagram
</p>
</center>
-->
<!-- <center>
<img
src = 'https://i.imgur.com/sG9YivQ.jpg'
width="900"/>
<p>Actual ER-diagram
</p>
</center> -->
:::
<br>
The diagram consists of 8 tables corresponding to 8 csv files. You can download the required dataset for this lab from [here](https://github.com/esther3587/Catch-Pink-Flamingo-Game/raw/master/big_data_capstone_datasets_and_scripts/flamingo-data.zip). We present in the table below the description of the dataset files.






<!-- | File name | ---------------------------- |
| ad-clicks.csv | A line is added to this file when a player clicks on an advertisement in the Flamingo app. | **timestamp:** when the click occurred.<br /> **txId:** a unique id (within ad-clicks.log) for the click. <br /> **userSessionid:** the id of the user session for the user who made the click. <br /> **teamid:** the current team id of the user who made the click. <br /> **userid:** the user id of the user who made the click. <br /> **adId:** the id of the ad clicked on. <br /> **adCategory:** the category/type of ad clicked on. |
-->
:::danger
Note: The original dataset is only a sample and not complete. To preserve the relationships between the tables, you need to import only the samples which staisfy these relationships and drop others. I did it for you and you can download the preprocessed dataset from [here](http://bit.ly/3wLnbUH).
:::
# Build the Database
:::warning
Note: Explore the dataset using any data analysis tool (Python+Pandas, Excel, Google Sheet, ...etc) you have before building the database.
Another note: You do not need to create a new role/user. Just use the current role `postgres`.
:::
The dataset are csv files and need to be imported into PostgreSQL tables.
To copy the values from csv files into the table, you need to use Postgres `COPY` method od data-loading. Postgres's COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based. COPY will be run by the PostgreSQL backend (user "postgres"). The backend user requires permissions to read & write to the data file in order to copy from/to it. You need to use an absolute pathname with COPY. \COPY on the other hand, runs under the current $USER, and with that users environment. And \COPY can handle relative pathnames. The psql \COPY is accordingly much easier to use if it handles what you need. So, you can use the meta-command `\COPY` to import the data from csv files into PostgreSQL tables but you need to make sure that the primary keys and foreign keys are set according to the given ER diagram. You can learn more about ER diagrams from [here](https://www.lucidchart.com/pages/ER-diagram-symbols-and-meaning).
:::warning
Notice that you need to create tables before importing their data from files. You can detect the datatype of the column from its values.
:::
<!-- :::spoiler Examples
1. You can create a database `testdb` as follows:
``
2. You can import the names into a table names
::: -->
:::success
### Exercises
1. Create a database called `flamingodb`.
2. Access the database `flamingodb`.
3. Import the file `users.csv` into `users` table and ensure that you set `userid` as the primary key in the table.
***Sample solution:***
A. Create the table `users`:
flamingodb=# create table users(
timestamp timestamp not null,
userid numeric primary key,
nick varchar not null,
twitter varchar not null,
dob date not null,
country varchar(2) not null);
*NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "users_pkey" for table "users"
CREATE TABLE*
B. Import the table values from the file. Here we used options `csv` to indicate the format of the file and `header` to determine that the csv file has a header.
flamingodb=# \copy users from 'users.csv' csv header;
4. Import the file `team.csv` into `team` table and ensure that you set `teamid` as the primary key in the table.
5. Import the file `level-events.csv` into `levelevents` table and ensure that you set `eventid` as the primary key and `teamid` as the foreign key in the table from the table `team`.
6. In the same way, import all other files into the database `flamingodb`.
:::
:::info
Info: If you installed HDP using a hypervisor, then you can use the command `scp` to immediately copy files from the host machine to the local file system of the cluster node.
For instance, to transfer the dataset files in *flamingo-data* folder from the host machine to the folder `/data` in the cluster node, we run the following command on the terminal.
```sh
scp -P 2222 flamingo-data/* root@localhost:/data
```
Note: If you got issues as shown in the figure below, then you need to open the file `%USERPROFILE%/.ssh/known_hosts` and remove the previous keys (You can empty the file if all keys are not important).

`scp` will create a new key and exchange it with the cluster node.
:::
# Data Retrieval
In this part of the lab, we will practice on writing SQL queries to retrieve data from our database.
## Data Retrieval from a single table
You can return data from a single table by using `SELECT` command. This kind of data retrieval is the simplest way to get data from a single table.
:::success
### Exercises
1. Who are the players of the game?
2. Who are the players who were born before the year 2000?
3. Which teams have strength more than 80%?
4. What are the nicknames of the players who lived in Russia (country code is `RU`).
5. What are the top 5 most recent ad-clicks whose category is `fashion`?
6. Which players played the game on `Android` or `iPhone` platforms?
<!-- 13. Which team reached the top level in the game? -->
:::
:::warning
In PostgreSQL, double quotes `""` are used to indicate identifiers within the database, which are objects like tables, column names, and roles. In contrast, single quotes `''` are used to indicate string literals.
:::
## Data Retrieval from multiple tables (Joining tables)
:::success
### Exercises
7. Which players watched `electronics` ads.
8. Which players watched `electronics` ads and purchased items whose price is less than $20.
:::
## Aggregation and grouping
:::success
### Exercises
9. How many players who were playing from Russia?
10. How many players who were playing from all countries except Russia?
11. How many teams which had more than 10 members?
12. What is the average strength of the teams in the game?
:::
## Subqueries
:::success
### Exercises
13. Which teams whose strength is greater than the average strength of all teams?
14. Which teams were having more than half of the members from Russia?
:::
## Advanced Queries
:::success
### Exercises
15. What items purchased by members of the teams who are powerful than 70% of the teams.
<!-- 16. -->
:::
<!-- ## Set operations -->
<!-- ## Group by Queries -->
# References
- [PostgreSQL cheatsheet](https://quickref.me/postgres)
- [Catch the Pink Flamingo game description](https://eagronin.github.io/capstone-acquire/)
- Yasin N. Silva, Isadora Almeida, and Michell Queiroz. 2016. SQL: From Traditional Databases to Big Data. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (SIGCSE '16). Association for Computing Machinery, New York, NY, USA, 413–418. https://doi.org/10.1145/2839509.2844560
- [Comprehensive PostgreSQL Tutorial](https://www.postgresqltutorial.com)
- [psql 9.2 Documentation](https://www.postgresql.org/docs/9.2/app-psql.html)