---
tags: BigData
title: Lab 02 - Pandas DataFrame & Neo4j
---
# Lab 02 : Pandas DataFrame & Neo4j
**Course:** Big Data - IU S23
**Author:** Firas Jolha
## Dataset
- flamingo_dataset
- [original](https://github.com/esther3587/Catch-Pink-Flamingo-Game/raw/master/big_data_capstone_datasets_and_scripts/flamingo-data.zip)
- [preprocessed](https://bit.ly/3YfcnK4)
- chat_dataset
- [original](https://github.com/esther3587/Catch-Pink-Flamingo-Game/raw/master/big_data_capstone_datasets_and_scripts/chat-data.zip)
<!-- - [preprocessed]() -->
<!-- - (http://bit.ly/3wLnbUH) -->
# Agenda
[toc]
# Prerequisites
- Installed Hortonworks Data Platform (HDP) Sandbox
# Objectives
- Install python interpreter on Zeppelin Notebook
- Review on pandas DataFrame API
- Read and analyse data using pandas DataFrame API
- Install Neo4j on the cluster node
- Build a graph database
- Retrieve data from graph database using Cypher language
# Introduction
The data in relational databases are structured where you need to define the schema of relations and the constraints on the fields. You can store the data in relational databases according to the schema and defined constraints. This kind of databases is not helpful for storing semistructured data (csv, xml, json, ...etc) and unstructured data (e.g. Word, pdf, html, ...etc). You can operate on structured data stored in relational databases using SQL whereas semistructured data need more flexible tools. In the first part of the lab, we will use one of the most common data analysis and manipulation packages in Python, called pandas. We will introduce an analogy between SQL and pandas for retrieving data. In the second part, we will introduce the graph database as NoSQL database. We will specifically use Neo4j as one of the most popular graph database management systems.
<!-- Examples of semistructured data are emails, chats in a social group. Examples of unstructured data are -->
# Part 1: Pandas DataFrame
Pandas is an open-source Python library that is extensively used for data analysis and manipulation. To load the pandas package and start working with it, import the package. The community agreed alias for pandas is pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.
<center>
<img src='https://i.imgur.com/XpIJLBx.png' width="300"/>
<p>pandas data table representation
</p>
</center>
<!-- ![]() -->
A [pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet or a SQL table. You can create a dataframe as follows:
>df = pd.DataFrame(
{
"Name": [
"Braund, Mr. Owen Harris",
"Allen, Mr. William Henry",
"Bonnell, Miss. Elizabeth",
],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
}
)
<!-- ![]() -->
<center>
<img src='https://i.imgur.com/uTh6hfI.png' width="140"/>
<p>Each column in a DataFrame is a Series
</p>
</center>
A [pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) has no column labels, as it is just a single column of a DataFrame. A Series does have row labels. Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. You can create a `Series` from scratch as well:
> ages = pd.Series([22, 35, 58], name="Age")
Many pandas operations return a DataFrame or a Series. The `describe()` method is an example of a pandas operation returning some basic statistics of the numerical data. You can learn more about data structures in pandas from [here](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro).
Here, we will show ana analogy between pandas library and SQL. We will use the dataset of the game "catch the pink flamingo" from the previous lab.
<center>
<img src='https://i.imgur.com/sGWiuF4.png' width="900"/>
<p>ER-diagram
</p>
</center>
## Installing pandas package
The cluster node in HDP Sandbox does not have pandas package installed, so you need to install it using pip. The Python version in the sandbox is 2.7, and pip 20.3.4 need to be installed using `easy_install` script or any other method. After that, you need to install the pandas package version 0.24.2.
:::info
Search for a tutorial on your preferred search engine :slightly_smiling_face:
:::
:::warning
Do not update the Python version in the sandbox, otherwise some services will not work.
:::
:::spoiler simple steps to install pip and pandas without using yum
```sh
wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
python get-pip.py
pip install pandas==0.24.2
```
:::
<!--
Simple steps to install pip and pandas without using yum
wget https://bootstrap.pypa.io/pip/2.7/get-pip.py
python get-pip.py
pip install pandas==0.24.2
-->
<!-- :::info
Only for this lab, it is ok to use any other
::: -->
## Installing python interpreter on Zeppelin
[Apache Zeppelin](https://zeppelin.apache.org) is a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala, Python, R and more.
<!-- -->
You can check this [tutorial](https://zeppelin.apache.org/docs/0.8.0/usage/interpreter/installation.html#3rd-party-interpreters) to see how you can add an interpreter to Zeppelin. You need to run the script `/usr/hdp/2.6.5.0-292/zeppelin/bin/install-interpreter.sh` and specify the name of the interpreter as an argument to the option `--name`. For Python, you need to pass `python` as follows:
```vim
/usr/hdp/2.6.5.0-292/zeppelin/bin/install-interpreter.sh --name python
```
:::danger
**Note:** Before you run the script, you need to add the line:
`export ZEPPELIN_INTERPRETER_DEP_MVNREPO="https://repo1.maven.org/maven2"`
at the beginning of the file `/usr/hdp/2.6.5.0-292/zeppelin/conf/zeppelin-env.sh`
<!-- as follows:
`sed -i '1iexport ZEPPELIN_INTERPRETER_DEP_MVNREPO="https://repo1.maven.org/maven2"' /usr/hdp/2.6.5.0-292/zeppelin/conf
/zeppelin-env.sh`
-->
<!-- This is due to the fact, that Zeppeling is not supported nowadays. -->

:::
Then run the script passing `python` as the interpreter.

After that, restart Zepplin from Ambari dashboard. You can access Zeppelin Notebook via the link http://localhost:9995 on your local machine.

You can add interpreters to Zeppelin by navigating to interpeters window. Press `create`, select `python`, and add a name `python2` for the interpreter instance as shown below.
<!-- with custom properties to the list of Zeppelin interpreters as show in the figure below. -->
<!--  -->

---
<!-- Select python -->

Create a new note and select python2 as shown below.

<!-- Add a name

-->
This will create a new notebook where the default interpreter is `python2` but indeed you can run cells using other interpreters by specifying the interpreter at the beginning of the cell such as `%sh` to run shell commands.
## Reading Data
The pandas I/O API provides read/write for several file formats. You can check the full list of file formats from [here](https://pandas.pydata.org/docs/user_guide/io.html). You can read data from csv files via [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) method and write to csv files using [`pd.to_csv`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv) method. For example, you can easily read the data from the file `users.csv` as follows:
```python
users = pd.read_csv('users.csv')
```
## Operations on Data
Most pandas operations return copies of the Series/DataFrame. To make the changes “stick”, you’ll need to either assign to a new variable:
```python
sorted_df = df.sort_values("col1")
```
or overwrite the original one:
```python
df = df.sort_values("col1")
```
You will see an `inplace=True` keyword for some methods but its usage [is discouraged](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy).
### SELECT operation
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a `*` to select all columns). For example, We can retreive the player nick, dob and country via SQL:
<!-- who played the game on `Android` platform: -->
```sql
1. SELECT nick, country, smoker, time
FROM users;
```
With pandas, column selection is done by passing a list of column names to your DataFrame:
```python
1. users[['nick', 'dob', 'country']]
```
In SQL, you can add a calculated column. For example, we can retrieve all information and the average strength of the teams:
```sql
2. SELECT *, AVG(strength) as avg_team_strength
FROM team;
```
With pandas, you can use the `DataFrame.assign()` method of a DataFrame to append a new column:
```python
2. team.assign(avg_team_strength=team["strength"].mean())
```
<!-- You can append a column to the dataframe in -->
:::success
**Exercises on pandas DataFrames**
1. Calculate the average price of the items and add the new column `avg_price` to the dataframe.
2. Calculate the age of the player and add the new column `age` to the dataframe.
:::
### WHERE operation
Filtering in SQL is done via a WHERE clause. We can retrieve the information of players whose country code is `BR`.
```sql
3. SELECT *
FROM users
WHERE country = 'BR';
```
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
```python
3. users[users['country']=='BR']
```
The above statement is simply passing a `Series` of `True/False` objects to the DataFrame, returning all rows with `True`. The dataframe `users['country']=='BR'` consists of boolean values. Using boolean indexing `users[users['country']=='BR']` will return the rows whose values are `True` in the series `users['country']=='BR'`.
Just like SQL’s `OR` and `AND`, multiple conditions can be passed to a DataFrame using `|` (`OR`) and `&` (`AND`). For example, we can retrieve the players whose country code is `RU` and born before 1990.
```sql
4. SELECT *
FROM users
WHERE country = 'RU' AND dob < '1990-01-01';
```
In pandas, we can use boolean indexing.
```python
4. users[(users["country"] == "RU") & (users["dob"] < '1990-01-01')]
```
:::info
NULL checking is done in pandas using the `notna()` and `isna()` methods whereas we use `IS NULL` and `IS NOT NULL` in SQL.
:::
:::success
**Exercises on pandas DataFrame**
3. Who are the teams whose strength is more than 80% and length of the name is less than 8 chars?
4. What gameclicks are hit by teams whose level is more than or equal to 5?
5. Which ads whose category is "computers"?
6. Which items whose price is more than 5 and purchased by different players?
:::
### GROUP BY operation
In pandas, SQL’s `GROUP BY` operations are performed using the similarly named [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby) method. `groupby()` typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation), and then combine the groups together. For example, we can retrieve the number of players grouped by country code:
```sql
5. SELECT country, count(*)
FROM users
GROUP BY country;
```
The pandas equivalent would be:
```python
5. users.groupby("country").size()
```
:::warning
Notice that in the pandas code we used `size()` and not `count()`. This is because `count()` applies the function to each column, returning the number of `NOT NULL` records within each. Try running `users.groupby("country").count()`.
:::
Multiple functions can also be applied at once. For instance, say we’d like to see how the team strength differs by the teamCreationTime - `agg()` allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to specific columns.
```sql
6. SELECT teamcreationtime, AVG(strength), COUNT(*)
FROM team
GROUP BY teamcreationtime;
```
The pandas equivalent would be:
```python
6. team.groupby("teamCreationTime").agg({"strength": np.mean, "teamCreationTime": np.size})
```
Grouping by more than one column is done by passing a list of columns to the `groupby()` method. For example, we can retrieve the number of user sessions for each user on each platform.
```sql
7. SELECT userid, platformtype, COUNT(*)
FROM usersessions
GROUP BY userid, platformtype;
```
The pandas equivalent would be:
```python
7. usersessions.groupby(["userid", "platformtype"]).agg({"usersessionid": [np.size]})
```
:::success
**Exercises on pandas DataFrame**
7. What is the average team level and number of players for each platform?
8. How many ads are there for each category?
9. How many teams watched each ad category?
10. How many hit click for each player per team?
11. What is the average price of items purchased by each team?
12. What is the average price of items purchased by each team member?
13. What is the number of items puchased by each team?
14. What is the number of items puchased by each team member?
:::
### JOIN operation
In pandas, `JOIN`s can be performed with [`join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) or [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge). By default, [`join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas.DataFrame.join) will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (`LEFT`, `RIGHT`, `INNER`, `FULL`) or the columns to join on (column names or indices).
:::warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
:::
#### INNER JOIN (Natural Join)
For example, we can retrieve all user info and item info by performing inner join between buyclicks and users tables (retrieves only users who purhcased items).
```sql
8. SELECT *
FROM users
INNER JOIN buyclick
ON users.userid = buyclicks.userid;
```
Similarly in pandas:
```python
8. pd.merge(users, buyclicks, on="userId") # pd.merge performs inner join be default
```
`pd.merge()` also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s index.
```sql
8. indexed_users = users.set_index("userid")
pd.merge(users, buyclicks, left_on="userId", right_index=True)
# Here we are joining the dataframes based on the column "userId" for the left dataframe "users" and the index for the right dataframe "buyclicks".
```
#### LEFT OUTER JOIN
For example, we can retrieve the user info and item info by performing left outer join between buyclicks and users tables (retrieves all users who purhcased items or did not).
```sql
9. SELECT *
FROM users
LEFT OUTER JOIN buyclicks
ON users.userid = buyclicks.userid;
```
In pandas, we use the argument "how" of the method `pd.merge`:
```python
9. pd.merge(users, buyclicks, on="userId", how="left")
```
#### RIGHT OUTER JOIN
For example, we can retrieve the user info and item info by performing right outer join between buyclicks and users tables (retrieves all items whether purchased by users or not).
```sql
10. SELECT *
FROM users
RIGHT OUTER JOIN buyclicks
ON users.userid = buyclicks.userid;
```
In pandas, we use the argument "how" of the method `pd.merge`:
```python
10. pd.merge(users, buyclicks, on="userId", how="right")
```
#### FULL OUTER JOIN
For example, we can retrieve the user info and item info by performing full outer join between buyclicks and users tables (retrieves all items whether purchased by users or not and all users who purchased items or not).
```sql
11. SELECT *
FROM users
FULL OUTER JOIN buyclicks
ON users.userid = buyclicks.userid;
```
In pandas, we use the argument "how" of the method `pd.merge`:
```python
11. pd.merge(users, buyclicks, on="userId", how="outer")
```
<!-- ### UNION operation
-->
### LIMIT operation
In SQL, we can use `LIMIT` keyword in `SELECT` statement, when we want to retrieve only a specific number of records. In pandas, we have `tail` and `head` methods who do similar jobs.
For example, we can retrieve only first 10 players.
```sql
12. SELECT * FROM users
LIMIT 10;
```
In pandas, we use `head` method who performs the same jobs as `LIMIT` does. It also provides `tail` method which returns the last records from the query.
```python
12. users.head(10)
```
:::success
**Exercises on pandas DataFrame**
15. How many ads watched by players who belong to teams whose strength is more than 80%?
16. What is the average price of items purchased by teams whose number of players is more than 5?
17. What is the average price of items purchased by players who played on `Android` or `iPhone` platforms?
18. How many hit click for users who played on `iPhone` platform?
19. What is the average strength of teams who watched ads from more than one category (different categories)?
20. How many players who performed assignments on `Android` platforms and watched ads from category `fashion`?
<!-- 21. What is the number of items puchased by each team?
22. What is the number of items puchased by each team member? -->
:::
<!-- ### Additional operations -->
<!-- #### Top n rows with offset
#### Top n rows per group
-->
### UPDATE/DELETE operation
In addition to data retrieval, you can manipulate data in SQL by using `DELETE` and `UPDATE` statements. In pandas, we have multiple ways to manipulate the data but here I will mention using `loc` method to update the data in the dataframe.
For example, to decrease the price of items whose price is more than 5 by half, we can write as follows:
```sql
13. UPDATE buyclicks
SET price = price / 2
WHERE price > 5;
```
Whereas in pandas:
```python
13. buyclicks.loc[buyclicks["price"] > 5, "price"] /= 2
```
If we want to delete some records, for instance, all players whose country code is `BR`.
```sql
14. DELETE FROM users
WHERE country = 'BR'
```
In pandas, we use just overwrite the original dataframe.
```python
14. users = users.loc[users["country"] != "BR"]
```
Here, we showed how pandas DataFrame provides more flexible ways to retrieve and manipulate data in which dataframe is similar to tables in relational databases. We encourage you to learn more about methods provided by pandas DataFrame API from the [official website](https://pandas.pydata.org/docs/user_guide/index.html). You can also learn more about differences between SQL and pandas DataFrame from [this blog](https://ponder.io/pandas-vs-sql-food-court-michelin-style-restaurant/).
<!-- In SQL, you can add a calculated column. For example, we can retrieve all information and the age of the players:
> 2. SELECT *, AGE('2023', dob) as player_age
> FROM users;
With pandas, you can use the `DataFrame.assign()` method of a DataFrame to append a new column:
> users.assign(users=users["tip"] / users["total_bill"])
<!-- You can append a column to the dataframe in -->
<!-- Pandas vs. SQL
https://ponder.io/pandas-vs-sql-part-3-pandas-is-more-flexible/
-->
<!-- ## Install Pandas library
Pandas is a popular data analysis tool where
You need to install Python in order to work on Pandas library. We can use Zeppelin service to run script as Jypter notebooks. You can run cells with different interpreters.
-->
<!-- ## Install Python interpreter in Zeppelin Notebook
-->
<!-- ## Analyze data -->
<!-- Today, we will use part of the dataset we used for the previous lab in addition to the chat data for the game `catch the pink flamingo`. We will analyze the data through using Pandas library. Can you realize the advantages/disadvantages of using Pandas versus SQL (Comparison)? -->
<!-- :::success
Exercises on Pandas library
1. Who are the players of the game?
2. Who are the players who were born before the year 2000?
3. Which teams have strength more than 80%?
4. What are the nicknames of the players who lived in Russia (country code is `RU`).
5. What are the top 5 most recent ad-clicks whose category is `fashion`?
6. Which players played the game on `Android` or `iPhone` platforms?
7. Who are the players who played the game
:::
-->
:::success
**Exercises on pandas DataFrame**
*Note: For this practice you can work on a copy of the dataframe.*
21. Delete all players who played on two different platforms?
22. Double the price of items which are purchased more than 3 time.
23. Delete all teams who has only 1 member.
24. Delete all items whose price is 20 or more.
25. Delete all ads from `computers` category and purchased by one player or less.
:::
# Part 2: Graph databases
A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it.
Graph databases address big challenges many of us tackle daily. Modern data problems often involve many-to-many relationships with heterogeneous data that sets up needs to:
- Navigate deep hierarchies,
- Find hidden connections between distant items, and
- Discover inter-relationships between items.
Whether it’s a social network, payment networks, or road network you’ll find that everything is an interconnected graph of relationships. And when we want to ask questions about the real world, many questions are about the relationships rather than about the individual data elements.
<center>
<img src='https://i.imgur.com/TolfC5Q.png' width="400"/>
<p>Example of a simple graph database
</p>
</center>
<!-- ![]() -->
Neo4j is an open-source, NoSQL, native graph database that provides an ACID-compliant transactional backend for your applications that has been publicly available since 2007. In Neo4j, information is organized as nodes, relationships, and properties. We can retrieve data from Neo4j via its query language called Cypher which is a declarative query language similar to SQL, but optimized for graphs.
<center>
<img src='https://i.imgur.com/JE0Cr95.png' width="400"/>
<p>Building blocks of the property graph model
</p>
</center>
The property graph model consists of mainly nodes and relationships. Nodes are the entities in the graph.
- Nodes can be tagged with labels, representing their different roles in your domain. (For example, Person).
- Nodes can hold any number of key-value pairs, or properties. (For example, name).
- Node labels may also attach metadata (such as index or constraint information) to certain nodes.
Whereas relationships provide directed, named, connections between two node entities (e.g. Person LOVES Person).
- Relationships always have a direction, a type, a start node, and an end node, and they can have properties, just like nodes.
- Nodes can have any number or type of relationships without sacrificing performance.
- Although relationships are always directed, they can be navigated efficiently in any direction.
## Download and Install Neo4j
Neo4j can be installed as an interpreter in Zeppelin but in HDP 2.6.5, Zeppelin 0.7.3 is installed and does not support neo4j interpreter. Installing Neo4j on HDP 2.6.5 as a Zeppelin interpreter will not work. The latest version (in Feb. 2023) of Zeppelin 0.10.0 supports Neo4j but we do not need to install it. Indeed, HDP 3.0 has Zeppelin 0.8.0 installed and supports neo4j interpreter but it is not a big deal to install HDP 3.0 just for neo4j since our work on the sandbox will not be mostly on graph databases.
:::danger
If you are downloading Neo4j Desktop from Russia, select `other` from country list and use some proxy server.
:::
The **easy** approach is to install Neo4j Desktop on your local machine. You can find here (https://neo4j.com/docs/desktop-manual/current/installation/download-installation/) some information about installing Neo4j Desktop and you can find here (https://neo4j.com/download-center/) download links. After you start downloading, the website will give a long key you need to use when you install the software as follows:

The software will prepare the environment for installation as follows.
<!-- ![]() -->
<center>
<img src='https://i.imgur.com/XvkI0rv.png' width="300"/>
<p>Installing Neo4j Desktop
</p>
</center>
When you succeed installation, it will show the following dashboard.

The **less easy** approach is to install Neo4j on your cluster node via `yum` OR offline via `rpm`. You can find here (https://yum.neo4j.com) some information about installing neo4j via `yum` and here (https://neo4j.com/docs/operations-manual/3.5/installation/linux/rpm/#linux-yum) some information about offline installation. After that, you need to forward the port 7474 to the local machine in order to access Neo4j Browser and this can be done via port forwarding. You need to forward the port from the local machine to the guest machine and from the guest machine to the cluster node. You can forward the port from the local machine to the guest machine via VirtualBox GUI whereas port forwarding from the guest machine to the clutster node needs SSH Local Port Forwarding where you can find the installation instructions for different Linux distros from here (https://linuxhint.com/ssh-port-forwarding-linux/).
:::warning
If you want to install Neo4j on the cluster, you need to know that the old versions of Neo4j can be installed. I suggest the version 3.5.35.1 and you can see below the cypher shell dependencies.


:::
:::info
You can access the guest machine in VirtualBox by ssh into the `HostSSH` port specified in the `port forwarding rules` which can be reached from the network settings of the virtual machine. You can also add your rules to the list.
:::
You can access the Neo4j Browser usually at http://localhost:7474.
:::danger
If you can not install Neo4j due to some reason, you can open a free session of Neo4j Sandbox at https://neo4j.com/sandbox/ but you need proxy.
:::
<!-- ## Access Neo4j Browser
After
-->
## Dataset Description
In this part, we will use the chat dataset from the same game "catch the pink flamingo". The dataset consists of 6 csv files collected from the chat activities of teams and players.
| File name | Description | Fields |
| -------- | -------- | -------- |
| chat_create_team_chat.csv | A line is added to this file when a player creates a new chat with their team | userid, teamid, TeamChatSessionID, timestamp|
| chat_item_team_chat.csv | Creates nodes labeled ChatItems. Column 0 is User id, column 1 is the TeamChatSession id, column 2 is the ChatItem id (i.e., the id property of the ChatItem node), column 3 is the timestamp for an edge labeled “CreateChat”. Also creates an edge labeled “PartOf” from the ChatItem node to the TeamChatSession node. This edge has a timestamp property using the value from Column 3. | userid, teamchatsessionid, chatitemid, timestamp|
chat_join_team_chat.csv | Creates an edge labeled “Joins” from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Joins edge. | userid, TeamChatSessionID, timestamp |
chat_leave_team_chat.csv | Creates an edge labeled “Leaves” from User to TeamChatSession. The columns are the User id, TeamChatSession id and the timestamp of the Leaves edge. | userid, teamchatsessionid, timestamp |
chat_mention_team_chat.csv | Creates an edge labeled “Mentioned”. Column 0 is the id of the ChatItem, column 1 is the id of the User, and column 2 is the timestamp of the edge going from the chatItem to the User. | ChatItem, userid, timestamp |
chat_respond_team_chat.csv | A line is added to this file when player with chatid2 responds to a chat post by another player with chatid1. | chatid1, chatid2, timestamp |
## Build the database
In Neo4j Desktop, create a new project as follows:

and add a local DBMs to the new project. Then, start the database as follows:

As you can see in the following screenshot, the database is active and we can access it as a default user `neo4j` by opening a dashboard with `Neo4j Browser`.

In order to load csv files to the database, you need to add them to the `import` of the database (do not add them to the project directory) as follows:

After copying/pasting the csv files, the `import` directory will be as follows:

<!--  -->
Each of the 6 CSV files dsecribed in the previous table needs to be loaded into Neo4j using the LOAD CSV command that reads each row of the file and then assigns the imported values to the nodes and edges of the graph. For example, the code below loads the nodes and values from chat_join_team_chat.csv. Each row in this file has 3 values: userid, TeamChatSessionID and timestamp. As the code reads each row of the file, it merges the imported value from the first column with a node of the type “User”, the value from the second column with a node of the type “TeamChatSession” and the value from the third column with an edge of the type “timestamp”. The code also specifies that this edge links each User to the User’s TeamChatSession.
```sql
LOAD CSV FROM "file:///chat_join_team_chat.csv" AS row
MERGE (u:User {id: toInteger(row[0])})
MERGE (c:TeamChatSession {id: toInteger(row[1])})
MERGE (u)-[:Join{timestamp: row[2]}]->( c )
```
:::success
<!-- Given the schema description in the previous table and diagram in the following figure.
**Schema of graph database `chatgraphdb`**

-->
**Exercises**
1. Build the graph database `chatgraphdb`.
2. Access the graph database `chatgraphdb` as `neo4j` user.
3. Create all nodes in the graph. You can create index for the node properties which are used as identifiers of the data.
4. Build the relationships according the data description given in the table.
:::
## Retrieve data
<!-- Working on Cypher language is straight-forward, and writing Cypher queries is -->
Cypher is Neo4j’s graph query language that lets you retrieve data from the graph. It is like SQL for graphs, and was inspired by SQL so it lets you focus on what data you want out of the graph (not how to go get it). It is the easiest graph language to learn by far because of its similarity to other languages and intiutiveness.
Example query on Cypher:
We can retrieve the players whose id is more than 100 and who joind team chat session before "2016-06-01"
```sql
MATCH (u:User)-[j:Join]->(s:TeamChatSession)
WHERE u.id >100 and j.timestamp < "2016-06-01"
return *
```
:::success
**Exercises on Cypher language**
1. Retrieve all players who added chat items before 2016-06-01.
2. Retrieve number of players who joined chat items.
3. Retrieve the players replied to chats posted by other players.
4. How many chat sessions is a user participating in at the same time?
5. Which users are most active in a specific chat session?
6. Do users chat more (or less) before they leave a team?
7. Which teams are having more conversations?
:::
# References
- [Pandas documentation](https://pandas.pydata.org/docs/index.html)
- [Catch the Pink Flamingo game description](https://eagronin.github.io/capstone-acquire/)
- [Comparison with SQL](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html)
- [Neo4j](https://neo4j.com/)