--- tags: IntroToBigData title: Lab 2 - Data Retrieval with SQL and Cypher --- Lab 2 - Data Retrieval with SQL and Cypher === **Course:** Intro to Big Data - IU S23 **Author:** Firas Jolha # Dataset - [Users and Posts](https://disk.yandex.ru/d/HlBEEgUJmkD6Xg) # Agenda [TOC] # Prerequisites - HDP Sandbox is installed - Familiarity with SQL and PostgreSQL server # Objectives - Access `PostgreSQL` database server - Review CRUD operations on a relational database - Install Neo4j on HDP Sandbox - Perform some CRUD operations on a graph database # Introduction The Structured Query Language (SQL) is the most extensively used database language. SQL is composed of a data definition language (DDL), which allows the specification of database schemas; a data manipulation language (DML), which supports operations to retrieve, store, modify and delete data. In this lab, we will practice how to retrieve structured data from relational databases using SQL. Then we will build a graph for the same dataset on Neo4j and return connected data via Cypher queries. # Dataset Descripion This dataset contains four tables about social media users. - **users. csv** contains data about the users of the social net. - **friends.csv** contains data about their friendship status - **posts.csv** contains data about the posts they made | File name | Description | Fields | | -------- | -------- | -------- | | users.csv | A line is added to this file when a user is subscribed to the socia media along with the age and subscription date time | userid, surname, name, age, timestamp| | posts.csv | The user can add multiple posts and for each post we have post type and post unix timestamp | postid, user, posttype, posttimestamp| friends.csv | The user can have friends and this relationship is mutual | friend1, friend2 | # Access PostgreSQL server In fact, HDP Sandbox comes with a pre-installed PostgreSQL server so you do not to install any additional software to use PostgreSQL. You can access it via `psql` as `postgres` user: ```sh [root@sandbox-hdp ~]# psql -U postgres ``` This will open a CLI to interact with PostgreSQL. You can access the databases using `\c` command and the tables via `\dt` command. For example, you can access the database `ambari` (a default database on HDP Sandbox) by running the following command in postgreSQL CLI. ```sh postgres=# \c ambari You are now connected to database "ambari" as user "postgres". ``` :::info The version of PostgreSQL in HDP 2.6.5 is 9.2.23 as shown below. So you should read the documentation for this version of PostgreSQL. ![](https://i.imgur.com/5pVBcTu.png) ::: <!-- # PostgreSQL Datatypes :::spoiler - Numerical types - INT - SERIAL - FLOAT(n) - n bytes - REAL - NUMERIC(precision, scale) - String types - CHAR(n) - VARCHAR(n) - TEXT - Datetime types - DATE - TIME - TIMESTAMP - TIMESTAMPZ - INTERVAL - Other types - BOOLEAN - ARRAYS - [] - MONEY ::: --> # PostgreSQL Meta-commands Anything you enter in psql that begins with an unquoted backslash is a psql meta-command that is processed by psql itself. These commands make psql more useful for administration or scripting. Meta-commands are often called slash or backslash commands. - Switch connection to a new database `dbname` as a user `username` ```sql \c <dbname> <username> ``` - List available databases ```sql \l ``` - List available tables ```sql \dt ``` - Describe a table `<table_name>` ```sql \d <table_name> ``` - Execute psql commands from a file `<file.sql>` ```sql \i <file.sql> ``` - Get help on psql commands ```sql \? \h CREATE TABLE ``` - Turn on query execution time ```sql \timing ``` :::info You use the same command `\timing` to turn it off. ::: - Quit psql ```sql \q ``` :::info You can run the shell commands in psql CLI via `\! `. For instance, to print the current working directory we write: ```powershell! \! pwd ``` ::: <!-- # PostgreSQL Data types | Category | Data types | Example | | -------- | -------- | -------- | | Numerical | INT, SERIAL, FLOAT(n_bytes), NUMERIC(precision, scale) | Text | --> <!-- - Numerical types - INT - SERIAL - FLOAT(n) - n bytes - REAL - NUMERIC(precision, scale) - String types - CHAR(n) - VARCHAR(n) - TEXT - Datetime types - DATE - TIME - TIMESTAMP - TIMESTAMPZ - INTERVAL - Other types - BOOLEAN - ARRAYS - [] - MONEY --> <!-- You can read this [cheat sheet](https://quickref.me/postgres) to refresh your knowledge in PostgreSQL which is **required for the next lab.** --> # Build the Database :::warning Note: Explore the dataset using any data analysis tool (Python+Pandas, Excel, Google Sheet, ...etc) you have before building the database. Another note: You do not need to create a new role/user. Just use the current role `postgres`. ::: The dataset are csv files and need to be imported into PostgreSQL tables. To copy the values from csv files into the table, you need to use Postgres `COPY` method od data-loading. Postgres's COPY comes in two separate variants, COPY and \COPY: COPY is server based, \COPY is client based. COPY will be run by the PostgreSQL backend (user "postgres"). The backend user requires permissions to read & write to the data file in order to copy from/to it. You need to use an absolute pathname with COPY. \COPY on the other hand, runs under the current $USER, and with that users environment. And \COPY can handle relative pathnames. The psql \COPY is accordingly much easier to use if it handles what you need. So, you can use the meta-command `\COPY` to import the data from csv files into PostgreSQL tables but you need to make sure that the primary keys and foreign keys are set according to the given ER diagram. You can learn more about ER diagrams from [here](https://www.lucidchart.com/pages/ER-diagram-symbols-and-meaning). :::warning Notice that you need to create tables before importing their data from files. You can detect the datatype of the column from its values. ::: 1. Create the database `social` ```sql! -- Create Database CREATE DATABASE social; -- Switch to the Database \c social; -- add User table CREATE TABLE IF NOT EXISTS Users( userid INT PRIMARY KEY NOT NULL, surname VARCHAR(50), name VARCHAR(50), age INT, _timestamp timestamp, temp BIGINT NOT NULL ); -- add Post table CREATE TABLE IF NOT EXISTS Posts( postid INT PRIMARY KEY NOT NULL, userid INT NOT NULL, posttype VARCHAR(50), posttimestamp timestamp, temp BIGINT NOT NULL ); -- add Friend table CREATE TABLE IF NOT EXISTS Friends( friend1 INT NOT NULL, friend2 INT NOT NULL ); -- add constrains ALTER TABLE Friends ADD CONSTRAINT fk_User_userid_friend1 FOREIGN KEY (friend1) REFERENCES Users (userid); ALTER TABLE Friends ADD CONSTRAINT fk_User_userid_friend2 FOREIGN KEY (friend2) REFERENCES Users (userid); ALTER TABLE Posts ADD CONSTRAINT fk_User_userid_userid FOREIGN KEY (userid) REFERENCES Users (userid); ``` 2. Upload the csv data into tables ```sql! \COPY Users(userid, surname, name, age, temp) FROM 'users.csv' DELIMITER ',' CSV HEADER; UPDATE Users SET _timestamp = TO_TIMESTAMP(temp); ALTER TABLE Users DROP COLUMN temp; \COPY Posts(postid, userid, posttype, temp) FROM 'posts.csv' CSV HEADER; UPDATE Posts SET posttimestamp = TO_TIMESTAMP(temp); ALTER TABLE Posts DROP COLUMN temp; \COPY Friends from 'friends.csv' csv header; ``` ## Fix Friends table ```sql! CREATE TABLE friends2 AS SELECT friend1, friend2 FROM friends UNION SELECT friend2, friend1 FROM friends; ALTER TABLE friends RENAME TO friends_old; ALTER TABLE friends2 RENAMT TO friends; ``` # Data Retrieval using SQL In this part of the lab, we will practice on writing SQL queries to retrieve data from our database. ## Data Retrieval from a single table You can return data from a single table by using `SELECT` command. This kind of data retrieval is the simplest way to get data from a single table. :::success ### Exercises 1. Who are the users who have posts? ```sql! SELECT DISTINCT userid FROM posts; ``` 2. What are the posts of the users? ```sql! SELECT DISTINCT postid FROM posts; ``` 3. Which posts published after 2012? ```sql! SELECT * FROM posts WHERE EXTRACT(YEAR FROM posttimestamp) > 2012; ``` 4. What is the oldest and recent dates of the posts? ```sql! SELECT MAX(posttimestamp), MIN(posttimestamp) FROM posts; ``` 5. What are the top 5 most recent posts whose type is `Image`? ```sql! SELECT * FROM posts WHERE posttype='Image' ORDER BY posttimestamp DESC LIMIT 5; ``` 6. Which top 5 users who posted an `Image` or `Video` recently? ```sql! SELECT userid, posttype FROM posts WHERE posttype IN ('Image', 'Video') ORDER BY posttimestamp DESC LIMIT 5; ``` 7. What is the age of the eldest and youngest persons? ```sql! SELECT MAX(age), MIN(age) FROM users; UPDATE users SET age = CASE WHEN age < 0 THEN -age ELSE age END; ``` ::: :::warning In PostgreSQL, double quotes `""` are used to indicate identifiers within the database, which are objects like tables, column names, and roles. In contrast, single quotes `''` are used to indicate string literals. ::: ## Data Retrieval from multiple tables (Joining tables) :::success ### Exercises 8. Who is the 2nd youngest person who published `Image` posts? ```sql! SELECT DISTINCT name, surname, age, posttype FROM posts JOIN users ON posts.userid = users.userid WHERE posts.posttype='Image' ORDER BY age ASC LIMIT 1 OFFSET 1; ``` 9. What are the posts that are published by people who are between 20 and 30? ```sql! SELECT postid, posttype, age, users.userid FROM posts JOIN users ON posts.userid = users.userid WHERE age <@ int4range(20, 30) ORDER BY age; ``` 10. What are the friends of the youngest person? ```sql! WITH youngest AS( SELECT userid FROM users ORDER BY age ASC LIMIT 1 ) SELECT friend1 AS user, friend2 AS friend FROM youngest, friends JOIN users ON friends.friend1 = users.userid WHERE users.userid = youngest.userid; ``` 11. How many friends for the surname `Thronton`? ```sql! SELECT COUNT(friend2) FROM users JOIN friends ON (users.userid = friends.friend1) WHERE users.surname = 'Thronton'; ``` ::: ## Aggregation and grouping :::success ### Exercises 12. How many posts for each user? ```sql! SELECT users.userid, COUNT(*) as c FROM users JOIN posts ON (users.userid = posts.userid) GROUP BY users.userid ORDER BY c DESC; ``` 13. Who are the friends of users who posted `Image`? ```sql! SELECT DISTINCT u2.userid, u2.name, u2.surname, u2.age, u.userid, u.name, u.surname FROM users u JOIN friends f ON f.friend1 = u.userid JOIN posts ON posts.userid = u.userid JOIN users u2 ON f.friend2 = u2.userid WHERE posts.posttype = 'Image'; ``` 14. How many friends of users whose age is less than 20? ```sql! SELECT COUNT(DISTINCT f1.friend2) as c FROM users u JOIN friends f1 ON (u.userid = f1.friend1) WHERE u.age < 20; ``` 15. How many friends of friends of users whose age is less than 20? ```sql! SELECT COUNT(DISTINCT f2.friend2) as c FROM users u JOIN friends f1 ON (u.userid = f1.friend1) JOIN friends f2 ON (f1.friend2 = f2.friend1) WHERE u.age < 20 AND f2.friend2 <> u.userid; ``` 16. How many users have more than 10 posts? ```sql! WITH users_10_posts AS ( SELECT posts.userid, COUNT(*) as c FROM posts GROUP BY posts.userid HAVING COUNT(*) > 10 ) SELECT COUNT(*) as c FROM users_10_posts; ``` 17. What is the average number of posts published at noon and by the friends of the users whose age is more than 40? Calculate the percentage of posts. ```sql! WITH post_count AS ( SELECT COUNT(posts.postid) AS c FROM users u JOIN friends f ON f.friend1 = u.userid JOIN posts ON posts.userid = f.friend1 JOIN users u2 ON f.friend1 = u2.userid WHERE EXTRACT(HOUR FROM u2._timestamp) = 12 AND u.age > 40 ) SELECT AVG(c) FROM post_count; ``` 18. What is the percentage of posts published at noon and by the friends of the users whose age is more than 40? ```sql! WITH post_count AS ( SELECT COUNT(posts.postid) AS c FROM users u JOIN friends f ON f.friend1 = u.userid JOIN posts ON posts.userid = f.friend1 JOIN users u2 ON f.friend1 = u2.userid WHERE EXTRACT(HOUR FROM u2._timestamp) = 12 AND u.age > 40) , post_total AS ( SELECT COUNT(*) as c2 FROM posts ) SELECT 100 * c/c2 || '%' FROM post_total, post_count; ``` ::: <!-- 12. How many friends of friends of friends of users whose age is less than 20? ```sql! SELECT COUNT(DISTINCT f3.friend2) as c FROM users u JOIN friends f1 ON (u.userid = f1.friend1) JOIN friends f2 ON (f1.friend2 = f2.friend1) JOIN friends f3 ON (f2.friend2 = f3.friend1) WHERE u.age < 20 AND f2.friend2 <> u.userid AND f3.friend2 <> f1.friend2; ``` ## Subqueries :::success ### Exercises 13. Which teams whose strength is greater than the average strength of all teams? 14. Which teams were having more than half of the members from Russia? ::: --> # Graph databases A graph database stores nodes and relationships instead of tables, or documents. Data is stored just like you might sketch ideas on a whiteboard. Your data is stored without restricting it to a pre-defined model, allowing a very flexible way of thinking about and using it. Graph databases address big challenges many of us tackle daily. Modern data problems often involve many-to-many relationships with heterogeneous data that sets up needs to: - Navigate deep hierarchies, - Find hidden connections between distant items, and - Discover inter-relationships between items. Whether it’s a social network, payment networks, or road network you’ll find that everything is an interconnected graph of relationships. And when we want to ask questions about the real world, many questions are about the relationships rather than about the individual data elements. <center> <img src='https://i.imgur.com/TolfC5Q.png' width="400"/> <p>Example of a simple graph database </p> </center> <!-- ![]() --> Neo4j is an open-source, NoSQL, native graph database that provides an ACID-compliant transactional backend for your applications that has been publicly available since 2007. In Neo4j, information is organized as nodes, relationships, and properties. We can retrieve data from Neo4j via its query language called Cypher which is a declarative query language similar to SQL, but optimized for graphs. <center> <img src='https://i.imgur.com/JE0Cr95.png' width="400"/> <p>Building blocks of the property graph model </p> </center> The property graph model consists of mainly nodes and relationships. Nodes are the entities in the graph. - Nodes can be tagged with labels, representing their different roles in your domain. (For example, Person). - Nodes can hold any number of key-value pairs, or properties. (For example, name). - Node labels may also attach metadata (such as index or constraint information) to certain nodes. Whereas relationships provide directed, named, connections between two node entities (e.g. Person LOVES Person). - Relationships always have a direction, a type, a start node, and an end node, and they can have properties, just like nodes. - Nodes can have any number or type of relationships without sacrificing performance. - Although relationships are always directed, they can be navigated efficiently in any direction. ## Download and Install Neo4j Neo4j can be installed as an interpreter in Zeppelin but in HDP 2.6.5, Zeppelin 0.7.3 is installed and does not support neo4j interpreter. Installing Neo4j on HDP 2.6.5 as a Zeppelin interpreter will not work. The latest version (in Feb. 2023) of Zeppelin 0.10.0 supports Neo4j but we do not need to install it. Indeed, HDP 3.0 has Zeppelin 0.8.0 installed and supports neo4j interpreter but it is not a big deal to install HDP 3.0 just for neo4j since our work on the sandbox will not be mostly on graph databases. :::danger If you are downloading Neo4j Desktop from Russia, select `other` from country list and use some proxy server. ::: The **easy** approach is to install Neo4j Desktop on your local machine. You can find here (https://neo4j.com/docs/desktop-manual/current/installation/download-installation/) some information about installing Neo4j Desktop and you can find here (https://neo4j.com/download-center/) download links. After you start downloading, the website will give a long key you need to use when you install the software as follows: ![](https://i.imgur.com/bwbGHJ5.png) The software will prepare the environment for installation as follows. <!-- ![]() --> <center> <img src='https://i.imgur.com/XvkI0rv.png' width="300"/> <p>Installing Neo4j Desktop </p> </center> When you succeed installation, it will show the following dashboard. ![](https://i.imgur.com/P3QBb6U.png) The **less easy** approach is to install Neo4j on your cluster node via `yum` OR offline via `rpm`. You can find here (https://yum.neo4j.com) some information about installing neo4j via `yum` and here (https://neo4j.com/docs/operations-manual/3.5/installation/linux/rpm/#linux-yum) some information about offline installation. After that, you can forward the port of the server to some custom port 10015 to the host machine in order to access Neo4j Browser (change this setting `server.http.listen_address` to 10015 in `neo4j.conf` file of the Neo4j DBMS instance). :::warning If you want to install Neo4j on the cluster, you need to know that the old versions of Neo4j can be installed. I suggest the version 3.5.35.1 and you can see below the cypher shell dependencies. ![](https://i.imgur.com/I6bj6a9.png) ![](https://i.imgur.com/xrZ2Eib.png) ::: :::info You can access the guest machine in VirtualBox by ssh into the `HostSSH` port specified in the `port forwarding rules` which can be reached from the network settings of the virtual machine. You can also add your rules to the list. ::: You can access the Neo4j Browser usually at http://localhost:7474. :::danger If you can not install Neo4j due to some reason, you can open a free session of Neo4j Sandbox at https://neo4j.com/sandbox/ but you need proxy. ::: # Cypher Cypher is Neo4j’s graph database query language that allows users to store and retrieve data from the graph database. It is a declarative, SQL-inspired language for describing visual patterns in graphs. The syntax provides a visual and logical way to match patterns of nodes and relationships in the graph. Cypher keywords are not case-sensitive but it is case-sensitive for variables. All Neo4j servers contain a built-in database called `system`, which behaves differently than all other databases. The `system` database stores system data and you can not perform graph queries against it. A fresh installation of Neo4j includes two databases: - **system** - the system database described above, containing meta-data on the DBMS and security configuration. - **neo4j** - the default database, named using the config option dbms.default_database=neo4j. Cypher provides first class support for a number of data types. These fall into the following three categories: property, structural, and composite. ## Property types The following data types are included in the property types category: Integer, Float, String, Boolean, Point, Date, Time, LocalTime, DateTime, LocalDateTime, and Duration. For more details, visit the [documentation](https://neo4j.com/docs/cypher-manual/4.4/syntax/values/#_property_type_details). ## Structural types The following data types are included in the structural types category: Node, Relationship, and Path. - The **Node** data type includes: Id, Label(s), and Map (of properties). Note that labels are not values, but a form of pattern syntax. - The **Relationship** data type includes: Id, Type, Map (of properties), Id of start node, and Id of end node. - The **Path** data type is an alternating sequence of nodes and relationships. Nodes, relationships, and paths are returned as a result of pattern matching. In Neo4j, all relationships have a direction. However, you can have the notion of undirected relationships at query time. ## Composite types The following data types are included in the composite types category: List and Map. ```cpp! RETURN [1,2,3,4,5], {id:1, msg: 'hello', age: 31} ``` ## How to define a variable When you reference parts of a pattern or a query, you do so by naming them. The names you give the different parts are called variables. In this example: ```cpp! MATCH (n)-->(b) RETURN b ``` The variables are n and b. ```cpp! // this is a comment ``` A comment begin with double slash (//) and continue to the end of the line. Comments do not execute, they are for humans to read. ## Operators ### **DISTINCT** (aggregation operator) ```sql! WITH ['a', 'a', 4, 'b', 4] AS ps UNWIND ps AS p RETURN DISTINCT p ``` The output will have only 'a', 4, and 'b'. <!-- <center> <img src="https://i.imgur.com/dRPT5py.png" width="50" /> </center> --> ### Property operators #### The operator `.` statically access the property of a node or relationship #### The operator `[]` used for filtering on a dynamically-computed property key ```cpp! CREATE (a:Restaurant {name: 'Hungry Jo', rating_hygiene: 10, rating_food: 7}), (b:Restaurant {name: 'Buttercup Tea Rooms', rating_hygiene: 5, rating_food: 6}), (c1:Category {name: 'hygiene'}), (c2:Category {name: 'food'}) WITH a, b, c1, c2 MATCH (restaurant:Restaurant), (category:Category) WHERE restaurant["rating_" + category.name] > 6 RETURN DISTINCT restaurant.name ``` #### The operator `=` used for replacing all properties of a node or relationship ```cpp! CREATE (a:Person {name: 'Jane', age: 20}) WITH a MATCH (p:Person {name: 'Jane'}) SET p = {name: 'Ellen', livesIn: 'London'} RETURN p.name, p.age, p.livesIn ``` ### Mathemtical operators The mathematical operators comprise: - addition: + - subtraction or unary minus: - - multiplication: * - division: / - modulo division: % - exponentiation: ^ ### Comparison operators The comparison operators comprise: - equality: = - inequality: <> - less than: < - greater than: > - less than or equal to: <= - greater than or equal to: >= - IS NULL - IS NOT NULL String-specific comparison operators comprise: - STARTS WITH: perform case-sensitive prefix searching on strings - ENDS WITH: perform case-sensitive suffix searching on strings - CONTAINS: perform case-sensitive inclusion searching in strings - =~: regular expression for matching a pattern ```cpp! WITH ['mouse', 'chair', 'door', 'house'] AS wordlist UNWIND wordlist AS word WITH word WHERE word =~ '.*ous.*' RETURN word ``` ### Boolean operators The boolean operators — also known as logical operators — comprise: `AND`, `OR`, `XOR`, `NOT`. ### String operators The string operators comprise: concatenating strings: `+` ```cpp! RETURN 'neo' + '4j' AS result ``` Check other operators from the documentation. ## Patterns Patterns and pattern-matching are at the very heart of Cypher, so being effective with Cypher requires a good understanding of patterns. Using patterns, you describe the shape of the data you are looking for. For example, in the MATCH clause you describe the shape with a pattern, and Cypher will figure out how to get that data for you. The pattern describes the data using a form that is very similar to how one typically draws the shape of property graph data on a whiteboard: usually as circles (representing nodes) and arrows between them to represent relationships. Patterns appear in multiple places in Cypher: in `MATCH`, `CREATE` and `MERGE` clauses, and in pattern expressions. ### Patterns for nodes This simple pattern describes a single node, and names that node using the variable a. ```cpp! (a) ``` ### Patterns for related nodes This pattern describes a very simple data shape: two nodes, and a single relationship from one to the other. ```cpp! (a)-->(b) ``` This manner of describing nodes and relationships can be extended to cover an arbitrary number of nodes and the relationships between them, for example: ```cpp! (a)-->(b)<--(c) ``` Such a series of connected nodes and relationships is called a **"path"**. ### Patterns for labels The most simple attribute that can be described in the pattern is a label that the node must have. For example: ```cpp! (a:User)-->(b) ``` One can also describe a node that has multiple labels: ```cpp! (a:User:Admin)-->(b) ``` ### Specifying properties Properties can be expressed in patterns using a map-construct: curly brackets surrounding a number of key-expression pairs, separated by commas. E.g. a node with two properties on it would look like: ```cpp! (a {name: 'Andy', sport: 'Brazilian Ju-Jitsu'}) ``` A relationship with expectations on it is given by: ```cpp! (a)-[{blocked: false}]->(b) ``` ### Patterns for relationships The simplest way to describe a relationship is by using the arrow between two nodes, as in the previous examples. Using this technique, you can describe that the relationship should exist and the directionality of it. If you don’t care about the direction of the relationship, the arrow head can be omitted, as exemplified by: ```cpp! (a)--(b) (a)-[r]->(b) (a)-[r:REL_TYPE]->(b) (a)-[r:TYPE1|TYPE2]->(b) (a)-[:REL_TYPE]->(b) ``` ### Variable-length pattern matching Rather than describing a long path using a sequence of many node and relationship descriptions in a pattern, many relationships (and the intermediate nodes) can be described by specifying a length in the relationship description of a pattern. For example: ```cpp! (a)-[*2]->(b) ``` This describes a graph of three nodes and two relationships, all in one path (a path of length 2). This is equivalent to: ```cpp! (a)-->()-->(b) ``` A range of lengths can also be specified: such relationship patterns are called 'variable length relationships'. For example: ```cpp! (a)-[*3..5]->(b) ``` This is a minimum length of 3, and a maximum of 5. It describes a graph of either 4 nodes and 3 relationships, 5 nodes and 4 relationships or 6 nodes and 5 relationships, all connected together in a single path. Either bound can be omitted. For example, to describe paths of length 3 or more, use: ```cpp! (a)-[*3..]->(b) ``` To describe paths of length 5 or less, use: ```cpp! (a)-[*..5]->(b) ``` Omitting both bounds is equivalent to specifying a minimum of 1, allowing paths of any positive length to be described: ```cpp! (a)-[*]->(b) ``` ### Assigning to path variables Cypher allows paths to be named using an identifer, as exemplified by: ```cpp! p = (a)-[*3..5]->(b) ``` <!-- ### CASE expressions Generic conditional expressions may be expressed using the CASE construct. Two variants of CASE exist within Cypher: the simple form, which allows an expression to be compared against multiple values, and the generic form, which allows multiple conditional statements to be expressed. :::info CASE can only be used as part of RETURN or WITH if you want to use the result in the succeeding clause or statement. ::: The expression is calculated, and compared in order with the WHEN clauses until a match is found. If no match is found, the expression in the ELSE clause is returned. However, if there is no ELSE case and no match is found, null will be returned. ```cpp! // simple CASE CASE test WHEN value THEN result [WHEN ...] [ELSE default] END // generic CASE CASE WHEN predicate THEN result [WHEN ...] [ELSE default] END ``` --> ## Build the database In Neo4j Desktop, create a new project as follows: ![](https://i.imgur.com/zHfzGM7.png) and add a local DBMs to the new project. Then, start the database as follows: ![](https://i.imgur.com/sr40QXb.png) As you can see in the following screenshot, the database is active and we can access it as a default user `neo4j` by opening a dashboard with `Neo4j Browser`. ![](https://i.imgur.com/orOE6Ch.png) In order to load csv files to the database, you need to add them to the `import` of the database (do not add them to the project directory) as follows: ![](https://i.imgur.com/71ItvTP.png) After copying/pasting the csv files, the `import` directory will be as follows: ![](https://i.imgur.com/qLTdMJO.png) 1. Create a new database and access it. ```sql! CREATE DATABASE socialmedia :use socialmedia ``` 2. Create nodes and relationships Each of the 3 CSV files needs to be loaded into Neo4j using the LOAD CSV command that reads each row of the file and then assigns the imported values to the nodes and edges of the graph. ```cpp // Create Index for users on userid field CREATE INDEX FOR (u:User) ON (u.userid) // Create Index for posts on postid field CREATE INDEX FOR (u:Post) ON (u.postid) // Load users data LOAD CSV WITH HEADERS FROM "file:///users.csv" AS row FIELDTERMINATOR ',' CREATE ( :User { userid: toInteger(row['userid']), name:row['name'], surname: row['surname'], age: toInteger(row['age']), timestamp: datetime({epochSeconds: toInteger(row['timestamp'])})} ); // Load posts data LOAD CSV WITH HEADERS FROM "file:///posts.csv" AS row FIELDTERMINATOR ',' MATCH (u:User{userid:toInteger(row['userid'])}) MERGE ( :Post { postid: toInteger(row['postid']), userid: toInteger(row['userid']), posttype:row['posttype'] })<-[:ADD{ posttimestamp: datetime({epochSeconds: toInteger(row['posttimestamp'])}) }]-(u); // Retrieve some nodes MATCH (n) RETURN n LIMIT 10; // Load Friends :auto USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM "file:///friends.csv" AS row FIELDTERMINATOR ',' MATCH (u:User{userid:toInteger(row['friend1'])}) MATCH (v:User{userid:toInteger(row['friend2'])}) MERGE (u)-[:FRIEND]->(v)-[:FRIEND]->(u) ``` # Data Retrieval with Cypher <!-- Working on Cypher language is straight-forward, and writing Cypher queries is --> Cypher is Neo4j’s graph query language that lets you retrieve data from the graph. It is like SQL for graphs, and was inspired by SQL so it lets you focus on what data you want out of the graph (not how to go get it). It is the easiest graph language to learn by far because of its similarity to other languages and intiutiveness. :::success ## Exercises 1. Who are the users who have posts? ```javascript! MATCH (u:User)-[:ADD]-() RETURN u LIMIT 10 // This is added just to not display all nodes ``` 2. What are the posts of the users? ```javascript! MATCH (p:Post) RETURN p LIMIT 10 // This is added just to not display all nodes ``` 3. Which posts published after 2012? ```javascript! MATCH (p:Post)<-[a:ADD]-() WHERE a.posttimestamp.year > 2012 RETURN p LIMIT 10 // This is added just to not display all nodes ``` 4. What is the oldest and recent dates of the posts? ```javascript! MATCH (p:Post)-[a:ADD]-() RETURN MAX(a.posttimestamp), MIN(a.posttimestamp) ``` 5. What are the top 5 most recent posts whose type is `Image`? ```javascript! MATCH (p:Post{posttype:'Image'})-[a:ADD]-() RETURN p, a ORDER BY a.posttimestamp DESC LIMIT 5 ``` 6. Which top 5 users who posted an `Image` or `Video` recently? ```javascript! MATCH (u:User)-[a:ADD]->(p:Post) WHERE p.posttype IN ['Image', 'Video'] RETURN u, p ORDER BY a.posttimestamp LIMIT 5 ``` 7. What is the age of the eldest and youngest persons? ```javascript! MATCH (u:User) RETURN MAX(u.age), MIN(u.age) // Update the age of the person MATCH (u:User) SET u.age = CASE WHEN u.age < 0 THEN -u.age ELSE u.age END ``` 8. Who is the 2nd youngest person who published `Image` posts? ```javascript! MATCH (u:User)-[:ADD]->(p:Post{posttype:'Image'}) RETURN DISTINCT u.name, u.surname, u.age, p.posttype ORDER BY u.age ASC SKIP 1 LIMIT 1 ``` 9. What are the posts that are published by people who are between 20 and 30? ```javascript! MATCH (p:Post)<-[:ADD]-(u:User) WHERE u.age IN range(20, 30) RETURN p.postid, p.posttype, u.age, u.userid ORDER BY u.age ``` 10. What are the friends of the youngest person? ```javascript! CALL{ MATCH (u:User) RETURN u.userid AS youngest ORDER BY u.age ASC LIMIT 1 } WITH youngest MATCH (u:User{userid:youngest})-[:FRIEND]-(f1:User) RETURN DISTINCT u.userid AS user, f1.userid AS friend ``` 11. How many friends for the surname `Thronton`? ```javascript! MATCH (u:User{surname:'Thronton'})-[:FRIEND]-(f1:User) RETURN COUNT(DISTINCT f1) ``` 12. How many posts for each user? ```javascript! MATCH (u:User)--(p:Post) RETURN u.userid AS user, COUNT(*) AS c ORDER BY c DESC ``` 13. Who are the friends of users who posted `Image`? ```javascript! MATCH (f:User)--(u:User)--(:Post{posttype:'Image'}) RETURN DISTINCT u.userid, u.name, u.surname, f.userid, f.name, f.surname, f.age ``` 14. How many friends of users whose age is less than 20? ```javascript! MATCH (u:User)--(f1:User) WHERE u.age < 20 RETURN COUNT(DISTINCT f1.userid) AS c ``` 15. How many friends of friends of users whose age is less than 20? ```javascript! MATCH (u:User)-[:FRIEND*2]-(f2:User) WHERE u.age < 20 AND f2.userid <> u.userid RETURN COUNT(DISTINCT f2.userid) AS c ``` 16. How many users have more than 10 posts? 17. What is the average number of posts published at noon and by the friends of the users whose age is more than 40? Calculate the percentage of posts. 18. What is the percentage of posts published at noon and by the friends of the users whose age is more than 40? ::: # Common issues * When you try to access the database by running the command `psql -U postgres` as `[root@sandbox-hdp data]# psql -U postgres`. You may get the following error: > psql: FATAL: Peer authentication failed for user "postgres" * Possible reasons: * The database is configured to allow access only through the credentials of the user. * Possible solutions: * You can change this configuration to allow all users to access the database without the need to enter credentials. Add the line `local all all trust` at the beginning of the file `/var/lib/pgsql/data/pg_hba.conf` then restart PostgreSQL service by running the command `systemctl restart postgresql` as root user. :::warning **NB:** If you have other issues, please contact the TA. ::: # References - [PostgreSQL cheatsheet](https://quickref.me/postgres) - Yasin N. Silva, Isadora Almeida, and Michell Queiroz. 2016. SQL: From Traditional Databases to Big Data. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (SIGCSE '16). Association for Computing Machinery, New York, NY, USA, 413–418. https://doi.org/10.1145/2839509.2844560 - [Comprehensive PostgreSQL Tutorial](https://www.postgresqltutorial.com) - [psql 9.2 Documentation](https://www.postgresql.org/docs/9.2/app-psql.html) - [Cypher Manual](https://neo4j.com/docs/cypher-manual/4.4/)