ArangoDB hands-on session 1

# ArangoDB hands-on session 1 ## Installation If you have Docker and experience using it, I recommend running the following code in the console: ``` > docker run -e ARANGO_NO_AUTH=1 -p 8529:8529 -d --name hands-on-arangodb arangodb ``` Then point your browser to http://localhost:8529/. Otherwise, if you don't have Docker, you can proceed with the standard installation. Go to [download page](https://www.arangodb.com/download/) and download the latest community version of ArangoDB. Follow the installation guide. You may also follow the installation instructions with a package manager if available. If you install ArangoDB using homebrew under macOS, start the server by running `/usr/local/sbin/arangod`. You can start the server by running `arangod`. You should run the command in the bin folder. The program announces the endpoint to WebUI (by default, it's http://localhost:8529/). After installation and accessing the WebUI, you should be able to see a user named `root` without a password and the default database `_system`. After connecting to the database `_system`, you should see the standard user interface for modern databases (Neo4j, OrientDB, Postgres, etc., all follow a similar layout): ![](https://i.imgur.com/yBUBbpc.png) You can connect to the database from the command line or use some programming language API. The WebUI is good in the beginning, and it also visualizes the graphs. ## Querying the database from a document/relational perspective Before executing queries, your database must contain some collections (cf. tables). Open the COLLECTIONS side tab and create a new collection named `staff`. Open then the QUERIES side tab. Next to the execute button, the explain button opens details about query processing. ###### Insert one ``` INSERT { "age": 53, "name": "Marie Mayweather", "role": "senior manager" } INTO staff ``` ###### Insert many ``` FOR member IN [{ "age": 27, "name": "Katie Foster", "role": "manager", "_key": "5640" }, { "age": 32, "name": "John Smith", "role": "developer", "_key": "5641" }, { "age": 69, "name": "James Hendrix", "role": "developer" }] INSERT member INTO staff RETURN NEW ``` ###### Return ``` RETURN document("staff", "5640") ``` ###### Update ``` UPDATE "5641" WITH { role: "manager" } IN staff RETURN NEW ``` ###### Remove ``` REMOVE { _key: "5641" } IN staff RETURN OLD ``` ##### Example queries ###### Q1 ``` FOR member IN staff SORT member._key RETURN member ``` ###### Q2 ``` FOR member IN staff FILTER member.age > 30 SORT member.age RETURN member ``` ###### Q3 ``` FOR member1 IN staff FOR member2 IN staff FILTER member1 != member2 RETURN { pair: [member1.name, member2.name], sumOfAges: member1.age + member2.age } ``` So, AQL looks like a programming language. More details, functions, and properties you can find at https://www.arangodb.com/docs/stable/tutorials.html. ### Array operators ArangoDB offers a particular [*] operator for array variable expansion: https://www.arangodb.com/docs/stable/aql/advanced-array-operators.html ## Querying the database from a graph perspective To create preinstalled graph data examples in ArangoDB, click the GRAPHS side tab, then Add graph, and finally, see the examples tab in the opened window. In the following examples, we use the City Graph example. ![](https://i.imgur.com/BE3CzbZ.png) In ArangoDB, we have two kinds of graph queries: * Graph traversals * Shortest path queries Compared to Neo4j, pattern matching is not possible in ArangoDB. Unlike Gremlin, explicit graph traversals are impossible (or at least not easy to formulate). You can find more about graph traversals in ArangoDB docs: https://www.arangodb.com/docs/devel/aql/graphs-traversals.html. ###### Graph Q1 ``` // Get a random city c LET c = ( FOR city IN frenchCity SORT rand() LIMIT 1 RETURN city ) // Find the connected cities FOR v, e, path IN 1..1 ANY c[0]._id GRAPH 'routeplanner' RETURN {c, e, v} ``` ###### Graph Q2 Find any of the shortest paths between Paris and Cologne ``` FOR v,e IN ANY SHORTEST_PATH 'frenchCity/Paris' to 'germanCity/Cologne' GRAPH 'routeplanner' RETURN {"vertex": v, "incoming or outgoing edge to the vertex in the path": e} ``` INBOUND: Find any of the shortest inbound paths between Paris and Cologne ``` FOR v,e IN INBOUND SHORTEST_PATH 'frenchCity/Paris' to 'germanCity/Cologne' GRAPH 'routeplanner' RETURN {"vertex": v, "outgoing edge to the vertex in the path": e} ``` OUTBOUND: Find any of the shortest outbound paths between Paris and Cologne ``` FOR v,e IN OUTBOUND SHORTEST_PATH 'frenchCity/Paris' to 'germanCity/Cologne' GRAPH 'routeplanner' RETURN {"vertex": v, "incoming edge to the vertex in the path": e} ``` ###### Q3 ``` // Get a random city c LET c = ( FOR city IN frenchCity SORT rand() LIMIT 1 RETURN city ) // Find the connected cities which are 2 to 3 edges away and return the path FOR v, e, path IN 2..3 ANY c[0]._id GRAPH 'routeplanner' RETURN {c, path, v} ``` ## Exercises 1 ### Insert realistic data into the database As you learned, ArangoDB has some small preinstalled tutorial datasets. Besides, the ArangoDB community provides realistic example datasets on GitHub: https://github.com/arangodb/example-datasets. Let's study one of them, the Debian dependency graph because it contains real-life data, is relatively big, and can be naturally modeled as a graph. Also, because of its size, we will face some challenges during query processing that require considering the query optimization perspective. Because the whole repository of example datasets is big and slow to download, I extracted the Debian dependency graph data to this repository: https://github.com/valterUo/debian-dependency-graph-example-arangodb. If you use ArangoDB in Docker, you can use Docker's volume feature to access the data on your host machine: ``` > docker run -e ARANGO_NO_AUTH=1 -p 8529:8529 -v <path-to-file>/debian-dependency-graph-example-arangodb:/var/lib/arangodb arangodb ``` #### Insert data with the arangorestrore tool Pull the data from the Github repository and import it to ArangoDB with the following tool. You need to change `<path-to-file>` to point to the correct path on your local computer or in Docker. **Important note: The following code creates a *new database*. This means you must switch from the default `_system` database to the `debianGraph` database before accessing the data! This switch between the databases is easy to do in the right corner of ArangoDB WebUI.** ``` > arangorestore --input-directory <path-to-file>/DebianDependencyGraph --create-collection true --include-system-collections true --create-database true --server.database debianGraph ``` If you used the previous Docker code, you can navigate to the usr/bin folder in the Docker container and run: ``` > arangorestore --input-directory /var/lib/arangodb/DebianDependencyGraph --create-collection true --include-system-collections true --create-database true --server.database debianGraph ``` In these exercises, we will meet on the dependency hell concept (https://wiki.debian.org/DependencyHell) using the Debian dependency graph. Write your ArangoDB queries (and short explanations if asked) to the following exercises. You don't need to include answers to the queries since sometimes they are large. Return your answers in Moodle in **one** PDF by **5.5.2023**. The total number of points is 7. The ArangoDB's documentation is useful: https://www.arangodb.com/docs/stable/index.html. 1. How many packages are there in total? (0.5 points) 2. Find the packages that Rob Browning maintains. Return only their names (`_key` attribute). Hint: you benefit from the operators https://www.arangodb.com/docs/stable/aql/operators.html. (0.5 points) 3. Write at least two different queries that find the maximum size of the packages. Return only the maximum value. Which methods perform better, and what is the difference in time? Use the explain feature (next to the execute button) to figure out the reasons for the difference. Hint: MAX, [*] array operator, COLLECT, AGGREGATE, SORT, etc. (1 point) 4. Write a graph query that starts from the package `libc6` and returns vertices and edges within one or two hops. Pay attention that we do not want vertices and edges that are coming to the package `libc6`. (1 point) 5. Find the shortest path between dependencies `debconf` and `gfxboot-dev`. Use EXPLAIN to find information on how the query is processed in the system. (1 point) 6. Find the shortest path from `libxrandr2` to `freecraft`, which contains vertex `chromium`. The direction of the edges does not matter. AQL has a limitation in that you cannot use filtering in shortest-path queries, but they claim that there is a workaround. Hints: https://www.arangodb.com/docs/stable/aql/graphs-shortest-path.html#conditional-shortest-path and https://www.arangodb.com/docs/stable/aql/graphs-traversals.html. (2 points) 7. You saw that ArangoDB supports certain kinds of graph queries. Describe a graph query that is difficult to write using ArangoDB's AQL query language. You can use natural language to describe the query. For example, you can compare AQL to [SPARQL](https://www.w3.org/TR/rdf-sparql-query/) and [Cypher](https://neo4j.com/developer/cypher/). SPARQL is the most popular RDF graph query language and W3 standard. Cypher is the most common property graph query language developed by the community around the Neo4j property graph database. (1 point) ## Introduction presentation 1 [Link to the presentation](https://helsinkifi-my.sharepoint.com/:b:/g/personal/vauotila_ad_helsinki_fi/EUjme9cIIDRPhGYbGckl9AYBizg3c9i77Uz4I6gQkZcfzA?e=lpVUS4) (but maybe you won't need it) ## Hands-on session 2 The next hands-on session is on Friday, 28.4.2023 → https://hackmd.io/@valterUo/BJx8H5tHO --- ## Comments * You can write questions, comments, points, suggest links here, etc. This page allows modifications (but please do not delete the content above)