or
or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up
Syntax | Example | Reference | |
---|---|---|---|
# Header | Header | 基本排版 | |
- Unordered List |
|
||
1. Ordered List |
|
||
- [ ] Todo List |
|
||
> Blockquote | Blockquote |
||
**Bold font** | Bold font | ||
*Italics font* | Italics font | ||
~~Strikethrough~~ | |||
19^th^ | 19th | ||
H~2~O | H2O | ||
++Inserted text++ | Inserted text | ||
==Marked text== | Marked text | ||
[link text](https:// "title") | Link | ||
 | Image | ||
`Code` | Code |
在筆記中貼入程式碼 | |
```javascript var i = 0; ``` |
|
||
:smile: | ![]() |
Emoji list | |
{%youtube youtube_id %} | Externals | ||
$L^aT_eX$ | LaTeX | ||
:::info This is a alert area. ::: |
This is a alert area. |
On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?
Please give us some advice and help us improve HackMD.
Do you want to remove this version name and description?
Syncing
xxxxxxxxxx
4 different ways to work with Nebula Graph in Apache Spark
A common question many Nebula Graph community users have asked is how to apply our graph database to Spark-based analytics. People want to use our powerful graph processing capabilities in conjunction with Spark, which is one of the most popular engines for data analytics.
In this article, I will try to walk you through four different ways that you can make Nebula Graph and Apache Spark work together. The first three approaches will use Nebula Graph’s three libraries: Spark Connector, Nebula Exchange, and Nebula Algorithm, whereas the fourth way will leverage PySpark, an interface for Spark in Python.
I have introduced quite a few data importing methods for Nebula Graph in this video, including three methods that import data to Spark. In this article, I’d like to dive deeper into these Spark-related projects, hoping it will provide more help if you want to connect Nebula Graph with Spark.
TL;DR
Spark-Connector
Nebula Spark Connector is a Spark Lib to enable spark application reading from and writing to Nebula Graph in form of dataframe.
Read data from Nebula Graph
In order to read data from Nebula Graph, Nebula Spark Connector will scan all storage instances in a Nebula Graph cluster that contain the given label(TAG). You can use the
withLabel
parameter to indicate the label. For example:withLabel("player")
. You can also optionally specify the properties of the vertex:withReturnCols(List("name", "age"))
.Once you have provided all required properties, you can run
spark.read.nebula.loadVerticesToDF
, which will return the dataframe of the vertex from Nebula Graph.It's similar for the writer part and one big difference here is the writing path is done via GraphD as underlying Spark Connector is shooting nGQL
INSERT
queries:Hands-on Spark Connector
Prerequisites: I will assume that you are running the following procedure on a Linux machine with an internet connection. Ideally, you will have Docker and Docker Compose installed.
Bootstrapping a Nebula Graph cluster
Firstly, let's deploy Nebula Graph v3.0 and Nebula Studio using Nebula-Up, which will run a script to install the two tools using Docker and Docker Compose. The script will automatically install Docker and Docker Compose for you if you don’t already have them installed. But to make sure you get the best experience, you can pre-install Docker and Docker Compose on your machine manually.
Once this is done, we can connect to the Nebula Graph instance with Nebula Console, the command line client for Nebula Graph.
Activate Storage Instances, and check hosts status
Creating a Spark playground
It is very easy to create a Spark environment using Docker thanks to Big data europe, who provided a Spark docker image.
Using this YMAL file, we will create a container named
spark-master-0
with built-in hadoop 2.7 and spark 2.4.5. The container is connected to the Nebula Graph cluster in a docker network namednebula-docker-compose_nebula-net
. It will also map the current path to/root
of the spark container.Then, we can access the Spark environment container with:
Optionally, we can install
mvn
inside the container to enable maven build/packaging:Run Spark Connector
In this section I will show you how to build the Nebula Graph Spark Connector from its soure code.
Now let’s replace the example code:
In this file we will put the following code, where two functions
readVertex
andreadEdges
was created on thebasketballplayer
graph space:Then build it:
Execute it on Spark:
And the result should be like:
There are more examples under the Spark Connector repo, including one for GraphX. Please note that in GraphX it is assumed that the vertex ID is numeric, and you will need to convert string ID types into numeric on the fly. Please refer to the example in Nebula Algorithom on how to mitigate that.
Nebula Exchange
Nebula Exchange is a Spark library that can read data from multiple sources and write it to either Nebula Graph directly or a Nebula Graph SST Files.
To use Nebula Exchange, we need to configure “where to fetch data sources” and “where to write graph data to” in a conf file, and submit the exchange package to spark with the conf file being specified.
Now let's do a hands-on test on Nebula Exchange with the same envrioment we created in previous chapter.
Hands-on Nebula Exchange
Here, we are using Nebula Exchange to consume data from a CSV file, in which the first column is the Vertex ID, and the second and third columns are "name" and "age", respectively.
exchange.conf
in theHOCON
format, where:.nebula
, information regarding Nebula Graph Cluster is configured;.tags
, information regarding Vertecies like how required fields are reflected to our data source(in this case, it's the CSV file) is configured.player.csv
andexchange.conf
files, which should be listed as follow:And the result should be like:
Please refer to Nebula Exchange documentations and configuration examples for more data sources. For how to write Spark data into SST files, you can refer to both the documentation and Nebula Exchange SST 2.x Hands-on Guide (link in Chinese).
Nebula Algorithm
Built on top of Nebula Spark Connector and GraphX, Nebula Algorithm is an Spark library and application to run graph algorithms(pagerank, LPA etc…) on top of graph data in Nebula Graph.
Calling with spark-submit
When we call Nebula Algorithm with
spark-submit
, on the how-to-run perspective, it is quite similar to Nebula-Exchange. This post comes with a hands-on example, too.Calling Nebula Algorithm a library in code
We can also call Nebula Algorithm in Spark as a library. This approach will give you more control on the output format of the algorithm. Also, with this approach, it is possible to perform algorithm for non-numerical vertex ID types, see here.
PySpark for Nebula Graph
Finally, if you want to make Spark and Nebula Graph work together using Python, PySpark is the go-to solution. In this section, I will show you how to connect Spark and Nebula Graph using Nebula Spark Connector with the help of PySpark.
PySpark is able to call Java or Scala packages inside Python, and it makes it very easy to use Spark Connector with Python.
Here I am doing this from the PySpark entrypoint in
/spark/bin/pyspark
, with Nebula Connector's Jar package specified with--driver-class-path
and--jars
Then, rather than pass
NebulaConnectionConfig
andReadNebulaConfig
tospark.read.nebula
, we should instead callspark.read.format("com.vesoft.nebula.connector.NebulaDataSource")
.Voilà!
I also made the same connection using Scala even though I almost have zero Scala knowledge. :-P
References: