Course: Big Data - IU S23
Author: Firas Jolha
This dataset includes results of international football matches starting from the very first official match in up to . The matches range from FIFA World Cup to FIFI Wild Cup to regular friendly matches. The matches are strictly men's full internationals and the data does not include Olympic Games or matches where at least one of the teams was the nation's B-team, U-23 or a league select team.
results.csv includes the following columns:
For the dataset of scorers and shootouts you can check this Kaggle data card.
Go to the link to do the test.
It is a module used for structured data processing. Spark SQL allows you to query structured data using either SQL or DataFrame API.
The pyspark.sql
is a module in Spark that is used to perform SQL-like operations on the data stored in memory. You can either leverage using programming API to query the data or use the ANSI SQL queries similar to RDBMS. You can also mix both, for example, use API on the result of an SQL query.
Following are the important classes from the SQL module.
pyspark.sql.SparkSession
β SparkSession is the main entry point for DataFrame and SQL functionality.pyspark.sql.DataFrame
β DataFrame is a distributed collection of data organized into named columns.pyspark.sql.Column
β A column expression in a DataFrame.pyspark.sql.Row
β A row of data in a DataFrame.pyspark.sql.GroupedData
β An object type that is returned by DataFrame.groupBy().pyspark.sql.DataFrameNaFunctions
β Methods for handling missing data (null values).pyspark.sql.DataFrameStatFunctions
β Methods for statistics functionality.pyspark.sql.functions
β List of standard built-in functions.pyspark.sql.types
β Available SQL data types in Spark.pyspark.sql.Window
β Would be used to work with window functions.Spark SQL is one of the most used Spark modules for processing structured columnar data format. Once you have a DataFrame created, you can interact with the data by using SQL syntax. In other words, Spark SQL brings native RAW SQL queries on Spark meaning you can run traditional ANSI SQL on Spark Dataframe.
In order to use SQL, first, register a temporary table/view on DataFrame using the createOrReplaceTempView()
function. Once created, this table can be accessed throughout the SparkSession using sql()
and it will be dropped along with your SparkContext termination. Use sql()
method of the SparkSession object to run the query and this method returns a new DataFrame.
Create SQL View
results
dataframe.
Spark SQL to Select Columns
Filter Rows
To filter the rows from the data, you can use function from the DataFrame API.
Similarly, in SQL you can use WHERE clause as follows.
Sorting
Grouping
PySpark SQL join has a below syntax and it can be accessed directly from DataFrame.
join() operation takes parameters as below and returns DataFrame.
You can also write Join expression by adding and methods on DataFrame and can have Join on multiple columns.
You can read about Anti-joins, semi-joins and unions from here.