PID-1268 - HackMD

To get unique values in a column of a PySpark DataFrame, we can use the `distinct()` method. There are different ways to use `distinct()` depending on whether we want to get unique values from a single column or multiple columns [1][2][4][5][6]. Here is an example of how to get unique values from a single column: ```python from pyspark.sql.functions import col # create a sample DataFrame data = [("Alice", 25), ("Bob", 30), ("Alice", 35), ("Charlie", 40)] df = spark.createDataFrame(data, ["Name", "Age"]) # get unique values from the "Name" column unique_names = df.select(col("Name")).distinct().rdd.flatMap(lambda x: x).collect() print(unique_names) ``` In this example, we first create a sample DataFrame with two columns "Name" and "Age". We then use `select()` to select only the "Name" column, followed by `distinct()` to get the unique values in that column. Finally, we use `rdd.flatMap(lambda x: x).collect()` to convert the result to a list of strings. Here is an example of how to get unique values from multiple columns: ```python # create a sample DataFrame data = [("Alice", 25), ("Bob", 30), ("Alice", 35), ("Charlie", 40)] df = spark.createDataFrame(data, ["Name", "Age"]) # get unique values from the "Name" and "Age" columns unique_values = df.select([col(c) for c in df.columns]).distinct().rdd.flatMap(lambda x: x).collect() print(unique_values) ``` In this example, we use a list comprehension to select all columns in the DataFrame, followed by `distinct()` to get the unique values in all columns. Finally, we use `rdd.flatMap(lambda x: x).collect()` to convert the result to a list of tuples. Note that `distinct()` returns a new DataFrame with only the distinct rows. If we want to get the unique values as a list, we need to use `rdd.flatMap(lambda x: x).collect()` as shown in the examples above.