How to count unique ID after groupBy in pyspark

PythonPysparkApache Spark-Sql

Python Problem Overview


I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year.

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

The problem that I discovered that so many ID's are repeated, so the result is wrong and huge.

I want to agregate the students by year, count the total number of student by year and avoid the repetition of ID's.

Python Solutions


Solution 1 - Python

Use countDistinct function

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

output

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

Solution 2 - Python

You can also do:

gr.groupBy("year", "id").count().groupBy("year").count()

This query will return the unique students per year.

Solution 3 - Python

countDistinct() and multiple aggr both are not supported in streaming.

Solution 4 - Python

If you are working with an older Spark version and don't have the countDistinct function, you can replicate it using the combination of size and collect_set functions like so:

gr = gr.groupBy("year").agg(fn.size(fn.collect_set("id")).alias("distinct_count"))

In case you have to count distinct over multiple columns, simply concatenate the columns into a new one using concat and perform the same as above.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionLizouView Question on Stackoverflow
Solution 1 - PythonpauliView Answer on Stackoverflow
Solution 2 - Pythoninformation_interchangeView Answer on Stackoverflow
Solution 3 - PythonArvind-MSFTView Answer on Stackoverflow
Solution 4 - PythondsalajView Answer on Stackoverflow