Spark SQL Row_number() PartitionBy Sort Desc

PythonApache SparkPysparkApache Spark-SqlWindow Functions

Python Problem Overview


I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code:

from pyspark import HiveContext
from pyspark.sql.types import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window

data_cooccur.select("driver", "also_item", "unit_count", 
    F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count")).alias("rowNum")).show()

That gives me this result:

 +------+---------+----------+------+
 |driver|also_item|unit_count|rowNum|
 +------+---------+----------+------+
 |   s10|      s11|         1|     1|
 |   s10|      s13|         1|     2|
 |   s10|      s17|         1|     3|

And here I add the desc() to order descending:

data_cooccur.select("driver", "also_item", "unit_count", F.rowNumber().over(Window.partitionBy("driver").orderBy("unit_count").desc()).alias("rowNum")).show()

And get this error:

> AttributeError: 'WindowSpec' object has no attribute 'desc'

What am I doing wrong here?

Python Solutions


Solution 1 - Python

desc should be applied on a column not a window definition. You can use either a method on a column:

from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window

F.row_number().over(
    Window.partitionBy("driver").orderBy(col("unit_count").desc())
)

or a standalone function:

from pyspark.sql.functions import desc
from pyspark.sql.window import Window

F.row_number().over(
    Window.partitionBy("driver").orderBy(desc("unit_count"))
)

Solution 2 - Python

Or you can use the SQL code in Spark-SQL:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .master('local[*]')\
    .appName('Test')\
    .getOrCreate()

spark.sql("""
    select driver
        ,also_item
        ,unit_count
        ,ROW_NUMBER() OVER (PARTITION BY driver ORDER BY unit_count DESC) AS rowNum
    from data_cooccur
""").show()

Solution 3 - Python

Update Actually, I tried looking more into this, and it appears to not work. (in fact it throws an error). The reason why it didn't work is that I had this code under a call to display() in Databricks (code after the display() call is never run). It seems like the orderBy() on a dataframe and the orderBy() on a window are not actually the same. I will keep this answer up just for negative confirmation

As of PySpark 2.4,(and probably earlier), simply adding in the keyword ascending=False into the orderBy call works for me.

Ex.

personal_recos.withColumn("row_number", F.row_number().over(Window.partitionBy("COLLECTOR_NUMBER").orderBy("count", ascending=False)))

and

personal_recos.withColumn("row_number", F.row_number().over(Window.partitionBy("COLLECTOR_NUMBER").orderBy(F.col("count").desc())))

seem to give me the same behaviour.

Recommended Window Functions Solutions

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionjKrautView Question on Stackoverflow
Solution 1 - Pythonzero323View Answer on Stackoverflow
Solution 2 - PythonkennyutView Answer on Stackoverflow
Solution 3 - Pythoninformation_interchangeView Answer on Stackoverflow