Show distinct column values in pyspark dataframe

PythonApache SparkPysparkApache Spark-Sql

Python Problem Overview


With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().

I want to list out all the unique values in a pyspark dataframe column.

Not the SQL type way (registertemplate then SQL query for distinct values).

Also I don't need groupby then countDistinct, instead I want to check distinct VALUES in that column.

Python Solutions


Solution 1 - Python

This should help to get distinct values of a column:

df.select('column1').distinct().collect()

Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.

Solution 2 - Python

Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

With a Pandas dataframe:

import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()

This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))

If you want the same result from Spark, i.e. an ndarray, use toPandas():

s_df.toPandas()['k'].unique()

Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()

Finally, you can also use a list comprehension as follows:

[i.k for i in s_df.select('k').distinct().collect()]

Solution 3 - Python

You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

Solution 4 - Python

If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.

df.select('colname').distinct().show(100, False)

If you want to do something fancy on the distinct values, you can save the distinct values in a vector:

a = df.select('colname').distinct()

Solution 5 - Python

collect_set can help to get unique values from a given column of pyspark.sql.DataFrame df.select(F.collect_set("column").alias("column")).first()["column"]

Solution 6 - Python

you could do

distinct_column = 'somecol' 

distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]

Solution 7 - Python

In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

> drop_duplicates() is an alias for dropDuplicates().

Example

s_df = sqlContext.createDataFrame([("foo", 1),
                                   ("foo", 1),
                                   ("bar", 2),
                                   ("foo", 3)], ('k', 'v'))
s_df.show()

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

Drop by subset

s_df.drop_duplicates(subset = ['k']).show()

+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  1|
+---+---+
s_df.drop_duplicates().show()


+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  3|
|foo|  1|
+---+---+

Solution 8 - Python

Run this first

df.createOrReplaceTempView('df')

Then run

spark.sql("""
    SELECT distinct
        column name
    FROM
        df
    """).show()

Solution 9 - Python

If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then

df.select('*').distinct().show(10,truncate=False)

Solution 10 - Python

Let us suppose that your original DataFrame is called df. Then, you can use:

df1 = df.groupBy('column_1').agg(F.count('column_1').alias('trip_count'))
df2 = df1.sort(df1.trip_count.desc()).show()

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionSatyaView Question on Stackoverflow
Solution 1 - PythonPabbatiView Answer on Stackoverflow
Solution 2 - PythoneddiesView Answer on Stackoverflow
Solution 3 - PythonseufagnerView Answer on Stackoverflow
Solution 4 - PythonNidhiView Answer on Stackoverflow
Solution 5 - PythonHari BaskarView Answer on Stackoverflow
Solution 6 - PythonmuonView Answer on Stackoverflow
Solution 7 - PythonansevView Answer on Stackoverflow
Solution 8 - PythonJoseph JacobView Answer on Stackoverflow
Solution 9 - PythonKapil SharmaView Answer on Stackoverflow
Solution 10 - PythonMarioanzasView Answer on Stackoverflow