In pyspark you can do it like this:

    array = [1, 2, 3]
    dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:
    
    dataframe.filter(~dataframe.column.isin(array))

Take the operator _~_ which means _contrary_ :


    df_filtered = df.filter(~df[&quot;column_name&quot;].isin([1, 2, 3]))

    df_result = df[df.column_name.isin([1, 2, 3]) == False]

slightly different syntax and a &quot;date&quot; data set:

    toGetDates={&#39;2017-11-09&#39;, &#39;2017-11-11&#39;, &#39;2017-11-12&#39;}
    df= df.filter(df[&#39;DATE&#39;].isin(toGetDates) == False)


`*` is not needed. So:

```
list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))
```

You can use the `.subtract()` buddy.

Example:

    df1 = df.select(col(1),col(2),col(3)) 
    df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.


You can also loop the array and filter:
```
array = [1, 2, 3]
for i in array:
    df = df.filter(df[&quot;column&quot;] != i)
```

There is a class like this.

    module Foo
      class Bar
      end
    end

And I want to get the class name of `Bar` without `Foo`.

    bar = Foo::Bar.new
    bar.class.to_s.match(&#39;::(.+)$&#39;){ $1 }

I could get the class name by this code, but I don&#39;t think this is a best way to get it.

Is there better way to get the name of class without namespace?


How to get only class name without namespace

I want to prevent the user from going back to the previous screen. So I added code, but this does not work. Are there any solutions for this? The alert pop up is seen but &quot;return false&quot; does not work.

 

    componentDidMount() {
       BackAndroid.addEventListener(&#39;hardwareBackPress&#39;, () =&gt; {
         Alert.alert(&quot;alert&quot;,&quot;alert&quot;)
        
          this.props.navigator.pop();
    
           return false;
       });

Preventing hardware back button android for React Native

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

    array &lt;- c(1,2,3)
    dataset &lt;- filter(!(column %in% array))

Pyspark dataframe operator &quot;IS NOT IN&quot;

I would like to rewrite this from R to Pyspark, any nice looking suggestions?
<pre><code class="hljs language-scss">array &#x3C;- c(1,2,3)
dataset &#x3C;- filter(!(column %in% array))
</code></pre>

I am reading a csv file in Pyspark as follows:

    df_raw=spark.read.option(&quot;header&quot;,&quot;true&quot;).csv(csv_path)

However, the data file has quoted fields with embedded commas in them which 
should not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0.

Here is an example which works in Pandas but fails using Spark:

    In [1]: import pandas as pd

    In [2]: pdf = pd.read_csv(&#39;malformed_data.csv&#39;)
    
    In [3]: sdf=spark.read.format(&quot;org.apache.spark.csv&quot;).csv(&#39;malformed_data.csv&#39;,header=True)
    
    In [4]: pdf[[&#39;col12&#39;,&#39;col13&#39;,&#39;col14&#39;]]
    Out[4]:
                        col12                                             col13  \
    0  32 XIY &quot;W&quot;   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE
    1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC
    
       col14
    0   23.0
    1    0.0
    
    In [5]: sdf.select(&quot;col12&quot;,&quot;col13&quot;,&#39;col14&#39;).show()
    +------------------+--------------------+--------------------+
    |             col12|               col13|               col14|
    +------------------+--------------------+--------------------+
    |&quot;32 XIY &quot;&quot;W&quot;&quot;   JK|              RE LK&quot;|SOMETHINGLIKEAPHE...|
    |              null|OUTKAST#THROOTS~W...|                 0.0|
    +------------------+--------------------+--------------------+

The contents of the file :

        col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19
    80015360210876000,11.22,X,4076710258,,,sxsw,,&quot;32 YIU &quot;&quot;A&quot;&quot;&quot;,S5,,&quot;32 XIY &quot;&quot;W&quot;&quot;   JK, RE LK&quot;,SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED
    61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE



Reading csv files with quoted fields containing embedded commas

I am trying to filter a dataframe in pyspark using a list.  I want to either filter based on the list or include only those records with a value in the list.  My code below does not work:

    # define a dataframe
    rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
    df = sqlContext.createDataFrame(rdd, [&quot;id&quot;, &quot;score&quot;])

    # define a list of scores
    l = [10,18,20]

    # filter out records by scores by list l
    records = df.filter(df.score in l)
    # expected: (0,1), (0,1), (0,2), (1,2)

    # include only records with these scores in list l
    records = df.where(df.score in l)
    # expected: (1,10), (1,20), (3,18), (3,18), (3,18)


Gives the following error:
ValueError: Cannot convert column into bool: please use &#39;&amp;&#39; for &#39;and&#39;, &#39;|&#39; for &#39;or&#39;, &#39;~&#39; for &#39;not&#39; when building DataFrame boolean expressions.


pyspark dataframe filter or include based on list

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length.  I want to split each list column into a separate row, while keeping any non-list column as is.

Sample DF:

    from pyspark import Row
    from pyspark.sql import SQLContext
    from pyspark.sql.functions import explode
 
    sqlc = SQLContext(sc)

    df = sqlc.createDataFrame([Row(a=1, b=[1,2,3],c=[7,8,9], d=&#39;foo&#39;)])
    # +---+---------+---------+---+
    # |  a|        b|        c|  d|
    # +---+---------+---------+---+
    # |  1|[1, 2, 3]|[7, 8, 9]|foo|
    # +---+---------+---------+---+

What I want:

    +---+---+----+------+
    |  a|  b|  c |    d |
    +---+---+----+------+
    |  1|  1|  7 |  foo |
    |  1|  2|  8 |  foo |
    |  1|  3|  9 |  foo |
    +---+---+----+------+

If I only had one list column, this would be easy by just doing an `explode`:

    df_exploded = df.withColumn(&#39;b&#39;, explode(&#39;b&#39;))
    # &gt;&gt;&gt; df_exploded.show()
    # +---+---+---------+---+
    # |  a|  b|        c|  d|
    # +---+---+---------+---+
    # |  1|  1|[7, 8, 9]|foo|
    # |  1|  2|[7, 8, 9]|foo|
    # |  1|  3|[7, 8, 9]|foo|
    # +---+---+---------+---+

However, if I try to also `explode` the `c` column, I end up with a dataframe with a length the square of what I want:

    df_exploded_again = df_exploded.withColumn(&#39;c&#39;, explode(&#39;c&#39;))
    # &gt;&gt;&gt; df_exploded_again.show()
    # +---+---+---+---+
    # |  a|  b|  c|  d|
    # +---+---+---+---+
    # |  1|  1|  7|foo|
    # |  1|  1|  8|foo|
    # |  1|  1|  9|foo|
    # |  1|  2|  7|foo|
    # |  1|  2|  8|foo|
    # |  1|  2|  9|foo|
    # |  1|  3|  7|foo|
    # |  1|  3|  8|foo|
    # |  1|  3|  9|foo|
    # +---+---+---+---+

What I want is - for each column, take the nth element of the array in that column and add that to a new row. I&#39;ve tried mapping an explode accross all columns in the dataframe, but that doesn&#39;t seem to work either:

    df_split = df.rdd.map(lambda col: df.withColumn(col, explode(col))).toDF()

Pyspark: Split multiple array columns into rows

I have a pyspark dataframe consisting of one column, called `json`, where each row is a unicode string of json. I&#39;d like to parse each row and return a new dataframe where each row is the parsed json.

    # Sample Data Frame
    jstr1 = u&#39;{&quot;header&quot;:{&quot;id&quot;:12345,&quot;foo&quot;:&quot;bar&quot;},&quot;body&quot;:{&quot;id&quot;:111000,&quot;name&quot;:&quot;foobar&quot;,&quot;sub_json&quot;:{&quot;id&quot;:54321,&quot;sub_sub_json&quot;:{&quot;col1&quot;:20,&quot;col2&quot;:&quot;somethong&quot;}}}}&#39;
    jstr2 = u&#39;{&quot;header&quot;:{&quot;id&quot;:12346,&quot;foo&quot;:&quot;baz&quot;},&quot;body&quot;:{&quot;id&quot;:111002,&quot;name&quot;:&quot;barfoo&quot;,&quot;sub_json&quot;:{&quot;id&quot;:23456,&quot;sub_sub_json&quot;:{&quot;col1&quot;:30,&quot;col2&quot;:&quot;something else&quot;}}}}&#39;
    jstr3 = u&#39;{&quot;header&quot;:{&quot;id&quot;:43256,&quot;foo&quot;:&quot;foobaz&quot;},&quot;body&quot;:{&quot;id&quot;:20192,&quot;name&quot;:&quot;bazbar&quot;,&quot;sub_json&quot;:{&quot;id&quot;:39283,&quot;sub_sub_json&quot;:{&quot;col1&quot;:50,&quot;col2&quot;:&quot;another thing&quot;}}}}&#39;
    df = sql_context.createDataFrame([Row(json=jstr1),Row(json=jstr2),Row(json=jstr3)])

I&#39;ve tried mapping over each row with `json.loads`:

    (df
      .select(&#39;json&#39;)
      .rdd
      .map(lambda x: json.loads(x))
      .toDF()
    ).show()

But this returns a `TypeError: expected string or buffer`

I suspect that part of the problem is that when converting from a `dataframe` to an `rdd`, the schema information is lost, so I&#39;ve also tried manually entering in the schema info:

    schema = StructType([StructField(&#39;json&#39;, StringType(), True)])
    rdd = (df
      .select(&#39;json&#39;)
      .rdd
      .map(lambda x: json.loads(x))
    )
    new_df = sql_context.createDataFrame(rdd, schema)
    new_df.show()

But I get the same `TypeError`.

Looking at [this answer][1], it looks like flattening out the rows with `flatMap` might be useful here, but I&#39;m not having success with that either:


    schema = StructType([StructField(&#39;json&#39;, StringType(), True)])
    rdd = (df
      .select(&#39;json&#39;)
      .rdd
      .flatMap(lambda x: x)
      .flatMap(lambda x: json.loads(x))
      .map(lambda x: x.get(&#39;body&#39;))
    )
    new_df = sql_context.createDataFrame(rdd, schema)
    new_df.show()

I get this error: `AttributeError: &#39;unicode&#39; object has no attribute &#39;get&#39;`.


  [1]: https://stackoverflow.com/a/39380110

Pyspark: Parse a column of json strings

Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala?

I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.

Content Type	Original Author	Original Content on Stackoverflow
Question	Babu	View Question on Stackoverflow
Solution 1 - Pyspark	Ryan Widmaier	View Answer on Stackoverflow
Solution 2 - Pyspark	LaSul	View Answer on Stackoverflow
Solution 3 - Pyspark	user7438406	View Answer on Stackoverflow
Solution 4 - Pyspark	Grant Shannon	View Answer on Stackoverflow
Solution 5 - Pyspark	Johnny M	View Answer on Stackoverflow
Solution 6 - Pyspark	raphael dayan	View Answer on Stackoverflow
Solution 7 - Pyspark	Shadowtrooper	View Answer on Stackoverflow

Pyspark dataframe operator "IS NOT IN"

Pyspark Problem Overview

Pyspark Solutions

Solution 1 - Pyspark

Solution 2 - Pyspark

Solution 3 - Pyspark

Solution 4 - Pyspark

Solution 5 - Pyspark

Solution 6 - Pyspark

Solution 7 - Pyspark

Preventing hardware back button android for React Native

How to get only class name without namespace

Attributions