## Spark 2.2 onwards

&gt; 
    df.filter(df.location.contains(&#39;google.com&#39;))

&gt; [Spark 2.2 documentation link][2]

-------------
## Spark 2.1 and before

&gt; You can use **plain SQL** in `filter`
&gt; 
    df.filter(&quot;location like &#39;%google.com%&#39;&quot;)

&gt; or **with DataFrame column methods**
&gt;
    df.filter(df.location.like(&#39;%google.com%&#39;))

&gt; [Spark 2.1 documentation link][1]

  [1]: http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?fireglass_rsn=true#pyspark.sql.Column.contains  

  [2]: http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

[`pyspark.sql.Column.contains()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.contains) is only available in pyspark version 2.2 and above.

    df.where(df.location.contains(&#39;google.com&#39;))

When filtering a DataFrame with string values, I find that the `pyspark.sql.functions` `lower` and `upper` come in handy, if your data could have column entries like &quot;foo&quot; and &quot;Foo&quot;:

    import pyspark.sql.functions as sql_fun
    result = source_df.filter(sql_fun.lower(source_df.col_name).contains(&quot;foo&quot;))

action.js

    export function getLoginStatus() {
      return async(dispatch) =&gt; {
        let token = await getOAuthToken();
        let success = await verifyToken(token);
        if (success == true) {
          dispatch(loginStatus(success));
        } else {
          console.log(&quot;Success: False&quot;);
          console.log(&quot;Token mismatch&quot;);
        }
        return success;
      }
    }


component.js

      componentDidMount() {
        this.props.dispatch(splashAction.getLoginStatus())
          .then((success) =&gt; {
            if (success == true) {
              Actions.counter()
            } else {
              console.log(&quot;Login not successfull&quot;);
            }
         });
       }
     
However, when I write component.js code with async/await like below I get this error:    

`Possible Unhandled Promise Rejection (id: 0): undefined is not a function (evaluating &#39;this.props.dispatch(splashAction.getLoginStatus())&#39;)`

component.js

      async componentDidMount() {
         let success = await this.props.dispatch(splashAction.getLoginStatus());
         if (success == true) {
           Actions.counter()
         } else {
           console.log(&quot;Login not successfull&quot;);
         }
       }

How do I await a getLoginStatus() and then execute the rest of the statements?
Everything works quite well when using .then(). I doubt something is missing in my async/await implementation. trying to figure that out.

how to async/await redux-thunk actions?

How do i unit test python dataframes? 

I have functions that have an input and output as dataframes. Almost every function I have does this. Now if i want to unit test this what is the best method of doing it? It seems a bit of an effort to create a new dataframe (with values populated) for every function?  

Are there any materials you can refer me to? Should you write unit tests for these functions? 

How do you Unit Test Python DataFrames

I have a large `pyspark.sql.dataframe.DataFrame` and I want to keep (so `filter`) all rows where the URL saved in the `location` column contains a pre-determined string, e.g. &#39;google.com&#39;.

I have tried:
```
import pyspark.sql.functions as sf
df.filter(sf.col(&#39;location&#39;).contains(&#39;google.com&#39;)).show(5)
```
but this throws a 

    TypeError: _TypeError: &#39;Column&#39; object is not callable&#39;

How do I go around and filter my df properly? Many thanks in advance! 

Filter df when values matches part of a string in pyspark

I have a large <code>pyspark.sql.dataframe.DataFrame</code> and I want to keep (so <code>filter</code>) all rows where the URL saved in the <code>location</code> column contains a pre-determined string, e.g. 'google.com'.
I have tried:
<pre><code class="hljs language-python">import pyspark.sql.functions as sf
df.filter(sf.col('location').contains('google.com')).show(5)
</code></pre>
but this throws a
<pre><code class="hljs language-php">TypeError: _TypeError: 'Column' object is not callable'
</code></pre>
How do I go around and filter my df properly? Many thanks in advance!

I am able to upload an image file using:

    s3 = session.resource(&#39;s3&#39;)
    bucket = s3.Bucket(S3_BUCKET)
    bucket.upload_file(file, key)

However, I want to make the file public too. I tried looking up for some functions to set ACL for the file but seems like boto3 have changes their API and removed some functions. Is there a way to do it in the latest release of boto3?

How to upload a file to S3 and make it public using boto3?

I want to plot the output of this simple neural network: 

   

    model.compile(loss=&#39;binary_crossentropy&#39;, optimizer=&#39;adam&#39;, metrics=[&#39;accuracy&#39;])
    history = model.fit(x_test, y_test, nb_epoch=10, validation_split=0.2, shuffle=True)
    
    model.test_on_batch(x_test, y_test)
    model.metrics_names

I have plotted *accuracy* and *loss* of training and validation:

    print(history.history.keys())
    #  &quot;Accuracy&quot;
    plt.plot(history.history[&#39;acc&#39;])
    plt.plot(history.history[&#39;val_acc&#39;])
    plt.title(&#39;model accuracy&#39;)
    plt.ylabel(&#39;accuracy&#39;)
    plt.xlabel(&#39;epoch&#39;)
    plt.legend([&#39;train&#39;, &#39;validation&#39;], loc=&#39;upper left&#39;)
    plt.show()
    # &quot;Loss&quot;
    plt.plot(history.history[&#39;loss&#39;])
    plt.plot(history.history[&#39;val_loss&#39;])
    plt.title(&#39;model loss&#39;)
    plt.ylabel(&#39;loss&#39;)
    plt.xlabel(&#39;epoch&#39;)
    plt.legend([&#39;train&#39;, &#39;validation&#39;], loc=&#39;upper left&#39;)
    plt.show()
Now I want to add and plot test set&#39;s accuracy from `model.test_on_batch(x_test, y_test)`, but from `model.metrics_names` I obtain the same value *&#39;acc&#39;* utilized for plotting accuracy on training data `plt.plot(history.history[&#39;acc&#39;])`. How could I plot test set&#39;s accuracy?



Keras - Plot training, validation and test set accuracy

I want to use `dlib` with python for image recognition. I have the python app running great with OpenCV on Windows 10, but when I want to install `dlib` from the `cmd` it gives me this following error :
```
error: Cannot find cmake, ensure it is installed and in the path. You
can install cmake using the instructions at https://cmake.org/install/
You can also specify its path with --cmake parameter.
```
What should I do?

[![My command prompt error][1]][1]


  [1]: https://i.stack.imgur.com/3DsdV.png

dlib installation on Windows 10

How do I reset the root environment of anaconda? There has to be a simple conda reset command that does this.

I don&#39;t want to reinstall anaconda all over again. I have other virtualenvs that I don&#39;t want to overwrite and that will happen if I install anaconda again.

How to reset anaconda root environment

A bunch of the tweets I am importing are having this issue where they read 

    b&#39;I posted a new photo to Facebook&#39;

I gather the `b` indicates it is a byte. But this is proving problematic because in my CSV files that I end up writing, the `b` doesn&#39;t go away and is interferring in future code. 

Is there a simple way to remove this `b` prefix from my lines of text? 

Keep in mind, I seem to need to have the text encoded in utf-8 or tweepy has trouble pulling them from the web. 

------------------------------------
Here&#39;s the link content I&#39;m analyzing:

https://www.dropbox.com/s/sjmsbuhrghj7abt/new_tweets.txt?dl=0

    new_tweets = &#39;content in the link&#39;

#Code Attempt

    outtweets = [[tweet.text.encode(&quot;utf-8&quot;).decode(&quot;utf-8&quot;)] for tweet in new_tweets]
    print(outtweets)

#Error

    UnicodeEncodeError                        Traceback (most recent call last)
    &lt;ipython-input-21-6019064596bf&gt; in &lt;module&gt;()
          1 for screen_name in user_list:
    ----&gt; 2     get_all_tweets(screen_name,&quot;instance file&quot;)
    
    &lt;ipython-input-19-e473b4771186&gt; in get_all_tweets(screen_name, mode)
         99             with open(os.path.join(save_location,&#39;%s.instance&#39; % screen_name), &#39;w&#39;) as f:
        100                 writer = csv.writer(f)
    --&gt; 101                 writer.writerows(outtweets)
        102         else:
        103             with open(os.path.join(save_location,&#39;%s.csv&#39; % screen_name), &#39;w&#39;) as f:
    
    C:\Users\Stan Shunpike\Anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
         17 class IncrementalEncoder(codecs.IncrementalEncoder):
         18     def encode(self, input, final=False):
    ---&gt; 19         return codecs.charmap_encode(input,self.errors,encoding_table)[0]
         20 
         21 class IncrementalDecoder(codecs.IncrementalDecoder):
    
    UnicodeEncodeError: &#39;charmap&#39; codec can&#39;t encode characters in position 64-65: character maps to &lt;undefined&gt;



How do I get rid of the b-prefix in a string in python?

I&#39;m trying to make multiple operations in one line of code in pySpark,
and not sure if that&#39;s possible for my case.

My intention is not having to save the output as a new dataframe.

My current code is rather simple:

    encodeUDF = udf(encode_time, StringType())
    new_log_df.cache().withColumn(&#39;timePeriod&#39;, encodeUDF(col(&#39;START_TIME&#39;)))
      .groupBy(&#39;timePeriod&#39;)
      .agg(
        mean(&#39;DOWNSTREAM_SIZE&#39;).alias(&quot;Mean&quot;),
        stddev(&#39;DOWNSTREAM_SIZE&#39;).alias(&quot;Stddev&quot;)
      )
      .show(20, False)

And my intention is to add `count()` after using `groupBy`, to get, well, the count of records matching each value of **`timePeriod`** column, printed\shown as output.

When trying to use `groupBy(..).count().agg(..)` I get exceptions.

**Is there any way to achieve both `count()` and `agg()`**.show() prints, without splitting code to two lines of commands, e.g. :

    new_log_df.withColumn(..).groupBy(..).count()
    new_log_df.withColumn(..).groupBy(..).agg(..).show()

Or better yet, for getting a merged output to `agg.show()` output - An extra column which states the counted number of records matching the row&#39;s value. e.g.:

    timePeriod | Mean | Stddev | Num Of Records
        X      | 10   |   20   |    315

aggregate function Count usage with groupBy in Spark

Short version of the question!
==============================

Consider the following snippet (assuming `spark` is already set to some `SparkSession`):

    from pyspark.sql import Row
    source_data = [
        Row(city=&quot;Chicago&quot;, temperatures=[-1.0, -2.0, -3.0]),
        Row(city=&quot;New York&quot;, temperatures=[-7.0, -7.0, -5.0]), 
    ]
    df = spark.createDataFrame(source_data)

Notice that the temperatures field is a list of floats. I would like to convert these lists of floats to the MLlib type `Vector`, and I&#39;d like this conversion to be expressed using the basic `DataFrame` API rather than going via RDDs (which is inefficient because it sends all data from the JVM to Python, the processing is done in Python, we don&#39;t get the benefits of Spark&#39;s Catalyst optimizer, yada yada). How do I do this? Specifically:

1. Is there a way to get a straight cast working? See below for details (and a failed attempt at a workaround)? Or, is there any other operation that has the effect I was after?
2. Which is more efficient out of the two alternative solutions I suggest below (UDF vs exploding/reassembling the items in the list)? Or are there any other almost-but-not-quite-right alternatives that are better than either of them?


A straight cast doesn&#39;t work
============================

This is what I would expect to be the &quot;proper&quot; solution. I want to convert the type of a column from one type to another, so I should use a cast. As a bit of context, let me remind you of the normal way to cast it to another type:

    from pyspark.sql import types
    df_with_strings = df.select(
        df[&quot;city&quot;], 
        df[&quot;temperatures&quot;].cast(types.ArrayType(types.StringType()))),
    )

Now e.g. `df_with_strings.collect()[0][&quot;temperatures&quot;][1]` is `&#39;-7.0&#39;`. But if I cast to an ml Vector then things do not go so well:

    from pyspark.ml.linalg import VectorUDT
    df_with_vectors = df.select(df[&quot;city&quot;], df[&quot;temperatures&quot;].cast(VectorUDT()))

This gives an error:

    pyspark.sql.utils.AnalysisException: &quot;cannot resolve &#39;CAST(`temperatures` AS STRUCT&lt;`type`: TINYINT, `size`: INT, `indices`: ARRAY&lt;INT&gt;, `values`: ARRAY&lt;DOUBLE&gt;&gt;)&#39; due to data type mismatch: cannot cast ArrayType(DoubleType,true) to org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7;;
    &#39;Project [city#0, unresolvedalias(cast(temperatures#1 as vector), None)]
    +- LogicalRDD [city#0, temperatures#1]
    &quot;

Yikes! Any ideas how to fix this?

    
Possible alternatives
=====================


Alternative 1: Using `VectorAssembler`
--------------------------------------

There is a `Transformer` that seems almost ideal for this job: the [`VectorAssembler`][1]. It takes one or more columns and concatenates them into a single vector. Unfortunately it only takes `Vector` and `Float` columns, not `Array` columns, so the follow doesn&#39;t work:

    from pyspark.ml.feature import VectorAssembler
    assembler = VectorAssembler(inputCols=[&quot;temperatures&quot;], outputCol=&quot;temperature_vector&quot;)
    df_fail = assembler.transform(df)

It gives this error:

    pyspark.sql.utils.IllegalArgumentException: &#39;Data type ArrayType(DoubleType,true) is not supported.&#39;

The best work around I can think of is to explode the list into multiple columns and then use the `VectorAssembler` to collect them all back up again:

    from pyspark.ml.feature import VectorAssembler
    TEMPERATURE_COUNT = 3
    assembler_exploded = VectorAssembler(
        inputCols=[&quot;temperatures[{}]&quot;.format(i) for i in range(TEMPERATURE_COUNT)], 
        outputCol=&quot;temperature_vector&quot;
    )
    df_exploded = df.select(
        df[&quot;city&quot;], 
        *[df[&quot;temperatures&quot;][i] for i in range(TEMPERATURE_COUNT)]
    )
    converted_df = assembler_exploded.transform(df_exploded)
    final_df = converted_df.select(&quot;city&quot;, &quot;temperature_vector&quot;)

This seems like it would be ideal, except that `TEMPERATURE_COUNT` be more than 100, and sometimes more than 1000. (Another problem is that the code would be more complicated if you don&#39;t know the size of the array in advance, although that is not the case for my data.) Does Spark actually generate an intermediate data set with that many columns, or does it just consider this an intermediate step that individual items pass through transiently (or indeed does it optimise this away step entirely when it sees that the only use of these columns is to be assembled into a vector)?

  [1]: http://spark.apache.org/docs/latest/ml-features.html#vectorassembler


Alternative 2: use a UDF
------------------------

A rather simpler alternative is to use a UDF to do the conversion. This lets me express quite directly what I want to do in one line of code, and doesn&#39;t require making a data set with a crazy number of columns. But all that data has to be exchanged between Python and the JVM, and every individual number has to be handled by Python (which is notoriously slow for iterating over individual data items). Here is how that looks:

    from pyspark.ml.linalg import Vectors, VectorUDT
    from pyspark.sql.functions import udf
    list_to_vector_udf = udf(lambda l: Vectors.dense(l), VectorUDT())
    df_with_vectors = df.select(
        df[&quot;city&quot;], 
        list_to_vector_udf(df[&quot;temperatures&quot;]).alias(&quot;temperatures&quot;)
    )


Ignorable remarks
=================

The remaining sections of this rambling question are some extra things I came up with while trying to find an answer. They can probably be skipped by most people reading this.


Not a solution: use `Vector` to begin with
------------------------------------------

In this trivial example it&#39;s possible to create the data using the vector type to begin with, but of course my data isn&#39;t really a Python list that I&#39;m parallelizing, but instead is being read from a data source. But for the record, here is how that would look:

    from pyspark.ml.linalg import Vectors
    from pyspark.sql import Row
    source_data = [
        Row(city=&quot;Chicago&quot;, temperatures=Vectors.dense([-1.0, -2.0, -3.0])),
        Row(city=&quot;New York&quot;, temperatures=Vectors.dense([-7.0, -7.0, -5.0])),
    ]
    df = spark.createDataFrame(source_data)


Inefficient solution: use `map()`
---------------------------------

One possibility is to use the RDD `map()` method to transform the list to a `Vector`. This is similar to the UDF idea, except that its even worse because the cost of serialisation etc. is incurred for all the fields in each row, not just the one being operated on. For the record, here&#39;s what that solution would look like:

    df_with_vectors = df.rdd.map(lambda row: Row(
        city=row[&quot;city&quot;], 
        temperatures=Vectors.dense(row[&quot;temperatures&quot;])
    )).toDF()


Failed attempt at a workaround for cast
---------------------------------------

In desperation, I noticed that `Vector` is represented internally by a struct with four fields, but using a traditional cast from that type of struct doesn&#39;t work either. Here is an illustration (where I built the struct using a udf but the udf isn&#39;t the important part):

    from pyspark.ml.linalg import Vectors, VectorUDT
    from pyspark.sql.functions import udf
    list_to_almost_vector_udf = udf(lambda l: (1, None, None, l), VectorUDT.sqlType())
    df_almost_vector = df.select(
        df[&quot;city&quot;], 
        list_to_almost_vector_udf(df[&quot;temperatures&quot;]).alias(&quot;temperatures&quot;)
    )
    df_with_vectors = df_almost_vector.select(
        df_almost_vector[&quot;city&quot;], 
        df_almost_vector[&quot;temperatures&quot;].cast(VectorUDT())
    )


This gives the error: 

    pyspark.sql.utils.AnalysisException: &quot;cannot resolve &#39;CAST(`temperatures` AS STRUCT&lt;`type`: TINYINT, `size`: INT, `indices`: ARRAY&lt;INT&gt;, `values`: ARRAY&lt;DOUBLE&gt;&gt;)&#39; due to data type mismatch: cannot cast StructType(StructField(type,ByteType,false), StructField(size,IntegerType,true), StructField(indices,ArrayType(IntegerType,false),true), StructField(values,ArrayType(DoubleType,false),true)) to org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7;;
    &#39;Project [city#0, unresolvedalias(cast(temperatures#5 as vector), None)]
    +- Project [city#0, &lt;lambda&gt;(temperatures#1) AS temperatures#5]
    +- LogicalRDD [city#0, temperatures#1]
    &quot;


How do I convert an array (i.e. list) column to Vector

Is there any way to get the current number of partitions of a DataFrame?
I checked the DataFrame javadoc (spark 1.6) and didn&#39;t found a method for that, or am I just missed it? 
(In case of JavaRDD there&#39;s a getNumPartitions() method.) 


Get current number of partitions of a DataFrame

Is my understanding right? 

1. Application:
one spark submit. 

2. job:
once a lazy evaluation happens, there is a job. 

3. stage:
It is related to the shuffle and the transformation type. 
It is hard for me to understand the boundary of the stage. 

4. task:
It is unit operation. One transformation per task. One task per transformation. 

Help wanted to improve this understanding. 

What is the concept of application, job, stage and task in spark?

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary. 

I know I can hard code 4 column names as pass in the UDF but in this case it will vary so I would like to know how to get it done?

Here are two examples in the first one we have two columns to add and in the second one we have three columns to add.

[![enter image description here][1]][1]


  [1]: https://i.stack.imgur.com/ZVQ72.png

Pyspark: Pass multiple columns in UDF

I have a data frame in pyspark with more than 300 columns. In these columns there are some columns with values null. 

For example:

    Column_1 column_2
    null     null
    null     null
    234      null
    125      124
    365      187
    and so on

When I want to do a sum of column_1 I am getting a Null as a result, instead of 724. 

Now I want to replace the null in all columns of the data frame with empty space. So when I try to do a sum of these columns I don&#39;t get a null value but I will get a numerical value.

How can we achieve that in pyspark

How to replace all Null values of a dataframe in Pyspark

I have a data frame in python/pyspark with columns `id` `time` `city` `zip` and so on......

Now I added a new column `name` to this data frame.

Now I have to arrange the columns in such a way that the `name` column comes after `id`

I have done like below

    change_cols = [&#39;id&#39;, &#39;name&#39;]

    cols = ([col for col in change_cols if col in df] 
            + [col for col in df if col not in change_cols])

    df = df[cols]

I am getting this error

    pyspark.sql.utils.AnalysisException: u&quot;Reference &#39;id&#39; is ambiguous, could be: id#609, id#1224.;&quot;


Why is this error occuring. How can I rectify this.

Python/pyspark data frame rearrange columns

Using Spark 1.6, I have a Spark `DataFrame column` (named let&#39;s say `col1`) with values A, B, C, DS, DNS, E, F, G and H and I want to create a new column (say `col2`) with the values from the `dict` here below, how do I map this? (so f.i. &#39;A&#39; needs to be mapped to &#39;S&#39; etc..)

    dict = {&#39;A&#39;: &#39;S&#39;, &#39;B&#39;: &#39;S&#39;, &#39;C&#39;: &#39;S&#39;, &#39;DS&#39;: &#39;S&#39;, &#39;DNS&#39;: &#39;S&#39;, &#39;E&#39;: &#39;NS&#39;, &#39;F&#39;: &#39;NS&#39;, &#39;G&#39;: &#39;NS&#39;, &#39;H&#39;: &#39;NS&#39;}



PySpark create new column with mapping from a dict

I&#39;ve been trying to find a reasonable way to test `SparkSession` with the JUnit testing framework. While there seem to be good examples for `SparkContext`, I couldn&#39;t figure out how to get a corresponding example working for `SparkSession`, even though it is used in several places internally in [spark-testing-base][1]. I&#39;d be happy to try a solution that doesn&#39;t use spark-testing-base as well if it isn&#39;t really the right way to go here.

Simple test case ([complete MWE project](https://github.com/bbarker/ProjectGists/tree/master/Scala/SparkSessionTester) with `build.sbt`):

    import com.holdenkarau.spark.testing.DataFrameSuiteBase
    import org.junit.Test
    import org.scalatest.FunSuite
    
    import org.apache.spark.sql.SparkSession
    
    
    class SessionTest extends FunSuite with DataFrameSuiteBase {
    
      implicit val sparkImpl: SparkSession = spark
    
      @Test
      def simpleLookupTest {
    
        val homeDir = System.getProperty(&quot;user.home&quot;)
        val training = spark.read.format(&quot;libsvm&quot;)
          .load(s&quot;$homeDir\\Documents\\GitHub\\sample_linear_regression_data.txt&quot;)
        println(&quot;completed simple lookup test&quot;)
      }
    
    }

The result of running this with JUnit is an NPE at the  load line:


    java.lang.NullPointerException
    	at SessionTest.simpleLookupTest(SessionTest.scala:16)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
    	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
    	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
    	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

Note it shouldn&#39;t matter that the file being loaded exists or not; in a properly configured SparkSession, a [more sensible error will be thrown](https://issues.apache.org/jira/browse/SPARK-20497).


  [1]: https://github.com/holdenk/spark-testing-base/

Content Type	Original Author	Original Content on Stackoverflow
Question	gaatjeniksaan	View Question on Stackoverflow
Solution 1 - Python	mrsrinivas	View Answer on Stackoverflow
Solution 2 - Python	joaofbsm	View Answer on Stackoverflow
Solution 3 - Python	caffreyd	View Answer on Stackoverflow

Filter df when values matches part of a string in pyspark

Python Problem Overview

Python Solutions

Solution 1 - Python

Spark 2.2 onwards

Spark 2.1 and before

Solution 2 - Python

Solution 3 - Python

How do you Unit Test Python DataFrames

how to async/await redux-thunk actions?

Attributions