Gang Of Coders
Home
About Us
Contact Us
All Apache Spark Solutions on Gang of Coders
Total of 219 Apache Spark Solutions
Spark - repartition() vs coalesce()
Apache Spark
Distributed Computing
Rdd
Difference between DataFrame, Dataset, and RDD in Spark
Dataframe
Apache Spark
Apache Spark-Sql
Rdd
Apache Spark-Dataset
What is the difference between map and flatMap and a good use case for each?
Apache Spark
How to show full column content in a Spark Dataframe?
Apache Spark
Dataframe
Spark Csv
Output Formatting
How to change dataframe column names in pyspark?
Python
Apache Spark
Pyspark
Apache Spark-Sql
What are workers, executors, cores in Spark Standalone cluster?
Apache Spark
Distributed Computing
Spark java.lang.OutOfMemoryError: Java heap space
Out of-Memory
Apache Spark
Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Scala
Apache Spark
Serialization
Apache Spark: The number of cores vs. the number of executors
Hadoop
Apache Spark
Hadoop Yarn
What is the difference between cache and persist?
Apache Spark
Distributed Computing
Rdd
How to stop INFO messages displaying on spark console?
Apache Spark
Log4j
Spark Submit
Spark performance for Scala vs Python
Scala
Performance
Apache Spark
Pyspark
Rdd
Add JAR files to a Spark job - spark-submit
Java
Scala
Apache Spark
Jar
Spark Submit
(Why) do we need to call cache or persist on a RDD
Scala
Apache Spark
Rdd
How to read multiple text files into a single RDD?
Apache Spark
How to add a constant column in a Spark DataFrame?
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
How to select the first row of each group?
Sql
Scala
Apache Spark
Dataframe
Apache Spark-Sql
How can I change column types in Spark SQL's DataFrame?
Scala
Apache Spark
Apache Spark-Sql
How to turn off INFO logging in Spark?
Python
Scala
Apache Spark
Hadoop
Pyspark
How do I add a new column to a Spark DataFrame (using PySpark)?
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
How to sort by column in descending order in Spark SQL?
Scala
Apache Spark
Apache Spark-Sql
Concatenate columns in Apache Spark DataFrame
Sql
Apache Spark
Dataframe
Apache Spark-Sql
How are stages split into tasks in Spark?
Apache Spark
Spark - load CSV file as DataFrame?
Scala
Apache Spark
Hadoop
Apache Spark-Sql
Hdfs
How to store custom objects in Dataset?
Scala
Apache Spark
Apache Spark-Dataset
Apache Spark-Encoders
Apache Spark: map vs mapPartitions?
Performance
Scala
Apache Spark
Rdd
How to set Apache Spark Executor memory
Memory
Apache Spark
Write single CSV file using spark-csv
Scala
Csv
Apache Spark
Spark Csv
How to convert rdd object to dataframe in spark
Scala
Apache Spark
Apache Spark-Sql
Rdd
Filter Pyspark dataframe column with None value
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
Show distinct column values in pyspark dataframe
Python
Apache Spark
Pyspark
Apache Spark-Sql
Convert spark DataFrame column to python list
Python
Apache Spark
Pyspark
Spark Dataframe
How to define partitioning of DataFrame?
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Partitioning
How to check if spark dataframe is empty?
Apache Spark
Pyspark
Apache Spark-Sql
How to change a dataframe column from String type to Double type in PySpark?
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
How to print the contents of RDD?
Scala
Apache Spark
importing pyspark in python shell
Python
Apache Spark
Pyspark
How to overwrite the output directory in spark
Apache Spark
How to delete columns in pyspark dataframe
Apache Spark
Apache Spark-Sql
Pyspark
How to kill a running Spark application?
Apache Spark
Hadoop Yarn
Pyspark
Load CSV file with Spark
Python
Csv
Apache Spark
Pyspark
Apache Spark-Sql
Spark Dataframe distinguish columns with duplicated name
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
Spark - Error "A master URL must be set in your configuration" when submitting an app
Scala
Apache Spark
How to load local file in sc.textFile, instead of HDFS
Scala
Apache Spark
Spark DataFrame groupBy and sort in the descending order (pyspark)
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
What do the numbers on the progress bar mean in spark-shell?
Apache Spark
Can apache spark run without hadoop?
Hadoop
Amazon S3
Apache Spark
Mapreduce
Mesos
Best way to get the max value in a Spark dataframe column
Python
Apache Spark
Pyspark
Apache Spark-Sql
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7
Eclipse
Scala
Apache Spark
Convert pyspark string to date format
Python
Apache Spark
Pyspark
Apache Spark-Sql
How to create an empty DataFrame with a specified schema?
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Extract column values of Dataframe as List in Apache Spark
Scala
Apache Spark
Apache Spark-Sql
Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?
Apache Spark
What does "Stage Skipped" mean in Apache Spark web UI?
Apache Spark
Rdd
How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4
Apache Spark
Pyspark
How to export a table dataframe in PySpark to csv?
Python
Apache Spark
Dataframe
Apache Spark-Sql
Export to-Csv
How to tune spark executor number, cores and executor memory?
Apache Spark
What are the benefits of Apache Beam over Spark/Flink for batch processing?
Apache Spark
Apache Flink
Apache Beam
Concatenate two PySpark dataframes
Python
Apache Spark
Pyspark
Apache Spark-Sql
What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
Performance
Apache Spark
Hadoop
Apache Spark-Sql
Renaming column names of a DataFrame in Spark Scala
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Spark Error - Unsupported class file major version
Java
Python
Macos
Apache Spark
Pyspark
How to save DataFrame directly to Hive?
Scala
Apache Spark
Hive
Apache Spark-Sql
Apache Spark: How to use pyspark with Python 3
Python
Python 3.x
Apache Spark
At what situation I can use Dask instead of Apache Spark?
Python
Pandas
Apache Spark
Dask
Is there a way to take the first 1000 rows of a Spark Dataframe?
Scala
Apache Spark
Join two data frames, select all columns from one and some columns from the other
Dataframe
Apache Spark
Pyspark
Apache Spark-Sql
Split Spark Dataframe string column into multiple columns
Apache Spark
Pyspark
Apache Spark-Sql
How do I set the driver's python version in spark?
Python
Apache Spark
Pyspark
Overwrite specific partitions in spark dataframe write method
Apache Spark
Apache Spark-Sql
Spark Dataframe
How to set up Spark on Windows?
Windows
Apache Spark
Updating a dataframe column in spark
Python
Dataframe
Apache Spark
Pyspark
Apache Spark-Sql
Spark SQL: apply aggregate functions to a list of columns
Apache Spark
Dataframe
Apache Spark-Sql
Aggregate Functions
Mac spark-shell Error initializing SparkContext
Apache Spark
Renaming columns for PySpark DataFrame aggregates
Dataframe
Apache Spark
Pyspark
Apache Spark-Sql
Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
Apache Spark
Apache Spark-Sql
Pyspark
Get current number of partitions of a DataFrame
Python
Scala
Dataframe
Apache Spark
Apache Spark-Sql
Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey
Apache Spark
Grouping
Reducing
How to write unit tests in Spark 2.0+?
Scala
Unit Testing
Apache Spark
Junit
Apache Spark-Sql
Pyspark: Exception: Java gateway process exited before sending the driver its port number
Java
Python
Macos
Apache Spark
Pyspark
How to link PyCharm with PySpark?
Python
Apache Spark
Pyspark
Pycharm
Homebrew
How to pass -D parameter or environment variable to Spark job?
Scala
Apache Spark
Which cluster type should I choose for Spark?
Apache Spark
Hadoop Yarn
Mesos
Apache Spark-Standalone
How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
Apache Spark
Pyspark
Apache Spark-Sql
How does HashPartitioner work?
Scala
Apache Spark
Rdd
Partitioning
how to make saveAsTextFile NOT split output into multiple file?
Scala
Apache Spark
How to pivot Spark DataFrame?
Dataframe
Apache Spark
Pyspark
Apache Spark-Sql
Pivot
Is it possible to get the current spark context settings in PySpark?
Apache Spark
Config
Pyspark
pyspark dataframe filter or include based on list
Apache Spark
Filter
Pyspark
Apache Spark-Sql
Pyspark: Split multiple array columns into rows
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
Scala
Apache Spark
Memory Management
Sbt
Scalatra Sbt
How to find median and quantiles using Spark
Python
Apache Spark
Median
Rdd
Pyspark
What is the difference between spark checkpoint and persist to a disk
Apache Spark
how to filter out a null value from spark dataframe
Scala
Apache Spark
Apache Spark-Sql
Spark Dataframe
Cannot find col function in pyspark
Python
Apache Spark
Pyspark
Apache Spark-Sql
Pyspark Sql
What is the relationship between workers, worker instances, and executors?
Apache Spark
Apache Spark-Standalone
How to use JDBC source to write and read data in (Py)Spark?
Python
Scala
Apache Spark
Apache Spark-Sql
Pyspark
How to join on multiple columns in Pyspark?
Python
Apache Spark
Join
Pyspark
Apache Spark-Sql
How does createOrReplaceTempView work in Spark?
Apache Spark
Apache Spark-Sql
Spark Dataframe
How to use Column.isin with list?
Scala
Apache Spark
Apache Spark-Sql
Create Spark DataFrame. Can not infer schema for type: <type 'float'>
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
How to make good reproducible Apache Spark examples
Dataframe
Apache Spark
Pyspark
Apache Spark-Sql
Removing duplicate columns after a DF join in Spark
Python
Apache Spark
Pyspark
Apache Spark-Sql
Querying Spark SQL DataFrame with complex types
Sql
Scala
Apache Spark
Dataframe
Apache Spark-Sql
How to loop through each row of dataFrame in pyspark
Apache Spark
Dataframe
For Loop
Pyspark
Apache Spark-Sql
How to check the Spark version
Apache Spark
Cloudera Cdh
Spark - SELECT WHERE or filtering?
Apache Spark
Apache Spark-Sql
How do I convert an array (i.e. list) column to Vector
Python
Apache Spark
Pyspark
Apache Spark-Sql
Apache Spark-Ml
Spark code organization and best practices
Apache Spark
Functional Programming
Code Organization
How to perform union on two DataFrames with different amounts of columns in spark?
Python
Apache Spark
Pyspark
Apache Spark-Sql
Pyspark Dataframes
Add an empty column to Spark DataFrame
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
Provide schema while reading csv file as a dataframe
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Spark Csv
How do I skip a header from CSV files in Spark?
Scala
Csv
Apache Spark
collect_list by preserving order based on another variable
Python
Apache Spark
Pyspark
Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4
Apache Spark
Apache Spark-Sql
Alluxio
What does setMaster `local[*]` mean in spark?
Scala
Apache Spark
reduceByKey: How does it work internally?
Scala
Apache Spark
Rdd
How to avoid duplicate columns after join?
Scala
Apache Spark
Apache Spark-Sql
Converting Pandas dataframe into Spark dataframe error
Python
Pandas
Apache Spark
Spark Dataframe
Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?
Scala
Apache Spark
Join
Apache Spark-Sql
Filter df when values matches part of a string in pyspark
Python
Apache Spark
Pyspark
Apache Spark-Sql
Write to multiple outputs by key Spark - one Spark job
Scala
Hadoop
Output
Hdfs
Apache Spark
Apache Spark logging within Scala
Scala
Logging
Apache Spark
How DAG works under the covers in RDD?
Apache Spark
Rdd
Directed Acyclic-Graphs
Spark : how to run spark file from spark shell
Scala
Apache Spark
Cloudera Cdh
Cloudera Manager
How to convert column with string type to int form in pyspark data frame?
Python
Dataframe
Apache Spark
Pyspark
Apache Spark-Sql
Spark Driver in Apache spark
Apache Spark
Apache Spark vs Akka
Apache Spark
Parallel Processing
Akka
Distributed Computing
PySpark: java.lang.OutofMemoryError: Java heap space
Java
Apache Spark
Out of-Memory
Heap Memory
Pyspark
How to query JSON data column using Spark DataFrames?
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Spark Cassandra-Connector
What is the concept of application, job, stage and task in spark?
Apache Spark
PySpark: How to fillna values in dataframe for specific columns?
Apache Spark
Pyspark
Spark Dataframe
How to aggregate values into collection after groupBy?
Scala
Apache Spark
Apache Spark-Sql
How to split Vector into columns - using PySpark
Python
Apache Spark
Pyspark
Apache Spark-Sql
Apache Spark-Ml
Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
Scala
Apache Spark
Apache Spark-Dataset
Apache Spark-Encoders
How to list all cassandra tables
Scala
Apache Spark
Cassandra
Spark Cassandra-Connector
How to flatten a struct in a Spark dataframe?
Java
Apache Spark
Pyspark
Apache Spark-Sql
PySpark - rename more than one column using withColumnRenamed
Apache Spark
Pyspark
Apache Spark-Sql
Rename
Spark: subtract two DataFrames
Apache Spark
Dataframe
Rdd
How to import multiple csv files in a single load?
Apache Spark
Apache Spark-Sql
Spark Dataframe
How to get name of dataframe column in PySpark?
Apache Spark
Pyspark
Apache Spark-Sql
Columnname
Median / quantiles within PySpark groupBy
Apache Spark
Pyspark
Apache Spark-Sql
Pyspark Sql
How to convert a DataFrame back to normal RDD in pyspark?
Python
Apache Spark
Pyspark
How to check Spark Version
Apache Spark
Hadoop
Cloudera
"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory
Apache Spark
Emr
Amazon Emr
Bigdata
How do I log from my Python Spark script
Python
Logging
Apache Spark
How does Spark partition(ing) work on files in HDFS?
Apache Spark
Hdfs
Apache Spark -- Assign the result of UDF to multiple dataframe columns
Python
Apache Spark
Pyspark
Apache Spark-Sql
User Defined-Functions
Pyspark replace strings in Spark dataframe column
Python
Apache Spark
Pyspark
Spark functions vs UDF performance?
Performance
Apache Spark
Pyspark
Apache Spark-Sql
User Defined-Functions
Retrieve top n in each group of a DataFrame in pyspark
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
PySpark: withColumn() with two conditions and three outcomes
Apache Spark
Hive
Pyspark
Apache Spark-Sql
Hiveql
How to access s3a:// files from Apache Spark?
Hadoop
Apache Spark
Amazon S3
Generate a Spark StructType / Schema from a case class
Apache Spark
Apache Spark-Sql
Importing spark.implicits._ in scala
Scala
Apache Spark
Difference in Used, Committed and Max Heap Memory
Java
Apache Spark
Memory Management
Jvm
Spark Streaming
How to melt Spark DataFrame?
Apache Spark
Pyspark
Apache Spark-Sql
Melt
aggregate function Count usage with groupBy in Spark
Java
Scala
Apache Spark
Pyspark
Apache Spark-Sql
What are the various join types in Spark?
Scala
Apache Spark
Apache Spark-Sql
Spark Dataframe
Apache Spark-2.0
Upacking a list to select multiple columns from a spark data frame
Apache Spark
Apache Spark-Sql
Spark Dataframe
Difference between == and === in Scala, Spark
Scala
Apache Spark
Pyspark: Pass multiple columns in UDF
Apache Spark
Pyspark
Spark Dataframe
Convert a spark DataFrame to pandas DF
Pandas
Apache Spark
Apache Spark-Sql
Why does a job fail with "No space left on device", but df says otherwise?
Apache Spark
Which operations preserve RDD order?
Apache Spark
Rdd
PySpark groupByKey returning pyspark.resultiterable.ResultIterable
Python
Apache Spark
Pyspark
Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?
Scala
Apache Spark
'PipelinedRDD' object has no attribute 'toDF' in PySpark
Python
Apache Spark
Pyspark
Apache Spark-Sql
Rdd
Automatically and Elegantly flatten DataFrame in Spark SQL
Scala
Apache Spark
Apache Spark-Sql
Spark load data and add filename as dataframe column
Apache Spark
Pyspark
Apache Spark-Sql
What is the difference between Apache Mahout and Apache Spark's MLlib?
Apache Spark
Mahout
Apache Spark-Mllib
Find maximum row per group in Spark DataFrame
Apache Spark
Pyspark
Apache Spark-Sql
Spark sql how to explode without losing null values
Java
Apache Spark
Null
Apache Spark-Sql
Convert date from String to Date format in Dataframes
Apache Spark
Apache Spark-Sql
Spark DataFrame TimestampType - how to get Year, Month, Day values from field?
Python
Timestamp
Apache Spark
Pyspark
How do I detect if a Spark DataFrame has a column
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Apply StringIndexer to several columns in a PySpark Dataframe
Python
Apache Spark
Pyspark
Spark unionAll multiple dataframes
Scala
Apache Spark
Apache Spark-Sql
SparkR vs sparklyr
R
Apache Spark
Sparkr
Sparklyr
get datatype of column using pyspark
Apache Spark
Pyspark
Apache Spark-Sql
Explain the aggregate functionality in Spark (with Python and Scala)
Python
Scala
Apache Spark
Aggregate
Rdd
Append a column to Data Frame in Apache Spark 1.3
Scala
Apache Spark
Dataframe
DataFrame partitionBy to a single Parquet file (per partition)
Apache Spark
Apache Spark-Sql
PySpark: multiple conditions in when clause
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
Including null values in an Apache Spark Join
Sql
Scala
Apache Spark
Join
Apache Spark-Sql
Spark dataframe: collect () vs select ()
Dataframe
Apache Spark
Apache Spark-Sql
What is yarn-client mode in Spark?
Hadoop Yarn
Apache Spark
View RDD contents in Python Spark?
Python
Apache Spark
How to convert Row of a Scala DataFrame into case class most efficiently?
Scala
Apache Spark
Apache Spark-Sql
What conditions should cluster deploy mode be used instead of client?
Apache Spark
DataFrame equality in Apache Spark
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Rdd
How do I check for equality using Spark Dataframe without SQL Query?
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Derive multiple columns from a single column in a Spark DataFrame
Scala
Apache Spark
Dataframe
Apache Spark-Sql
User Defined-Functions
Spark parquet partitioning : Large number of files
Apache Spark
Spark Dataframe
Rdd
Apache Spark-2.0
Bigdata
Where are logs in Spark on YARN?
Hadoop
Logging
Apache Spark
Cloudera
Hadoop Yarn
When are accumulators truly reliable?
Apache Spark
How to prevent Spark Executors from getting Lost when using YARN client mode?
Apache Spark
Hadoop Yarn
dataframe: how to groupBy/count then filter on count in Scala
Scala
Apache Spark
Apache Spark-Sql
What is the difference between cube, rollup and groupBy operators?
Sql
Apache Spark
Apache Spark-Sql
Cube
Rollup
What's the difference between join and cogroup in Apache Spark
Scala
Apache Spark
Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
Apache Spark
Hadoop Yarn
Amazon Emr
Amazon Kinesis
Spark read file from S3 using sc.textFile ("s3n://...)
Java
Scala
Apache Spark
Rdd
Hortonworks Data-Platform
Installing of SparkR
R
Apache Spark
Sparkr
Spark specify multiple column conditions for dataframe join
Apache Spark
Apache Spark-Sql
Rdd
Filtering DataFrame using the length of a column
Python
Apache Spark
Dataframe
Pyspark
Apache Spark-Sql
Spark SQL Row_number() PartitionBy Sort Desc
Python
Apache Spark
Pyspark
Apache Spark-Sql
Window Functions
Reading csv files with quoted fields containing embedded commas
Csv
Apache Spark
Pyspark
Apache Spark-Sql
Apache Spark-2.0
Pyspark: Parse a column of json strings
Python
Json
Apache Spark
Pyspark
Spark yarn cluster vs client - how to choose which one to use?
Apache Spark
Hadoop Yarn
PySpark create new column with mapping from a dict
Python
Apache Spark
Dictionary
Pyspark
Apache Spark-Sql
Python Spark Cumulative Sum by Group Using DataFrame
Apache Spark
Pyspark
Spark Dataframe
How to assign unique contiguous numbers to elements in a Spark RDD
Apache Spark
Apache Spark-Mllib
How do I convert csv file to rdd
Scala
Apache Spark
How to export data from Spark SQL to CSV
Hadoop
Apache Spark
Export to-Csv
Hiveql
Apache Spark-Sql
Spark Window Functions - rangeBetween dates
Sql
Apache Spark
Pyspark
Apache Spark-Sql
Window Functions
FetchFailedException or MetadataFetchFailedException when processing big data set
Apache Spark
Hadoop Yarn
What's the difference between Spark ML and MLLIB packages
Apache Spark
Apache Spark-Mllib
Apache Spark-Ml
Fetching distinct values on a column using Spark DataFrame
Scala
Apache Spark
Dataframe
Apache Spark-Sql
Spark Dataframe
What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?
Apache Spark
Jdbc
Apache Spark-Sql