How to convert column with string type to int form in pyspark data frame?

Python Dataframe Apache Spark Pyspark Apache Spark-Sql

Python Problem Overview

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type.

How I can change them to int type. I replaced the nan values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

my data looks like this:

here columns Plays and drafts containing integer values but because of nan present in these columns, they are treated as string type.

Python Solutions

Solution 1 - Python

from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

You can run loop for each column but this is the simplest way to convert string column into integer.

Solution 2 - Python

You could use cast(as int) after replacing NaN with 0,

data_df = df.withColumn("Plays", df.call_time.cast('float'))

Solution 3 - Python

Another way to do it is using the StructField if you have multiple fields that needs to be modified.

Ex:

from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
       StructField('OPEN_FLG',IntegerType(),True),
       StructField('I1_GNDR_CODE',StringType(),True),
       StructField('TRW_INCOME_CD_V4',StringType(),True),
       StructField('ASIAN_CD',IntegerType(),True),
       StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)

Output:

Before

root
 |-- CLICK_FLG: string (nullable = true)
 |-- OPEN_FLG: string (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)

After:

root
 |-- CLICK_FLG: integer (nullable = true)
 |-- OPEN_FLG: integer (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)

This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.

It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.

Content Type	Original Author	Original Content on Stackoverflow
Question	neha	View Question on Stackoverflow
Solution 1 - Python	Sahil Desai	View Answer on Stackoverflow
Solution 2 - Python	Ani Menon	View Answer on Stackoverflow
Solution 3 - Python	Keshav Pradeep Ramanath	View Answer on Stackoverflow

How to convert column with string type to int form in pyspark data frame?

Python Problem Overview

Python Solutions

Solution 1 - Python

Solution 2 - Python

Solution 3 - Python

Firestore Getting documents id from collection

asp.net core 2.0 - multiple projects solution docker file

Attributions