Convert data.frame columns from factors to characters

RDataframe

R Problem Overview


I have a data frame. Let's call him bob:

> head(bob)
                 phenotype                         exclusion
GSM399350 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399351 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399352 3- 4- 8- 25- 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399353 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399354 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-
GSM399355 3- 4- 8- 25+ 44+ 11b- 11c- 19- NK1.1- Gr1- TER119-

I'd like to concatenate the rows of this data frame (this will be another question). But look:

> class(bob$phenotype)
[1] "factor"

Bob's columns are factors. So, for example:

> as.character(head(bob))
[1] "c(3, 3, 3, 6, 6, 6)"       "c(3, 3, 3, 3, 3, 3)"      
[3] "c(29, 29, 29, 30, 30, 30)"

I don't begin to understand this, but I guess these are indices into the levels of the factors of the columns (of the court of king caractacus) of bob? Not what I need.

Strangely I can go through the columns of bob by hand, and do

bob$phenotype <- as.character(bob$phenotype)

which works fine. And, after some typing, I can get a data.frame whose columns are characters rather than factors. So my question is: how can I do this automatically? How do I convert a data.frame with factor columns into a data.frame with character columns without having to manually go through each column?

Bonus question: why does the manual approach work?

R Solutions


Solution 1 - R

Just following on Matt and Dirk. If you want to recreate your existing data frame without changing the global option, you can recreate it with an apply statement:

bob <- data.frame(lapply(bob, as.character), stringsAsFactors=FALSE)

This will convert all variables to class "character", if you want to only convert factors, see Marek's solution below.

As @hadley points out, the following is more concise.

bob[] <- lapply(bob, as.character)

In both cases, lapply outputs a list; however, owing to the magical properties of R, the use of [] in the second case keeps the data.frame class of the bob object, thereby eliminating the need to convert back to a data.frame using as.data.frame with the argument stringsAsFactors = FALSE.

Solution 2 - R

To replace only factors:

i <- sapply(bob, is.factor)
bob[i] <- lapply(bob[i], as.character)

In package dplyr in version 0.5.0 new function mutate_if was introduced:

library(dplyr)
bob %>% mutate_if(is.factor, as.character) -> bob

...and in version 1.0.0 was replaced by across:

library(dplyr)
bob %>% mutate(across(where(is.factor), as.character)) -> bob

Package purrr from RStudio gives another alternative:

library(purrr)
bob %>% modify_if(is.factor, as.character) -> bob

Solution 3 - R

The global option

> stringsAsFactors: > The default setting for arguments of data.frame and read.table.

may be something you want to set to FALSE in your startup files (e.g. ~/.Rprofile). Please see help(options).

Solution 4 - R

If you understand how factors are stored, you can avoid using apply-based functions to accomplish this. Which isn't at all to imply that the apply solutions don't work well.

Factors are structured as numeric indices tied to a list of 'levels'. This can be seen if you convert a factor to numeric. So:

> fact <- as.factor(c("a","b","a","d")
> fact
[1] a b a d
Levels: a b d

> as.numeric(fact)
[1] 1 2 1 3

The numbers returned in the last line correspond to the levels of the factor.

> levels(fact)
[1] "a" "b" "d"

Notice that levels() returns an array of characters. You can use this fact to easily and compactly convert factors to strings or numerics like this:

> fact_character <- levels(fact)[as.numeric(fact)]
> fact_character
[1] "a" "b" "a" "d"

This also works for numeric values, provided you wrap your expression in as.numeric().

> num_fact <- factor(c(1,2,3,6,5,4))
> num_fact
[1] 1 2 3 6 5 4
Levels: 1 2 3 4 5 6
> num_num <- as.numeric(levels(num_fact)[as.numeric(num_fact)])
> num_num
[1] 1 2 3 6 5 4

Solution 5 - R

If you want a new data frame bobc where every factor vector in bobf is converted to a character vector, try this:

bobc <- rapply(bobf, as.character, classes="factor", how="replace")

If you then want to convert it back, you can create a logical vector of which columns are factors, and use that to selectively apply factor

f <- sapply(bobf, class) == "factor"
bobc[,f] <- lapply(bobc[,f], factor)

Solution 6 - R

I typically make this function apart of all my projects. Quick and easy.

unfactorize <- function(df){
  for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
  return(df)
}

Solution 7 - R

Another way is to convert it using apply

bob2 <- apply(bob,2,as.character)

And a better one (the previous is of class 'matrix')

bob2 <- as.data.frame(as.matrix(bob),stringsAsFactors=F)

Solution 8 - R

Update: Here's an example of something that doesn't work. I thought it would, but I think that the stringsAsFactors option only works on character strings - it leaves the factors alone.

Try this:

bob2 <- data.frame(bob, stringsAsFactors = FALSE)

Generally speaking, whenever you're having problems with factors that should be characters, there's a stringsAsFactors setting somewhere to help you (including a global setting).

Solution 9 - R

Or you can try transform:

newbob <- transform(bob, phenotype = as.character(phenotype))

Just be sure to put every factor you'd like to convert to character.

Or you can do something like this and kill all the pests with one blow:

newbob_char <- as.data.frame(lapply(bob[sapply(bob, is.factor)], as.character), stringsAsFactors = FALSE)
newbob_rest <- bob[!(sapply(bob, is.factor))]
newbob <- cbind(newbob_char, newbob_rest)

It's not good idea to shove the data in code like this, I could do the sapply part separately (actually, it's much easier to do it like that), but you get the point... I haven't checked the code, 'cause I'm not at home, so I hope it works! =)

This approach, however, has a downside... you must reorganize columns afterwards, while with transform you can do whatever you like, but at cost of "pedestrian-style-code-writting"...

So there... =)

Solution 10 - R

At the beginning of your data frame include stringsAsFactors = FALSE to ignore all misunderstandings.

Solution 11 - R

If you would use data.table package for the operations on data.frame then the problem is not present.

library(data.table)
dt = data.table(col1 = c("a","b","c"), col2 = 1:3)
sapply(dt, class)
#       col1        col2 
#"character"   "integer" 

If you have a factor columns in you dataset already and you want to convert them to character you can do the following.

library(data.table)
dt = data.table(col1 = factor(c("a","b","c")), col2 = 1:3)
sapply(dt, class)
#     col1      col2 
# "factor" "integer" 
upd.cols = sapply(dt, is.factor)
dt[, names(dt)[upd.cols] := lapply(.SD, as.character), .SDcols = upd.cols]
sapply(dt, class)
#       col1        col2 
#"character"   "integer" 

Solution 12 - R

This works for me - I finally figured a one liner

df <- as.data.frame(lapply(df,function (y) if(class(y)=="factor" ) as.character(y) else y),stringsAsFactors=F)

Solution 13 - R

New function "across" was introduced in dplyr version 1.0.0. The new function will supersede scoped variables (_if, _at, _all). Here's the official documentation

library(dplyr)
bob <- bob %>% 
       mutate(across(where(is.factor), as.character))

Solution 14 - R

This function does the trick

df <- stacomirtools::killfactor(df)

Solution 15 - R

You should use convert in hablar which gives readable syntax compatible with tidyverse pipes:

library(dplyr)
library(hablar)

df <- tibble(a = factor(c(1, 2, 3, 4)),
             b = factor(c(5, 6, 7, 8)))

df %>% convert(chr(a:b))

which gives you:

  a     b    
  <chr> <chr>
1 1     5    
2 2     6    
3 3     7    
4 4     8   

Solution 16 - R

Maybe a newer option?

library("tidyverse")

bob <- bob %>% group_by_if(is.factor, as.character)

Solution 17 - R

With the dplyr-package loaded use

bob=bob%>%mutate_at("phenotype", as.character)

if you only want to change the phenotype-column specifically.

Solution 18 - R

This works transforming all to character and then the numeric to numeric:

makenumcols<-function(df){
  df<-as.data.frame(df)
  df[] <- lapply(df, as.character)
  cond <- apply(df, 2, function(x) {
    x <- x[!is.na(x)]
    all(suppressWarnings(!is.na(as.numeric(x))))
  })
  numeric_cols <- names(df)[cond]
  df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)
  return(df)
}

Adapted from: https://stackoverflow.com/questions/42735346/get-column-types-of-excel-sheet-automatically

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMike DewarView Question on Stackoverflow
Solution 1 - RShaneView Answer on Stackoverflow
Solution 2 - RMarekView Answer on Stackoverflow
Solution 3 - RDirk EddelbuettelView Answer on Stackoverflow
Solution 4 - RKikappView Answer on Stackoverflow
Solution 5 - RscentoniView Answer on Stackoverflow
Solution 6 - ROmar WagihView Answer on Stackoverflow
Solution 7 - RGeorge DontasView Answer on Stackoverflow
Solution 8 - RMatt ParkerView Answer on Stackoverflow
Solution 9 - RaL3xaView Answer on Stackoverflow
Solution 10 - Ruser5462317View Answer on Stackoverflow
Solution 11 - RjangoreckiView Answer on Stackoverflow
Solution 12 - Ruser1617979View Answer on Stackoverflow
Solution 13 - Rradhikesh93View Answer on Stackoverflow
Solution 14 - RCedricView Answer on Stackoverflow
Solution 15 - RdavsjobView Answer on Stackoverflow
Solution 16 - RracheletteView Answer on Stackoverflow
Solution 17 - RnexonvantecView Answer on Stackoverflow
Solution 18 - RFerroaoView Answer on Stackoverflow