How to calculate the number of occurrence of a given character in each row of a column of strings?

RegexRDataframe

Regex Problem Overview


I have a data.frame in which certain variables contain a text string. I wish to count the number of occurrences of a given character in each individual string.

Example:

q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))

I wish to create a new column for q.data with the number of occurence of "a" in string (ie. c(2,1,0)).

The only convoluted approach I have managed is:

string.counter<-function(strings, pattern){  
  counts<-NULL
  for(i in 1:length(strings)){
    counts[i]<-length(attr(gregexpr(pattern,strings[i])[[1]], "match.length")[attr(gregexpr(pattern,strings[i])[[1]], "match.length")>0])
  }
return(counts)
}

string.counter(strings=q.data$string, pattern="a")

 number     string number.of.a
1      1 greatgreat           2
2      2      magic           1
3      3        not           0

Regex Solutions


Solution 1 - Regex

The stringr package provides the str_count function which seems to do what you're interested in

# Load your example data
q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = F)
library(stringr)

# Count the number of 'a's in each element of string
q.data$number.of.a <- str_count(q.data$string, "a")
q.data
#  number     string number.of.a
#1      1 greatgreat           2
#2      2      magic           1
#3      3        not           0

Solution 2 - Regex

If you don't want to leave base R, here's a fairly succinct and expressive possibility:

x <- q.data$string
lengths(regmatches(x, gregexpr("a", x)))
# [1] 2 1 0

Solution 3 - Regex

nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))
[1] 2 1 0

Notice that I coerce the factor variable to character, before passing to nchar. The regex functions appear to do that internally.

Here's benchmark results (with a scaled up size of the test to 3000 rows)

 q.data<-q.data[rep(1:NROW(q.data), 1000),]
 str(q.data)
'data.frame':	3000 obs. of  3 variables:
 $ number     : int  1 2 3 1 2 3 1 2 3 1 ...
 $ string     : Factor w/ 3 levels "greatgreat","magic",..: 1 2 3 1 2 3 1 2 3 1 ...
 $ number.of.a: int  2 1 0 2 1 0 2 1 0 2 ...

 benchmark( Dason = { q.data$number.of.a <- str_count(as.character(q.data$string), "a") },
 Tim = {resT <- sapply(as.character(q.data$string), function(x, letter = "a"){
                            sum(unlist(strsplit(x, split = "")) == letter) }) }, 
 
 DWin = {resW <- nchar(as.character(q.data$string)) -nchar( gsub("a", "", q.data$string))},
 Josh = {x <- sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, replications=100)
#-----------------------
   test replications elapsed  relative user.self sys.self user.child sys.child
1 Dason          100   4.173  9.959427     2.985    1.204          0         0
3  DWin          100   0.419  1.000000     0.417    0.003          0         0
4  Josh          100  18.635 44.474940    17.883    0.827          0         0
2   Tim          100   3.705  8.842482     3.646    0.072          0         0

Solution 4 - Regex

Another good option, using charToRaw:

sum(charToRaw("abc.d.aa") == charToRaw('.'))

Solution 5 - Regex

The stringi package provides the functions stri_count and stri_count_fixed which are very fast.

stringi::stri_count(q.data$string, fixed = "a")
# [1] 2 1 0

benchmark

Compared to the fastest approach from @42-'s answer and to the equivalent function from the stringr package for a vector with 30.000 elements.

library(microbenchmark)

benchmark <- microbenchmark(
  stringi = stringi::stri_count(test.data$string, fixed = "a"),
  baseR = nchar(test.data$string) - nchar(gsub("a", "", test.data$string, fixed = TRUE)),
  stringr = str_count(test.data$string, "a")
)

autoplot(benchmark)

data

q.data <- data.frame(number=1:3, string=c("greatgreat", "magic", "not"), stringsAsFactors = FALSE)
test.data <- q.data[rep(1:NROW(q.data), 10000),]

enter image description here

Solution 6 - Regex

A variation of https://stackoverflow.com/a/12430764/589165 is

> nchar(gsub("[^a]", "", q.data$string))
[1] 2 1 0

Solution 7 - Regex

I'm sure someone can do better, but this works:

sapply(as.character(q.data$string), function(x, letter = "a"){
  sum(unlist(strsplit(x, split = "")) == letter)
})
greatgreat      magic        not 
     2          1          0 

or in a function:

countLetter <- function(charvec, letter){
  sapply(charvec, function(x, letter){
    sum(unlist(strsplit(x, split = "")) == letter)
  }, letter = letter)
}
countLetter(as.character(q.data$string),"a")

Solution 8 - Regex

You could just use string division

require(roperators)
my_strings <- c('apple', banana', 'pear', 'melon')
my_strings %s/% 'a'

Which will give you 1, 3, 1, 0. You can also use string division with regular expressions and whole words.

Solution 9 - Regex

The question below has been moved here, but it seems this page doesn't directly answer to Farah El's question. https://stackoverflow.com/questions/55233227/how-to-find-number-1s-in-101-in-r

So, I'll write an answer here, just in case.

library(magrittr)
n %>% # n is a number you'd like to inspect
  as.character() %>%
  str_count(pattern = "1")

https://stackoverflow.com/users/8931457/farah-el

Solution 10 - Regex

Yet another base R option could be:

lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))

[1] 2 1 0

Solution 11 - Regex

Another base R answer, not so good as those by @IRTFM and @Finn (or as those using stringi/stringr), but better than the others:

sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))

q.data<-data.frame(number=1:3, string=c("greatgreat", "magic", "not"))
q.data<-q.data[rep(1:NROW(q.data), 3000),]
library(rbenchmark)
library(stringr)
library(stringi)

benchmark( Dason = {str_count(q.data$string, "a") },
           Tim = {sapply(q.data$string, function(x, letter = "a"){sum(unlist(strsplit(x, split = "")) == letter) }) },
           DWin = {nchar(q.data$string) -nchar( gsub("a", "", q.data$string, fixed=TRUE))}, 
           Markus = {stringi::stri_count(q.data$string, fixed = "a")},
           Finn={nchar(gsub("[^a]", "", q.data$string))},
           tmmfmnk={lengths(lapply(q.data$string, grepRaw, pattern = "a", all = TRUE, fixed = TRUE))},
           Josh1 = {sapply(regmatches(q.data$string, gregexpr("g",q.data$string )), length)}, 
           Josh2 = {lengths(regmatches(q.data$string, gregexpr("g",q.data$string )))}, 
           Iago = {sapply(strsplit(q.data$string, split=""), function(x) sum(x %in% "a"))}, 
           replications =100, order = "elapsed")

     test replications elapsed relative user.self sys.self user.child sys.child
4  Markus          100   0.076    1.000     0.076    0.000          0         0
3    DWin          100   0.277    3.645     0.277    0.000          0         0
1   Dason          100   0.290    3.816     0.291    0.000          0         0
5    Finn          100   1.057   13.908     1.057    0.000          0         0
9    Iago          100   3.214   42.289     3.215    0.000          0         0
2     Tim          100   6.000   78.947     6.002    0.000          0         0
6 tmmfmnk          100   6.345   83.487     5.760    0.003          0         0
8   Josh2          100  12.542  165.026    12.545    0.000          0         0
7   Josh1          100  13.288  174.842    13.268    0.028          0         0

Solution 12 - Regex

The easiest and the cleanest way IMHO is :

q.data$number.of.a <- lengths(gregexpr('a', q.data$string))

#  number     string number.of.a`
#1      1 greatgreat           2`
#2      2      magic           1`
#3      3        not           0`

Solution 13 - Regex

The next expression does the job and also works for symbols, not only letters.

The expression works as follows: >1: it uses lapply on the columns of the dataframe q.data to iterate over the rows of the column 2 ("lapply(q.data[,2],"),

>2: it apply to each row of the column 2 a function "function(x){sum('a' == strsplit(as.character(x), '')[[1]])}". The function takes each row value of column 2 (x), convert to character (in case it is a factor for example), and it does the split of the string on every character ("strsplit(as.character(x), '')"). As a result we have a a vector with each character of the string value for each row of the column 2.

>3: Each vector value of the vector is compared with the desired character to be counted, in this case "a" (" 'a' == "). This operation will return a vector of True and False values "c(True,False,True,....)", being True when the value in the vector matches the desired character to be counted.

>4: The total times the character 'a' appears in the row is calculated as the sum of all the 'True' values in the vector "sum(....)".

>5: Then it is applied the "unlist" function to unpack the result of the "lapply" function and assign it to a new column in the dataframe ("q.data$number.of.a<-unlist(....")

q.data$number.of.a<-unlist(lapply(q.data[,2],function(x){sum('a' == strsplit(as.character(x), '')[[1]])}))

>q.data

#  number     string     number.of.a
#1   greatgreat         2
#2      magic           1
#3      not             0

Solution 14 - Regex

s <- "aababacababaaathhhhhslsls jsjsjjsaa ghhaalll"
p <- "a"
s2 <- gsub(p,"",s)
numOcc <- nchar(s) - nchar(s2)

May not be the efficient one but solve my purpose.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionEtienne Low-D&#233;carieView Question on Stackoverflow
Solution 1 - RegexDasonView Answer on Stackoverflow
Solution 2 - RegexJosh O'BrienView Answer on Stackoverflow
Solution 3 - RegexIRTFMView Answer on Stackoverflow
Solution 4 - RegexZhang TaoView Answer on Stackoverflow
Solution 5 - RegexmarkusView Answer on Stackoverflow
Solution 6 - RegexFinn Årup NielsenView Answer on Stackoverflow
Solution 7 - Regextim riffeView Answer on Stackoverflow
Solution 8 - RegexBenbobView Answer on Stackoverflow
Solution 9 - RegexYosherView Answer on Stackoverflow
Solution 10 - RegextmfmnkView Answer on Stackoverflow
Solution 11 - RegexiagoView Answer on Stackoverflow
Solution 12 - RegexGiovanni CampagnoliView Answer on Stackoverflow
Solution 13 - RegexbacnqnView Answer on Stackoverflow
Solution 14 - RegexAmarjeetView Answer on Stackoverflow