Count the number of all words in a string

RStringWord Count

R Problem Overview


Is there a function to count the number of words in a string? For example:

str1 <- "How many words are in this sentence"

to return a result of 7.

R Solutions


Solution 1 - R

Use the regular expression symbol \\W to match non-word characters, using + to indicate one or more in a row, along with gregexpr to find all matches in a string. Words are the number of word separators plus 1.

lengths(gregexpr("\\W+", str1)) + 1

This will fail with blank strings at the beginning or end of the character vector, when a "word" doesn't satisfy \\W's notion of non-word (one could work with other regular expressions, \\S+, [[:alpha:]], etc., but there will always be edge cases with a regex approach), etc. It is likely more efficient than strsplit solutions, which will allocate memory for each word. Regular expressions are described in ?regex.

Update As noted in the comments and in a different answer by @Andri the approach fails with (zero) and one-word strings, and with trailing punctuation

str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3

Many of the other answers also fail in these or similar (e.g., multiple spaces) cases. I think my answer's caveat about 'notion of one word' in the original answer covers problems with punctuation (solution: choose a different regular expression, e.g., [[:space:]]+), but the zero and one word cases are a problem; @Andri's solution fails to distinguish between zero and one words. So taking a 'positive' approach to finding words one might

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

Leading to

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3

Again the regular expression might be refined for different notions of 'word'.

I like the use of gregexpr() because it's memory efficient. An alternative using strsplit() (like @user813966, but with a regular expression to delimit words) and making use of the original notion of delimiting words is

lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3

This needs to allocate new memory for each word that is created, and for the intermediate list-of-words. This could be relatively expensive when the data is 'big', but probably it's effective and understandable for most purposes.

Solution 2 - R

Most simple way would be:

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\\S+")

... counting all sequences on non-space characters (\\S+).

But what about a little function that lets us also decide which kind of words we would like to count and which works on whole vectors as well?

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

Solution 3 - R

I use the str_count function from the stringr library with the escape sequence \w that represents:

> any ‘word’ character (letter, digit or underscore in the current > locale: in UTF-8 mode only ASCII letters and digits are considered)

Example:

> str_count("How many words are in this sentence", '\\w+')
[1] 7

Of all other 9 answers that I was able to test, only two (by Vincent Zoonekynd, and by petermeissner) worked for all inputs presented here so far, but they also require stringr.

But only this solution works with all inputs presented so far, plus inputs such as "foo+bar+baz~spam+eggs" or "Combien de mots sont dans cette phrase ?".

Benchmark:

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\\S+")[[1]]),
    function(s) str_count(s, "\\S+"),
    function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\\w+')
  )

unlist(lapply(funs, score))

Output (11 is the maximum possible score):

6 10 10  8  9  9  7  6  6 11

Solution 4 - R

You can use strsplit and sapply functions

sapply(strsplit(str1, " "), length)

Solution 5 - R

str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])

The gsub(' {2,}',' ',str1) makes sure all words are separated by one space only, by replacing all occurences of two or more spaces with one space.

The strsplit(str,' ') splits the sentence at every space and returns the result in a list. The [[1]] grabs the vector of words out of that list. The length counts up how many words.

> str1 <- "How many words are in this     sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> strsplit(str2,' ')[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7

Solution 6 - R

You can use str_match_all, with a regular expression that would identify your words. The following works with initial, final and duplicated spaces.

library(stringr)
s <-  "
  Day after day, day after day,
  We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" )  # Sequences of non-spaces
length(m[[1]])

Solution 7 - R

Try this function from stringi package

   require(stringi)
   > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
    +        "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
    +        "Cras vel lorem. Etiam pellentesque aliquet tellus.",
    +        "")
    > stri_stats_latex(s)
        CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
              133             0            30            24             0             0 

Solution 8 - R

You can use wc function in library qdap:

> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7

Solution 9 - R

You can remove double spaces and count the number of " " in the string to get the count of words. Use stringr and rm_white {qdapRegex}

str_count(rm_white(s), " ") +1

Solution 10 - R

Also from stringi package, the straight forward function stri_count_words

stringi::stri_count_words(str1)
#[1] 7

Solution 11 - R

Try this

length(unlist(strsplit(str1," ")))

Solution 12 - R

require(stringr)
str_count(x,"\\w+")

will be fine with double/triple spaces between words

All other answers have issues with more than one space between the words.

Solution 13 - R

The solution 7 does not give the correct result in the case there's just one word. You should not just count the elements in gregexpr's result (which is -1 if there where not matches) but count the elements > 0.

Ergo:

sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1 

Solution 14 - R

require(stringr)

Define a very simple function

str_words <- function(sentence) {
  
  str_count(sentence, " ") + 1

}

Check

str_words(This is a sentence with six words)

Solution 15 - R

You could use stringr functions str_split() and boundary(), which will recognize the boundaries of words while ignoring punctuation and any extra spaces

sapply(str_split("It's 12 o'clock already", boundary("word")), length)
#[1] 4
sapply(str_split("  It's  >12  o'clock already ?! ", boundary("word")), length)
#[1] 4

Solution 16 - R

Use nchar

if vector of strings is called x

(nchar(x) - nchar(gsub(' ','',x))) + 1

Find out number of spaces then add one

Solution 17 - R

With stringr package, one can also write a simple script that could traverse a vector of strings for example through a for loop.

Let's say

> df$text

contains a vector of strings that we are interested in analysing. First, we add additional columns to the existing dataframe df as below:

df$strings    = as.integer(NA)
df$characters = as.integer(NA)

Then we run a for-loop over the vector of strings as below:

for (i in 1:nrow(df)) 
{
   df$strings[i]    = str_count(df$text[i], '\\S+') # counts the strings
   df$characters[i] = str_count(df$text[i])         # counts the characters & spaces
}

The resulting columns: strings and character will contain the counts of words and characters and this will be achieved in one-go for a vector of strings.

Solution 18 - R

I've found the following function and regex useful for word counts, especially in dealing with single vs. double hyphens, where the former generally should not count as a word break, eg, well-known, hi-fi; whereas double hyphen is a punctuation delimiter that is not bounded by white-space--such as for parenthetical remarks.

txt <- "Don't you think e-mail is one word--and not two!" #10 words
words <- function(txt) { 
length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+",txt)[[1]])$match.length) 
}

words(txt) #10 words

Stringi is a useful package. But it over-counts words in this example due to hyphen.

stringi::stri_count_words(txt) #11 words

Solution 19 - R

There's a simple solution using split and len:

text = 'This is a test for counting words'

# default separator: space
result = len(text.split())

print("There are " + str(result) + " words.")

You can get more details at https://www.delftstack.com/howto/python/python-count-words-in-string/

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJohnView Question on Stackoverflow
Solution 1 - RMartin MorganView Answer on Stackoverflow
Solution 2 - RpetermeissnerView Answer on Stackoverflow
Solution 3 - RarekolekView Answer on Stackoverflow
Solution 4 - RAVSureshView Answer on Stackoverflow
Solution 5 - Rmathematical.coffeeView Answer on Stackoverflow
Solution 6 - RVincent ZoonekyndView Answer on Stackoverflow
Solution 7 - RbartektartanusView Answer on Stackoverflow
Solution 8 - RyuqianView Answer on Stackoverflow
Solution 9 - RMurali MenonView Answer on Stackoverflow
Solution 10 - RSotosView Answer on Stackoverflow
Solution 11 - RSangramView Answer on Stackoverflow
Solution 12 - RCJunkView Answer on Stackoverflow
Solution 13 - RAndriView Answer on Stackoverflow
Solution 14 - RJDieView Answer on Stackoverflow
Solution 15 - RVer1sView Answer on Stackoverflow
Solution 16 - RJonnyView Answer on Stackoverflow
Solution 17 - RSandyView Answer on Stackoverflow
Solution 18 - RSorenView Answer on Stackoverflow
Solution 19 - RFlavio AmorimView Answer on Stackoverflow