Detecting syllables in a word

NlpSpell CheckingHyphenation

Nlp Problem Overview

I need to find a fairly efficient way to detect syllables in a word. E.g.,

Invisible -> in-vi-sib-le

There are some syllabification rules that could be used:


*where V is a vowel and C is a consonant. E.g.,

Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).

The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.

I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.

I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

Nlp Solutions

Solution 1 - Nlp

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang's thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

Solution 2 - Nlp

I stumbled across this page looking for the same thing, and found a few implementations of the Liang paper here: or the successor:

That is unless you're the type that enjoys reading a 60 page thesis instead of adapting freely available code for non-unique problem. :)

Solution 3 - Nlp

Here is a solution using NLTK:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]] 

Solution 4 - Nlp

I'm trying to tackle this problem for a program that will calculate the flesch-kincaid and flesch reading score of a block of text. My algorithm uses what I found on this website: and it gets reasonably close. It still has trouble on complicated words like invisible and hyphenation, but I've found it gets in the ballpark for my purposes.

It has the upside of being easy to implement. I found the "es" can be either syllabic or not. It's a gamble, but I decided to remove the es in my algorithm.

private int CountSyllables(string word)
		char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
		string currentWord = word;
		int numVowels = 0;
		bool lastWasVowel = false;
		foreach (char wc in currentWord)
			bool foundVowel = false;
			foreach (char v in vowels)
				//don't count diphthongs
				if (v == wc && lastWasVowel)
					foundVowel = true;
					lastWasVowel = true;
				else if (v == wc && !lastWasVowel)
					foundVowel = true;
					lastWasVowel = true;

			//if full cycle and no vowel found, set lastWasVowel to false;
			if (!foundVowel)
				lastWasVowel = false;
		//remove es, it's _usually? silent
		if (currentWord.Length > 2 && 
			currentWord.Substring(currentWord.Length - 2) == "es")
		// remove silent e
		else if (currentWord.Length > 1 &&
			currentWord.Substring(currentWord.Length - 1) == "e")

		return numVowels;

Solution 5 - Nlp

This is a particularly difficult problem which is not completely solved by the LaTeX hyphenation algorithm. A good summary of some available methods and the challenges involved can be found in the paper Evaluating Automatic Syllabification Algorithms for English (Marchand, Adsett, and Damper 2007).

Solution 6 - Nlp

Why calculate it? Every online dictionary has this info. in·vis·i·ble

Solution 7 - Nlp

Bumping @Tihamer and @joe-basirico. Very useful function, not perfect, but good for most small-to-medium projects. Joe, I have re-written an implementation of your code in Python:

def countSyllables(word):
	vowels = "aeiouy"
	numVowels = 0
	lastWasVowel = False
	for wc in word:
		foundVowel = False
		for v in vowels:
			if v == wc:
				if not lastWasVowel: numVowels+=1	#don't count diphthongs
				foundVowel = lastWasVowel = True
		if not foundVowel:	#If full cycle and no vowel found, set lastWasVowel to false
			lastWasVowel = False
	if len(word) > 2 and word[-2:] == "es":	#Remove es - it's "usually" silent (?)
	elif len(word) > 1 and word[-1:] == "e":	#remove silent e
	return numVowels

Hope someone finds this useful!

Solution 8 - Nlp

Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.

Here is your code in Java, along with test cases:

public static int countSyllables(String word)
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
    	boolean foundVowel = false;
        for (char v : vowels)
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
                foundVowel = true;
                lastWasVowel = true;
            else if (v == wc && !lastWasVowel)
                foundVowel = true;
                lastWasVowel = true;
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
    		word.substring(word.length() - 2) == "es")
    // remove silent e
    else if (word.length() > 1 &&
    		word.substring(word.length() - 1) == "e")
    return numVowels;

public static void main(String[] args) {
	String txt = "what";
	System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
	txt = "super";
	System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
	txt = "Maryland";
	System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
	txt = "American";
	System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
	txt = "disenfranchized";
	System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
	txt = "Sophia";
	System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));

The result was as expected (it works good enough for Flesch-Kincaid):

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

Solution 9 - Nlp

I ran into this exact same issue a little while ago.

I ended up using the CMU Pronunciation Dictionary for quick and accurate lookups of most words. For words not in the dictionary, I fell back to a machine learning model that's ~98% accurate at predicting syllable counts.

I wrapped the whole thing up in an easy-to-use python module here:

Install: pip install big-phoney

Count Syllables:

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

If you're not using Python and you want to try the ML-model-based approach, I did a pretty detailed write up on how the syllable counting model works on Kaggle.

Solution 10 - Nlp

Perl has Lingua::Phonology::Syllable module. You might try that, or try looking into its algorithm. I saw a few other older modules there, too.

I don't understand why a regular expression gives you only a count of syllables. You should be able to get the syllables themselves using capture parentheses. Assuming you can construct a regular expression that works, that is.

Solution 11 - Nlp

Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.

Cave: It is important to remove the last lines of the .tex pattern files, because otherwise those files can not be loaded with the current version on Maven Central.

To load and use the hyphenator, you can use the following Java code snippet. texTable is the name of the .tex files containing the needed patterns. Those files are available on the project github site.

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);

            public void info(String s) {

            public void warning(String s) {
                logger.warn("WARNING: " + s);

            public void error(String s) {
                logger.error("ERROR: " + s);

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);

            public boolean isDebugged(String guard) {
                return false;

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);

        return hyphenator;

Afterwards the Hyphenator is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

You need to split on "\u00AD", since the API does not return a normal "-".

This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.

Solution 12 - Nlp

Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):


function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true

    if not foundVowel then
      lastWasVowel = false

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1

  return numVowels

And some fun tests to confirm it works (as much as it's supposed to):


require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)

print("Tests passed.")

Solution 13 - Nlp

I could not find an adequate way to count syllables, so I designed a method myself.

You can view my method here:

I use a combination of a dictionary and algorithm method to count syllables.

You can view my library here:

I just tested my algorithm and had a 99.4% strike rate!

Lawrence lawrence = new Lawrence();




Solution 14 - Nlp

After doing a lot of testing and trying out hyphenation packages as well, I wrote my own based on a number of examples. I also tried the pyhyphen and pyphen packages that interfaces with hyphenation dictionaries, but they produce the wrong number of syllables in many cases. The nltk package was simply too slow for this use case.

My implementation in Python is part of a class i wrote, and the syllable counting routine is pasted below. It over-estimates the number of syllables a bit as I still haven't found a good way to account for silent word endings.

The function returns the ratio of syllables per word as it is used for a Flesch-Kincaid readability score. The number doesn't have to be exact, just close enough for an estimate.

On my 7th generation i7 CPU, this function took 1.1-1.2 milliseconds for a 759 word sample text.

def _countSyllablesEN(self, theText):

    cleanText = ""
    for ch in theText:
        if ch in "abcdefghijklmnopqrstuvwxyz'’":
            cleanText += ch
            cleanText += " "

    asVow    = "aeiouy'’"
    dExep    = ("ei","ie","ua","ia","eo")
    theWords = cleanText.lower().split()
    allSylls = 0
    for inWord in theWords:
        nChar  = len(inWord)
        nSyll  = 0
        wasVow = False
        wasY   = False
        if nChar == 0:
        if inWord[0] in asVow:
            nSyll += 1
            wasVow = True
            wasY   = inWord[0] == "y"
        for c in range(1,nChar):
            isVow  = False
            if inWord[c] in asVow:
                nSyll += 1
                isVow = True
            if isVow and wasVow:
                nSyll -= 1
            if isVow and wasY:
                nSyll -= 1
            if inWord[c:c+2] in dExep:
                nSyll += 1
            wasVow = isVow
            wasY   = inWord[c] == "y"
        if inWord.endswith(("e")):
            nSyll -= 1
        if inWord.endswith(("le","ea","io")):
            nSyll += 1
        if nSyll < 1:
            nSyll = 1
        # print("%-15s: %d" % (inWord,nSyll))
        allSylls += nSyll

    return allSylls/len(theWords)

Solution 15 - Nlp

You can try Spacy Syllables. This works on Python 3.9:


pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md


import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')

def spacy_syllablize(word):
    token = nlp(word)[0]
    return token._.syllables

for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
    print(f"{test_word} -> {spacy_syllablize(test_word)}")


trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']

Solution 16 - Nlp

I am including a solution that works "okay" in R. Far from perfect.

countSyllablesInWord = function(words)
  #word = "super";
  n.words = length(words);
  result = list();
  for(j in 1:n.words)
    word = words[j];
    vowels = c("a","e","i","o","u","y");
    word.vec = strsplit(word,"")[[1]];
    n.char = length(word.vec);
    is.vowel = is.element(tolower(word.vec), vowels);
    n.vowels = sum(is.vowel);
    # nontrivial problem 
    if(n.vowels <= 1)
      syllables = 1;
      str = word;
      } else {
              # syllables = 0;
              previous = "C";
              # on average ? 
              str = "";
              n.hyphen = 0;
              for(i in 1:n.char)
                my.char = word.vec[i];
                my.vowel = is.vowel[i];
                  if(previous == "C")
                    if(i == 1)
                      str = paste0(my.char, "-");
                      n.hyphen = 1 + n.hyphen;
                      } else {
                              if(i < n.char)
                                if(n.vowels > (n.hyphen + 1))
                                  str = paste0(str, my.char, "-");
                                  n.hyphen = 1 + n.hyphen;
                                  } else {
                                           str = paste0(str, my.char);
                                } else {
                                        str = paste0(str, my.char);
                     # syllables = 1 + syllables;
                     previous = "V";
                    } else {  # "VV"
                          # assume what  ?  vowel team?
                          str = paste0(str, my.char);
                } else {
                            str = paste0(str, my.char);
                            previous = "C";
              syllables = 1 + n.hyphen;
      result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
  if(n.words == 1) { result[[1]]; } else { result; }

Here are some results:

my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));

my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);


#    syllables vowels         word
# 1          4      4   A-me-ri-ca
# 2          4      5 be-auti-fu-l
# 3          3      4   spa-ci-ous
# 4          2      2       ski-es
# 5          2      2       a-mber
# 6          2      2       wa-ves
# 7          2      2       gra-in
# 8          2      2      pu-rple
# 9          3      4  mo-unta-ins
# 10         3      3    ma-je-sty

I didn't realize how big of a "rabbit hole" this is, seems so easy.

################ hackathon #######



  # start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr", 
  #                   "fl", "fr", "gl", "gr", "pl", "pr",
  #                   "sc", "sh", "sk", "sl", "sm", "sn",
  #                   "sp", "st", "sw", "th", "tr", "tw",
  #                   "wh", "wr");
  # start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
  #                     "spr", "squ", "str", "thr");
  # end.digraphs = c("ch","sh","th","ng","dge","tch");
  # ile
  # farmer
  # ar er
  # vowel teams ... beaver1
  # # "able"
  # #
  # blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian", 
  #             "ck", "ct", "dge", "dis", "ed", "ex", "ful", 
  #             "gh", "ng", "ous", "kn", "ment", "mis", );
  # glue = c("ld", "st", "nd", "ld", "ng", "nk", 
  #           "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch", 
  #           "nse", "nt", "ph", "psy", "pt", "re", )
  # start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
  # #
  # digraphs.start = c("ch","sh","th","wh","ph","qu");
  # digraphs.end = c("ch","sh","th","ng","dge","tch");
  # #
  # blends.start = c("pl", "gr", "gl", "pr",
  # blends.end = c("lk","nk","nt",
  # #
  # # Monte     Mon-te
  # # Sophia    So-phi-a
  # # American  A-mer-i-can
  # n.vowels = 0;
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  # n.syll = 0;
  # str = "";
  # previous = "C"; # consonant vs "V" vowel
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  #   my.vowel = is.element(tolower(my.char), vowels);
  #   if(my.vowel)
  #     {
  #     n.vowels = 1 + n.vowels;
  #     if(previous == "C")
  #       {
  #       if(i == 1)
  #         {
  #         str = paste0(my.char, "-");
  #         } else {
  #                 if(n.syll > 1)
  #                   {
  #                   str = paste0(str, "-", my.char);
  #                   } else {
  #                          str = paste0(str, my.char);
  #                         }
  #                 }
  #        n.syll = 1 + n.syll;
  #        previous = "V";
  #       } 
  #   } else {
  #               str = paste0(str, my.char);
  #               previous = "C";
  #               }
  #   #
  #   }
# AIDE   1
# IDEA   3
# IDEAS  2
# IDEE   2
# IDE   1
# AIDA   2
# DUE   1
# IDEAL  2
# DEE   1
# UREA  3
# VACUO  3
# MOPED  1
# AGED  1
# TOTED  2
# JADED  2
# BRED  1
# RED   1

And for good measure, a simple kincaid readability function ... syllables is a list of counts returned from the first function ...

Since my function is a bit biased towards more syllables, that will give an inflated readability score ... which for now is fine ... if the goal is to make text more readable, this is not the worst thing.

computeReadability = function(n.sentences, n.words, syllables=NULL)
  n = length(syllables);
  n.syllables = 0;
  for(i in 1:n)
    my.syllable = syllables[[i]];
    n.syllables = my.syllable$syllables + n.syllables;
  # Flesch Reading Ease (FRE):
  FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
  # Flesh-Kincaid Grade Level (FKGL):
  FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59; 
  # FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
  # FKGL = -0.13948  * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
  list("FRE" = FRE, "FKGL" = FKGL); 

Solution 17 - Nlp

I used jsoup to do this once. Here's a sample syllable parser:

public String[] syllables(String text){
		String url = "" + text;
		String relHref;
			Document doc = Jsoup.connect(url).get();
			Element link = doc.getElementsByClass("word-syllables").first();
			if(link == null){return new String[]{text};}
			relHref = link.html(); 
		}catch(IOException e){
			relHref = text;
		String[] syl = relHref.split("·");
		return syl;


All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionuser50705View Question on Stackoverflow
Solution 1 - NlpjasonView Answer on Stackoverflow
Solution 2 - NlpSeanView Answer on Stackoverflow
Solution 3 - NlphojuView Answer on Stackoverflow
Solution 4 - NlpJoe BasiricoView Answer on Stackoverflow
Solution 5 - NlpChrisView Answer on Stackoverflow
Solution 6 - NlpCerinView Answer on Stackoverflow
Solution 7 - NlpTersosaurosView Answer on Stackoverflow
Solution 8 - NlpTihamerView Answer on Stackoverflow
Solution 9 - NlpRyan EppView Answer on Stackoverflow
Solution 10 - NlpskiphoppyView Answer on Stackoverflow
Solution 11 - Nlprzo1View Answer on Stackoverflow
Solution 12 - NlpjosefnpatView Answer on Stackoverflow
Solution 13 - NlptroyView Answer on Stackoverflow
Solution 14 - NlpJadzia626View Answer on Stackoverflow
Solution 15 - NlpchrisView Answer on Stackoverflow
Solution 16 - NlpmshafferView Answer on Stackoverflow
Solution 17 - NlpItamar FiorinoView Answer on Stackoverflow