Detecting syllables in a word
NlpSpell CheckingHyphenationNlp Problem Overview
I need to find a fairly efficient way to detect syllables in a word. E.g.,
Invisible -> in-vi-sib-le
There are some syllabification rules that could be used:
V CV VC CVC CCV CCCV CVCC
*where V is a vowel and C is a consonant. E.g.,
Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)
I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).
The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.
I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.
I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.
Nlp Solutions
Solution 1 - Nlp
Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang's thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.
Solution 2 - Nlp
I stumbled across this page looking for the same thing, and found a few implementations of the Liang paper here: https://github.com/mnater/hyphenator or the successor: https://github.com/mnater/Hyphenopoly
That is unless you're the type that enjoys reading a 60 page thesis instead of adapting freely available code for non-unique problem. :)
Solution 3 - Nlp
Here is a solution using NLTK:
from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
Solution 4 - Nlp
I'm trying to tackle this problem for a program that will calculate the flesch-kincaid and flesch reading score of a block of text. My algorithm uses what I found on this website: http://www.howmanysyllables.com/howtocountsyllables.html and it gets reasonably close. It still has trouble on complicated words like invisible and hyphenation, but I've found it gets in the ballpark for my purposes.
It has the upside of being easy to implement. I found the "es" can be either syllabic or not. It's a gamble, but I decided to remove the es in my algorithm.
private int CountSyllables(string word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
string currentWord = word;
int numVowels = 0;
bool lastWasVowel = false;
foreach (char wc in currentWord)
{
bool foundVowel = false;
foreach (char v in vowels)
{
//don't count diphthongs
if (v == wc && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
//if full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
//remove es, it's _usually? silent
if (currentWord.Length > 2 &&
currentWord.Substring(currentWord.Length - 2) == "es")
numVowels--;
// remove silent e
else if (currentWord.Length > 1 &&
currentWord.Substring(currentWord.Length - 1) == "e")
numVowels--;
return numVowels;
}
Solution 5 - Nlp
This is a particularly difficult problem which is not completely solved by the LaTeX hyphenation algorithm. A good summary of some available methods and the challenges involved can be found in the paper Evaluating Automatic Syllabification Algorithms for English (Marchand, Adsett, and Damper 2007).
Solution 6 - Nlp
Why calculate it? Every online dictionary has this info. http://dictionary.reference.com/browse/invisible in·vis·i·ble
Solution 7 - Nlp
Bumping @Tihamer and @joe-basirico. Very useful function, not perfect, but good for most small-to-medium projects. Joe, I have re-written an implementation of your code in Python:
def countSyllables(word):
vowels = "aeiouy"
numVowels = 0
lastWasVowel = False
for wc in word:
foundVowel = False
for v in vowels:
if v == wc:
if not lastWasVowel: numVowels+=1 #don't count diphthongs
foundVowel = lastWasVowel = True
break
if not foundVowel: #If full cycle and no vowel found, set lastWasVowel to false
lastWasVowel = False
if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
numVowels-=1
elif len(word) > 1 and word[-1:] == "e": #remove silent e
numVowels-=1
return numVowels
Hope someone finds this useful!
Solution 8 - Nlp
Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.
Here is your code in Java, along with test cases:
public static int countSyllables(String word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
char[] currentWord = word.toCharArray();
int numVowels = 0;
boolean lastWasVowel = false;
for (char wc : currentWord) {
boolean foundVowel = false;
for (char v : vowels)
{
//don't count diphthongs
if ((v == wc) && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
// If full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
// Remove es, it's _usually? silent
if (word.length() > 2 &&
word.substring(word.length() - 2) == "es")
numVowels--;
// remove silent e
else if (word.length() > 1 &&
word.substring(word.length() - 1) == "e")
numVowels--;
return numVowels;
}
public static void main(String[] args) {
String txt = "what";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "super";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Maryland";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "American";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "disenfranchized";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Sophia";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}
The result was as expected (it works good enough for Flesch-Kincaid):
txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2
Solution 9 - Nlp
I ran into this exact same issue a little while ago.
I ended up using the CMU Pronunciation Dictionary for quick and accurate lookups of most words. For words not in the dictionary, I fell back to a machine learning model that's ~98% accurate at predicting syllable counts.
I wrapped the whole thing up in an easy-to-use python module here: https://github.com/repp/big-phoney
Install:
pip install big-phoney
Count Syllables:
from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops') # --> 4
If you're not using Python and you want to try the ML-model-based approach, I did a pretty detailed write up on how the syllable counting model works on Kaggle.
Solution 10 - Nlp
Perl has Lingua::Phonology::Syllable module. You might try that, or try looking into its algorithm. I saw a few other older modules there, too.
I don't understand why a regular expression gives you only a count of syllables. You should be able to get the syllables themselves using capture parentheses. Assuming you can construct a regular expression that works, that is.
Solution 11 - Nlp
Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.
Cave: It is important to remove the last lines of the .tex
pattern files, because otherwise those files can not be loaded with the current version on Maven Central.
To load and use the hyphenator
, you can use the following Java code snippet. texTable
is the name of the .tex
files containing the needed patterns. Those files are available on the project github site.
private Hyphenator createHyphenator(String texTable) {
Hyphenator hyphenator = new Hyphenator();
hyphenator.setErrorHandler(new ErrorHandler() {
public void debug(String guard, String s) {
logger.debug("{},{}", guard, s);
}
public void info(String s) {
logger.info(s);
}
public void warning(String s) {
logger.warn("WARNING: " + s);
}
public void error(String s) {
logger.error("ERROR: " + s);
}
public void exception(String s, Exception e) {
logger.error("EXCEPTION: " + s, e);
}
public boolean isDebugged(String guard) {
return false;
}
});
BufferedReader table = null;
try {
table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
.getResourceAsStream((texTable)), Charset.forName("UTF-8")));
hyphenator.loadTable(table);
} catch (Utf8TexParser.TexParserException e) {
logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
throw new RuntimeException("Failed to load hyphenation table", e);
} finally {
if (table != null) {
try {
table.close();
} catch (IOException e) {
logger.error("Closing hyphenation table failed", e);
}
}
}
return hyphenator;
}
Afterwards the Hyphenator
is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.
String hyphenedTerm = hyphenator.hyphenate(term);
String hyphens[] = hyphenedTerm.split("\u00AD");
int syllables = hyphens.length;
You need to split on "\u00AD
", since the API does not return a normal "-"
.
This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.
Solution 12 - Nlp
Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):
countsyllables.lua
function CountSyllables(word)
local vowels = { 'a','e','i','o','u','y' }
local numVowels = 0
local lastWasVowel = false
for i = 1, #word do
local wc = string.sub(word,i,i)
local foundVowel = false;
for _,v in pairs(vowels) do
if (v == string.lower(wc) and lastWasVowel) then
foundVowel = true
lastWasVowel = true
elseif (v == string.lower(wc) and not lastWasVowel) then
numVowels = numVowels + 1
foundVowel = true
lastWasVowel = true
end
end
if not foundVowel then
lastWasVowel = false
end
end
if string.len(word) > 2 and
string.sub(word,string.len(word) - 1) == "es" then
numVowels = numVowels - 1
elseif string.len(word) > 1 and
string.sub(word,string.len(word)) == "e" then
numVowels = numVowels - 1
end
return numVowels
end
And some fun tests to confirm it works (as much as it's supposed to):
countsyllables.tests.lua
require "countsyllables"
tests = {
{ word = "what", syll = 1 },
{ word = "super", syll = 2 },
{ word = "Maryland", syll = 3},
{ word = "American", syll = 4},
{ word = "disenfranchized", syll = 5},
{ word = "Sophia", syll = 2},
{ word = "End", syll = 1},
{ word = "I", syll = 1},
{ word = "release", syll = 2},
{ word = "same", syll = 1},
}
for _,test in pairs(tests) do
local resultSyll = CountSyllables(test.word)
assert(resultSyll == test.syll,
"Word: "..test.word.."\n"..
"Expected: "..test.syll.."\n"..
"Result: "..resultSyll)
end
print("Tests passed.")
Solution 13 - Nlp
I could not find an adequate way to count syllables, so I designed a method myself.
You can view my method here: https://stackoverflow.com/a/32784041/2734752
I use a combination of a dictionary and algorithm method to count syllables.
You can view my library here: https://github.com/troywatson/Lawrence-Style-Checker
I just tested my algorithm and had a 99.4% strike rate!
Lawrence lawrence = new Lawrence();
System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));
Output:
4
3
Solution 14 - Nlp
After doing a lot of testing and trying out hyphenation packages as well, I wrote my own based on a number of examples. I also tried the pyhyphen
and pyphen
packages that interfaces with hyphenation dictionaries, but they produce the wrong number of syllables in many cases. The nltk
package was simply too slow for this use case.
My implementation in Python is part of a class i wrote, and the syllable counting routine is pasted below. It over-estimates the number of syllables a bit as I still haven't found a good way to account for silent word endings.
The function returns the ratio of syllables per word as it is used for a Flesch-Kincaid readability score. The number doesn't have to be exact, just close enough for an estimate.
On my 7th generation i7 CPU, this function took 1.1-1.2 milliseconds for a 759 word sample text.
def _countSyllablesEN(self, theText):
cleanText = ""
for ch in theText:
if ch in "abcdefghijklmnopqrstuvwxyz'’":
cleanText += ch
else:
cleanText += " "
asVow = "aeiouy'’"
dExep = ("ei","ie","ua","ia","eo")
theWords = cleanText.lower().split()
allSylls = 0
for inWord in theWords:
nChar = len(inWord)
nSyll = 0
wasVow = False
wasY = False
if nChar == 0:
continue
if inWord[0] in asVow:
nSyll += 1
wasVow = True
wasY = inWord[0] == "y"
for c in range(1,nChar):
isVow = False
if inWord[c] in asVow:
nSyll += 1
isVow = True
if isVow and wasVow:
nSyll -= 1
if isVow and wasY:
nSyll -= 1
if inWord[c:c+2] in dExep:
nSyll += 1
wasVow = isVow
wasY = inWord[c] == "y"
if inWord.endswith(("e")):
nSyll -= 1
if inWord.endswith(("le","ea","io")):
nSyll += 1
if nSyll < 1:
nSyll = 1
# print("%-15s: %d" % (inWord,nSyll))
allSylls += nSyll
return allSylls/len(theWords)
Solution 15 - Nlp
You can try Spacy Syllables. This works on Python 3.9:
Setup:
pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md
Code:
import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')
def spacy_syllablize(word):
token = nlp(word)[0]
return token._.syllables
for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
print(f"{test_word} -> {spacy_syllablize(test_word)}")
Output:
trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']
Solution 16 - Nlp
I am including a solution that works "okay" in R. Far from perfect.
countSyllablesInWord = function(words)
{
#word = "super";
n.words = length(words);
result = list();
for(j in 1:n.words)
{
word = words[j];
vowels = c("a","e","i","o","u","y");
word.vec = strsplit(word,"")[[1]];
word.vec;
n.char = length(word.vec);
is.vowel = is.element(tolower(word.vec), vowels);
n.vowels = sum(is.vowel);
# nontrivial problem
if(n.vowels <= 1)
{
syllables = 1;
str = word;
} else {
# syllables = 0;
previous = "C";
# on average ?
str = "";
n.hyphen = 0;
for(i in 1:n.char)
{
my.char = word.vec[i];
my.vowel = is.vowel[i];
if(my.vowel)
{
if(previous == "C")
{
if(i == 1)
{
str = paste0(my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
if(i < n.char)
{
if(n.vowels > (n.hyphen + 1))
{
str = paste0(str, my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
}
}
# syllables = 1 + syllables;
previous = "V";
} else { # "VV"
# assume what ? vowel team?
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
previous = "C";
}
#
}
syllables = 1 + n.hyphen;
}
result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
}
if(n.words == 1) { result[[1]]; } else { result; }
}
Here are some results:
my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));
my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);
my.count.df;
# syllables vowels word
# 1 4 4 A-me-ri-ca
# 2 4 5 be-auti-fu-l
# 3 3 4 spa-ci-ous
# 4 2 2 ski-es
# 5 2 2 a-mber
# 6 2 2 wa-ves
# 7 2 2 gra-in
# 8 2 2 pu-rple
# 9 3 4 mo-unta-ins
# 10 3 3 ma-je-sty
I didn't realize how big of a "rabbit hole" this is, seems so easy.
################ hackathon #######
# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word
# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/
# https://enchantedlearning.com/consonantblends/index.shtml
# start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr",
# "fl", "fr", "gl", "gr", "pl", "pr",
# "sc", "sh", "sk", "sl", "sm", "sn",
# "sp", "st", "sw", "th", "tr", "tw",
# "wh", "wr");
# start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
# "spr", "squ", "str", "thr");
#
#
#
# end.digraphs = c("ch","sh","th","ng","dge","tch");
#
# ile
#
# farmer
# ar er
#
# vowel teams ... beaver1
#
#
# # "able"
# # http://www.abcfastphonics.com/letter-blends/blend-cial.html
# blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian",
# "ck", "ct", "dge", "dis", "ed", "ex", "ful",
# "gh", "ng", "ous", "kn", "ment", "mis", );
#
# glue = c("ld", "st", "nd", "ld", "ng", "nk",
# "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch",
# "nse", "nt", "ph", "psy", "pt", "re", )
#
#
# start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
#
# # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
# digraphs.start = c("ch","sh","th","wh","ph","qu");
# digraphs.end = c("ch","sh","th","ng","dge","tch");
# # https://www.education.com/worksheet/article/beginning-consonant-blends/
# blends.start = c("pl", "gr", "gl", "pr",
#
# blends.end = c("lk","nk","nt",
#
#
# # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
# # Monte Mon-te
# # Sophia So-phi-a
# # American A-mer-i-can
#
# n.vowels = 0;
# for(i in 1:n.char)
# {
# my.char = word.vec[i];
#
#
#
#
#
# n.syll = 0;
# str = "";
#
# previous = "C"; # consonant vs "V" vowel
#
# for(i in 1:n.char)
# {
# my.char = word.vec[i];
#
# my.vowel = is.element(tolower(my.char), vowels);
# if(my.vowel)
# {
# n.vowels = 1 + n.vowels;
# if(previous == "C")
# {
# if(i == 1)
# {
# str = paste0(my.char, "-");
# } else {
# if(n.syll > 1)
# {
# str = paste0(str, "-", my.char);
# } else {
# str = paste0(str, my.char);
# }
# }
# n.syll = 1 + n.syll;
# previous = "V";
# }
#
# } else {
# str = paste0(str, my.char);
# previous = "C";
# }
# #
# }
#
#
#
#
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE 1
# IDEA 3
# IDEAS 2
# IDEE 2
# IDE 1
# AIDA 2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE 1
# HALIDE 2
# TELEPHONE 3
# TELEPHONY 4
# DUE 1
# IDEAL 2
# DEE 1
# UREA 3
# VACUO 3
# SEANCE 1
# SAILED 1
# RIBBED 1
# MOPED 1
# BLESSED 1
# AGED 1
# TOTED 2
# WARRED 1
# UNDERFED 2
# JADED 2
# INBRED 2
# BRED 1
# RED 1
# STATES 1
# TASTES 1
# TESTES 1
# UTILIZES 4
And for good measure, a simple kincaid readability function ... syllables is a list of counts returned from the first function ...
Since my function is a bit biased towards more syllables, that will give an inflated readability score ... which for now is fine ... if the goal is to make text more readable, this is not the worst thing.
computeReadability = function(n.sentences, n.words, syllables=NULL)
{
n = length(syllables);
n.syllables = 0;
for(i in 1:n)
{
my.syllable = syllables[[i]];
n.syllables = my.syllable$syllables + n.syllables;
}
# Flesch Reading Ease (FRE):
FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
# Flesh-Kincaid Grade Level (FKGL):
FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59;
# FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
# FKGL = -0.13948 * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
list("FRE" = FRE, "FKGL" = FKGL);
}
Solution 17 - Nlp
I used jsoup to do this once. Here's a sample syllable parser:
public String[] syllables(String text){
String url = "https://www.merriam-webster.com/dictionary/" + text;
String relHref;
try{
Document doc = Jsoup.connect(url).get();
Element link = doc.getElementsByClass("word-syllables").first();
if(link == null){return new String[]{text};}
relHref = link.html();
}catch(IOException e){
relHref = text;
}
String[] syl = relHref.split("·");
return syl;
}