"bad words" filter

ListDictionaryProfanity

List Problem Overview


Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found this one, and it's a start, but nothing more.

Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)

The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.

Thanks for any help.

List Solutions


Solution 1 - List

Beware of clbuttic mistakes.

> "Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!" > > Hmm. "clbuttic". > > Google "clbuttic" - thousands of hits! > > There's someone who call his car 'clbuttic'. > > There are "Clbuttic Steam Engine" message boards. > > Webster's dictionary - no help. > > Hmm. What can this be? > > HINT: People who make buttumptions about their regex scripts, will be > embarbutted when they repeat this mbuttive mistake.

Solution 2 - List

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.

<?php

/**
 * @author unkwntech@unkwndesign.com
 **/

if($_GET['act'] == 'do')
 {
	$pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
	$pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
	$pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
	$pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
	$pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
	$pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
	$pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
	$pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
	$pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
	$pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
	$pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
	$pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
	$pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
	$pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
	$pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
	$pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
	$pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
	$pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
	$pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
	$pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
	$pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
	$pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
	$pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
	$pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
	$pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
	$pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
	$word = str_split(strtolower($_POST['word']));
	$i=0;
	while($i < count($word))
	 {
	 	if(!is_numeric($word[$i]))
		 {
		 	if($word[$i] != ' ' || count($word[$i]) < '1')
		 	 {
				$word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
			 }
		 }
		$i++;
	 }
	//$word = "/" . implode('', $word) . "/";
	echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
 	$link = mysql_connect('localhost', 'username', 'password', '1');
 	mysql_select_db('peoples');
 	$sql = "SELECT word FROM filters";
 	$result = mysql_query($sql, $link);
 	$i=0;
 	while($i < mysql_num_rows($result))
 	 {
		echo mysql_result($result, $i, 'word') . "<br />";
		$i++;
	 }
	 echo '<hr>';
 }
?>
<html>
	<head>
		<title>RegEx Generator</title>
	</head>
	<body>
		<form action='badword.php?act=do' method='post'>
			Word: <input type='text' name='word' /><br />
			<input type='submit' value='Generate' />
		</form>
		<a href="badword.php?act=list">List Words</a>
	</body>
</html>

Solution 3 - List

Shutterstock has a Github repo with a list of bad words used for filtering.

You can check it out here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

Solution 4 - List

If anyone needs an API, google currently provide a bad word indicator.

http://www.wdyl.com/profanity?q=naughtyword

{
response: "false"
}

Update: Google has now removed this service.

Solution 5 - List

I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.

Solution 6 - List

+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.

Solution 7 - List

Solution 8 - List

You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.

Solution 9 - List

In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionilaView Question on Stackoverflow
Solution 1 - ListAgentConundrumView Answer on Stackoverflow
Solution 2 - ListUnkwnTechView Answer on Stackoverflow
Solution 3 - ListDavid FragaView Answer on Stackoverflow
Solution 4 - ListTonyView Answer on Stackoverflow
Solution 5 - ListKibbeeView Answer on Stackoverflow
Solution 6 - ListJon LimjapView Answer on Stackoverflow
Solution 7 - ListMing-TangView Answer on Stackoverflow
Solution 8 - ListRossView Answer on Stackoverflow
Solution 9 - ListRichardView Answer on Stackoverflow