How do I replace accented Latin characters in Ruby?

Ruby on-Rails Ruby Activerecord Unicode Utf 8

Ruby on-Rails Problem Overview

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search:

class Foo
  validates_presence_of :name

  before_validate :set_canonical_name

  private

  def set_canonical_name
    self.canonical_name ||= canonicalize(self.name) if self.name
  end

  def canonicalize(x)
    x.downcase.  # something here
  end
end

I need to fill in the "something here" to replace the accented characters. Is there anything better than

x.downcase.gsub(/[àáâãäå]/,'a').gsub(/æ/,'ae').gsub(/ç/, 'c').gsub(/[èéêë]/,'e')....

And, for that matter, since I'm not on Ruby 1.9, I can't put those Unicode literals in my code. The actual regular expressions will look much uglier.

Ruby on-Rails Solutions

Solution 1 - Ruby on-Rails

ActiveSupport::Inflector.transliterate (requires Rails 2.2.1+ and Ruby 1.9 or 1.8.7)

example:

>> ActiveSupport::Inflector.transliterate("àáâãäå").to_s => "aaaaaa"

Solution 2 - Ruby on-Rails

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:

>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"

Solution 3 - Ruby on-Rails

Better yet is to use I18n:

1.9.3-p392 :001 > require "i18n"
 => false
1.9.3-p392 :002 > I18n.transliterate("Olá Mundo!")
 => "Ola Mundo!"

Solution 4 - Ruby on-Rails

I have tried a lot of this approaches but they were not achieving one or several of these requirements:

Respect spaces
Respect 'ñ' character
Respect case (I know is not a requirement for the original question but is not difficult to move an string to lowcase)

Has been this:

# coding: utf-8
string.tr(
  "ÀÁÂÃÄÅàáâãäåĀāĂăĄąÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňŉŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
  "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
)

– http://blog.slashpoundbang.com/post/12938588984/remove-all-accents-and-diacritics-from-string-in-ruby

You have to modify a little bit the character list to respect 'ñ' character but is an easy job.

Solution 5 - Ruby on-Rails

My answer: the String#parameterize method:

"Le cœur de la crémiére".parameterize
=> "le-coeur-de-la-cremiere"

For non-Rails programs:

Install activesupport: gem install activesupport then:

require 'active_support/inflector'

"a&]'s--3\014\xC2àáâã3D".parameterize
# => "a-s-3-3d"

Solution 6 - Ruby on-Rails

I think that you maybe don't really what to go down that path. If you are developing for a market that has these kind of letters your users probably will think you are a sort of ...pip. Because 'å' isn't even close to 'a' in any meaning to a user. Take a different road and read up about searching in a non-ascii way. This is just one of those cases someone invented unicode and collation.

A very late PS:

http://www.w3.org/International/wiki/Case_folding http://www.w3.org/TR/charmod-norm/#sec-WhyNormalization

Besides that I have no ide way the link to collation go to a msdn page but I leave it there. It should have been http://www.unicode.org/reports/tr10/

Solution 7 - Ruby on-Rails

Decompose the string and remove non-spacing marks from it.

irb -ractive_support/all
> "àáâãäå".mb_chars.normalize(:kd).gsub(/\p{Mn}/, '')
aaaaaa

You may also need this if used in a .rb file.

# coding: utf-8

the normalize(:kd) part here splits out diacriticals where possible (ex: the "n with tilda" single character is split into an n followed by a combining diacritical tilda character), and the gsub part then removes all the diacritical characters.

Solution 8 - Ruby on-Rails

This assumes you use Rails.

"anything".parameterize.underscore.humanize.downcase

Given your requirements, this is probably what I'd do... I think it's neat, simple and will stay up to date in future versions of Rails and Ruby.

Update: dgilperez pointed out that parameterize takes a separator argument, so "anything".parameterize(" ") (deprecated) or "anything".parameterize(separator: " ") is shorter and cleaner.

Solution 9 - Ruby on-Rails

Convert the text to normalization form D, remove all codepoints with unicode category non spacing mark (Mn), and convert it back to normalization form C. This will strip all diacritics, and your problem is reduced to a case insensitive search.

See http://www.siao2.com/2005/02/19/376617.aspx and http://www.siao2.com/2007/05/14/2629747.aspx for details.

Solution 10 - Ruby on-Rails

The key is to use two columns in your database: canonical_text and original_text. Use original_text for display and canonical_text for searches. That way, if a user searches for "Visual Cafe," she sees the "Visual Café" result. If she really wants a different item called "Visual Cafe," it can be saved separately.

To get the canonical_text characters in a Ruby 1.8 source file, do something like this:

register_replacement([0x008A].pack('U'), 'S')

Solution 11 - Ruby on-Rails

You probably want Unicode decomposition ("NFD"). After decomposing the string, just filter out anything not in [A-Za-z]. æ will decompose to "ae", ã to "a~" (approximately - the diacritical will become a separate character) so the filtering leaves a reasonable approximation.

Solution 12 - Ruby on-Rails

iconv:

http://groups.google.com/group/ruby-talk-google/browse_frm/thread/8064dcac15d688ce?

=============

a perl module which i can't understand:

http://www.ahinea.com/en/tech/accented-translate.html

============

brute force (there's a lot of htose critters!:

http://projects.jkraemer.net/acts_as_ferret/wiki#UTF-8support

http://snippets.dzone.com/posts/show/2384

Solution 13 - Ruby on-Rails

For anyone reading this wanting to strip all non-ascii characters this might be useful, I used the first example successfully.

Solution 14 - Ruby on-Rails

I had problems getting the foo.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s solution to work. I'm not using Rails and there was some conflict with my activesupport/ruby versions that I couldn't get to the bottom of.

Using the ruby-unf gem seems to be a good substitute:

require 'unf'
foo.to_nfd.gsub(/[^\x00-\x7F]/n,'').downcase

As far as I can tell this does the same thing as .mb_chars.normalize(:kd). Is this correct? Thanks!

Solution 15 - Ruby on-Rails

If you are using PostgreSQL => 9.4 as your DB adapter, maybe you could add in a migration it's "unaccent" extension that I think does what you want, like this:

def self.up
   enable_extension "unaccent" # No falla si ya existe
end

In order to test, in the console:

2.3.1 :045 > ActiveRecord::Base.connection.execute("SELECT unaccent('unaccent', 'àáâãäåÁÄ')").first
 => {"unaccent"=>"aaaaaaAA"}

Notice there is case sensitive up to now.

Then, maybe use it in a scope, like:

scope :with_canonical_name, -> (name) {
   where("unaccent(foos.name) iLIKE unaccent('#{name}')")
}

The iLIKE operator makes the search case insensitive. There is another approach, using citext data type. Here is a discussion about this two approaches. Notice also that use of PosgreSQL's lower() function is not recommended.

This will save you some DB space, since you will no longer require the cannonical_name field, and perhaps make your model simpler, at the cost of some extra processing in each query, in an amount depending of whether you are using iLIKE or citext, and your dataset.

If you are using MySQL maybe you can use this simple solution, but I have not tested it.

Solution 16 - Ruby on-Rails

lol.. i just tryed this.. and it is working.. iam still not pretty sure why.. but when i use this 4 lines of code:

str = str.gsub(/[^a-zA-Z0-9 ]/,"")
str = str.gsub(/[ ]+/," ")
str = str.gsub(/ /,"-")
str = str.downcase

it automaticly removes any accent from filenames.. which i was trying to remove(accent from filenames and renaming them than) hope it helped :)

Content Type	Original Author	Original Content on Stackoverflow
Question	James A. Rosen	View Question on Stackoverflow
Solution 1 - Ruby on-Rails	Mark Wilden	View Answer on Stackoverflow
Solution 2 - Ruby on-Rails	unexist	View Answer on Stackoverflow
Solution 3 - Ruby on-Rails	Diego Moreira	View Answer on Stackoverflow
Solution 4 - Ruby on-Rails	fguillen	View Answer on Stackoverflow
Solution 5 - Ruby on-Rails	Dorian	View Answer on Stackoverflow
Solution 6 - Ruby on-Rails	Jonke	View Answer on Stackoverflow
Solution 7 - Ruby on-Rails	Cheng	View Answer on Stackoverflow
Solution 8 - Ruby on-Rails	Sudhir Jonathan	View Answer on Stackoverflow
Solution 9 - Ruby on-Rails	CesarB	View Answer on Stackoverflow
Solution 10 - Ruby on-Rails	James A. Rosen	View Answer on Stackoverflow
Solution 11 - Ruby on-Rails	MSalters	View Answer on Stackoverflow
Solution 12 - Ruby on-Rails	Gene T	View Answer on Stackoverflow
Solution 13 - Ruby on-Rails	Kris	View Answer on Stackoverflow
Solution 14 - Ruby on-Rails	eoghan.ocarragain	View Answer on Stackoverflow
Solution 15 - Ruby on-Rails	user2553863	View Answer on Stackoverflow
Solution 16 - Ruby on-Rails	Jozef	View Answer on Stackoverflow