How to make a Ruby string safe for a filesystem?

Ruby on-RailsRubyStringFilesystems

Ruby on-Rails Problem Overview


I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.

For instance:

my§document$is°°   very&interesting___thisIs%nice445.doc.pdf

should become

my_document_is_____very_interesting___thisIs_nice445_doc.pdf

and then ideally

my_document_is_very_interesting_thisIs_nice445_doc.pdf

Is there a nice and elegant way for doing this?

Ruby on-Rails Solutions


Solution 1 - Ruby on-Rails

I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.

Here's my solution:

def sanitize_filename(filename)
  # Split the name when finding a period which is preceded by some
  # character, and is followed by some character other than a period,
  # if there is no following period that is followed by something
  # other than a period (yeah, confusing, I know)
  fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m

  # We now have one or two parts (depending on whether we could find
  # a suitable period). For each of these parts, replace any unwanted
  # sequence of characters with an underscore
  fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }

  # Finally, join the parts with a period and return the result
  return fn.join '.'
end

You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:

  • There should be at most one filename extension, which means that there should be at most one period in the filename
  • Trailing periods do not mark the start of an extension
  • Leading periods do not mark the start of an extension
  • Any sequence of characters beyond AZ, az, 09 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')

The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.

My results from testing the function:

1.9.3p125 :006 > sanitize_filename 'my§document$is°°   very&interesting___thisIs%nice445.doc.pdf'
 => "my_document_is_very_interesting_thisIs_nice445_doc.pdf"

which I think is what you requested. I hope this is nice and elegant enough.

Solution 2 - Ruby on-Rails

From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:

def sanitize_filename(filename)
  returning filename.strip do |name|
   # NOTE: File.basename doesn't work right with Windows paths on Unix
   # get only the filename, not the whole path
   name.gsub!(/^.*(\\|\/)/, '')

   # Strip out the non-ascii character
   name.gsub!(/[^0-9A-Za-z.\-]/, '_')
  end
end

Solution 3 - Ruby on-Rails

In Rails you might also be able to use ActiveStorage::Filename#sanitized:

ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"

Solution 4 - Ruby on-Rails

If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.

"my§document$is°°   very&interesting___thisIs%nice445.doc.pdf".parameterize

Solution 5 - Ruby on-Rails

For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:

filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")

Implementation details and ideas see source: https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/transliterate.rb

def parameterize(string, separator: "-", preserve_case: false)
  # Turn unwanted chars into the separator.
  parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
  #... some more stuff
end

Solution 6 - Ruby on-Rails

If your goal is just to generate a filename that is "safe" to use on all operating systems (and not to remove any and all non-ASCII characters), then I would recommend the zaru gem. It doesn't do everything the original question specifies, but the filename produced should be safe to use (and still keep any filename-safe unicode characters untouched):

Zaru.sanitize! "  what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°°   very&interes:ting___thisIs%nice445.doc.pdf" 
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"

Solution 7 - Ruby on-Rails

There is a library that may be helpful, especially if you're interested in replacing weird Unicode characters with ASCII: unidecode.

irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmarcggView Question on Stackoverflow
Solution 1 - Ruby on-RailsAnders SjöqvistView Answer on Stackoverflow
Solution 2 - Ruby on-RailsmikuView Answer on Stackoverflow
Solution 3 - Ruby on-RailsmorglerView Answer on Stackoverflow
Solution 4 - Ruby on-RailsalbandiguerView Answer on Stackoverflow
Solution 5 - Ruby on-RailsBlair AndersonView Answer on Stackoverflow
Solution 6 - Ruby on-RailsDavidView Answer on Stackoverflow
Solution 7 - Ruby on-RailsJan WarchołView Answer on Stackoverflow