What's the best way to parse a tab-delimited file in Ruby?

RubyTsv

Ruby Problem Overview


What's the best (most efficient) way to parse a tab-delimited file in Ruby?

Ruby Solutions


Solution 1 - Ruby

The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:

require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")

Solution 2 - Ruby

The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:

require 'csv'
line = 'boogie\ttime\tis "now"'
begin
  line = CSV.parse_line(line, col_sep: "\t")
  puts "parsed correctly"
rescue CSV::MalformedCSVError
  puts "failed to parse line"
end

begin
  line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
  puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
  puts "failed to parse line with random quote char"
end

#Output:
# failed to parse line
# parsed correctly with random quote char

If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.

# The main parse method is mostly borrowed from a tweet by @JEG2
class StrictTsv
  attr_reader :filepath
  def initialize(filepath)
    @filepath = filepath
  end
 
  def parse
    open(filepath) do |f|
      headers = f.gets.strip.split("\t")
      f.each do |line|
        fields = Hash[headers.zip(line.split("\t"))]
        yield fields
      end
    end
  end
end

# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
  puts row['named field']
end

The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.

Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values

Solution 3 - Ruby

There are actually two different kinds of TSV files.

  1. TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:

     use 'csv'
     parsed = CSV.read("file.tsv", col_sep: "\t")
    
  2. TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:

     use 'csv'
     parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
    

Solution 4 - Ruby

I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.

Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:

def parse
  open(filepath) do |f|
    headers = f.gets.strip.split("\t")
    f.each do |line|
      myline=line
      while myline.scan(/\t/).count != headers.count-1
        myline+=f.gets
      end
      fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
      yield fields
    end
  end
end

This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionmbmView Question on Stackoverflow
Solution 1 - RubyjergasonView Answer on Stackoverflow
Solution 2 - RubymmmriesView Answer on Stackoverflow
Solution 3 - RubymichauView Answer on Stackoverflow
Solution 4 - RubyJim MacKenzieView Answer on Stackoverflow