Parsing XLS and XLSX (MS Excel) files with Ruby?

RubyExcel

Ruby Problem Overview


Are there any gems able to parse XLS and XLSX files? I've found Spreadsheet and ParseExcel, but they both don't understand XLSX format.

Ruby Solutions


Solution 1 - Ruby

I recently needed to parse some Excel files with Ruby. The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it.

Here is a table of different Ruby libraries and what they support:

enter image description here

If you care about performance, here is how the xlsx libraries compare: enter image description here

I have sample code to read xlsx files with each supported library here

Here are some examples for reading xlsx files with some different libraries:

rubyXL

require 'rubyXL'

workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.sheet_name}"
  num_rows = 0
  worksheet.each do |row|
    row_cells = row.cells.map{ |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

roo

require 'roo'

workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet}"
  num_rows = 0
  workbook.sheet(worksheet).each_row_streaming do |row|
    row_cells = row.map { |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows" 
end

creek

require 'creek'

workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.values
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

simple_xlsx_reader

require 'simple_xlsx_reader'

workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

Here is an example of reading a legacy xls file using the spreadsheet library:

spreadsheet

require 'spreadsheet'

# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

Solution 2 - Ruby

Just found roo, that might do the job - works for my requirements, reading a basic spreadsheet.

Solution 3 - Ruby

The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed.

I agree the syntax is not great nor ruby-like. But that can be easily achieved with something like:

class Spreadsheet
  def initialize(file_path)
    @xls = Roo::Spreadsheet.open(file_path)
  end

  def each_sheet
    @xls.sheets.each do |sheet|
      @xls.default_sheet = sheet
      yield sheet
    end
  end

  def each_row
    0.upto(@xls.last_row) do |index|
      yield @xls.row(index)
    end
  end

  def each_column
    0.upto(@xls.last_column) do |index|
      yield @xls.column(index)
    end
  end
end

Solution 4 - Ruby

I'm using creek which uses nokogiri. It is fast. Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is a hash of row and column name to cell content: {"A1"=>"a cell", "B1"=>"another cell"} The hash makes no guarantee that the keys will be in the original column order. https://github.com/pythonicrubyist/creek

dullard is another great one that uses nokogiri. It is super fast. Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 2.0.0+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/thirtyseven/dullard

simple_xlsx_reader which has been mentioned is great, a bit slow. Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. Got it to work on ruby 1.9.3+. The output format for each row is an array: ["a cell", "another cell"] https://github.com/woahdae/simple_xlsx_reader

Another interesting one is oxcelix. It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. It supposedly outputs a Matrix. I could not get it to work. Also, there were some dependency issues with rubyzip. Would not recommend it.

In conclusion, creek seems like a good choice. Other posts recommend simple_xlsx_parser as it has similar performance.

Removed dullard as recommended as it's outdated and people are getting errors/having problems with it.

Solution 5 - Ruby

If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html. I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test).

At this point, it looks like roo is a good option again. It supports XLSX, allows (some) iteration by just using times with cell access. I admit, it's not pretty though.

Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over.

Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. However, as @PanagiotisKanavos mentioned in the comments, this has a few major drawbacks:

  • Excel must be installed
  • A new Excel instance is started for each document
  • Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation.

But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel. Here's the documentation: http://support.microsoft.com/kb/222101 Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html

Again, the options don't look much better, but there isn't much else out there, I'm afraid. it's hard to parse a file format that is a black box. And those few who managed to break it didn't do it that visibly. Google Docs is closed source, and LibreOffice is thousands of lines of harry C++.

Solution 6 - Ruby

I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. However, one area that both suffer is the lack of examples on actually implementing anything useful. Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. I hope the code below can serve as a helpful example and show just how effective these tools can be.

require 'find'
require 'rubyXL'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xlsx$\b/                                  # check if file is xlsx format
    workbook = RubyXL::Parser.parse(file).worksheets       # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      data = worksheet.extract_data.to_s                   # extract data of a given worksheet - must be converted to a string in order to match a regex
      if data =~ /regex/
        puts file
        count += 1
      end      
    end
  end
end

puts "#{count} files were found"

require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xls$\b/                                   # check if a given file is xls format
    workbook =  Spreadsheet.open(file).worksheets          # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      worksheet.each do |row|                              # begin iteration over each row of a worksheet
        if row.to_s =~ /regex/                             # rows must be converted to strings in order to match the regex
          puts file
          count += 1
        end
      end
    end
  end
end

puts "#{count} files were found"

Solution 7 - Ruby

The rubyXL gem parses XLSX files beautifully.

Solution 8 - Ruby

I couldn't find a satisfactory xlsx parser. RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code.

So, I wrote simple_xlsx_reader. You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for.

Solution 9 - Ruby

Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. That's fine if your spreadsheet is small.

xls = Spreadsheet.open(file_path)

For anyone working with very large files, a better way is to stream-read the contents of the file. The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015).

Spreadsheet.open(file_path).worksheets.first.rows do |row|
  # do something with the array of CSV data
end

CITE: https://github.com/zdavatz/spreadsheet

Solution 10 - Ruby

The RemoteTable library uses roo internally. It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.):

require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
  puts row.inspect
end

Output:

{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionDanielView Question on Stackoverflow
Solution 1 - RubymattnedrichView Answer on Stackoverflow
Solution 2 - RubyChris KimptonView Answer on Stackoverflow
Solution 3 - RubyBruno BuccoloView Answer on Stackoverflow
Solution 4 - RubyfrediyView Answer on Stackoverflow
Solution 5 - RubyLinuxiosView Answer on Stackoverflow
Solution 6 - RubyAnconiaView Answer on Stackoverflow
Solution 7 - RubyJason GalvinView Answer on Stackoverflow
Solution 8 - RubyWoahdaeView Answer on Stackoverflow
Solution 9 - Rubyscarver2View Answer on Stackoverflow
Solution 10 - RubySeamus AbshereView Answer on Stackoverflow