Reading Excel File using Python, how do I get the values of a specific column with indicated column name?

PythonExcelXlrd

Python Problem Overview


I've an Excel File:

Arm_id	    DSPName	       DSPCode	        HubCode	         PinCode	PPTL
1	         JaVAS	          01	          AGR	          282001	1,2
2	         JaVAS	          01	          AGR	          282002	3,4
3	         JaVAS	          01	          AGR	          282003	5,6

I want to save a string in the form Arm_id,DSPCode,Pincode. This format is configurable, i.e. it might change to DSPCode,Arm_id,Pincode. I save it in a list like:

FORMAT = ['Arm_id', 'DSPName', 'Pincode']

How do I read the content of a specific column with provided name, given that the FORMAT is configurable?

This is what I tried. Currently I'm able to read all the content in the file

from xlrd import open_workbook
wb = open_workbook('sample.xls')
for s in wb.sheets():
	#print 'Sheet:',s.name
	values = []
	for row in range(s.nrows):
		col_value = []
		for col in range(s.ncols):
			value  = (s.cell(row,col).value)
			try : value = str(int(value))
			except : pass
			col_value.append(value)
		values.append(col_value)
print values

My output is :

[
    [u'Arm_id', u'DSPName', u'DSPCode', u'HubCode', u'PinCode', u'PPTL'],
    ['1', u'JaVAS', '1', u'AGR', '282001', u'1,2'], 
    ['2', u'JaVAS', '1', u'AGR', '282002', u'3,4'], 
    ['3', u'JaVAS', '1', u'AGR', '282003', u'5,6']
]

Then I loop around values[0] trying to find out the FORMAT content in values[0] and then getting the index of Arm_id, DSPname and Pincode in the values[0] and then from next loop I know the index of all the FORMAT factors , thereby getting to know which value do I need to get .

But this is such a poor solution.

How do I get the values of a specific column with name in excel file?

Python Solutions


Solution 1 - Python

A somewhat late answer, but with pandas, it is possible to get directly a column of an excel file:

import pandas

df = pandas.read_excel('sample.xls')
#print the column names
print df.columns
#get the values for a given column
values = df['Arm_id'].values
#get a data frame with selected columns
FORMAT = ['Arm_id', 'DSPName', 'Pincode']
df_selected = df[FORMAT]

Make sure you have installed xlrd and pandas:

pip install pandas xlrd

Solution 2 - Python

This is one approach:

from xlrd import open_workbook

class Arm(object):
    def __init__(self, id, dsp_name, dsp_code, hub_code, pin_code, pptl):
        self.id = id
        self.dsp_name = dsp_name
        self.dsp_code = dsp_code
        self.hub_code = hub_code
        self.pin_code = pin_code
        self.pptl = pptl
        
    def __str__(self):
        return("Arm object:\n"
               "  Arm_id = {0}\n"
               "  DSPName = {1}\n"
               "  DSPCode = {2}\n"
               "  HubCode = {3}\n"
               "  PinCode = {4} \n"
               "  PPTL = {5}"
               .format(self.id, self.dsp_name, self.dsp_code,
                       self.hub_code, self.pin_code, self.pptl))
        
wb = open_workbook('sample.xls')
for sheet in wb.sheets():
    number_of_rows = sheet.nrows
    number_of_columns = sheet.ncols
    
    items = []

    rows = []
    for row in range(1, number_of_rows):
        values = []
        for col in range(number_of_columns):
            value  = (sheet.cell(row,col).value)
            try:
                value = str(int(value))
            except ValueError:
                pass
            finally:
                values.append(value)
        item = Arm(*values)
        items.append(item)

for item in items:
    print item
    print("Accessing one single value (eg. DSPName): {0}".format(item.dsp_name))
    print

You don't have to use a custom class, you can simply take a dict(). If you use a class however, you can access all values via dot-notation, as you see above.

Here is the output of the script above:

Arm object:
  Arm_id = 1
  DSPName = JaVAS
  DSPCode = 1
  HubCode = AGR
  PinCode = 282001 
  PPTL = 1
Accessing one single value (eg. DSPName): JaVAS

Arm object:
  Arm_id = 2
  DSPName = JaVAS
  DSPCode = 1
  HubCode = AGR
  PinCode = 282002 
  PPTL = 3
Accessing one single value (eg. DSPName): JaVAS

Arm object:
  Arm_id = 3
  DSPName = JaVAS
  DSPCode = 1
  HubCode = AGR
  PinCode = 282003 
  PPTL = 5
Accessing one single value (eg. DSPName): JaVAS

Solution 3 - Python

So the key parts are to grab the header ( col_names = s.row(0) ) and when iterating through the rows, to skip the first row which isn't needed for row in range(1, s.nrows) - done by using range from 1 onwards (not the implicit 0). You then use zip to step through the rows holding 'name' as the header of the column.

from xlrd import open_workbook

wb = open_workbook('Book2.xls')
values = []
for s in wb.sheets():
    #print 'Sheet:',s.name
    for row in range(1, s.nrows):
        col_names = s.row(0)
        col_value = []
        for name, col in zip(col_names, range(s.ncols)):
            value  = (s.cell(row,col).value)
            try : value = str(int(value))
            except : pass
            col_value.append((name.value, value))
        values.append(col_value)
print values

Solution 4 - Python

By using pandas we can read excel easily.

import pandas as pd 
from pandas import ExcelWriter
from pandas import ExcelFile 

DataF=pd.read_excel("Test.xlsx",sheet_name='Sheet1')

print("Column headings:")
print(DataF.columns)

Test at :https://repl.it Reference: https://pythonspot.com/read-excel-with-pandas/

Solution 5 - Python

Here is the code to read an excel file and and print all the cells present in column 1 (except the first cell i.e the header):

import xlrd

file_location="C:\pythonprog\xxx.xlsv"
workbook=xlrd.open_workbook(file_location)
sheet=workbook.sheet_by_index(0)
print(sheet.cell_value(0,0))

for row in range(1,sheet.nrows):
     print(sheet.cell_value(row,0))

Solution 6 - Python

The approach I took reads the header information from the first row to determine the indexes of the columns of interest.

You mentioned in the question that you also want the values output to a string. I dynamically build a format string for the output from the FORMAT column list. Rows are appended to the values string separated by a new line char.

The output column order is determined by the order of the column names in the FORMAT list.

In my code below the case of the column name in the FORMAT list is important. In the question above you've got 'Pincode' in your FORMAT list, but 'PinCode' in your excel. This wouldn't work below, it would need to be 'PinCode'.

from xlrd import open_workbook
wb = open_workbook('sample.xls')

FORMAT = ['Arm_id', 'DSPName', 'PinCode']
values = ""

for s in wb.sheets():
    headerRow = s.row(0)
    columnIndex = [x for y in FORMAT for x in range(len(headerRow)) if y == firstRow[x].value]
    formatString = ("%s,"*len(columnIndex))[0:-1] + "\n"

    for row in range(1,s.nrows):
        currentRow = s.row(row)
        currentRowValues = [currentRow[x].value for x in columnIndex]
        values += formatString % tuple(currentRowValues)

print values

For the sample input you gave above this code outputs:

>>> 1.0,JaVAS,282001.0
2.0,JaVAS,282002.0
3.0,JaVAS,282003.0

And because I'm a python noob, props be to: this answer, this answer, this question, this question and this answer.

Solution 7 - Python

I have read using openpyxl library,

import openpyxl
from pathlib import Path

xlsx_file = Path('C:\\Users\\Amit\\Desktop\\ReadExcel', 'ReadData.xlsx')
wb_obj = openpyxl.load_workbook(xlsx_file) 

# Read the active sheet:
sheet = wb_obj.active

for i in range(sheet.max_column):
    print(f'i = {i}')
    for row in sheet.iter_rows():
        print(row[i].value)

Solution 8 - Python

Although I almost always just use pandas for this, my current little tool is being packaged into an executable and including pandas is overkill. So I created a version of poida's solution that resulted in a list of named tuples. His code with this change would look like this:

from xlrd import open_workbook
from collections import namedtuple
from pprint import pprint

wb = open_workbook('sample.xls')

FORMAT = ['Arm_id', 'DSPName', 'PinCode']
OneRow = namedtuple('OneRow', ' '.join(FORMAT))
all_rows = []

for s in wb.sheets():
    headerRow = s.row(0)
    columnIndex = [x for y in FORMAT for x in range(len(headerRow)) if y == headerRow[x].value]

    for row in range(1,s.nrows):
        currentRow = s.row(row)
        currentRowValues = [currentRow[x].value for x in columnIndex]
        all_rows.append(OneRow(*currentRowValues))

pprint(all_rows)

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionPythonEnthusiastView Question on Stackoverflow
Solution 1 - PythonsheinisView Answer on Stackoverflow
Solution 2 - PythontamasgalView Answer on Stackoverflow
Solution 3 - PythonNoel EvansView Answer on Stackoverflow
Solution 4 - PythonMahabubuzzamanView Answer on Stackoverflow
Solution 5 - Pythonharsha vardhanView Answer on Stackoverflow
Solution 6 - PythonpoidaView Answer on Stackoverflow
Solution 7 - PythonSiddhpura AmitView Answer on Stackoverflow
Solution 8 - PythonTSeymourView Answer on Stackoverflow