What is the most efficient way to get first and last line of a text file?

PythonFileSeek

Python Problem Overview


I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.

Python Solutions


Solution 1 - Python

To read both the first and final line of a file you could...

  • open the file, ...
  • ... read the first line using built-in readline(), ...
  • ... seek (move the cursor) to the end of the file, ...
  • ... step backwards until you encounter EOL (line break) and ...
  • ... read the last line from there.
def readlastline(f):
    f.seek(-2, 2)              # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found ...
        f.seek(-2, 1)          # ... jump back, over the read byte plus one more.
    return f.read()            # Read all data from this point on.
    
with open(file, "rb") as f:
    first = f.readline()
    last = readlastline(f)

Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.

The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.

The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...

* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.


Efficiency

> 1-2 million lines each and I have to do this for several hundred files.

I timed this method and compared it against against the top answer.

10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.

Millions of lines would increase the difference a lot more.

Exakt code used for timing:

with open(file, "rb") as f:
    first = f.readline()     # Read and store the first line.
    for last in f: pass      # Read all lines, keep final value.


Amendment

A more complex, and harder to read, variation to address comments and issues raised since.

  • Return empty string when parsing empty file, raised by comment.
  • Return all content when no delimiter is found, raised by comment.
  • Avoid relative offsets to support text mode, raised by comment.
  • UTF16/UTF32 hack, noted by comment.

Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).

Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.

#!/bin/python3

from os import SEEK_END

def readlast(f, sep, fixed=True):
    r"""Read the last segment from a file-like object.

    :param f: File to read last line from.
    :type  f: file-like object
    :param sep: Segment separator (delimiter).
    :type  sep: bytes, str
    :param fixed: Treat data in ``f`` as a chain of fixed size blocks.
    :type  fixed: bool
    :returns: Last line of file.
    :rtype: bytes, str
    """
    bs   = len(sep)
    step = bs if fixed else 1
    if not bs:
        raise ValueError("Zero-length separator.")
    try:
        o = f.seek(0, SEEK_END)
        o = f.seek(o-bs-step)    # - Ignore trailing delimiter 'sep'.
        while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
            o = f.seek(o-step)   #  and then seek to the block to read next.
    except (OSError,ValueError): # - Beginning of file reached.
        f.seek(0)
    return f.read()

def test_readlast():
    from io import BytesIO, StringIO
    
    # Text mode.
    f = StringIO("first\nlast\n")
    assert readlast(f, "\n") == "last\n"
    
    # Bytes.
    f = BytesIO(b'first|last')
    assert readlast(f, b'|') == b'last'
    
    # Bytes, UTF-8.
    f = BytesIO("X\nY\n".encode("utf-8"))
    assert readlast(f, b'\n').decode() == "Y\n"
    
    # Bytes, UTF-16.
    f = BytesIO("X\nY\n".encode("utf-16"))
    assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
  
    # Bytes, UTF-32.
    f = BytesIO("X\nY\n".encode("utf-32"))
    assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
    
    # Multichar delimiter.
    f = StringIO("X<br>Y")
    assert readlast(f, "<br>", fixed=False) == "Y"
    
    # Make sure you use the correct delimiters.
    seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
    assert "\n".encode('utf8' )     == seps['utf8']
    assert "\n".encode('utf16')[2:] == seps['utf16']
    assert "\n".encode('utf32')[4:] == seps['utf32']
    
    # Edge cases.
    edges = (
        # Text , Match
        (""    , ""  ), # Empty file, empty string.
        ("X"   , "X" ), # No delimiter, full content.
        ("\n"  , "\n"),
        ("\n\n", "\n"),
        # UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
        (b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
    )
    for txt, match in edges:
        for enc,sep in seps.items():
            assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match

if __name__ == "__main__":
    import sys
    for path in sys.argv[1:]:
        with open(path) as f:
            print(f.readline()    , end="")
            print(readlast(f,"\n"), end="")

Solution 2 - Python

docs for io module

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname).

ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.

Solution 3 - Python

Here's a modified version of SilentGhost's answer that will do what you want.

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.

Solution 4 - Python

Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.

Solution 5 - Python

This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:

def tail(filepath):
    """
    @author Marco Sulla ([email protected])
    @date May 31, 2016
    """
    
    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath
    
    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1
        
        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)
            
            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)
            
            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""
                
                for pos in range(start_pos, -1, -1):
                    f.seek(pos)
                    
                    char = f.read(1)
                    
                    if char == b"\n":
                        break
        
        return f.readline()

It's ispired by Trasp's answer and AnotherParker's comment.

Solution 6 - Python

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

Solution 7 - Python

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The for loop runs through the lines and x gets the last line on the final iteration.

Solution 8 - Python

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

Solution 9 - Python

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

Solution 10 - Python

Nobody mentioned using reversed:

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

Solution 11 - Python

Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.

Solution 12 - Python

with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
    first = f.readline()
    if f.read(1) == '':
        return first
    f.seek(-2, 2)  # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)  # ...jump back the read byte plus one more.
    last = f.readline()  # Read last line.
    return last

The above answer is a modified version of the above answers which handles the case that there is only one line in the file

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionpasbinoView Question on Stackoverflow
Solution 1 - PythonTraspView Answer on Stackoverflow
Solution 2 - PythonSilentGhostView Answer on Stackoverflow
Solution 3 - Pythonmik01ajView Answer on Stackoverflow
Solution 4 - PythonbeitarView Answer on Stackoverflow
Solution 5 - PythonMarco SullaView Answer on Stackoverflow
Solution 6 - PythonSrinivasreddy JakkireddyView Answer on Stackoverflow
Solution 7 - PythonVipeRView Answer on Stackoverflow
Solution 8 - PythonRiccardo VolpeView Answer on Stackoverflow
Solution 9 - Pythontony_tigerView Answer on Stackoverflow
Solution 10 - PythonMichael MeanswellView Answer on Stackoverflow
Solution 11 - PythonmswView Answer on Stackoverflow
Solution 12 - Pythonuser37940View Answer on Stackoverflow