Efficiently counting the number of lines of a text file. (200mb+)

PhpFileMemoryTextMemory Leaks

Php Problem Overview


I have just found out that my script gives me a fatal error:

Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 440 bytes) in C:\process_txt.php on line 109

That line is this:

$lines = count(file($path)) - 1;

So I think it is having difficulty loading the file into memeory and counting the number of lines, is there a more efficient way I can do this without having memory issues?

The text files that I need to count the number of lines for range from 2MB to 500MB. Maybe a Gig sometimes.

Thanks all for any help.

Php Solutions


Solution 1 - Php

This will use less memory, since it doesn't load the whole file into memory:

$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
  $line = fgets($handle);
  $linecount++;
}

fclose($handle);

echo $linecount;

fgets loads a single line into memory (if the second argument $length is omitted it will keep reading from the stream until it reaches the end of the line, which is what we want). This is still unlikely to be as quick as using something other than PHP, if you care about wall time as well as memory usage.

The only danger with this is if any lines are particularly long (what if you encounter a 2GB file without line breaks?). In which case you're better off doing slurping it in in chunks, and counting end-of-line characters:

$file="largefile.txt";
$linecount = 0;
$handle = fopen($file, "r");
while(!feof($handle)){
  $line = fgets($handle, 4096);
  $linecount = $linecount + substr_count($line, PHP_EOL);
}

fclose($handle);

echo $linecount;

Solution 2 - Php

Using a loop of fgets() calls is fine solution and the most straightforward to write, however:

  1. even though internally the file is read using a buffer of 8192 bytes, your code still has to call that function for each line.

  2. it's technically possible that a single line may be bigger than the available memory if you're reading a binary file.

This code reads a file in chunks of 8kB each and then counts the number of newlines within that chunk.

function getLines($file)
{
    $f = fopen($file, 'rb');
    $lines = 0;

    while (!feof($f)) {
        $lines += substr_count(fread($f, 8192), "\n");
    }

    fclose($f);

    return $lines;
}

If the average length of each line is at most 4kB, you will already start saving on function calls, and those can add up when you process big files.

Benchmark

I ran a test with a 1GB file; here are the results:

             +-------------+------------------+---------+
             | This answer | Dominic's answer | wc -l   |
+------------+-------------+------------------+---------+
| Lines      | 3550388     | 3550389          | 3550388 |
+------------+-------------+------------------+---------+
| Runtime    | 1.055       | 4.297            | 0.587   |
+------------+-------------+------------------+---------+

Time is measured in seconds real time, see here what real means

True line count

While the above works well and returns the same results as wc -l, if the file ends without a newline, the line number will be off by one; if you care about this particular scenario, you can make it more accurate by using this logic:


function getLines($file)
{
    $f = fopen($file, 'rb');
    $lines = 0; $buffer = '';

    while (!feof($f)) {
        $buffer = fread($f, 8192);
        $lines += substr_count($buffer, "\n");
    }

    fclose($f);

    if (strlen($buffer) > 0 && $buffer[-1] != "\n") {
        ++$lines;
    }
    return $lines;
}

Solution 3 - Php

Simple Oriented Object solution

$file = new \SplFileObject('file.extension');

while($file->valid()) $file->fgets();

var_dump($file->key());

#Update

Another way to make this is with PHP_INT_MAX in SplFileObject::seek method.

$file = new \SplFileObject('file.extension', 'r');
$file->seek(PHP_INT_MAX);

echo $file->key(); 

Solution 4 - Php

If you're running this on a Linux/Unix host, the easiest solution would be to use exec() or similar to run the command wc -l $path. Just make sure you've sanitized $path first to be sure that it isn't something like "/path/to/file ; rm -rf /".

Solution 5 - Php

There is a faster way I found that does not require looping through the entire file

*only on nix systems, there might be a similar way on windows ...

$file = '/path/to/your.file';

//Get number of lines
$totalLines = intval(exec("wc -l '$file'"));

Solution 6 - Php

If you're using PHP 5.5 you can use a generator. This will NOT work in any version of PHP before 5.5 though. From php.net:

"Generators provide an easy way to implement simple iterators without the overhead or complexity of implementing a class that implements the Iterator interface."

// This function implements a generator to load individual lines of a large file
function getLines($file) {
    $f = fopen($file, 'r');

    // read each line of the file without loading the whole file to memory
    while ($line = fgets($f)) {
        yield $line;
    }
}

// Since generators implement simple iterators, I can quickly count the number
// of lines using the iterator_count() function.
$file = '/path/to/file.txt';
$lineCount = iterator_count(getLines($file)); // the number of lines in the file

Solution 7 - Php

If you're under linux you can simply do:

number_of_lines = intval(trim(shell_exec("wc -l ".$file_name." | awk '{print $1}'")));

You just have to find the right command if you're using another OS

Regards

Solution 8 - Php

This is an addition to Wallace Maxter's solution

It also skips empty lines while counting:

function getLines($file)
{
    $file = new \SplFileObject($file, 'r');
    $file->setFlags(SplFileObject::READ_AHEAD | SplFileObject::SKIP_EMPTY | 
SplFileObject::DROP_NEW_LINE);
    $file->seek(PHP_INT_MAX);

    return $file->key() + 1; 
}

Solution 9 - Php

The most succinct cross-platform solution that only buffers one line at a time.

$file = new \SplFileObject(__FILE__);
$file->setFlags($file::READ_AHEAD);
$lines = iterator_count($file);

Unfortunately, we have to set the READ_AHEAD flag otherwise iterator_count blocks indefinitely. Otherwise, this would be a one-liner.

Solution 10 - Php

private static function lineCount($file) {
	$linecount = 0;
	$handle = fopen($file, "r");
	while(!feof($handle)){
		if (fgets($handle) !== false) {
				$linecount++;
		}
	}
	fclose($handle);
	return  $linecount;		
}

I wanted to add a little fix to the function above...

in a specific example where i had a file containing the word 'testing' the function returned 2 as a result. so i needed to add a check if fgets returned false or not :)

have fun :)

Solution 11 - Php

Based on dominic Rodger's solution, here is what I use (it uses wc if available, otherwise fallbacks to dominic Rodger's solution).

class FileTool
{

    public static function getNbLines($file)
    {
        $linecount = 0;

        $m = exec('which wc');
        if ('' !== $m) {
            $cmd = 'wc -l < "' . str_replace('"', '\\"', $file) . '"';
            $n = exec($cmd);
            return (int)$n + 1;
        }


        $handle = fopen($file, "r");
        while (!feof($handle)) {
            $line = fgets($handle);
            $linecount++;
        }
        fclose($handle);
        return $linecount;
    }
}

https://github.com/lingtalfi/Bat/blob/master/FileTool.php

Solution 12 - Php

Counting the number of lines can be done by following codes:

<?php
$fp= fopen("myfile.txt", "r");
$count=0;
while($line = fgetss($fp)) // fgetss() is used to get a line from a file ignoring html tags
$count++;
echo "Total number of lines  are ".$count;
fclose($fp);
?>

Solution 13 - Php

You have several options. The first is to increase the availble memory allowed, which is probably not the best way to do things given that you state the file can get very large. The other way is to use fgets to read the file line by line and increment a counter, which should not cause any memory issues at all as only the current line is in memory at any one time.

Solution 14 - Php

There is another answer that I thought might be a good addition to this list.

If you have perl installed and are able to run things from the shell in PHP:

$lines = exec('perl -pe \'s/\r\n|\n|\r/\n/g\' ' . escapeshellarg('largetextfile.txt') . ' | wc -l');

This should handle most line breaks whether from Unix or Windows created files.

TWO downsides (at least):

  1. It is not a great idea to have your script so dependent upon the system its running on ( it may not be safe to assume Perl and wc are available )

  2. Just a small mistake in escaping and you have handed over access to a shell on your machine.

As with most things I know (or think I know) about coding, I got this info from somewhere else:

John Reeve Article

Solution 15 - Php

public function quickAndDirtyLineCounter()
{
    echo "<table>";
    $folders = ['C:\wamp\www\qa\abcfolder\',
    ];
    foreach ($folders as $folder) {
        $files = scandir($folder);
        foreach ($files as $file) {
            if($file == '.' || $file == '..' || !file_exists($folder.'\\'.$file)){
                continue;
            }
                $handle = fopen($folder.'/'.$file, "r");
                $linecount = 0;
                while(!feof($handle)){
                    if(is_bool($handle)){break;}
                    $line = fgets($handle);
                    $linecount++;
                  }
                fclose($handle);
                echo "<tr><td>" . $folder . "</td><td>" . $file . "</td><td>" . $linecount . "</td></tr>";
            }
        }
        echo "</table>";
}

Solution 16 - Php

I use this method for purely counting how many lines in a file. What is the downside of doing this verses the other answers. I'm seeing many lines as opposed to my two line solution. I'm guessing there's a reason nobody does this.

$lines = count(file('your.file'));
echo $lines;

Solution 17 - Php

this is a bit late but...

Here is my solution for a text log file I have which uses \n to separate each line.

$data = file_get_contents("myfile.txt");
$numlines = strlen($data) - strlen(str_replace("\n","",$data));

It does load the file into memory but doesn't need to cycle through an unknown number of lines. It may be unsuitable if the file is GB in size but for smaller files with short lines of data it works a treat for me.

It just removes the "\n" from the file and compares how many have been removed by comparing the length of the data in the file to the length after removing all the line breaks ("\n" chars n my case). If your line delineator is a different char, replace the "\n" with whatever is your line delineation character.

I know it is not the best answer for all occasions but is something I have found quick and simple for my purposes where each line of the log is only a few hundred chars and total log file is not too large.

Solution 18 - Php

For just counting the lines use:

$handle = fopen("file","r");
static $b = 0;
while($a = fgets($handle)) {
    $b++;
}
echo $b;

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionAbsView Question on Stackoverflow
Solution 1 - PhpDominic RodgerView Answer on Stackoverflow
Solution 2 - PhpJa͢ckView Answer on Stackoverflow
Solution 3 - PhpWallace MaxtersView Answer on Stackoverflow
Solution 4 - PhpDave SherohmanView Answer on Stackoverflow
Solution 5 - PhpAndy BrahamView Answer on Stackoverflow
Solution 6 - PhpBen HaroldView Answer on Stackoverflow
Solution 7 - PhpelkolotfiView Answer on Stackoverflow
Solution 8 - PhpJaniView Answer on Stackoverflow
Solution 9 - PhpQuolonel QuestionsView Answer on Stackoverflow
Solution 10 - PhpufkView Answer on Stackoverflow
Solution 11 - PhplingView Answer on Stackoverflow
Solution 12 - PhpSantosh KumarView Answer on Stackoverflow
Solution 13 - PhpYacobyView Answer on Stackoverflow
Solution 14 - PhpDouglas.SesarView Answer on Stackoverflow
Solution 15 - PhpYogi SadhwaniView Answer on Stackoverflow
Solution 16 - Phpkaspirtk1View Answer on Stackoverflow
Solution 17 - PhpDavid CrawfordView Answer on Stackoverflow
Solution 18 - PhpAdeel AhmadView Answer on Stackoverflow