Is it faster to access data from files or a database server?

SqlPerformanceCgiFlat File

Sql Problem Overview


If I have a static database consisting of folders and files, would access and manipulation be faster than SQL server type databases, considering this would be used in a CGI script?

When working with files and folders, what are the tricks to better performance?

Sql Solutions


Solution 1 - Sql

I'll add to the it depends crowd.

This is the kind of question that has no generic answer but is heavily dependent on the situation at hand. I even recently moved some data from a SQL database to a flat file system because the overhead of the DB, combined with some DB connection reliability issues, made using flat files a better choice.

Some questions I would ask myself when making the choice include:

  1. How am I consuming the data? For example will I just be reading from the beginning to the end rows in the order entered? Or will I be searching for rows that match multiple criteria?

  2. How often will I be accessing the data during one program execution? Will I go once to get all books with Salinger as the author or will I go several times to get several different authors? Will I go more than once for several different criteria?

  3. How will I be adding data? Can I just append a row to the end and that's perfect for my retrieval or will it need to be resorted?

  4. How logical will the code look in six months? I emphasize this because I think this is too often forgotten in designing things (not just code, this hobby horse is actually from my days as a Navy mechanic cursing mechanical engineers). In six months when I have to maintain your code (or you do after working another project) which way of storing and retrieving data will make more sense. If going from flat files to a DB results in a 1% efficiency improvement but adds a week of figuring things out when you have to update the code have you really improved things.

Solution 2 - Sql

As a general rule, databases are slower than files.

If you require indexing of your files, a hard-coded access path on customised indexing structures will always have the potential to be faster if you do it correctly.

But 'performance' is not the goal when choosing a database over a file based solution.

You should ask yourself whether your system needs any of the benefits that a database would provide. If so, then the small performance overhead is quite acceptable.

So:

  1. Do you need to deal with multiple users and concurrent updates? (Well; you did say it's static.)
  2. Do you need flexibility in order to easily query the data from a variety of angles?
  3. Do you have multiple users, and could gain from making use of an existing security model?

Basically, the question is more of which would be easier to develop. The performance difference between the two is not worth wasting dev time.

Solution 3 - Sql

Depends on what your information is and what your access patterns and scale are. Two of the biggest benefits of a relational databases are:

  1. Caching. Unless you're very clever, you can't write a cache as good as that of a DB server

  2. Optimizer.

However, for certain specialized applications, neither of these 2 benefits manifest itself compared to files+folders data store - therefore the answer is a resounding "depends".

As for files/folders, the tricks are:

  • Cache the contents of frequently requested files
  • Have small directories (files in deeply nested small directories are much faster to access than in a flatter structure, due to the time it takes to read the contents of a big directory).
  • There are other, more advanced optimizations (slice across disks, placement on different places in a disk or different partition, etc..) - but if you have need of THAT level, you are better off with a database in the first place.

Solution 4 - Sql

From my little bit of experience, server-based databases (even those served on the local machine) tend to to have very slow throughput compared to local filesystems. However, this depends on some things, one of which being asymptotic complexity. Comparing scanning a big list of files against using a database with an index to look up an item, the database wins.

My little bit of experience is with PostgreSQL. I had a table with three million rows, and I went to update a mere 8,000 records. It took 8 seconds.

As for the quote "Premature optimization is the root of all evil.", I would take that with a grain of salt. If you write your application using a database, then find it to be slow, it might take a tremendous amount of time to switch to a filesystem-based approach or something else (e.g. SQLite). I would say your best bet is to create a very simple prototype of your workload, and test it with both approaches. I believe it is important to know which is faster in this case.

Solution 5 - Sql

As others have pointed out: it depends!

If you really need to find out which is going to be more performant for your purposes, you may want to generate some sample data to store in each format and then run some benchmarks. The Benchmark.pm module comes with Perl, and makes it fairly simple to do a side-by-side comparison with something like this:

use Benchmark qw(:all) ;

my $count = 1000;  # Some large-ish number of trials is recommended.

cmpthese($count, {
    'File System' => sub { ...your filesystem code... },
    'Database'    => sub { ...your database code... }
});

You can type perldoc Benchmark to get more complete documentation.

Solution 6 - Sql

It is very useful to use files instead of db when it comes to images if site structure is suitable. Create folders representing your matching data and place images inside. For example you have an article site, you store your articles in db. You don't have to place your image paths on db, name folders with your primary keys like 1,2,3.. and put images inside. E-books, music files, videos, this approach can be used in all media files. Same logic works with xml files if you won't search for something.

Solution 7 - Sql

Like other said DB is a tool, and it creates some overhead, but in case if your data is static and it's read-only data reading directory from files will be faster: Here are some tests that I've done: I had files with the name of the file as .csv In database I had indexed column as 'date' in order to find the same records in the database. Each day has 30K-50K records/rows and 100 columns of different type of data (90% floats).

DB Info: PostgreSQL 11.5, 16GB of RAM

  Table:
    335,162,867 records
    Table size: 110GB
    Index size: 7GB
    Total size: 117GB
  Files:
    Number of files: 8033
    Total Files size: 158GB
    Number of records/lines per file/date: 30K - 50K

> Reading data for a random date (1986-2019) from a file was constantly > 4-5 times faster than reading data for the same date in PostgreSQL

Solution 8 - Sql

It depends on the profile of the data and what logic you are going to be using to access it. If you simply need to save and fetch named nodes then a filesystem-based database may be faster and more efficient. (You could also have a look at Berkeley DB for that purpose.) If you need to do index-based searches, and especially if you need to join different sets of data based on keys, then an SQL database is your best bet.

I would just go with whatever solution seems the most natural for your application.

Solution 9 - Sql

As others have said, it depends: on the size and nature of the data and the operations you're planning to run on it.

Particularly for a CGI script, you're going to incur a performance hit for connecting to a database server on every page view. However if you create a naive file-based approach, you could easily create worse performance problems ;-)

As well as a Berkeley DB File solution you could also consider using SQLite. This creates a SQL interface to a database stored in a local file. You can access it with DBI and SQL but there's no server, configuration or network protocol. This could allow easier migration if a database server is necessary in the future (example: if you decide to have multiple front-end servers, but need to share state).

Without knowing any details, I'd suggest using a SQLite/DBI solution then reviewing the performance. This will give flexibility with a reasonably simple start up and decent performance.

Solution 10 - Sql

To quickly access files, depending on what you are doing, an mmap can be very handy. I just wrote about this in the Effective Perl blog as Memory-map files instead of slurping them.

However, I expect that a database server would be much faster. It's difficult to say what would be faster for you when we have no idea what you are doing, what sort of data you need to access, and so on.

Solution 11 - Sql

database certainly can be faster,

quoting SQLite test,

> SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite(). > > Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files. > > The performance difference arises (we believe) because when working from an SQLite database, the open() and close() system calls are invoked only once, whereas open() and close() are invoked once for each blob when using blobs stored in individual files. It appears that the overhead of calling open() and close() is greater than the overhead of using the database. The size reduction arises from the fact that individual files are padded out to the next multiple of the filesystem block size, whereas the blobs are packed more tightly into an SQLite database. > > The measurements in this article were made during the week of 2017-06-05 using a version of SQLite in between 3.19.2 and 3.20.0. You may expect future versions of SQLite to perform even better.

Solution 12 - Sql

I'm going to give you the same answer everyone else gave you, It Depends

In a simple scenario with a single server that returns data (READ Only), Yes file system will be great and easy to manage.

But, when you have more than one server you'll have to manage distributed files system like glusterfs, ceph, etc..

A database is a tool to manage all of it for you, distributed files system, compression, read/write, locks etc..

hope that's helpful.

Solution 13 - Sql

I chose file/folder system for my office-managing-site (1M workers throughout USA and Brazil) which was my first project (as a job).

Actually it was the most irresponsible but luckily most suitable decision I took ever since.

To deal with this file system I made a php function set as well to simplify stuff. Things I consider to make those functions were:

  • Creating table file (like csv)
  • Creating folder containing those files,
  • Letting repeat a function for each row in a selected file,
  • Letting change / delete selected column/row data in a selected file iteratively
  • And save binary files as separate files

That's all.

But when using it I had to think about optimizing so much. For example when it comes to last seen of someone..

When someone X user want to see lastseen of a some other Y I made a folder called dataRequestingCluster and under X which contains everything that X wants to see last seen (maybe X's friendlist if X is viewing his friends page) and added that cluster in to Y's last seen table (which contains all "Y's last seen requesters separate clusterFiles" and updates every time Y interact with server)

And then added a mechanism to remove dataRequestClusters when not used for 3mins (WHEN X Leaves his friends page no need to update last seen for him)

I had to follow this approach because updating last seen in the ALL_USERS File is a complete joke when it comes to time.

Most important thing is this method is way faster than using a database system with MySql ... But you have to use the brain and 30% more time but happiness and completeness ibatined by this is awesome

This is just a one case "last seen"

So many times I used different algorithms to succeed each problem and it was always efficient than sql/ db system.

Conclusion: File systems make stuff complex but its the only way to improve the quality in a shocking way...if you are not good at thinking or have less time or you are not interested about what you do you should probably go SQL/DB system.

After 6 months my server data storage became 20GB (Without media files.. just user info ) but any individual file never exceeded 200kb.. it was amazingly successful with least time for processing.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJeremy GwaView Question on Stackoverflow
Solution 1 - SqlHerbNView Answer on Stackoverflow
Solution 2 - SqlDisillusionedView Answer on Stackoverflow
Solution 3 - SqlDVKView Answer on Stackoverflow
Solution 4 - SqlJoey AdamsView Answer on Stackoverflow
Solution 5 - SqlJohn HylandView Answer on Stackoverflow
Solution 6 - SqlKuzgunView Answer on Stackoverflow
Solution 7 - SqlVlad BezdenView Answer on Stackoverflow
Solution 8 - SqlNate C-KView Answer on Stackoverflow
Solution 9 - SqlFalseVinylShrubView Answer on Stackoverflow
Solution 10 - Sqlbrian d foyView Answer on Stackoverflow
Solution 11 - SqlhanshenrikView Answer on Stackoverflow
Solution 12 - SqlDoron SegalView Answer on Stackoverflow
Solution 13 - SqlR3KI3GView Answer on Stackoverflow