Just what is 'A big database'?

Database

Database Problem Overview


Ok, dumb question I know but I see the nebulous comment 'a large database' as well as small and medium and I wonder just what that means. Can someone define what a small, medium and large database is for us SQL neophytes?

Database Solutions


Solution 1 - Database

There isn't a threshold where a small database becomes medium or a medium database becomes large. Generally, when I hear these terms, I think of particular orders of magnitude in terms of total records being stored.

  • Small: Fits in a spreadsheet.
  • Medium: Fits in memory on a commodity server.
  • Large: Fits in a commodity cloud offering.
  • Very large: Fits in a specialized environment; unusual storage, latency, or throughput characteristics.

As poster dkretz suggested, you could also think about it in terms of the properties each kind of database has. Categorizing it this way, I'd say:

  • Small: Performance is not a concern. Your queries run fine without making any special optimizations. You see only a marginal performance difference when using front-line enhancements like indexes.

  • Medium: Your database probably has one or more staff that are assigned part-time to its maintenance and care. These people pay attention to the database's health; their primary administrative responsibility is to prevent unacceptable performance problems and minimize downtime.

  • Large: Probably has dedicated staff member(s) whose job is to work on the database and improve performance, as well as make sure that application changes don't cause schema breakage over the lifetime of the database. Metrics about the health and status of the database are monitored closely. Significant expertise is required to understand and perform optimizations.

  • Very large: The database stores vast amounts of information that must be readily accessible. Performance optimizations are absolutely required to wring every last ounce of speed out of each queries, and without it, the database would be much less usable or even impossible to use. The database may be using sophisticated or innovative replication or clustering techniques, pushing the boundaries of current technology.

Note that these are entirely subjective, and that someone may very well have a perfectly legitimate alternate definition of "large".

Solution 2 - Database

One way to figure it is by observing your test queries.

A small database is one where indexes don't matter.

A medium database is one where queries take longer than one second if you don't have an appropriate index in place.

A big database is one where queries often take hours to optimize, using a combination of query design, index modification, and many test cycles.

Solution 3 - Database

Large database are ones that force you have to stop using relational databases.

In other words, a normalized, relational database where all the indexes in the world can't help you meet your response time requirements because of the massive JOINs.

If you've ever had to abandon relational databases for something else, you're either a poor database developer, have no expert DBA, or have a very large database.

Solution 4 - Database

“Large Database” is indeed a nebulous concept. There are already very different answers and opinions posted in the answers to this question. Some approaches to define “small”, “medium” and “large” Databases may make more sense than others BUT THEN, at some point, I consider each definition is right, true and valid.

Some definitions make more sense than others because they focus on different aspects of importance for the design, programming, use, maintenance and administration of a Database and these different aspects are what really matter for a usable Database. It just happens that all these aspects are impacted by the nebulous concept of “Database size”.

So, Does this mean that it does not matter if you are able to define if a particular Database is big or not?

Certainly not. What it mean is you will apply the concept differently while evaluating different design/operational/administrative aspects of your Database. It also means that every time this concept will be nebulous.

As an example: Database Index strategy (an aspect of Database design) is impacted by record count for each table (a measure of “size”), by record size times record count (another measure of “size”), and by Query Vs. Creation/Update/Delete operations ratio (an aspect of Database usage).

Query response times are better if indexes are used for tables with large amount of records. Depending on the nature of your WHERE, ORDER BY and record-aggregation clauses you may need several indexes for certain tables.

Creation, Update and Delete operations are impacted negatively with the increase of number of indexes on the affected table(s). More indexes for an affected table means more changes that the RDBMS must perform, spending more time and more resources to apply those changes.

Also, if your RDBMS spends more time to apply those changes, then the locks are maintained for longer times also, impacting the response times other queries being sent to the system at the same time.

So, How do you balance the quantity and design of your indexes? How do you know if you need an additional index and if by adding that index you will not be introducing a big negative impact on query response times? Answer: You test and profile your database against a target load as per your load/performance requirements and analyze the profiling data in order to discover if further optimizations/redesigns/indexes are needed.

Different Index strategies are required for different Query Vs. Creation/Update/Delete operations ratios. If your Database is under a heavy load of queries but is rarely updated, the performance for the overall application will be better if you add every index that improves query response times. On the other hand, if your Database is constantly being updated but there are not large query operations, then the performance will be better if you use less indexes.

There are other aspects of course: Database Schema design, Storage Strategy, Network design, Backup strategy, Stored Procedures/Triggers/Etc. programming, Application Programming (against the Database), Etc. All these aspects are impacted differently by distinct concepts of “size” (record size, record count, index size, index count, schema design, storage size, etc.).

I'd like to have more time as this topic is fascinating. I hope this small contribution serves as an starting point for you in this fascinating world of SQL.

Solution 5 - Database

You have to account for hardware advancement for this definition:

  1. Small database: working set fits into the physical RAM of a single commodity server (about 16GB now)

  2. Medium database: fits into a single or several (through RAID) commodity hard drives on a single machine (up to several TBs now)

  3. Large database: Data needs to distributed across multiple commodity servers in order to fit (up to several PBs now.)

Solution 6 - Database

According to wikipedia article on Very Large Database

>A very large database, or VLDB, is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time.

Solution 7 - Database

If you have a database that is large enough that you can't just "back it up" to put on a development or test box, you likely have a "large database".

Solution 8 - Database

I think something like wikipedia, or the US census data is a 'big' database. My personal address lists or todos is a small database. A middle sized database is something in between.

You could try and define the sizes by how many servers you needed. A small database is a component of an application you run on your desktop, a mid-sized database would be a single mysql (whatever) server somewhere, and a large database is going to require multiple servers with some kind of replication/failover support.

Solution 9 - Database

Alternatively, consider the "size" of the database as the amount of time it takes to change the schema used to represent a domain of information. (In actual implementations, databases may contain multiple schemas and disparate domains at once.)

  1. Days = "Small database."
  2. Weeks = "Medium database."
  3. Months = "Large database."
  4. Years = "HUGE database."

With this heuristic, "size" ultimately an aspect of the information stored and the rate at which the information can be fully transformed. Such an approach based on time also maintains some semblance of how-does-this-affect-design-decisions as the sheer amount of data / number of rows increases and the performance of technology & implementations increases.

A variation of the above is to consider the “size” based on the amount of time required for management and routine maintenance. At the amount of data increases, so do the time for tasks such as backups, rebuilds, and upgrades. Without significant investment this may outpace the time available for such tasks.

Regardless, the key factor of “size” is time.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionRandinView Question on Stackoverflow
Solution 1 - DatabaseJohn FeminellaView Answer on Stackoverflow
Solution 2 - DatabasedkretzView Answer on Stackoverflow
Solution 3 - DatabasecoreView Answer on Stackoverflow
Solution 4 - DatabasevmarquezView Answer on Stackoverflow
Solution 5 - DatabaseobecalpView Answer on Stackoverflow
Solution 6 - DatabasekarlcowView Answer on Stackoverflow
Solution 7 - DatabasepearcewgView Answer on Stackoverflow
Solution 8 - DatabaseZoredacheView Answer on Stackoverflow
Solution 9 - Databaseuser2864740View Answer on Stackoverflow