Large public datasets?

DatabasePerformanceDatasetBenchmarking

Database Problem Overview


I am looking for some large public datasets, in particular:

  1. Large sample web server logs that have been anonymized.

  2. Datasets used for database performance benchmarking.

Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/

Database Solutions


Solution 1 - Database

> 1. Large sample web server logs that have been anonymized.

These work to start with:

There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact link if you have specific needs they may know of.

> 2. Datasets used for database performance benchmarking.

This sounds like a misnomer, because you're asking for empirical data sets that describe well-defined algorithmic problems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.

I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmic guarantees of these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.

Wikipedia provides a terse article on database testing concepts that you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBC and JDBC Benchmark to determine the relative timings of each operation. From here, you can hone in on a correct solution.

In short, go to the research first for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.

Solution 2 - Database

Based on Quora answers and my personal collections in my studies, an awesome-public-datasets repository was created and updated lively on GitHub:

Below is a snapshot version of this list. For a newest list, please visit Github:

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.

Climate

Economics

Finance

Biology

Physics

Healthcare

GeoSpace

Transportation

Government

Data Challenges

Machine Learning

Natural Language

Image Processing

Time Series

Social Sciences

Complex Networks

Computer Networks

Data SEs

Public Doamins

Complementary Collections

Solution 3 - Database

Solution 4 - Database

Just a thought:

Solution 5 - Database

Well for the web server logs you could always just generate them for the format you need. If you are going to test code against it etc. it will have to be tailored to the fields you want to store/parse.

For the datasets used for database performance benchmarking, you'll probably want to look at a tool that can generate data for you. Red Gate has a great one for not too much money.

Solution 6 - Database

Google Fusion Tables has a few.

http://tables.googlelabs.com/

Solution 7 - Database

Datasets available here as well.

Solution 8 - Database

Kaggle.com frequently has datamining challenges. The datasets cover a wide range of fienlds: healthcare provider data to credit history information. Perhaps something there is what you're after.

Solution 9 - Database

http://Quandl.com has over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.

Solution 10 - Database

Solution 11 - Database

I am surprised no one mentioned Google N-Grams. More on N-Grams at http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Solution 12 - Database

Perhaps some databases used as training sets for face recognition algorithms: face-rec.org

Solution 13 - Database

Well, this one is new and there is a challenge behind it:

Million song dataset challenge

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionJasonView Question on Stackoverflow
Solution 1 - DatabaseMrGomezView Answer on Stackoverflow
Solution 2 - Databasecaesar0301View Answer on Stackoverflow
Solution 3 - DatabaseGene De LisaView Answer on Stackoverflow
Solution 4 - DatabaseJason SView Answer on Stackoverflow
Solution 5 - Databasekemiller2002View Answer on Stackoverflow
Solution 6 - DatabaseCarter MedlinView Answer on Stackoverflow
Solution 7 - DatabaseviperView Answer on Stackoverflow
Solution 8 - DatabaseRishiView Answer on Stackoverflow
Solution 9 - DatabaseBrian RiskView Answer on Stackoverflow
Solution 10 - DatabasealexView Answer on Stackoverflow
Solution 11 - DatabaseVishnu PedireddiView Answer on Stackoverflow
Solution 12 - DatabaseMihai TodorView Answer on Stackoverflow
Solution 13 - DatabasezeroDivisibleView Answer on Stackoverflow