Popular Machine Learning Datasets

Machine learning models require a large amount of data in order to learn about a specific subject. This collection of data is called a dataset. When working with machine learning methods we typically need a few datasets for different purposes.

We can rely on open-source datasets to initiate ML execution. There are mountains of data for machine learning around. This article might help you find some useful sites on Datasets for Data Science Projects.

Datasets for ML

Some Data Aggregators

These sites have datasets available in multi categories.


UCI Machine Learning Repository

Dataset Search

Registry of Open Data on AWS

List of datasets for machine-learning research – Wikipedia

VisualData Discovery

Machine Learning Datasets | Papers With Code



Dataset list

Computer Vision Datasets

I. Image Datasets

Computer Vision Datasets : Roboflow is a dataset aggregator for computer vision and has a few open-source datasets.

Deepmind open source data : The site contains a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes, depending on the dataset version.

CIFAR-10 and CIFAR-100 datasets : CIFAR is a popular dataset used for object recognition. It has 60,000, 32×32 color images.

ImageNet : ImageNet has 14 million images. It is popularly used for object recognition.

MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges : MNIST is a large database of handwritten single digits containing 60,000 training images and 10,000 testing images. 

Stanford CARS dataset : The dataset contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images, where each class has been split roughly in a 50-50 split.

Stanford actions dataset : The Stanford 40 Action Dataset contains images of humans performing 40 actions. There are 9532 images in total with 180-300 images per action class.

MS COCO : The MS COCO (Microsoft Common Objects in Context) dataset is consisting of 328K images. It contains annotations for object detection, key points detection, panoptic segmentation, stuff image segmentation, captioning, and Dense human pose estimation.

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop | Papers With Code :  It contains around one million labeled images for each of 10 scene categories and 20 object categories for visual recognition algorithms.

CAVE | Software: COIL-100: Columbia Object Image Library : It contains data for object recognition.

Visual Genome : It has a database consisting of images, graphs, and region descriptions.

Google AI Blog: Introducing the Open Images Dataset : It is a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.

YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research : YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities.

MIT Indoor Scene Recognition : The database contains 67 Indoor categories and a total of 15620 images for indoor scene recognition.

xView dataset : xView is one of the largest publicly available datasets of overhead imagery. It contains images from complex scenes around the world, annotated using bounding boxes.

Large-scale CelebFaces Attributes (CelebA) Dataset : CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter.

Places : The database contains 205 scene categories and 2.5 million images with a category label.

Visual Question Answering : VQA is a new dataset containing open-ended questions about images.

Cityscapes Dataset : The datasets contain a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high-quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.

LFW Face Database : Main : It is a large-scale database of 13.000 face photographs designed for facial recognition tasks.

II. Video Datasets

Stanford Dogs Dataset : The Stanford Dogs dataset contains approximately 20,000 images of 120 breeds of dogs from around the world.

BDD100K Dataset : BDD100K is one of the most diverse open video datasets collected by a driving platform. It is best suited for automotive applications and owns the following priorities: it’s large-scale, provides data variation, temporal information, and annotated footage from the streets.

KITTI : KITTI is one of the most popular datasets for use in mobile robotics and autonomous driving. It consists of hours of traffic scenarios recorded with a variety of sensor modalities, including high-resolution RGB, grayscale stereo cameras, and a 3D laser scanner. 

VOT2016 Challenge | Dataset : This video dataset is used for visual object tracking

UCF101 – Action Recognition Data Set : UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories.

Serre Lab » HMDB: a large human motion database : It is a database containing videos of human motion.

DAVIS dataset : Densely Annotated Video Segmentation is a dataset for in-depth analysis of the state-of-the-art in video object segmentation.

Government Data Aggregators

These government (national and local city ) sites have datasets available in multi categories.

Data USA

European Data Portal


UK Data Service

Annual Survey of School System Finances – CKAN

National Center for Education Statistics

Open Data (bouldercolorado.gov)

PJM – Data Directory

Mendeley Data

Finance Datasets

Global Financial Development Database : The Global Financial Development Database is an extensive dataset of financial system characteristics for 214 economies.

Markets data : Data about markets

Data Sources : A huge amount of economic data is available on this site

IMF Data : The IMF publishes a range of time series data on IMF lending, exchange rates, and other economic and financial indicators.

Natural Language Processing Datasets

The NLP Index : The site has a lot of datasets used for NLP algorithms

Enron Email Dataset : It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. 

Google Ngram Viewer : It is a vast collection of words extracted from the Google Books corpus.

The Wikipedia Corpus : This corpus contains the full text of Wikipedia, and it contains 1.9 billion words in more than 4.4 million articles.

Yelp Dataset : The Yelp dataset is a subset of the businesses, reviews, and user data for use in personal, educational, and academic purposes. 

200,000+ Jeopardy! Questions in a JSON file : r/datasets : It is a JSON file containing 216,930 Jeopardy questions, answers, and other data.

Home Page for 20 Newsgroups Data Set : The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. 

Stanford Natural Language Inference (SNLI) Corpus : The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. 

Large Movie Review Dataset : This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. They have provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing.

Audio Datasets

Mozilla Common Voice Dataset : It is an open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications.

AudioSet – A sound vocabulary and dataset : AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

LibriSpeech ASR corpus : LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech used for ASR

The VoxCeleb1 Dataset : VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

VoxCeleb : VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube.

The Spoken Wikipedia Corpora : The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia.

Similar Posts

Leave a Reply

Your email address will not be published.