10 awesome ML datasets that deserve your attention!

ML dataset sites

In today’s day and age, data is everything. Datasets have been popping up left and right on various big platforms like Kaggle and Google Dataset Search. We’ve never had so much data in our entire history, and because of this, there are a few datasets that are a bit too underrated.

In this article, I’ll like to bring such datasets to light and hopefully provide them with the attention they deserve!

1. Common Crawl Corpus

common crawl

The common crawl corpus is a collection of all the web pages that have been uploaded on the internet since 2008. You heard that right, “a collection of all the web pages that have been uploaded on the internet”. Go check it out for yourself on their website.

This dataset is absolutely massive, probably one of the biggest datasets out there. And what’s more…is that this is completely free! Anyone can access the data they have stored on their AWS S3 buckets as long as they have an internet connection.

The corpus contains tens of billions of web pages and is updated monthly. Using Common Crawl you can extract any type of data you’ll ever need. The fact that something like Common Crawl is not widely known is very frustrating.

2. StockNote API


This one is for all the stock market enthusiasts out there. StockNote is a service provided by an Indian stock market broker called Samco. This service also provides an API that is free to use, unlike most other brokers. All you need to use the API is a Demat account.

The API provides you with the prices for different types of financial instruments like stocks, futures, derivatives, commodities, and currencies for NSE (National Stock Exchange), BSE (Bombay Stock Exchange), MCX (Multi Commodity Exchange of India), and CDS (Currency derivatives Segment on National Stock Exchange).

The API is well documented and also provides an SDK on GitHub. You can get the prices on a per-minute basis for each instrument for the past 30 days, real-time price data during market hours, and daily price data using the API. I myself have worked with this for a few projects and rate it a 10/10.

3. YFinance (Yahoo Finance)


YFinance is another stock market-related API provided by Yahoo Finance.

Unlike StockNote, YFinance isn’t restricted to the Indian stock market. It instead provides data related to instruments from all types of financial markets. You can check stocks, derivatives, forex, and even cryptocurrencies.

From my experience, the API is not as well documented as the StockNote API but it does provide useful data. However, there is a python module available on pypi website (https://pypi.org/project/yfinance/)

4. COVID-19 in India

covid 19 india dataset
Covid 19 India dataset

As the name suggests, this dataset records the daily Covid-19 cases, deaths, and recoveries in states and Union Territories of India.

The dataset is excellent for beginners as it would help them practice data cleaning, data visualization, and inferential statistics. Check out the dataset for yourself.

I myself wrote a Kaggle notebook on this. You can check it out here.

5. Weather in Szeged City

This dataset has weather-related data recorded on an hourly basis and can be very helpful for Regression practice.

The dataset is something a lot of people of different levels have worked on. So it would provide you with a variety of different approaches, ranging from Beginner to Advanced.

Check the dataset out for yourself

6. Face Mask Detection

With the rise of Covid-19, the importance of face masks sky-rocketed. Wearing masks incorrectly became a significant health hazard and something needed to be done to prevent this. Wouldn’t it be convenient if computers can identify if someone wore their mask correctly or not?? And thus was born this dataset…or at least I think that’s how it was born.

The dataset has 853 images of people wearing masks. Each of these images belongs to either of the 3 classes –

  • With mask
  • Without mask
  • Mask worn incorrectly

This dataset would be a good practice for all the CV enthusiasts out there. Check the dataset out for yourself.

7. Captcha Images

I love automation. Ever since I got my hands on Selenium, I’ve always thought of writing code that’ll do any of my mundane but necessary tasks…but damn you captcha tests! Don’t even get me started on how the captcha tests have ruined my automation dreams.

But thankfully other people had this problem too. That’s why the Captcha Images datasets sprung into existence.

Check the dataset out for yourself.

8. COVID-19 Vaccine Articles

Covid-19 and Vaccines are controversial topics in the USA. There seems to be no ending to the number of people who doubt the existence of both of these significant things.

So it was no wonder when the Covid-19 Vaccine was announced…there were bound to be all sorts of emotions around them. Or should I say, that there were bound to be all sorts of sentiments around them???

Yeah, that’s right. The Covid-19 Vaccine Articles dataset provides excellent practice for Sentiment Analysis. The dataset consists of 30 articles from 6 major news websites in the USA. Click here to check the dataset out for yourself. Also, checkout a paper on the same on nature.com

9. r/FloridaMan

Reddit is a pretty big social media platform. Just as Facebook has Facebook groups, Reddit has subreddits (denoted with an r/ before their name). One of these subreddits is r/FloridaMan.

If you don’t know what a Florida Man actually is, then I’ll recommend getting familiar with the meme culture around this topic (or going through the posts on r/FloridaMan) before trying out the dataset.

The dataset has over 40k articles and titles that have been posted on the r/FloridaMan subreddit since 2014 along with their scores. This can help you perform NLP to understand the types of posts that do well on r/FloridaMan.

Click here to check the dataset out for yourself.

10 . A Million News Headlines

Sentiment Analysis of A Million News Headlines

As the name of the dataset suggests, this dataset has over a million news headlines sourced from ABC (Australian Broadcasting Corporation).

This includes the entire corpus of articles published by the ABC news website in the given date range. With a volume of two hundred articles per day and a good focus on international news, we can be fairly certain that every event of significance has been captured here.

This is another dataset that’ll help in your NLP practice. Click here to check the dataset out for yourself.

We at ml-concepts are also working on creating a massive compilation of publicly available datasets. You should check them out at machine learning datasets

Let us know in the comments if you have come across a hidden gem in datasets and want us to cover it.

Similar Posts

Leave a Reply

Your email address will not be published.