Highlighted Datasets

These are some open datasets we think you could enjoy working on.

The Pokémon Pokédex

Pokémon started out as a Japanese card game and became a worldwide phenomenom. The link is to a public API providing access to all the information about all Pokémons, throughout all existing (seven) generations + including berries! Two great projects for this dataset would be to create PokéBots: (1) a bot that you could compete against and (2) a bot that could help you train your Pokémons.

A link to the API: http://pokeapi.co/docsv2/#info

Some additional Pokémon resources are:

  • Pokédex Python module - The name says it all.
  • The Pokédex - A website holding all information about Pokémon, they have no public API (as far as we could tell), but you can scrape it for info.

Datasets Subreddit

Reddit is one of the biggest social bullitien boards hosting many communities, one of these communities is the dataset community, used to both requesting and publishing open datasets.

A link to the community: https://www.reddit.com/r/datasets/

Some examples are:

  • United States Prisioner Dataset - Containing ~18M record of prisoners admission, term, population etc.
  • Urbana Police Arrests - Urbana (a city in the Champaign county, in Illinois USA) published a list of all police arrests since 1988.
  • Monthly Grain Prices in Englad - All grain prices from 1270-1955.
  • All Reddit Posts - A dataset containing all posts published in Reddit.
  • There are people who request datasets (and being provided with one) so if you browse this community, don't just go into the 'dataset' tagged posts.
  • A scrapper for this community was also developed, so you can browse all available datasets in an easy fashion.

Open Municipality Budgets

Open local budget is a project under The Public Knowledge Workshop aimed at making local authorities budgets accessible to the public. While the project is at beta, many budgets are already online (some in a more accessible format than others) browsable, though a single budget at a time, in the projects home page. Being in beta means that there is plenty of room for improvement - this is where you can come in! Some project ideas based on this data are:

  • How can we compare between budgets of different municipalities or of the same municipality but in different years? And how can such a comparison be visualized?
  • Given a budget can we understand how it is invested geographically/demographically? Moreover, can we derive how a given municipality invests the money it got from taxes/arnona in proportion to the amount payed by each region in the city?
  • Generalize the above two ideas into a tool that is able to take complex queries and produce meaningful results (not necesarily visualized).

The budget data is available in this Google Drive folder.

Stack Exchange

Starting in Stack-Overflow, the Stack-Exchange network is a collection of Q&A websites, each dealing with a different topic - from porgramming to home improvement. These vast knowledge bases, some containing over a few millions of answers, are available to download in XML format.

A link to the dataset: https://archive.org/details/stackexchange

Some projects that you could attempt using this dataset are:

  • How many questions are unique? We believe that most questions have been answered before (in some form or another) so why not develop an automated answering system?
  • Could we teach a machine to code based on answers from Stack-Overflow?
  • Is there similarity between different sites relating to similar topics? For instance, do questions asked around Latin-based languages have a similar answer?

Israeli MKs' Facebook Posts' Comments

We provide a unique dataset of facebook comments to statuses published by Israeli MKs during 2015-2016. In total there are about 5 million such comments, out of which 1,600 are labeled according to the sentiment of the comment's text. A great challenge is to use the 1,600 labeled comments, in order to find the sentiment of all the comments. In this folder you'll find the labeled data, some information about the labels, and the unlabeled data. This dataset was collected by the team of Kikar Hamedina, and they will be more than happy to help. Contact the data team if you wish us to get you in touch.

Additional Datasets

Here are some additional resources which you can use to find open datasets. This is really just the tip of the iceberg, so if you don't find anything interesting here, it dosen't mean that it dosen't exist at all. If you have something specific in mind and need our help, mail us at data@datahack-il.com.

  • Kaggle - A datascience community, contains some nice datasets such as school fires in Sweden from 1998-2014.
  • The GDELT Project - The GDELT Project monitors the world's broadcast, print, and web news from around the world and identifies people, locations, organizations, emotions and more.
  • The New York Times - One of the most widespread newspapers in the world.
  • Prize4Life - A dataset containing clinical records of ALS patients.
  • European Centre for Medium-Range Weather Forecasts - Datasets containing weather information, one of which contains atmospheric data from 1900 until 2010!
  • NASA - A dataset containing all of NASA's data from biological measurements to software usage.