I am constantly amazed by the energy and momentum around data science. Only a few years ago, I would be met with a blank stare when I told someone I planned on going to grad school for machine learning. Today, there is no need for my “it's like computer science, linear algebra, and statistics had a combined love child” analogy as most people instantly respond with “Oh, like AI!”
The conversations around machine learning and the data science industry aren’t always pleasant. In fact, machine learning and data science have gained a bit of notoriety. This conversation, however, is coming from an optimist. So today, I am highlighting a few movements around the community’s use of data that I am proud to see grow into prominence and hope will continue to forge the future of our industry.
Manifesto for Data Practices
Built upon four values and 12 principles that when “taken together, describe the most effective, ethical, and modern approach to data teamwork.” The beauty of the Manifesto for Data Practices, in my opinion, is how well it applies to domains ranging from algorithm design to data entry.
I see this manifesto as a solution to problematic parallels I connect between the energy and data industries. Fossil fuels have historically been incredibly profitable and have made modern goods, transportation, and industry possible. However, the industry has done a terrible job in accounting for negative externalities like the long term impact on climate change and the environment. The data industry has also been incredibly profitable, we have seen the growth of tech giants like Google and Facebook built upon the exchange of data. It has made conveniences like “free” content viable as publishers can monetize their content through digital marketing streams. However, just like fossil fuels and climate change, we have seen negative externalities with the current data usage model come to the forefront with issues like filter bubbles and data privacy.
But just as there are more ethical ways to power our world, there are more ethical ways to handle our data. I fully believe the few simple guidelines in the manifesto are the key to a sustainable data model.
While the Manifesto for Data Practices works to establish an ethical framework to our use of Data, the FAIR Data principles tackle the siloed nature of data across organizations. Often times, whether purposefully or not, data of the same subject have no way of connecting.
I have attended multiple webinars over the past couple months for FAIR data movements within the pharmaceutical industry. Historically, clinical trial data has been treated like an asset or competitive advantage for one pharma company over another. However, all companies are harmed in that they can not see how patients who appear in multiple trials are affected in a second trial or how their outlook has changed over time if the trials occur within different organizations or companies. The ability to tie these datasets together could greatly increase the rate of innovation across the industry, as everyone involved accesses more robust data. While there are plenty of technical and political hurdles to get over in order to make this standard practice within pharma, there are some working to make it happen. It is a golden example of an industry that could greatly benefit from implementing FAIR data practices. More importantly, it is an example that could have a profound impact on people’s health and daily lives.
The Manifesto and FAIR Data in practice today
There is a subset of the community where these two data guidelines are alive and well, data science competitions. Crowd-sourced data science competitions are popular among data scientists of all levels. When first breaking into the industry, competitions provided me with examples to learn from and forums to have questions answered. Through grad school, they provided repositories of interesting data to use for projects. As I continue to develop my skill set, it is a place to see how you compare to some of the best data scientists in the world (and even make money if you are skilled enough).
On these competition sites -- Kaggle being the most popular -- organizations post a labeled training dataset, an unlabeled test dataset, a goal, a scoring criteria, and an end date. My personal favorite competition site is Driven Data, which specializes in projects focused in social good. Community members leverage the training data to create models and submit predictions from the test data as many times as they wish until the end date. The team with the best results based on the scoring criteria on the close date can win everything from bragging rights and swag to large amounts of money.
Now let’s take a closer look at how crowd-sourced data science competitions stand up against the manifesto’s four values and the four letters that make up the FAIR acronym.
- Inclusion - “Maximize diversity, connectivity, and accessibility among data projects, collaborators, and outputs.”
- Experimentation - “Emphasize continuously iterative testing and data analysis.”
- Accountability - “Behave ethically and transparently, fix mistakes quickly, and hold ourselves and others accountable.”
- Impact - “Prioritize projects with well-defined goals, and design them to achieve measurable, substantive outcomes.”
On the manifesto front, crowd-sourced competitions like Kaggle and Driven Data are perfect examples of inclusion as they are open to anyone willing to take on the challenge. Regardless of location, demographic, or skill set, you can take on some of the most interesting problems data science has to offer. Competitions are built on experimentation as users submit results as many times before the deadline allows for continuous iterations. The competitions are not necessarily transparent in that the winning source code may not be shown but they often have forums with kernels where you can view examples. And some of the most popular machine learning algorithms have come out of competitions, like the famous collaborative filtering algorithm that came from the Netflix Prize. While projects vary in the substantiveness of their outcomes, each one of the competitions is based around solving a specific goal and has scoring criteria so impact is foundational.
FAIR Data Values
- Findable - “The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers.”
- Accessible - “Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization.”
- Interoperable - “The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.”
- Reusable - “The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.”
In terms of FAIR data practices, competitions check the findable box as competitions often include metadata to help the competitors navigate sometimes confusing data. The data are accessible as users can see descriptions of the data and then download them once they have signed up to compete. Competitions may not be the most interoperable as they generally don’t rely on tying in additional data. But users can always leverage the competition data by tying it into other projects. And they are reusable. The data is clearly described through text and metadata, as well as having examples of how people used the data in the competition itself.
Data Science competitions are not perfect and they may not completely encapsulate both the Manifesto for Data Practices and FAIR data, but they are great examples of most of the principles in practice. I’m glad after in my post-grad journey I’ve wound up on a team that gladly embraces these practices when we develop and shape the platforms we build. We hope movements like these will be a guiding force as the data science industry continues to develop.
You may also be interested in these articles on other movements within modern research: