Machine learning continues to drive the future of many industries, from digital advertising to traffic roads in densely populated cities. Yet as machine learning continues to evolve, it remains encumbered by many technical issues, including data bias.
Data bias, also known as algorithm bias, pertains to a phenomenon wherein an algorithm generates output that is systematically prejudiced because of inaccurate assumption/s during data collection and processing. Here’s how data bias can affect artificial intelligence (AI) and the rate at which it grows and becomes adopted.
Garbage In, Garbage Out
In order to remove bias from data, AI engineers must be able to pinpoint faulty or incomplete information that can likely result in inaccurate predictions. Since AI models rely on the data that is fed to them to produce output, the “garbage in, garbage out” rule perfectly conveys the theory that the quality of input predominantly decides the quality of the output.
The 2020 documentary Coded Bias drives a stake into the heart of this very alarming yet neglected issue. In Coded Bias, artificial intelligence expert Joy Buolamwini dives deep into how biased data can result in unfair advantages to society’s elite and, simultaneously, how it can make it more difficult for low-income individuals and people of color to rise above their circumstances.
For instance, the Netflix documentary claims that some algorithms used by banks and lenders to identify creditworthy borrowers give heavier weight to male Caucasians. Furthermore, AI algorithms were unable to visually identify African Americans and people of darker color.
Types of Data Bias
There are different types of data bias that research analysts and AI engineers deal with on a daily basis. These include the following:
This happens when a dataset fails to reflect the realities of the condition to which the model is applied. A good example of sample bias comes in the form of facial recognition systems trained mainly to identify Caucasian males based on images fed to the model. When tested with images of people of color and of the opposite gender, models that are designed with sample bias perform with a low level of precision.
This type of bias is often found in the data pre-processing phase. Exclusion bias often stems from removing invaluable data points that were initially thought to be useless. That said, it can also stem from the systematic exclusion of specific data. For instance, say you have datasets for product orders coming from the U.S. and Canada. If 90 percent of the data are US-based orders, the model may miss out on the fact that your Canada-based orders are twice more in nominal value.
Measurement bias happens when information gathered for training a model varies from that gathered in real-world conditions. It may also stem from inaccurate measurements that lead to distortion of data. For instance, if you collect training data from a specific type of camera and then produce data from another camera with different specifications, measurement bias is bound to occur.
Some in the industry also refer to it as confirmation bias. It’s essentially the subconscious effect of seeing what you expect or want to see in the information you are looking at. For instance, when a researcher goes into a project with strong opinions on the subject, these subjective thoughts can affect their labeling habits, which lead to imprecise data.
This happens when data fed to an AI model augments or multiplies an existing prejudice. For instance, an AI model that is fed a dataset of jobs where all doctors are males, and all nurses are females will propagate those biases even though doctors can be females and nurses can be males.
Avoiding Data Bias in AI Projects
Unfortunately, data bias remains unresolved in today’s AI projects, including major projects that large tech companies work on. If you wish to not contribute to further bias in datasets used in your own machine learning projects, make sure to research your users in advance, identify any and every general use-case and outlier, and make sure that your data scientists and labelers come from diverse backgrounds and experiences.
As algorithms continue to be deployed into the real world, it’s important to understand what variables impact the efficacy and fairness of these models.