Wednesday, 5 October 2016

Eliminating gender bias from algorithms

Machine learning is found everywherein our daily lives. Every time we talk to our smartphones, search for images or ask for hotels recommendations, we are interacting with machine learning algorithms. They take input as large amounts of data, like the whole text of an encyclopedia, or the complete archives of a newspaper, and interpret the information to extract patterns that might  be invisible to human analysts. But when these huge data sets include social bias, the machines learn that too.

A machine learning algorithm is like a newborn baby that has been given billion of books to read without being taught the alphabet or knowing any grammer or words. The effect of this type of information processing is impressive, but there is a problem. When it takes in the text data, a computer notices relationships between words based on various factors, including how many times they are used together.

We can test how exact the word relationships are identified by using analogy puzzles. Suppose I ask the machine to complete the analogy "He is to Man as She is to Y." If the machine comes back with "Woman," then we would say it is successful, because it returns the same answer a human would.
Our research group trained the system on Google News articles, and then asked it to complete another analogy: "Man is to Computer Programmer as Woman is to X." The answer came back: "Housewife."

Scrutinizing bias
We use same type of machine learning algorithm to generate what are called "word embeddings." Each English word is embedded, or allocate, to a point in space. Words that are semantically related are allocate to points that are close together in space. This type of embedding makes it easy for computer programs to quickly and efficiently recognize word relationships.
Afterfinding our computer programmer/housewife result, we asked the system to automatically generate huge numbers of "He is to A as She is to B" analogies, completing both portions itself. It returned many sensable analogies, like "He is to Brother as She is to Sister."
 In analogy notation, which you may remember from your school days, we can write this as "he:brother::she:sister." But it also came back with answers that reflect clear gender formula, such as "he:Father::she:Mother" and "he:doctor::she:nurse."
The fact that the machine learning system started is similar to  a newborn baby is not just the strength that makes it to learn interesting patterns, but also the weakness that falls prey to these obvious gender stereotypes. The algorithm makes its decisions based on which words appear near each other frequently. If the source documents reflect gender bias – if they more often have the word "Father" near the word "he" than near "she," and the word "Mother" more commonly near "she" than "he" – then the algorithm learns those biases too.