Andreas Weigend | Social Data Revolution | Fall 2016
School of Information | University of California at Berkeley | INFO 290A

Video: Social Data Revolution: Topic 9
Transcript: sdr2016topic09.docx
Audio: sdr2016topic09.mp3

Topic 9 : People for the Data

During this session, we had Dr. Qing Wu, who is a senior economist at Google as our guest speaker.
He owns a PhD in Operations Research and Management Science from Stanford.

As the scope of the course is all about how to have data for the people and how to bring data to the people, it is highly necessary to reverse the expression and ask : who can be the “people for the data”? That is to say : Who can make the data available for the sake of society?

The answer to that question is a term which popularity keeps on growing, as we can see in the Google Trends screenshot below : the data scientist.


What definition can we give of a data scientist, a term that seems to encompass a lot of things ?

Qing Wu mentions that the term “data scientist” is a much recent term, and the “scientist” term underlines the fact that as much more amount of data is available now, there is a real need for understanding how to deal with data, and data is seen as a scientific disciplinary.
There is an abundance of debate around the skills a data scientist should have : statistics, computer science, economics, etc., with a strong focus in machine learning.
Data Science embraces all these different kinds of fields, despite the diversity and differences.
A metaphor for data science: “old wine in new bottles”

For Andreas, a major requirement when you deal with data is curiosity. Data scientists should be very curious people, in the way that they want to explore unknown methods, or datasets, to come up with meaningful patterns. Engagement with the topic matters; it is not just about applying formulas from books.

How to use data to provide both new products or features, and new business insights?

There seem to be two distinct communities in the field of data science[1] :
- business analytics : marketing, sales, make data-driven decisions
- engineers : build new features, new products

The “modern data scientist” is the most wanted cross-disciplinary guy of this century, who has at the same time the technical background of mathematicians and statisticians, the ability to speak computer programming languages, the communicative and aesthetic skills of designers, and the curiosity and determinism of someone motivated by his job!


What do you think we need to to get people involved for the data ? How do we utilize private data sources to reach public demand?

Qing Wu : What can distinguish a good scientist is not only his ability to use machine learning models, but his creativity about these models and about the data he inputs. It is the capacity to find out what data source might be useful for your task, and which transformations are more likely to give you the most insightful patterns.

In the previous class, we introduced the notion of data literacy, and data creativity is another skill that goes one step beyond.

Examples of data creativity :
- By using satellite pictures from Google Earth search engine to see how bright cities are, we can know how much the city has gone green and eco-friendly.
- New York Times: What ingredients make a person famous?
- Finding patterns from Wikipedia list of famous persons.

A 'part analyst, part artist” job
Anjul Bhambhri, vice president of big data products at IBM, says, 'A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It's almost like a Renaissance individual who really wants to learn and bring change to an organization.'

A good example of such a data “artist” may be Seth Stevens-Davidowitz, who is a New York Times columnist, former data scientist at Google.
Personal website with all his research slides available :
Seth Stephens-Davidowitz has paved the way on the use of Google searches and other Big Data sources to get new insights into the human psychology. Most of his research is published in the New York Times. He has made research to measure various factors such as racism, self-induced abortion, pregnancy symptoms, depression, child abuse, religious doubt,.. He has also used Google searches to estimate how many men are gay; explore why we tell jokes; and (w/ Evan Soltas) learn how politicians can successfully calm an angry mob.
Exploring Facebook “likes” feature, he has measured the key childhood development stages, and has scrapped Wikipedia to find out which cities produced the most superstars.

Andreas : Given a dataset, what are insights? Given a dataset, what are the answers?

Qing Wu : Be careful about the bias of the dataset.

Andreas: what do you want to learn in a class that’s about Data for the people?

Data ethics
A question that was raised during the discussion with Qing Wu was how do you act as a data scientist in a company regarding the data you are manipulating ? Do you always put in front the company’s interests ? Qing Wu points out that as a company wants to retain its customers as long as possible, there is no point for Google in using their users’ data for bad purposes. This would indeed make customers leave and force the company out of business.
Google uses personal data for no more than geolocation purposes.

Data science and the fear of unfairness

An ill-designed model could exacerbate inequalities instead of bringing more equality. This is the claim that makes Cathy O’Neil in her book Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.
In this highly commented book, Cathy O’Neil explains how, in the age of algorithm, where a rising number of our decisions are made by mathematical models, this does not lead to greater fairness in society but to its contrary. She depicts the opacity of many models which, used as “black boxes”, could reinforce discrimination instead of fighting it (taking the example of a personalized lending model unfairly parametrized, than could arbitrarily deem students from getting a loan they need for their studies). She calls for more responsibility and caution in our use of algorithms.

Data-driven decisions : how to ensure fairness ?
Fairness does not only come from the good tuning of parameters of the models the data scientist builds but is also brought by the decisions that are made thanks to the results of this model. And the possible decisions are multiple. How can we make sure to take the good one? For example:
Should we encourage top 1% to achieve even higher or should we instead encourage the bottom 1%?

This is why we as citizens, should get mobilized to have governments come up with rules to make data really for the people.

  • People fear: possibility of Robots / Machines taking over human society

Other useful references
  • Data Scientist was named “best job of the year” by Glassdoor in 2016
  • Data Scientist : Sexiest job of the 21st century, in the Harvard Business Review (2012), by Thomas H. Davenport and D.J. Patil (link here :
  • Data Scientist VS Data Engineer, What’s the Difference? By Saeed Aghabozorgi from Big Data University
  • Predictably Irrational, The Hidden Forces that Shape our Decisions, by Dan Ariely. This book is about the psychological patterns that lead to irrational behaviours, that any business analyst has to be conscious of in his analyses

1Read about the interesting differences between Data Scientists and Data Engineers: //