Andreas Weigend | Social Data Revolution | Fall 2016
School of Information | University of California at Berkeley | INFO 290A

Video: Social Data Revolution: Topic 8
Transcript: sdr2016topic08.docx
Audio: sdr2016topic08.mp3

Contributors: Savanah Frisk, Zhitong Qiu, & Lukas Schwab

Topic 8: Predicting the Present

Introduction


In a presentation from Dr. Qing Wu, Lead Economist at Google, we learned about the application of different data tools in social science. Much of the field of data science is obsessed with prediction and forecasting, methods of taking data and generating guesses about the future state of the world. However, another useful practice is “Now Casting”. This takes aggregated data about the past and helps contextualize the present. Rather than making grand predictions for the future we can use data and social science to “predict the present”.

Google has developed several tools for data and social scientists that leverage its extensive search term database. These tools combine with other data sets and statistical techniques can help frame current affairs.

Google Trends is a tool to find search history trends over time.
Google Correlate is a tool that is used to combine search histories and other data (eg. time series) to find and visualize correlations!



“Now Casting” with Economics


In 2010, Google Trends was used to take note of the economy and help predict the present. A paper written by Hyunyoung Choi and Hal Varian, students from UC Berkeley, shows “how to use search engine data to forecast near-term values of economic indicators. Examples include automobile sales, unemployment claims, travel destination planning, and consumer confidence”.

Link to the paper: http://people.ischool.berkeley.edu/~hal/Papers/2011/ptp.pdf

Unlike the prediction of the future, which is not so clear, predicting the present is manageable with Google Trends. Many economists and the Feds have been able to double check their data used in their research using this tool. This was similar to Google’s past project for healthcare on predicting the diseases in a certain neighborhood, but there were certain flaws that caused Google to shut down that project.

screenshot-beteling-THC-10012093.png
screenshot-beteling-THC-10012093.png



However, there has been promising results that come from finding the weight of unemployment in different places as well as the conditions of the economy. This makes data mining easier and the data is truly for the people-- if they know how to analyze and interpret it.

Unemployment-predictor.png
Unemployment-predictor.png



There were internal APIs and tools that Google keep for themselves that are not released for the public which does lead to certain debates on what the data is used for in Google and how they will use it.


Screen Shot 2016-09-22 at 10.06.32 PM.png
Screen Shot 2016-09-22 at 10.06.32 PM.png





Combining Data Sources

Google Trends + Geo Location Data

o-HAPPIEST-STATES-facebook.jpg
Source: Huffington Post image displaying happiness index per state.

By evaluating which search terms correlate positively and negatively with the happiest states Wu figured out a general list of what the happiest states are searching for. Things like “Watercraft” aka water sports, music, and dogs correlated very highly with happy states.


Because..duh!
xqkbwkexcl7udc5va7pn.jpg
xqkbwkexcl7udc5va7pn.jpg



Google Trends + Time Series Data


One meaningful application of Google’s tools is in combination with other data. This creates rich data that can be analyzed further to find patterns of public interest. By bringing in external data sources you and comparing the trend data with ground truth data, meaningful conclusions can be exposed.


“Seasonal Depression” is a medical condition that occurs when people with depression lapse during colder, cloudier times. People often generalize this concept and apply it to the public as a whole, but is this a valid assumption to make?


Dr. Wu examines this with an analysis of the search term “Depression Syndrome” combined with time series weather data. By correlating the search with weather data from each month of the year, you can see a monthly analysis. The data was somewhat inconclusive, however, summer months and winter months had higher correlation values than fall and spring and it seemed to loosely support a general “seasonal depression” trend.


An interesting note is that when one searches “Depression Syndrome” on Google Trends alone it is very hard to identify a trend, there is a basic wavelike structure, however the low points and high points very quite a bit.
Screen Shot 2016-09-23 at 2.31.49 PM.png
Screen Shot 2016-09-23 at 2.31.49 PM.png

This exemplifies the benefit of enriching your data with multiple data sources. In the social sciences, trends can often be hard to parse so knowing when to examine further and with varying methods is a learned art.




The LGBTQ++ Community and Data


Dr. Qing Wu also talks about using Google Survey, a quick way to collect data from a specific audience. “On the web, people answer questions in exchange for access to that content, an alternative to subscribing or upgrading. We infer the person’s gender, age, and geographic location based on their browsing history and IP address. On mobile, people answer questions in exchange for credits for books, music, and apps. They answer demographic questions up front. This not only means we can automatically build a representative sample of thousands of respondents, it also means you don’t have to ask those demographic questions yourself.”

Because these surveys are quick and Google has a lot of information about the person from their history and IP address, it is easier for people to analyse the results and draw quick conclusions.

Dr. Qing Wu used Google Survey to collect information about the size of the gay population and in a short period of time recorded a percentage far greater than current available metrics (Facebook, census, etc.) The Google Survey reached about 150,000 people and is likely to be more representative (although obviously bias towards younger, more tech savvy crowds) than other more public methods of demographic identification.



Data for Social Science


Not only is data used for economics or health care, but it also aids philosophers on the search for happiness as many start analyzing Google’s data. They were able to back up a lot of their claims through the data that was shown and give a good empirical prediction of the general consensus of happiness. Dacher Keltner, a professor at UC Berkeley, used the data to help him analyse the science of happiness. (He also teaches a class on Human Happiness: PSYCH 162 or Letters and Science C160V)

Data sources such as Google Trends offers social scientists new and interesting way to analysis current and past social interests, worries, likes, and dislikes. The possibilities are quite wide, there is also a lot of low hanging fruit (aka studies that have been done before but can now be verified with data gathered “right from the source”. Google search data is more representative than traditional data gathering methods, as it comes from all over the world and from people of all demographics.

There are some drawbacks for the social sciences as we swing towards a world where companies hold all the data. This was clear in the presentations today as most of the analysis done was only performed with ease due to the use of Google’s internal API, something non-Google employees do not have access to. This raises questions for the field of social science as a whole, if the data is owned and controlled by tech companies is it really open to the scientific community to explore, to the degree necessary for useful insights?

Another implication in the upswing of data for social science is the mantra “Correlation does not equal causation”. The tools Google provides do offer quick and easy comparisons, which might be a great first step in the scientific process, however, a worry is that people will stop with a Google Trend comparison and present that information as fact without conducting further research. There is still much work to be done to make sure data is accessible, useful, and informative. However, the partnership between data and social science is exciting and will likely lead to many new ways to understand the past and present.


Data and critical theory



Vast swaths of critical theory inherit their foundational underpinnings from the work of Michel Foucault. Foucault's work on power was characterized by his focus on the various methods by which power is exercised: one of his major conclusions was the definition of le savoir-pouvoir ("power-knowledge") on the observation that power relies on knowledge for its exercise (how is it most effectively exercised, and over whom?) and that it produces knowledge through its exercise (e.g. via population censuses which require power to enforce but generate specific insights about governed groups).


A frequent analogy in his work was that of the panopticon, introduced in Chapter 9 ("Panopticism") of Discipline and Punish. The panopticon was a theoretical model of a perfect prison by Jeremy Bentham in the late 19th century. Built to emphasize the unwavering gaze of prison guards over prisoners (rather than isolate and punish prisoners, as in a dungeon model), Foucault suggests that the panopticon is the archetype for disciplinary power in the modern era.

Panopticon.jpg


What does any of this have to do with social data?


Foucault died well before the development of modern data science, but now there are critical theorists updating his work for modern dynamics. With the rise of huge data collection projects like the NSA and the persistent development of technologies to make sense of vast, pre-aggregated data sets on digital identities, the notion of the "gaze" ("regard") can be abstracted away from the literal act of looking. Now knowledge is generated and power exercised without any in-person interaction between the prisoner and guard in the Panopticon; rather, Flyverbom suggests in "Disclosing and concealing: internet governance, information control and the management of visibility" that political subjects are known first and foremost through their digital doubles.


What does this mean? It is already changing how law enforcement functions, for example. Counterterrorism task forces can watch energy bills to identify apartments that are likely hold-outs for terrorist cells, or DEA agents can use energy bills to find grow houses (or heat imaging). Suspicious search histories might constitute sufficient probable cause to further investigate an individual.


Whether this constitutes a hopeful step towards more effective enforcement of national laws and norms or a treacherous move towards totalitarian surveillance is up to individual judgement. One thing is clear, though: big data is changing, at a very fundamental level, how we relate to governments in the modern era. If modernity ushered in an age of panoptic power, the right term for now might be "algorithmic."