the plan for the future
26 March 2016 - 25 May 2016
Lecturer: Scott Page
This document contains redacted course notes of the course Model Thinking by Professor Scott E. Page of the University of Michigan in the United States that is available on Coursera.org. This course explains the basics of modelling that are essential in understanding human behaviour. According to Professor Page, you need models to make sense of the world and become a better thinker.
1. Why Model?
1.1. Why Model?
1.2. Intelligent citizens of the world
1.3. Thinking more clearly
1.4. Using and understanding data
1.5. Using models to decide, strategise and design
2. Segregation and Peer Effects
2.1. Sorting and peer effects introduction
2.2. Schelling's segregation model
2.3. Measuring segregation
2.4. Peer effects
2.5. The standing ovation model
2.6. The identification problem
3.2. Central limit theorem
3.3. Six sigma
3.4. Game of life
3.5. Cellular automata
3.6. Preference aggregation
4. Decision Models
4.1. Introduction to decision making
4.2. Multi-criterion decision making
4.3. Spatial choice models
4.4. Probability: the basics
4.5. Decision trees
4.6. Value of information
5. Thinking Electrons: Modelling People
5.1. Thinking electrons: modelling people
5.2. Rational actor models
5.3. Behavioural models
5.4. Rule based models
5.5. When does behaviour matter?
6. Categorical and Linear Models
6.1. Introduction to categorical, linear, and non-linear models
6.2. Categorical models
6.3. Linear models
6.4. Fitting lines to data
6.5. Reading regression output
6.6. From linear to nonlinear
6.7. The big coefficient versus the new reality
7. Tipping Points
7.1. Tipping points
7.2. Percolation models
7.3. Contagion models 1: diffusion
7.4. Contagion models 2: SIS model
7.5. Classifying tipping points
7.6. Measuring tips
8. Economic Growth
8.1. Introduction to growth
8.2. Exponential growth
8.3. Basic growth model
8.4. Solow growth model
8.5. Will China continue to grow?
8.6. Why do some countries not grow?
9. Diversity and innovation
9.1. Problem solving and innovation
9.2. Perspectives and innovation
9.4. Teams and problem solving
10. Markov Processes
10.1. Markov models
10.2. A simple Markov model
10.3. Markov model of democratisation
10.4. Markov convergence theorem
10.5. Exaptation of the Markov model
11. Lyapunov Functions
11.1. Lyapunov functions
11.2. The organisation of cities
11.3. Exchange economies and externalities
11.4. Time to convergence and optimality
11.5. Lyapunov: fun and deep
11.6. Lyapunov or Markov
12. Coordination and Culture
12.1. Coordination and culture
12.2. What is culture and why do we care?
12.3. Pure coordination game
12.4. Emergence of culture
12.5. Coordination and consistency
13. Path Dependence
13.1. Path dependence
13.2. Urn models
13.3. Mathematics on urn models
13.4. Path dependence and chaos
13.5. Path dependence and increasing returns
13.6. Path dependent or tipping point
14.2. The structure of networks
14.3. The logic of network formation
14.4. Network function
15. Randomness and Random Walks
15.1. Randomness and random walk models
15.2. Sources of randomness
15.3. Skill and luck
15.4. Random walks
15.5. Random walks and Wall Street
15.6. Finite memory random walks
16. Colonel Blotto
16.1. Colonel Blotto game
16.2. Blotto: no best strategy
16.3. Applications of Colonel Blotto
16.4. Blotto: troop advantages
16.5. Blotto and competition
17. Prisoners' Dilemma and Collective Action
17.1. Intro: the prisoners' dilemma and collective action
17.2. The prisoners' dilemma game
17.3. Seven ways to cooperation
17.4. Collective action and common resource pool problems
17.5. No panacea
18. Mechanism Design
18.1. Mechanism design
18.2. Hidden action and hidden information
18.4. Public projects
19. Replicator Dynamics
19.1. Replicator dynamics
19.2. The replicator equation
19.3. Fisher's theorem
19.4. Variation or six sigma
20. Prediction and the Many Model Thinker
20.2. Linear models
20.3. Diversity prediction theorem
20.4. The many model thinker
- (1) in order to be an intelligent citizen of the world, you have to understand models;
- (2) models make us clearer thinkers, and people who use models are better decision makers;
- (3) models help us to use and understand data;
- (4) models help us to decide, strategise and design.
There are many types of models as you can see in the picture. The course focuses on (1) the models themselves, (2) technical details and (3) applications in other fields (fertility).
Models are simplifications or abstractions. Models are wrong to some extent but they are useful. Models are a new language in academia, business and politics. People use models to be better at achieving their objectives. The Great Books Movement aimed at making a list of ideas people should be familiar with. One idea is to tie yourself to the mast, which comes from the Odyssey where Odysseus was tied to the mast so he could hear the song of the sirens. This idea reoccurred in history. Cortés burned his ships, so that his men would not retreat.
Models tie us to a mast of logic so that we are not carried away by our thoughts and can figure out which ideas are useful to us under what circumstances. Scientific disciplines like economics, sociology, political science, linguistics and biology, use models. Some disciplines, like game theory, entirely rely on models.
The reason to use models is that models are better. A graph taken from the research of Professor Tetlock shows how accurate people predict (calibration) on the horizontal axis, and the vertical axis shows how discriminating their predictions are. Hedgehogs are people that use a single model. They are not good at predicting. Foxes have lots of loose models in their head. They do better at calibration. People that use formal models are far better than the rest .
Smart people use models but they also use personal judgement. Models are not only better, but they are also fertile. This means that they can be used in other domains. Models make us humble. After using a model, the picture of the situation can change dramatically. For example, people who used a simple linear model of house prices did not see the housing crash of 2008 coming. People who had models that predicted a crash could have made a lot of money. Multiple models are better and formal models are better. The only people that are better than random in predicting use multiple models.
- (1) name the parts. For example, if you want to figure out which people go to which restaurant, then you need to identify the people as well as the preferences and the budgets of these people. You also need to identify the restaurants as well as their menus and the price of those menus.
- (2) identify the relationships between the parts. For example, in a game theory model you can identify relationships between decisions of player 1, subsequent decisions of player 2 and the payoffs for the players.
- (3) work through the logic. For example, suppose you want to calculate the length of a rope that you want to tie around the earth at one metre above the surface. Assume the Earth's diameter to be 40,000 kilometres. The formula for circumference C is: C = πD, where D is the diameter of the Earth. In this case C = π(DEarth + 2m) = πDEarth + (π * 2m) = 40,000 km + 6.28m.
- (4) inductively explore. For example, if people are often jammed near the exit of a room, you could explore the effects of putting a post before the exit to prevent people from pushing each other.
Models have different classes of outcomes, which are (1) equilibrium, (2) cycle, (3) random, and (4) complex. Models help us to predict which of these outcomes will materialise. For example, the demand for oil and the supply of oil may tend to slope up in a fairly predictable manner. The price of oil depends on all kinds of things, such as reserves, people in markets, politics, and so on, so that price of oil might be complex, and hard to predict, but not random.
- (5) identify logical boundaries. Statements like two heads is better than one and too many cooks spoil the broth identify boundaries. Models enable us to find out the conditions under which one thing holds and when it doesn't.
- (6) communicate. For instance, in politics you can make estimates of how liberal or conservative the candidate and the voter are, and place them in a model to see which candidate matches closest the voter's preferences, so that you can explain why you think that the voter will opt for candidate B and not for candidate A.
An important application of models is using and understanding data, and this is done in the following ways:
- (1) understand patterns. For example, fluctuations in GDP growth can be explained by a business cycle model.
- (2) predict points. For example, if the price of a house in a neighbourhood is a linear function of the number of square metres, and you know the number of square metres, then you could use the function to predict the price of the house via the point value on the function graph;
- (3) produce bounds. For example, if you use models to estimate inflation 10 year from now, there is too much uncertainty to produce an exact number, so a model will probably produce a range with a lower bound and an upper bound.
- (4) retrodict. We can use models with the data to predict the past. If you don't have data of the past, you can use models to guess the data of the past. If you have the data, you can test models and check how good they are. For example, you have the economic data from 1950 to the present, and you have a model that predicts the unemployment rate based on the economic data of previous years, so you can use the data from 1950 to 1970 in the model to predict the unemployment in 1972, and then check whether or not the prediction is close to the real unemployment figure of 1972.
- (5) predict other things. For example, you may have made a model that predicts the unemployment rate, but as a side benefit the model might also predict the inflation rate. Another example is that early models of the solar system and gravity showed that there must be an unknown planet, which turned out to be Neptune.
- (6) informed data collection. For example, if you want to improve education, and make a model that predicts school results, you will have to name the parts, such as teacher quality, education level of parents, the amount of money spent on the school, and class size. The model then determines which data should be collected.
- (7) estimate hidden parameters. We can use data to tell us more about the model and then use the model to tell us more about the world. For example, a well known model for the spread of diseases is the SIR model. SIR stands for Susceptible, Infected, Recovered. If you can see from the data how many people are getting the disease, you can predict how the disease will spread over time.
- (8) calibrate. After constructing a model, you can use data to calibrate it as close as possible to the real world.
- (1) decision aides;
- (2) comparative statics;
- (3) counterfactuals;
- (4) identify and rank levers;
- (5) experimental design;
- (6) institutional design;
- (7) to help choose among policies and institutions.
The first reason is to make better decisions. For example, you can model different financial institutions like Bear Sterns, AIG, CitiGroup, and Morgan Stanley that represents the relationships between these companies in terms of how their economic success depends on another.
Now imagine the federal government is faced with a financial crisis and some of these companies are starting to fail. The government has to decide whether or not to save these companies. We can use this very simple model to help make that decision. The numbers represent is how correlated the success of one company is to another, in particular how correlated their failures are.
So, for example, if AIG fails then how likely is it that JP Morgan fails? This number 466 is big. The number 94 represents the link between Wells Fargo and Lehman Brothers. If Lehman Brothers fails, this only has a small effect on Wells Fargo and vice versa. Lehman Brothers only has three lines going in and out and the numbers associated with these lines are relatively small. For the government this can be a reason not to save Lehman Brothers. AIG has much larger numbers associated with AIG and can be a reason to save AIG because a failure of AIG may cause the whole system to fail.
The Monty Hall problem is named after Monty Hall, who was the host of a game show called Lets Make a Deal that aired during the 1970's. The problem is that there are three doors. Behind one of these doors is a large prize, but behind the other two doors there is nothing. Now you must pick one door, for example door 1. Monty knows where the prize is. He will open one of the doors without a prize but it will never be the door that you chose, for example he may open door 3.
Then Monty says, do you want to switch to door number 2? This can be formalised in a simple decision three model. There are three doors. If you pick door number 1, the probability you are right is 1/3 and the probability and the probability that you are wrong is 2/3. After you pick door 1, the prize can't be moved. Monty can only show door 2 or 3 so nothing happens to your probability of 1/3 of being right, hence you should move to door 2. Drawing circles and writing probabilities allows us to see the correct action.
You can run models of the economy and look at the unemployment rate with and without the recovery plan. It doesn't mean that what a model shows would really have happened without the recovery plan, but at least the model provides some understanding of the effect of recovery plan. Counterfactuals are not exact, but still helpful in figuring out whether a policy was a good or not.
Reason number four is to identify and rank levers. A simple model of contagion of failure shows what will happen over time if Great Britain fails. Initially after Great Britain fails, we see the Netherlands, Switzerland, Ireland and Belgium fail, and after that we see Germany and Sweden fail, and after that we see France fail. This model tells us that in terms of its effect on the world financial system, London is a big lever.
One of the big things in climate change is the carbon cycle. It is one of the models that is used all the time. The total amount of carbon is fixed. It can be up in the air or down on the earth. If it is down on the earth then it is better because it doesn't contribute to global warming. If you think about intervening, you may ask where in this cycle are there big levers? For example, surface radiation is a big number. If you think about where you want to have a policy, you want to think about it in terms of where those numbers are large.
Reason number six is institutional design. The graph shows a Mount Rider diagram, which is named after Stan Rider and Ken Mount. The box with Θ represents the environment, for example the set of technologies or people's preferences. X represents the desired outcomes. We want to use our technologies, labour and whatever we have at our disposal to create good outcomes.
The model assumes that we could decide collectively what kind of outcomes we'd like to have based on the environment. The f(θ) is called a social choice correspondence function. We don't get the ideal outcome because to get those outcomes, you have to use mechanisms M. A mechanism might be something like a market, a political institution, or a bureaucracy. The closer the outcome from the mechanism is to the ideal outcome, the better the mechanism is.
Suppose we allocate classes by a market and students have to bid for classes. Would that be a good thing or a bad thing? Currently there is a hierarchy. Seniors register first, then juniors, then sophomores and then freshmen. Markets may not work well because you need to graduate. Seniors need specific courses and that's why we let seniors register first, and if people could bid for courses then those who have a lot of money might bid away the courses from seniors and people might never graduate. The way to figure that out is by using models.
Reason seven is to help choose among policies in institutions. Suppose a market for pollution permits or a carbon trade system. We can make a simple model and tell which one is going to work better. Suppose a city has to decide about creating more green spaces. Green spaces might seem a good thing but if the city makes a green area then people could move next to that and build houses all around it, which is leading to more sprawl. So what may seem a good ideas may not be so good if you construct a model to think through it.
Both sorting and peer effects create groups of similar people that hang out with each other. Models can be used to understand how those processes work. Even though the phenomena seem obvious, they can be modelled, and these models can provide some unexpected insights.
Some interesting models with regard to sorting and peer effects are Schelling's segregation model that explains how segregation can happen, Granovettor's model that evolves around the willingness to participate in some collective behaviour, and the standing ovation model that deals with peer effects. Finally, it is important to distinguish between sorting an peer effects. This is the identification problem.
There are different types of models. Equation based models explain phenomena through equation. Agent based models work with individuals that could be people, countries or organisations. The agents have behaviours and rules that they follow. Sometimes they follow optimising rules. This is the subject of game theory and rational choice models. The actions of the agents then generate outcomes. These outcomes can be surprising and that is why modelling can be useful.
Thomas Schelling was interested in racial segregation and segregation by income. If you look at the map of New York, the red dots represent wealthy people, the blue dots represent poor people and the moderately blue dots represent people from the middle class. It show that apart from racial segregation, there is also a stark income segregation. Schelling made an agent based model with people following rules (behaviours) and aggregated them.
In this case 3 out of 7 neighbours are rich like the person living at X. Schelling made a model in which he assumed that people had a threshold of similar people living in their neighbourhood that caused them to stay. If the individual living at X has a threshold of 33% rich people living in the block to stay, he or she would stay if 3 out of 7 are rich. But if one of the rich neighbours moves out, and is replaced by a poor person, the person living at X will move.
In the initial situation, the average is 50% similar and only 16% is unhappy because their threshold of 30% is not met. 50% similar makes sense because the population is 50% poor and 50% rich. If you run the simulation based on these rules and the initial situation, it will end in equilibrium where 72% is similar and 0% is unhappy.
The insight coming from Schelling's model is that what happens on the macro level, for instance segregation, may not be the intention of the individuals at the micro level, as they may be fairly tolerant people requiring only a minor percentage of similar people living in their neighbourhood.
If you make the people slightly less tolerant, and set the threshold on 40%, 29% of the people are unhappy in the initial situation. This simulation ends in equilibrium where there is 80% similarity. If we reduce tolerance somewhat more, and set the threshold to 52%, 57% of the people are unhappy with the initial situation, and similarity ends in equilibrium at 94%, which is an extreme segregation. And a threshold of 52% isn't even intolerant. If we make people intolerant, and set the threshold on 80%, the model doesn't result in equilibrium at a high level of segregation, but in a constant movement of people.
The model suggests that what happens at the macro level may not reflect the intentions of individuals at the micro level. There is a tipping phenomenon in Schelling's model. For example, if one person moves, this may cause other persons to move. There is an exodus tip, which means that a person will leave if a similar person in the neighbourhood leaves. There is also a genesis tip, which means that a person will leave when a different person moves in. All big cities where people from different ethnicity live, are strongly segregated by race, while most people may want to live in mixed neighbourhoods.
A simple measure of segregation is the index of dissimilarity. Assume that there are rich people and poor people. Assume further that there are 24 blocks with 10 people living in each block. Assume that 12 blocks consist of only rich people, 6 blocks consist of only poor people and that 6 blocks are mixed with 50% poor people and 50% rich people.
The total number of rich people is 12*10 + 6*5 = 150 and the total number of poor people is 6*10 + 6*5 = 90. If b is the number of blue in a block, B is the total number of blue (150), y is the number of yellow in a block, and Y is the total number of yellow (90), then the difference between b/B and y/Y, or |b/B - y/Y| tells us something about how distorted the distribution within a block is.
Using this measure, a perfectly mixed block in this example would have |b/B - y/Y| = | (5/150) - (3/90) | = 0. For the blue blocks: |b/B - y/Y| = | (10/150) - (0/90) | = 1/15. For the yellow blocks: |b/B - y/Y| = | (0/150) - (10/90) | = 1/9. For the green blocks: |b/B - y/Y| = | (5/150) - (5/90) | = 1/45.
The index of dissimilarity is: (Σ|b/B - y/Y|)/2. The sum of the individual differences is divided by 2 in order to make the index range between 0 and 1. If the index is 0 then the group of blocks is perfectly mixed. If the index is 1 then the group of blocks is perfectly segregated. For the example above, you have ( 12*(1/15) + 6*(1/9) + (6*(1/45) ) / 2 = 72/90 = 0.8, which is strongly segregated. The racial segregation in Philadelphia is 0.8, and in Detroit it is 0.6.
With contagious phenomena like peer effects it seems that the tail is wagging the dog and that people at the end of the distribution or the extremist determine what is going to happen, for example with uprisings such as the fall of the Berlin Wall and the Orange Revolution. In those cases it is difficult to predict what will happen. Granovettor's Model explains why it is very hard to predict such events.
Granovettor's Model has the following elements. There are N individuals. Each individual j has a threshold for participating Tj and will join if Tj others join. If your threshold is 0, you go out and join anyway. If your threshold is 50, you join if you see 50 people participating. The outcome varies depending on the distribution of thresholds.
However, if the thresholds had been 1, 1, 1, 2, 2, then nobody would have bought a hat despite the group on average being open for such an idea. On the other hand, if the thresholds had been 0, 1, 2, 3, 4, 5, then everyone would have bought the had after 5 turns. In this case everybody eventually bought a purple hat, even though the group on average wasn't keen on doing this. In this case the extremists determine the outcome and so tail is able to wag the dog.
From this we can conclude that collective action is more likely going to happen with: (1) lower thresholds and (2) more variation in thresholds. The latter is rather surprising. That is why it is very difficult to predict whether or not something like an uprising is going to happen. Not only do you need to know the average level of discontent, but you also need to know the distribution of discontent, and how people are connected.
The standing ovation model helps us to think of participation and peer effects in a more subtle way. With standing ovations, you have to decide quickly whether or not you will stand up, and after the standing ovation starts, you may have to decide again whether to follow the people that are standing up or whether you don't, and keep on sitting. There are different models for human behaviour. One of them is that humans are rational and optimising.
With standing ovations, people don't have time for that, and they probably follow rules that can be aggregated. A standing ovation can be seen as a peer effect. It can also be information. For example, if a person that appears to be more informed or sophisticated stands up, you may stand up too, because the standing up of this person tells you something about how good the show is.
The standing ovation model has the following elements. First there is a threshold T that is related to the quality of the show Q. For example, the quality Q may range from 0 to 100, and if the quality is above the threshold T of 70, you stand up. To make it more sophisticated, people react to a signal S, which is based on Q and some error, noise or diversity E, so S = Q + E. The initial rule is: if S > T then stand up. The subsequent rule is: stand if more than X% stands.
(4) If the quality Q is below the threshold T (Q < T), then more variation in E will cause more people to stand. The value of E can express error but also diversity. People may interpret the performance differently. The value of E can be high if the audience is diverse or unsophisticated or if the performance is complicated. Let's do an example.
If Assume there are 1,000 people, T = 60 and Q = 50. Because Q < T, people are not going to stand. Assume now that E in [-15,+15], so for each individual S in [35,65]. In this case only a small number of people stands up, unless X is very small. Now increase E so that E in [-50,+50]. In this case 40% of the people stand up initially. In this case, unless X is below 40%, you will get a standing ovation.
If you are part of a group, you are more likely to stand up if someone in the group stands up. This increases the percentage of people standing up, so that there is a greater chance on a standing ovation.
This model can be useful in other situations. This is called fertility. The standing ovation model can be used for collective action problems. You can then start to think about who are celebrities and which people belong to groups. You can use this model also to improve academic performance, for urban renewal, making people go to fitness or to improve their health, and making people follow an online course.
The identification problem is how do you tell whether people hang out with each other because of sorting or homophily (Schelling) or because of the peer-effect (standing ovation). In some cases it is easy to figure out. Segregation by race is caused by sorting, because people don't move because of language preferences. Language preferences are caused by the peer effect.
Other situations are not so clear. People that are happy hang out with each other as do people that are unhappy. Sorting as well as peer effects may have caused this. Often you can't tell whether it is sorting or peer effect because the expected outcome is the same. So, you can't figure it out by evaluating the outcome. Sorting can be identified because it involves the movement of people. You need dynamic data to figure that out.
The book The Big Sort: Why the Clustering of Like-Minded America is Tearing Us Apart, written by Bill Bishop, discusses sorting. For example, in politics more and more areas became uncontested Republican or Democratic over time. Bishop argues that this happened because of sorting . The book Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives, written by Nicholas A. Christakis and James H. Fowler, discusses peer effects . Whether or not like-mindedness is caused by sorting and peer effects can sometimes be established by migration data.
Aggregation can lead to unexpected results. For example, one water molecule can't be wet. Wetness is a property of multiple water molecules. Similarly, a single brain neuron can't explain consciousness, cognition and personal character traits. Another example is Schelling's model. It indicates that, even when people are fairly tolerant, this could lead to segregation. This phenomenon is sometimes referred to as more is different. You can't do reductionist science to understand the whole.
- (1) actions: how actions add up, for example using the central limit theorem;
- (2) single rule: how to use a single rule for aggregation, for example in the game of life;
- (3) family of rules: how to aggregate using one-dimensional cellular automata models;
- (4) preferences: how to add up preferences, for example to make collective choices.
We model for the following reasons:
- (1) predict points;
- (2) understand data, for example using a bell-shaped curve;
- (3) understand patterns, for example the glider pattern in the game of life;
- (4) understand class of outcome, for example using one-dimensional cellular automata models;
- (5) work through logic, for example with the aggregation of preferences difficulties arise.
Probability distribution tells us what different things could happen and what the likelihood of each of those different things is. The central limit theorem states that if we add up a series of independent events, the distribution will have a bell-shaped curve. The most likely outcome is in the middle and it is also the mean μ. There is a lot of structure in what happens, which can be used to make predictions.
To get an understanding of where these distributions come from, you can flip coins several times. For example, if you flip a coin 2 times, how many heads can there be, and what are the odds? The options are TT, TH, HT, TT (H is heads, T is tails), so the answer is 0: 25%, 1: 50%, 2: 25%. When you flip a coin 4 times, the answer is 0: 1/16, 1: 1/4, 2: 3/8, 3: 1/4, 4: 1/16. If you make graphs of these probability distributions, they look like Bell curves.
If you flip a coin N times, the expected number is N/2. It is at the centre of the Bell curve. If the probability is not 50/50 like flipping coins, the mean μ = pN and p is the probability of the event happening and N is the number of occasions. This is called binomial distribution. For example, if there are a 100 people, and the probability of someone wearing a hat is 0.15, the expected number of people of wearing a hat is 0.15 * 100 = 15.
In a simple binomial distribution where p = 1/2 like flipping coins, σ = (√N)/2, so if N = 100, the mean is 100/2 = 50, and the standard deviation is √100/2 = 5. In this case 68% of values are between 45 and 55, 95% are between 40 and 60, and 99.75% are between 35 and 65.
In general for binomial distributions σ = √(p(1-p)N), so if p = 1/2 then σ = √(p(1-p)N) = (1/2)√N. For example, a Boeing 747 has 380 seats. There is a 90% show up rate, and assume that people show up independently. There are 400 tickets sold. So, N = 400, p = 0.9, μ = 360. Airlines overbook so important for the Airline is the probability of someone not getting a seat. The mean μ = 360 The standard deviation is σ = √(0.9*(1-0.9)*400) = √36 = 6. So in 99.75% of the cases, the number of people showing up will be between 342 and 378.
The central limit theorem states that adding random variables that are independent and of finite variance, meaning that the values are in a limited range, sum up to a normal distribution. A lot of the predictability in the world stems from the fact that exceptional events, like 300 people lining up before a bathroom, do not happen. When actions are not independent, for example stock returns, the model doesn't work. In that case you may have more big events than expected or more small events than expected.
Six sigma is a business practise to assert quality so that there are fewer quality errors. In a binomial distribution the value would only be 3.4 times per million outside six standard deviation. For example, average banana sales 500 kilogrammes, and sigma is 10 kilogrammes. How much bananas need to be in stock to cope with six sigma events? Six sigma is 60 kilogrammes, so 560 kilogrammes of bananas. Another example is a production process where the required metal thickness is between 500 and 560 mm. If the outcome has a normal distribution, and mean is 530, then sigma needs to be 5 or less.
John Conway's game of life is a simple model that shows how things aggregate that leads to surprising conclusions. Like the cellular automata, this model shows how complicated aggregation can be. It shows the limits of aggregation in the real world. It is very hard to infer from what happens at the macro level what is going on at the micro level. Aggregations of simple rules can create complex patterns.
The game of life uses a rectangular grid like a go board. It works a lot like Schelling's model:
- cells can be either on (grey) or off (white);
- if a cell is off, a cell can only turn on if exactly 3 of the 8 neighbours are on;
- if a cell is on, a cell stays on if 2 or 3 of the 8 neighbours are on.
The game of life can also produce systems that grow as can be seen in configuration 3a and configuration 3b. Using the Netlogo programme, you can calculate the evolution of different patterns such as Beacon, Figure 8 and F-Pimento. The beacon pattern behaves like a blinker and moves back and forth between two states.
The figure 8 pattern generates a sequence of more complex patterns and after 8 cycles it returns to the initial state. The F-Pimento pattern moves out into space and generates gliders. In this case simple rules can create complex patterns.
This can be used, for instance, to argue that simple neurons following simple rules, can create complex things like cognition, memory and thought. The game of life doesn't explain cognition, but it shows that cognition might be an aggregation of elements that follow simple rules. A glider pattern occurs when after a sequence of patterns the same pattern reappears on a different location.
The game of life learns us the following:
- there can be self organisation because patterns appear without a designer as these patterns occur because of individual cells following simple rules;
- there can be an emergence of functionality, such as gliders, glider guns, and counters. You can even use them as computers if you interpret what the patterns mean. For example, cognition is an emergent phenomenon;
- it can help us to get the logic right as simple rules produce incredible phenomena, and this you may may not have known without these simple models.
Simple cellular automata models are one-dimensional. Cells have two neighbours, except those on the borders. In this way you can write out every rule and display how the model evolves over time by letting time move along the vertical axis. In his book, A New Kind of Science, Stephen Wolfram explores this subject in depth .
In this simple model a cell turn on or off based on the individual state of the cell itself, and the states of its two neighbouring cells. Those three cells have eight possible states. For each of those eight states, the cell in the centre has two possible options, which is either to go on or off, so there are 28 = 256 possible rules that define what can happen to the middle cell for each of the eight states of the three relevant cells.
You can also use Netlogo to depict this rule. It shows a complex pattern and the behaviour of the most middle cell is random, in the sense that you can't predict what is going to happen to this cell if you don't have the state of this cell and its neighbours in the previous period.
These simple models indicate that everything may come from very simple rules. Some physicists like John wheeler have suggested that reality can come from binary processes. This is called "it from bit" or binary choices. This is not proof that the world does come from binary processes, but it may be possible to explain anything by answering a sequence of yes or no questions, and the very deep bottom of reality could just be binary switches.
Based on the different values for Lambda, you can expect different types of behaviour:
- if λ = 0 then all cells die off;
- if λ = 1/8 then the system blinks;
- if λ = 1 then all cells switch on;
- if λ is in the range of 3/8 to 5/8, most of the complex and random patterns emerge.
Intermediate levels of interdependence lead to complexity and randomness. For example, in markets there are intermediate levels of interdependence, so this leads to complex patterns.
Preference aggregation has a different mathematical structure than aggregating numbers like in the central limit theorem or aggregating rules like with the cellular automata. One way to write down preferences is through revealed actions. You can give people sets of choices and ask them what they prefer. So a person may prefer apples to bananas and bananas to coconuts. Preference orderings are rankings of a set of alternatives. Typically, these preference orderings concern a particular class, for instance fruit or cars.
If you have three options, there are 3 * 2 * 1 = 6 rational preference orderings, because you have three options to pick first, and for each of them two options to pick second, and for each of them, one option to pick third. The six options are A > B > C, A > C > B, B > A > C, B > C > A, C > A > B and C > B > A.
For example, person 1 has preference A > B > C, person 2 has preference B > C > A and person 3 has preference ordering C > A > B. In this case it is not obvious what the aggregate preference ordering is. A possible solution is to order the preferences pair wise. Person 2 and 3 prefer coconuts to apples, so the aggregate preference is C > A. Other aggregate preference are B > C and A > B. So we get C > A > B > C.
This may have consequences, for example with voting. Even when individuals vote rational, the collective outcome might not be rational. Because collective outcomes may not be rational, people may vote strategically, or there may be all kinds of political games, for example misrepresenting preferences, to manipulate the outcome into the desired direction.
Decision models describe how people do make decisions and how people should make decisions. Positive models predict the behaviour of others. Normative models make us better at making decisions. There are multi criteria decision making models and probabilistic models. Spatial models are multi criteria decision making models that assume that there are a couple of dimensions and an ideal point, and then measure how far the option is from the ideal point on different features. To deal with uncertainty, decision trees can be used. With decision trees, we can compute the value of information.
Multi-criterion decision making can be based on qualitative or quantitative criteria. For example, when buying a house, you can list the criteria, and check which house scores best on each of them and then choose the one that scores best on the most criteria. This is decision making on a qualitative basis. If the outcome is one house, while you prefer another house, you may have missed criteria or not weighed the criteria correctly. Adding a weight to the criteria means that the decision can be made on quantitative measures.
Spatial choice models assume that there is some ideal point, where there is not too much and not too little of something. Here the difference between the choice and the ideal point with regard to a specific feature is of importance. This can be on one dimension, for example the size of a television screen. It can be on multiple dimensions, for example when choosing between cars and political candidates, multiple issues might be relevant. You can not only use this model normatively, but also positively to gauge an individuals preferences based upon his or her choices.
Probability is the the odds that something happens. There are three axioms with regard to probability that are always true:
- (1) the probability of an outcome is between 0, meaning that it certainly will not happen, and 1, meaning that it certainly will happen;
- (2) the sum of all possible outcomes equals 1, for example when flipping coins there are two outcomes, heads or tails;
- (3) if A is a subset of B, then the probability of B happening is greater than A happening, because if A happens then B happens, but not the other way around.
There are three types of probabilities:
- classical probabilities: you can write down mathematically in a pure sense what each probability will be, for example when rolling a dice or tossing a coin;
- frequency: based upon data it is possible to make a frequency count and make an estimate of the probability, for example the probability of dying from cancer;
- subjective probabilities: based on a guess or a model. With guessing, people can make errors and the probabilities they come up with may not meet the axioms.
For example, you are planning a trip to a city, there is a 40% chance you won't be on time to get on the 3pm train, a ticket for the 3pm train costs $200, but the 4pm train costs $400. Buying a $200 ticket means a 40% chance of throwing this money away and having a total cost of $600.
Using a decision tree, it is possible to calculate the expected cost of both options. Buying a $200 ticket has an expected cost of $360, which is cheaper than buying a $400 ticket, so it is rational to buy the $200 ticket and take the risk of wasting $200.
Now follows a more complicated example. Suppose you think about applying for a scholarship of $5,000. They limit the number of applicants for this scholarship to 200. For this scholarship you have to write a two page essay and then they pick ten finalists that have to write a ten page essay. You have to make this choice. For this you need to know the probability of events happening, so the probability of making it to the final and the probability of winning. You also need to know the payoff and the cost. Suppose that it costs $20 to write a two page essay and $40 to write a ten page essay.
- (1) draw the tree;
- (2) write down payoffs and probabilities;
- (3) solve backwards.
In the final there is a 10% chance of winning $4940 and a 90% chance of losing $60, so the value is 0.9 * -$60 + 0.1 * $4940 = $440. Of the 200 applicants 10 make it to be finalists. The chance to get to the final and have a chance to win values at $440 is 5%, so this is worth $22. There's a 95% chance of losing $20, so this is worth -$19. Hence, the value of making the first essay is $3 compared to $0 for not competing.
Assume that there is a chance of p that the investment will be successful, so there is a chance of 1-p that the investment will fail. Her decision suggests that she thinks that 50p - 2(1-p) > 0 => 52p > 2 => p > 1/26 ≈ 4%. You can infer from her decision that she thinks that the chance of success is above 4%.
You have to decide do you go to the airport or stay on the campus and not go home for the weekend. Suppose you decide not to go. You can use a decision tree to find out exactly how much you really wanted to see your parents.
Assume the value of seeing your parents to be V and assume the cost of going to the airport to be c, then your decision means that (1/3)(V - c) -(2/3)c < 0 => (1/3)V < c => V < 3c. Hence, the value of seeing your parents is less than three times the cost of going to the airport. If you go to the airport, the value of seeing your parents is greater than three times the cost of going to the airport.
In models of decisions being made under uncertainty, you can ask how much would information be worth to you? One of the benefits of having a formal model is that you can figure it out. For, example, a roulette wheel in the United States has 38 different numbers, the numbers 1 through 36 plus to other spots. If I guess a particular number, the odds of winning would be 1/38. What is the value of information here? There are two different questions we could ask.
What if you consider to bet on number 17 and somebody knows whether or not the number 17 will come next? Without this information, you presume that you will lose, so you won't make a bet. If you can win $100, and the chance of winning is 1/38, the information is worth $100/38 = $2.63. Alternatively, suppose the person said that he can tell you the winning number and you can win is $100 per round. The value of that information is $100.
This can also applied to more complicated examples. Suppose you think about buying a car but you are worried that there will be a cash back programme next month in which you can get $1,000 back. You figure there's a 40 percent chance there will be a cash back program. You could rent a car for $500 right now and wait to see if there will be a cash back action. Suppose someone at the auto company can sell this information to you, whether there will be a cash back program or not.
- (1) calculate the value without the information based on the optimal choice;
- (2) calculate the value of the optimal choice with the information;
- (3) calculate the difference.
If you have the information, then there is 40% chance that there is a cash back and 60% chance that there isn't. If there is a cash back, you are going to rent and get $500. If there is not, you are going to buy a car, and win $0. With the information, you are no longer making a choice under uncertainty. There is a 40% chance that you gain $500, and 60% chance that you gain $0. The value of the best options is: 0.4 * $500 + 0.6 * $0 = $200. The value of this information is $200 - $0 = $200.
The physicist Murray Gell-Mann once said: "Imagine how hard physics would be if electrons could think." People are more complicated than electrons because they think and are diverse. This makes people more difficult to model. Humans are modelled according to three basic frameworks:
- rational actor models: These models state that people have goals and optimise their goals.
- behavioural models: These models are based on real data about choices and actions and made as close as possible to how real people behave.
- rule based models: These models assume that people follow rules without deeper motives. These models are the simplest.
Actual models may have some rational, behavioural and rule based aspects. Rational models presume that people have an objective and a specific function that they optimise, for example happiness or utility. Rational models are often criticised based on data and neuroscience. Humans are not rational based on how they actually behave. Behaviour can be very complex but if the actual behaviour is close to some simple rule then this rule may suffice for modelling.
Even though rational actor models have come under increased criticism because they are not good at predicting actual behaviour, the assumption of rationality can be useful for constructing models. Rational actor models can be applied on decisions as well as games. Rational actor models work as follows. These models assume that people, or groups of people such as corporations, have an objective, and that they optimise their choices given that objective.
A firm might maximise profits and an individual might maximise utility. For example, a firm wants to maximise revenue, and revenue = price * quantity and price is 50 - quantity then the firm will produce 25 and the revenue will be 625. Rationality doesn't have to be selfishness. People may have altruistic preferences. For example, you want to maximise your utility from spending your income of 40k on consumption C and donations D, your objective you want to maximise is: √C√D = √C√(40-C) = √(C(40-C)). With C = 20 and D = 20 this goal is optimised.
There are normal form games and extensive form games or game trees. The next example is a normal form game. Assume that there are two people, person 1 (options in black) and person 2 (options in green) that are going to decide about going to the city. Person 1 gets a payoff of 1 if he stays at home and 2 if he goes to the city. His payoff doesn't depend on what person 2 does.
The following example is an extensive form game or game tree. Assume that there is a green person, who makes the first decision, and a blue person. If the green person chooses option 1, both will get a payoff of 0. If the green person chooses option 2, the blue person can choose between option 2.1 and option 2.2.
If the blue person chooses 2.1 he will get a payoff of 3 but the green person will get a payoff of -3. If the blue person chooses option 2.2, both will get a payoff of 2, which is the best choice for both. However, if the green person assumes that the blue person is rational, she will chose option 1, and both will get a payoff of 0.
We likely to see rational behaviour in the following cases:
- when the stakes are large so that people are more likely to take a lot of time evaluating the options and their benefits and drawbacks;
- with repeated actions so that people can learn from experience;
- group decisions because typically when more people are involved, decisions tend to be more rational;
- easy problems because they can be solved with little effort.
The assumption of rationality has the following merits:
- benchmark: it can be used as a benchmark to evaluate actual behaviour;
- unique: it results in a specific answer as there can be an unlimited ways of being irrational;
- easiest to solve: you can often use mathematics to find the optional point;
- people learn: by experience people become closer to rational;
- mistakes often cancel out: if there is no bias in the mistakes, the average may be close to rational.
Behavioural models are critical of the rational actor assumption based on evidence of actual behaviour and neuroscience and psychology. Data from laboratories and the real world shows that people systematically deviate from the optimal choices. Evidence from neurology regarding how our brain is structured and the way we encode an represent information and how we think causes us to systematically deviate from what a rational actor model would suggest.
Daniel Kahneman argues in his book Thinking Fast And Slow that there are fast thought processes that are based on emotion and quick clues as well as slow thought processes that process information and are more rational. Fast thought processes make us biased in ways the rational actor model assumes that we are not . In their book Nudge, Cass Sunstein and Richard Thaler argue that, because people make systematic mistakes, this has implications for policies .
There are four types of well documented biases that cause behaviour to deviate from rational behaviour:
- prospect theory indicates that we look at gains and losses differently;
- hyperbolic discounting is about how much we discount the future and how that changes depending on how much in the future that is;
- status quo bias is a tendency to stick with what we are currently doing and not make changes;
- base rate bias means that we are influenced by what we are currently thinking.
Prospect theory states that people tend to be risk averse over gains and risk loving over losses, which explains why people take gambles when they shouldn't. Kahneman came up with the following example. Suppose you have two options. You can get $400 for sure or you get a 50% chance on winning $1,000 and a 50% chance on gaining $0. A lot of people would choose the $400 for sure. If the amounts get larger, people become more risk averse in gains. However, when there is a choice between a loss of $400 for sure or a 50% of a loss of $1,000 or a loss of $0, people are more willing to take the gamble. Both behaviours are not rational.
Hyperbolic discounting means that we discount the near future more than we discount the future that is further away. For example, we tend to prefer $1000 now to $1005 tomorrow, but we tend to prefer $1005 over a year and a day to $1000 over a year. Immediate gratification matters a lot to humans. This has often what is called the chocolate cake implication. People want to be healthy, so if you are offered a chocolate cake a week from now, you are more likely to decline the offer, but if the chocolate cake is put in front of you, you are more likely to eat it. This is because fast thinking prevails.
The status quo bias means that people tend to keep things the way they are. For example, if people have to check a box to contribute to the pension fund, most of them won't check the box, but if they have to check a box to not contribute to the pension fund, most of them still won't check the box. This probably is because checking the box seems to imply a change. In England people have to check a box to donate organs, and 25% checked. In the rest of Europe people have to check a box to not donate organs, and 10% checked.
The base rate bias means that people are influenced by what they are currently thinking. For example, when people are asked when a box is made, and how much it costs, the answers are often close to each other. For example, you may think that the box may is made in 1950 (50), and then you probably estimate price close to that number, for instance 52. This is because you were already thinking of a number, so if you have to think of another number, this number probably is close to the first number.
There are lots of biases, and they are well documented. There are also criticisms. For instance, most of those biases are found in Western Educated Industrialised Rich Developed (WEIRD) countries. So the question is how many of them apply to other countries as well? Furthermore, people learn so that they may overcome their biases. Finally, it can be computationally difficult to account for all kinds of biases, so many models assume people are rational. One way to deal with this, is to use simple rules. A more sophisticated way is to start assuming that people are rational, and then look for the biases that are relevant, and include them in the model.
Rule based models assume that people simply follow some rules. For example, the Schelling model presumes that people will move as soon as the percentage of similar people falls below a certain threshold. There are four types of rules based behaviours in two dimensions. There are fixed and adaptive rules in a decision context or a game context where the payoff depends on what other people do. In a game context a rule is often called a strategy.
- (1) fixed decision rules. An example of a fixed decision rule is random choice. Random choice, like optimal choice, can be used as a benchmark. You can compare optimal choice and random choice and see how the model behaves under those assumptions to get more understanding about what might happen and what could happen. Another example of a fixed rule is taking the most direct route, which is the route closest to the right direction. This may not be the shortest or the fastest route, so this rule may not be optimal.
Grim trigger is another fixed strategy that can be encoded in a Moore Machine. The initial state is being nice, until the other person start acting mean, the state changes to mean, and it remain mean even when the other person starts acting nice again.
- (3) adaptive decision rules. The gradient-based method is an adaptive rule that means that you keep trying things in directions that are working. For example, suppose that you are baking cookies, and start adding one spoon full of honey, and it turns out that the cookies are very good. The next time you might try adding two spoons full of honey. If the cookies taste even better, you might add another spoon full of honey. You might go on until the cookies start to taste too sweet. Another adaptive rule is random behaviour or changing what you are doing until you find something better. With regard to cookies, you might try adding raisins, chocolate or walnuts.
- (4) adaptive strategies. Adaptive behaviour makes the most sense in strategies because other persons might try to take advantage of me so that I will change may behaviour to take this into account. One adaptive strategy is called best response. Assume that there is some strategic situation, and the other person is taking some action, you could act like a rational choice person, and give the best possible response. Another option is mimicry, which is copying the behaviour of other people around you, or more specifically, people that are doing well. This option may be chosen if you do not understand the situation well enough, for example with stock market investing.
There are a few observations:
- sometimes optimal rules are simple. For example, your happiness might depend on chocolate and movies only, so that it is easy to maximise happiness by following a simple optimising rule, so that following a simple rule is a good thing to do.
- simple rules can often be exploited. For example, in a bargaining situation you might start by accepting only if you get 60%, and if you fail you demand 1% less every round. If the other party is aware of this, it is possible that you will end up with nothing.
There are reasons to model people using rules:
- they are simple and therefore easy to model and to compute;
- it is possible to capture the main effects;
- models are ad hoc because people are different and no model is able to explain everything;
- rules can be exploited in a strategic situation, so if people follow rules, you can take advantage of that.
An important question is does it matter which rules we write down? This depends on the situation. One reason we model is to figure out how much it matters how accurately we model. This can be demonstrated using two examples, which are a two sided market and a race to the bottom. In a two sided market it doesn't matter much how me model behaviour, but in a race to the bottom it matters a lot.
The first example is a two sided market of buyers and sellers. Assume that buyers are willing to pay prices ranging from $0 to $100. Assume that sellers are willing to sell at prices ranging from $50 to $150. What would rational people do? Rational buyers would bid somewhat less than the price at which they will make zero profit and sellers ask a little bit more than the price at which they will make zero profit.
The relevant buyers and sellers are willing to make deals between $50 and $100. If the prices are evenly distributed, the price may end up being around $75, and those who are willing to buy for $75 or more and those who are willing to sell for $75 or less will make a deal. This happens when everyone is acting rationally.
Assume now that buyers and sellers are not so informed, and are biased to bid at rounded prices like $40, $60, $75 or $100. A rational buyer might bid $72, but a less sophisticated buyer might then go for $75. However, the outcome probably will not differ much, and the market price is likely to settle around $75.
Assume now that buyers and sellers and completely uninformed and show zero intelligent behaviour, which means that they pick some random price that is below their zero profit value for buyers and above their zero profit value for sellers. Even in this case the price ends up being close to $75.
In markets there is little difference between completely rational and informed behaviour and zero intelligent behaviour, so in models of markets behaviour is largely irrelevant. This is completely different in games, like for instance, a race to the bottom.
An example of a race to the bottom is that people pick a random number between 0 and 100, and that the person who is the closest to 2/3 of the mean, wins. Assume that everyone is rational and knows that all the others are rational too. The result is that they will come up with 0. This can be explained as follows. Suppose everybody picks 6, 2/3 of the mean is 4, so everyone should pick 4, but then the mean would be 8/3, so everyone should pick 8/3. This goes on until 0.
In many cases people are biased. And in this race to the bottom game, a significant number of people will start out with the number 50. How might a rule based model work in this case? Some may think that people should guess 50, so they will come up with the number 33. A lot of people guess 33. But some other people think that people should guess 33, so they will come up with 2/3 of 33, which is 22. In real world experiments, you often see 50, 33 and 22, and sometimes even 14, which is 2/3 of 22.
These rules are a mix of a bias and rational behaviour assumptions. If the game is repeated many times, the answers come closer to 0. This is because people go for the adaptive strategy of best responding after the previous results.
Assume now that you have two rational people in this game and one irrational person. The rational persons know nothing about the irrational person, only that he is new to the game. What is going to happen? The two rational person are going to pick some number R. Suppose that both the rational persons assume that the irational person is going to pick number X. Then: R = (2/3)(R + R + X)/3 => 5R = 2X => R = 2X/5. If the other person is expected to choose 50, then R = 20.
The lesson from these examples is that rational behaviour is a good benchmark, but that it is also important to include biases in a model. It is also important to consider simple rules. Then we have to consider how much the outcome of each model differs from the other models. If the differences are small, then the result might not depend much on behaviour. If the differences are big, then we may have to consider which class of model is the most appropriate.
An important use of models is understanding data. Categorical models place all the data in different boxes. For example, in trying to understand why some people live longer than others, you might divide them into people that exercise and people that don't exercise. By looking at the variation and the mean values in those boxes, you might find out whether or not this categorisation tells something about how long people live.
Non-linear models could be represented by functions that do not produce a straight line, for example exponential growth functions. The diversity of non-linear functions is enormous.
The big coefficient means that decisions can best be based on the biggest coefficient. For example, a function of school quality might be Y = a1X1 + a2X2, where X1 might be class size and X2 might be teacher quality. Here a1 and a2 are the coefficients. These coefficients tell how important the variable is, so the bigger the coefficient, the more important the variable.
Linear models and big coefficient thinking are better than just using intuition and experience. There is also criticism on this way of thinking. A problem with big coefficient thinking is that it only works in areas in which there is data available. Big breakthroughs are often made by shifting to areas where there is no data. This means that there will be a new reality where big coefficient thinking doesn't work very well.
Categorical models bin the data into different categories in order to explain some of the variation in the data. For example, at the time of the IPO of Amazon, one analyst labelled it a delivery company like DHL or Fedex, and because delivery companies had low profit margins, he thought that Amazon would be a bad investment. Another analyst considered Amazon a good investment because he thought Amazon was part of the new information economy that has high profit margins. In other words, both analysts put Amazon in a different box. Amazon did very well, so which box you chose matters a lot for your investment returns.
Lump to live means that we create categories to make sense of the world. For example, if I see a car, I don't say there goes a 2003 Volkswagen 1.6 GTI, but I just say car. We model to decide, strategise and design. One reason we lump is to make faster decisions. For example, a child may have a rule not to eat green items. This helps the child to avoid eating grasshoppers, which is something it doesn't like to eat. But the rule is not optimal. The child might forego a juicy pear in this way.
One way of determining the variation is the difference between every value and the mean. You take absolute values because these differences cancel each other out. Hence, the total difference from the mean is 80 + 70 +90 + 70 + 170 = 480.
The obvious categorisation is that pears, apples and bananas are fruit, while cakes and pies are desserts. The fruit values are 90, 100, and 110, with a mean of 100 and a total variation of 200, while the dessert values are 250 and 350, with a mean of 300 and a total variation of 5,000.
Total variation went down from 53,200 to 5,200. Hence, the categories substantially reduce the amount of variation. Variation can be seen as unexplained, so the categorisation explained a lot. The total variation explained is (1 - 5,200/53,200)*100% = 90.2%. This is called the R-squared. If the R-squared is near 1, the model explains a lot, if it is near 0, it explains very little.
Sometimes there is so much variation in the data that great models only have a R-squared of 5% to 10%. Sometimes the situation is clear-cut and the R-squared can be near 90%. There is no standard measure for the R-squared of a good model, but in a class of models, a higher R-squared means that the model is better.
Experts tend to have many boxes that are useful. The previous example of fruits and desserts could be enhanced with categories like vegetables and grains. If you want to be good at understanding how the world works, you need to have a lot of categories, which must also be useful and explain a lot of variation.
Even if the model explains a lot of variation, it doesn't mean that the model is good. If you a trying to figure out what determines good school performance, and you make a distinction between schools that have an equestrian team and those who have not, you might find out that schools that have equestrian teams do better. However, that doesn't mean that the equestrian team made the school good. Correlation is not causation.
- (1) sign of the coefficient: does Y increase or decrease with X?
- (2) magnitude of the coefficient: how much does Y increase for each unit increase in X?
Models are used to predict and to understand data. Assume you have a linear model for the price of a television that is: cost = $15 * length + $100. So suppose you want to buy a 30 inch TV, then you can predict that it will cost $550. In this way you can even predict the price of things that don't exist, like a 100 inch TV.
One of his examples was 43 bank loan officers predicting which 30 of 60 companies would go bankrupt based on their financial statements. The bankers were 75% accurate, but a simple linear model based on the ration between assets to liabilities was 80% accurate. In similar studies experts did not better than simple linear models.
An important question is how much variation can be explained using the model. If you take just the mean, and you calculate the variation, then there would be a lot of variation. If you draw the right line, it is possible to explain 87.2% of the variation. The question is how to draw the best line?
The goal is to make a linear model that explains as much of the variation as possible. You could try Y = 2X. The variation would be (1-2)² + (4-5)² + (9-8)² = 3. In this way (1 - 3/32) = 29/32 of the variation would be explained, which is over 90%. But this is just a guess.
These values are b = -1 and m = 8/3 so that Y = -1 + (8/3)m. In this case the total variation would be (2/3)² + (2/3)² + (2/3)² = (4/9)*3 = 4/3. This would explain (1 - (4/3)/32) of the variation, which is over 95%. The line is now even closer to the data.
Suppose there are multiple variables. For example, the test score Y of kids might depend on their IQ Q, teacher quality T and class size Z in the following way: Y = a + bQ + cT + dZ. You expect the coefficient d on class size to be negative, and b as well as c to be positive.
Of the 78 studies on class size, 4 showed a positive coefficient, 13 a negative coefficient and 61 showed no effect. Even though we think that class size matters, it appears not to matter much, at least within the range that has been studied. Research has shown that teacher quality matters far more.
You can fit data in linear models. These models can often explain some percentage of the variation. The models also show us the sign and the magnitude of the coefficients. This shows whether there variable has a positive or a negative effect and how big that effect is. This allows us to make policy choices.
Somebody may give you some regression output and you may have to make sense of this. When we see regression output, then we deal with a linear model based on multiple variables and functions like Y = m1X1 + m2X2 + ... + mnXn + b. With regression output there is more than one X.
For example, Y could be a test score depending on teacher quality T and class size Z so that the model function could be Y = cT + dZ + b. You could expect that better teachers lead to better results so that c > 0 and bigger classes lead to poorer results so that d < 0.
In the regression output, the standard error is 24.21, which is telling us that on average how far the values differ from the mean. The value of R² = 0.72, which means that 72% of the variation is explained by the linear model. There were 50 data points. The coefficient column tell us that the model is Y = 20X1 + 10X2 + 25.
If X1 is teacher quality and X2 is class size then the positive value for the class size coefficient raises some questions. Maybe the data is wrong or maybe our intuition is wrong. So we have to dig a little deeper. The first issue is that we have only 50 observations, so the coefficient may not be correct. The standard error (SE) column gives the error in the coefficients.
For the intercept, the coefficient is 25 and the standard error is 2, so 68% of the values are expected to be between 23 and 27. Hence, we could be really sure that the coefficient is between 19 and 31. Similarly, we could be really sure that X1 is between 17 and 23, and X2 is between -2 and 22. The p-value gives the probability that the sign of the coefficient is wrong. We shouldn't be so sure about the positive impact of class size on school performance.
The important issues to deal with are the following:
- (1) how good the model is: how much of the variation does it explain?
- (2) sign of the coefficient: does Y increase or decrease with X?
- (3) magnitude of the coefficient: how much does Y increase for each unit increase in X?
- (4) what is the probability that the coefficient is wrong?
There is a problem with linear regression. Phenomena in the real world are often nonlinear. The amount of nonlinear functions is enormous compared to linear functions. There are all kinds of graphs, that could be based on functions that are exponential, logarithmic, something else, or mixed. So how can we fit lines best in those messy situations if we only have techniques that can be used with linear functions.
There are three approaches to get around this, which are:
- (1) approximate nonlinear models with linear functions: you could use multiple linear functions to approximate a nonlinear model, hence you get different linear models for different ranges.
- (2) break up the data into different quadrants: you can draw lines in each quadrant that matches the data as close as possible, and to make the line continuous, you may have to forego drawing an optimal line for each quadrant separately.
- (3) include nonlinear terms, for example Y = m√X + b.
If we have some model like Y = m1X1 + m2X2 + b where Y might be sales, X1 might be advertising on the internet, and X2 might be advertising in magazines. If the coefficient a1 is bigger than a2 then we would invest in advertising on the internet and not in advertising in magazines. The coefficient is an evidence based rank, and it is applied, for instance in medicine, philanthropy, education and management. Linear models based on evidence are better than guessing.
Evidence based thinking with models works as follows:
- construct a model;
- gather data;
- identify important variables;
- change those variables;
- gather data;
- find patterns;
- identify important variables;
- change those variables.
Big data does not make models obsolete. Models help to understand how the world works. Identifying the patterns is not the same as understanding where they came from, because:
- (1) Correlation is not causation;
- (2) Linear models tell the sign and magnitude of changes in dependent variables within the data range, but we need some understanding to tell whether the model will hold outside the data range.
A bigger problem is multiple peaks. For example, you might have some data in a small range that suggest a peak. And so, based on this data, the peak seems the optimal point, and you miss a better point.
The new reality means that you may want to try something big and new, so you consider non-marginal changes. If you have only data in a small range, this may not help you very much. The multiple peak problem is just one instance of this issue.
Sometimes big coefficient thinking may be helpful, for instance when taxing cigarettes to reduce health care cost and to raise money for financing healthcare. A new reality would be implementing universal healthcare in the United States.
The American Jobs Act is also an example of big coefficient thinking. A new reality was the US interstate highway system. In 1956, the US government allocated $25 billion, which is about $410 billion in today's prices, for 41,000 miles of roads. This was creating something completely new, so it was difficult to calculate the employment effects. Big coefficient thinking can be good for minor changes, but it ignores new realities.
Often when people think about tipping points, they think of kinks in curves. Sometimes a kink in a curve reflects a tipping point, but not always. In many cases these kinks are just exponential growth, because with exponential growth you have a curve that takes off at some point. A book about tipping points is The Tipping Point: How Little Things Can Make a Big Difference, written by Malcolm Gladwell .
It is important to see what is a tip and what is not, and also what kind of models produce tips. Two famous models are the percolation model from physics and SIS model from epidemiology. Percolation refers to the question whether or not water can make its way through a certain layer, for example the ground or a coffee filter. SIS stands for susceptible, infected and susceptible again, which is a simple model for diseases.
There is a distinction between types of tips. There are direct tips and contextual tips. With a direct tip or active tip, a variable itself changes, which causes it to tip. For example, a battle might tip a war. With a contextual tip something changes in the environment that makes it possible for the system to move from one state to another. For example, the density of the trees could tip the spread of a forest fire.
A system can be stable (in equilibrium), periodic, random or complex. A system can tip from one state to another but there are also tips within classes. For example, a system might tip from one equilibrium to another equilibrium.
Percolation models come from physics. The idea is that water comes down in the form of rain and the question is does the water percolate through the soil or not? To make a model, you can simplify the situation by using a checkerboard. The idea is that water can only percolate from one filled box to another, including diagonal directions. In the example, the water can't make it to the bottom.
What is causing the tip? For p less than 59.2%, there just aren't enough boxes filled in. However, when the values passes 59.2%, it suddenly becomes more likely that the water percolates to the bottom.
As you can see, the fire doesn't make it to the other side, even after repeated attempts. If you push up forest density to 61%, nearly all attempts will show that the forest fire makes it to the other side. This is an example of the fertility of models. A model that was used to explain percolation, can also be used for forest fires.
You can also apply this to banks. You can have a checkerboard of banks. If one bank fails then all banks that have loaned a lot of money to this bank may also fail. In this way bank failures can cascade.
This model is also based on the idea of percolation. In this model you can also ask the question whether there is a tipping point, where suddenly there are many bank failures? You could do the same with country failures.
Intuitively, you might think that there might be a linear relationship between the value of information and the number of people hearing of it, but if you use a network model with a probability of people telling the information across links, then nothing happens if the information is not very valuable, but once the value gets above some critical threshold, you might see a tipping point, and almost everybody will hear about it.
Why does the percolation model apply here? Often finding a solution for a particular type of problem, for example producing a car or getting a mathematical proof, requires getting from A to B. There are many parts that have to work together, and you need a significant number of partial solutions to get through the whole problem. For instance, for a car you need wheels, brakes, an engine and a steering wheel.
As information accumulates, we can fill in more squares, and eventually someone can find a path from A to B. Suddenly there are multiple paths so that others can find other paths from A to B. It is therefore plausible that the percolation model explains those bursts in scientific activity.
In a diffusion model everybody receives something, which could be information or a disease. The diffusion model works as follows. Suppose that there is some new disease called Wobblies, and Wt the number of people who got the Wobblies at time t. If the total population is N then the number of people that don't have the Wobblies would be N - Wt. Assume that τ is the transmission rate, which is the likelihood that someone with the Wobblies gives the Wobblies to someone who doesn't have the Wobblies.
The spreading also depends on the contact rate c, which is how often people meet, then cN is the number of meetings. Hence, Wt+1 = Wt + cNτWt(Wt - N)/N. The formula says that the spread will start slow because only a few people are affected, then speed up where more people are affected, and then, when most people are affected, slows down again because there are fewer people that can be affected. The model has no tipping point.
The SIS model is much like the diffusion model but there is a difference. After people have been infected, they can recover and move back to the susceptible state. Hence, Wt+1 = Wt + cNτWt(Wt - N)/N - aWt, where a is the recovery rate. This can be simplified to Wt+1 = Wt + Wt(cτ(N - Wt/N) - a). This model is interesting, because if the recovery rate is higher than the transmission rate, then the disease is not going to spread. This model has a tipping point.
If Wt is very small, then (N - Wt)/N is close to 1 so Wt+1 = Wt + Wt(cτ - a). The disease is going to spread if cτ - a > 0 or cτ > a or cτ/a > 1. cτ/a is the basic reproduction number R0. So if R0 > 1 then the disease will spread. If R0 < 1 then the disease dies off. The tipping point is 0. Diseases like measles (15), mumps (5) and flu (3) have R0 > 0. If you have had the measles or the mumps, you don't become susceptible again, so here the SIR model applies.
For example, if we want to keep the measles from spreading, and measles has a R0 = 15, then we need 1 - 1/15 = 14/15 of the people to be vaccinated. There is also a tipping point with regard to vaccines. Below the tipping point only the people that are vaccinated are protected. Above the tipping point everyone is protected.
The basic idea of a dynamical system is in the graph to the left. What does the graph say? If says that x is going to change over time. If y is positive, then x is going to change in the direction of the arrows on the left. However, if y is negative, then x is going to change in the direction of the arrows on the right. This system moves to a stable equilibrium where x = y/2.
If more and more preconditions are met, then the percolation model suggests that it is going to happen anyway. For example, in the eve of World War I, the European powers were already forming alliances and preparing for war. So, often what causes a direct tip, is a change in the context. This is a contextual tip.
A system can be stable (in equilibrium), periodic, random or complex. A system can tip from one class to another but there are also tips within classes. For example, a system might tip from one equilibrium to another equilibrium. This can be represented in a graph that depicts the movement of a variable x depending on some other variable r.
When measuring tips, we try to find out whether how likely the tip was going to happen. With an active or direct tip, the variable itself can cause the system to tip. If the system is at a tipping point, it is uncertain how it will behave. Once a tip has occurred, we know for certain how the system is going to behave. One way of measuring tippiness is by reductions in uncertainty.
A measure of uncertainty can be changes in the likelihood of different outcomes that could occur. Initially, there may be a large number of possible outcomes. After the system tips, there might be an equilibrium or one possible outcome, or a number of other things could occur. We measure changes in the likelihood of different outcomes using the diversity index, which is used in social sciences, and entropy, which comes from physics and information theory.
The diversity index for i possible outcomes is 1/ΣPi². For example, if we have three possible outcomes A, B and C, with probabilities PA = 1/2, PB = 1/3 and PC = 1/6, then the diversity index is 1 / (1/4 + 1/9 + 1/36) = 36/14 ≈ 2.57. For i possible outcomes, the maximum diversity is i. For example, if we have four possible outcomes A, B, C and D, which all have a probability of 1/4, so PA = 1/4, PB = 1/4, PC = 1/4 and PD = 1/4, then the diversity index is 1 / ((1/4)² + (1/4)² + (1/4)² + (1/4)²) = 4.
How can we use the diversity index to measure tips? If changes in a variable cause changes in the diversity index, this is indicative of a tip. For example, if initially there were three possible values, and the diversity index was 2.57, and then it flips to 1, the change in value of the diversity index is the measure of the tip.
Entropy also measures the degree of uncertainty. The formula for entropy is -ΣPilog2(Pi). log2 is the power of 2, so log2(2x) = x. For example, log2(1/4) = log2(2-2) = -2. For example, if we have four possible outcomes A, B, C and D, which all have a probability of 1/4, then the entropy is - ((1/4)*log2(1/4) + (1/4)*log2(1/4) + (1/4)*log2(1/4) + (1/4)*log2(1/4)) = 2.
Entropy tells us the number of bits of information we need to know to identify the outcome. If we have four possible outcomes A, B, C and D, which all have a probability of 1/4, you can split up the possiblities in half. It is either in A, B or in C, D. This is one bit of information. If it is in C, D then you need to know whether it is C or D. This is the second bit of information. You can always find the answer by asking two questions. Hence, the entropy is 2.
The diversity index shows the number of types. The entropy is the amount of information you need to identify the type. For example, if you have options A and B that each have a probability of 1/2, then the diversity index is 2 and the entropy is 1. After the system tips, the diversity index goes to 1 and the entropy becomes 0, and you don't need to ask any question to know what the state of the system is. Tips are changes in the likelihood of outcomes.
Exponential growth is accumulating over time like interest. It is possible to make primitive models for economic growth. These models show that without innovation, growth stops. Solow's growth model allows for innovation and shows how innovation has a multiplier effect on our collective well-being. Extensions to these models can be used to explain why some countries are successful while others aren't, and what enables economic growth.
Economists often talk about real changes in GDP. Real means that inflation is taken into account. The economy is measured in currency, for example dollars. If the economy grows by 5% in dollars but the inflation is 3%, then this increased amount of dollars is worth less.
An important question is whether or not super high levels of economic growth are sustainable. China has had growth rates of around 10% for 15 years. Japan had similar growth rates in the 1960s, and they remained high in the 1970s and 1980s. Japan's growth rates fell after the country caught up with the rest of the world.
Economic growth is focused on material things. Does material wealth really matter, in the sense that it makes people happier? That is a complicated question. GDP can be established accurately, but life satisfaction is a soft variable that is difficult to measure. Given these limitations it is possible to construct a graph that shows a few things.
For high income countries, where average incomes are above $20,000, more wealth doesn't really matter. It does matter for low income countries. Getting from 0 to $10,000 is huge, but getting from $30,000 to $60,000 doesn't matter a lot. Lifting people out of poverty makes them happier, but becoming richer after that isn't that important.
Economic growth models are complex with variables like labour, physical capital, depreciation rates and savings rates, so it is better to start with simple exponential growth like the compounding of interest. The GDP of countries can grow in a similar fashion. This is why different growth rates are so important. The rule of 72 explains how quickly an exponential growing variable will double.
So, why are growth rates so important? That is because differences in the growth rate can have dramatic consequences in the long run. This is because this growth is exponential.
The Rule of 72 means that dividing 72 by the growth rate approximately gives you the number of years in which the variable will double. For a growth rate of 2%, the Rule of 72 gives a number of years of 36. The exact number is 35. For 6% growth, the Rule of 72 gives a number of years of 12. So, in 36 years, 6% growth means doubling 3 times, or growing to 8 times the original GDP.
Continuous compounding means that growth is constant and not just at intervals. The formula x(1 + r)t is just a simplification as if growth only happens once a year. It is possible to calculate interest every day using a formula like x(1 + r/365)365, but this doesn't work for continuous compounding.
For continuous compounding, the interval length approximates zero and the number of intervals approximates infinite. In that case the interest can be calculated using limn→∞ (1 + r/n)nt = ert, where e = 2.71828. This is a much simpler formula for growth.
Let's make a simple growth model. Assume there is a group of workers and a field of coconut trees. When workers pick coconuts they can do two things, which are eating them or they can use the coconuts to build coconut picking machines that can pick coconuts faster but those machines wear out over time and have to be replaced with new machines. This can be used to make a model that explain the role of capital in growth and the limits to that.
Now assume that Lt is the number of workers at time t, Mt is the number of machines at time t, Ot is the output of coconuts at time t, Et is the number of coconuts consumed at time t, It is the number of coconuts invested in machines at time t, s is the savings rate and d is the depreciation rate.
The models has some assumptions. First, the output is increasing and concave in labour and machines, so Ot = √Lt√Mt. Concave means that the first machine is worth more than the second, the second is worth more than the third, and so on. Economists call this diminishing returns to scale. Second, the output is consumed or invested, so Ot = Et + It, where It = sOt. Third, machines can be built, but they depreciate, so Mt+1 = Mt + It - dMt.
The question is whether or not this growth can continue. That is more easy to see if we consider a big number of machines, for instance 400. Output will be 10 * √400 = 200. Investment will be 60. Depreciation will be 100. Hence, 40 machines will be lost. From this one can conclude that the economy cannot grow this big by itself. Consequently, economic growth must flat out and there must be equilibrium level of GDP.
The long run equilibrium occurs when investment equals depreciation. Now, O = 10√M, I = 3√M and I = M/4. Now 3√M = M/4, so 12√M = M, so 3√M = 12, so M = 144. So, when the number of machines is 144, then depreciation equals investments. In this case output will be 10√144 = 120. In this case investment as well as depreciation will be 36 machines.
The irony of the growth model is that it isn't really a growth model. Eventually there is no growth. This is because depreciation is linear while output is concave. In this case innovation may bring more growth. That is why innovation is so important.
Without innovation, growth will stop assuming that the amount of labour is fixed. In reality economies continue to grow. The Solow growth model adds another variable to include innovation. In this model Lt is labour at time t, Kt is capital at time t, At is technology at time t, and Ot is output at time t. The model then states that Ot = AtKtβLt1-β. The variable A just states how good the technology is. If β = 1/2 then this is just a square root function. If β > 1/2 then capital matters more. If β < 1/2 then capital matters less.
In the previous example O = 10√M, I = 3√M, I = M/4 and in equilibrium M = 144. If we now introduce an innovation, and the variable for technology A = 2, then O = 20√M, I = 6√M, I = M/4 and in equilibrium 6√M = M/4, so M = 496 and O = 2 * 10 * √496 = 480. The model shows that when productivity doubles, long run GDP becomes 4 times as big. This is because two thing happen. First, the process is becoming more productive. Second, because the proces is more productive, it makes sense to invest in more machines.
The innovation multiplier means that if labour and capital become more productive, it makes sense to invest in more capital. A = 3, then O = 30√M, I = 9√M, I = M/4 and in equilibrium 9√M = M/4, so M = 36² and O = 3 * 10 * √36² = 1080. If we are three times as productive then the total output went up to nine times the initial output. Hence, the effect of innovation is multiplicative.
According to the Solow model, if we continue to innovate and increase our productivity, then growth can continue. This raises the question of where does growth come from? In endogenous growth models, labour can go to picking coconuts to increase capital, but also into investing in new technology in order to increase the parameter A. A can be seen as a choice variable, as a corporation or a country can choose to invest in research an development.
In the past Japan had very high growth rates like China has nowadays. The question is can China sustain these levels of economic growth? The growth models show that it is dubious that China can continue to grow this fast unless they have massive increases in technological improvement.
As long as China is catching up with other countries, it is relatively easy to have such high growth levels, but once China has caught up with the rest of the world, it become much harder to sustain high levels of economic growth.
When there are 10,000 machines, output will be 100 * 100 = 10,000, investment 2,000, depreciation 1,000, so there will be 11,000 machines in the next period. Next year, output will be 100√11,000 ≈ 10,500, a growth of 5%. If the number of machines rises, growth falls. If the number of machines is 22,500, output will go from 15,000 to 15,250, and growth will be 1.7%.
China has 8%-10% growth rates for over a decade. China is in the early part of the curve where there is relatively little capital relative to labour. When China is moving further along the curve, it probably will not grow so fast any more. The growth function without innovation is concave. To sustain high growth, China can't keep on putting more money into capital, and it must innovate. But improvements in technology probably will not be enough to sustain such high levels of growth, and the picture will look like Japan.
The Solow growth model has labour, capital and technology that are combined to produce output. In this model things like equality and culture are left out. In the book Why Nations Fail the economists Daron Acemoglu and James Robinson look over hundreds of years to investigate why some countries are successful and why others are not. For example, Botswana did very well while Zimbabwe didn't. Why is that?
Acemoglu and Robinson think that growth requires a strong central government to protect capital and investment but that can't be controlled by a select few. If there is no strong central government to protect property rights, whether it is physical or intellectual property, there is less incentive for people to invest and innovate. And if you have less investment and innovation, you will have lower growth .
On the other hand, the central government shouldn't be controlled by a select few like in Zimbabwe. If that is the case then the people in control will extract resources from the economy, often in the form of bribery and corruption. Extraction limits growth by lowering investment in innovation and capital. Extraction causes less money to be available for investment as well as less incentive to invest. This has a similar effect as reducing the variable A, and if this variable is reduced to 1/3, output is only 1/9. Syphoning off money from the economy therefore has a multiplier effect .
What we learn from the model is that growth requires creative destruction, which is a term invented by Joseph Schumpeter. When A increases, whole industries may be wiped out. For example, when the tractor was invented, blacksmiths were not needed any more.
The American newspaper industry was hurt by people placing ads online. On Craigslist you can post things for sale. This company has really hurt the newspaper industry. The revenue for Craigslist went up while the revenue for the newspaper industry went down as people moved ads from newspapers to Craigslist. Craigslist has more options and it is cheaper, so it is an innovation.
The number of people employed by the newspaper industry dropped from 450,000 in 1988 to 275,000 in 2009. Craigslist only has 23 employees. So 23 employees wiped out 225,000 jobs. The reason why Craigslist has so few employees is that people place the ads themselves. That is called creative destruction. However the internet also created a lot of jobs. Other internet corporations like Yahoo, Google, Time Warner, Disney and Amazon hired tens of thousands of employees.
Suppose now that the country is controlled by a few and that the newspaper industry has a lot of influence on the government. And so it might happen that the government bans advertising on the web because this saves many jobs in the newspaper industry and because the government is captured by the newspaper industry. This might be a good thing because it saves jobs but this might also be a bad thing because it lowers productivity.
This model may also be applied in other fields such as our personal production and income. We can work hard, but if we don't invest in new skills and technologies, our income may level out, or even go down when our skills become obsolete. Successful people continue to learn so that they improve their personal variable A.
We can model problem solving in a formal fashion. Assume that you take some action a and there is a payout function F, that gives the value of that action F(a). For example, a could be a health care policy, and F could be how efficient this health care policy is. So a is the solution that you propose and F(a) is how good the solution is.
We want to understand how people come up with better solutions, hence where innovation comes from. For this, we use the metaphor of a landscape as a lens through which to interpret our models. Each solution has a value and the altitude can be seen as the value of it. Here B is the best possible solution. Assume that I have some idea X, which is represented by the black dot. It may be a good idea.
I might be looking for better ideas by going up and down the slope. In this way I well arrive at C and here I will get stuck. What we want to see is how people come up with these ideas, how teams of people come up with better ideas, and how we can avoid getting stuck on C, and possibly get to B. How do we model this?
The first part of the model is perspectives. A perspective is how you represent or encode a problem. So if someone poses some problem to you, whether it is software code, designing a bicycle, or a health care policy, you have some way of representing that problem in your head. Once you have encoded the problem, you can create a metaphorical landscape with a value for each possible solution. Different perspectives give different landscapes.
The second part of the model is heuristics. Heuristics define how you move on the landscape. Hill climbing is one heuristic. Random Search would be another heuristic. Different perspectives and different heuristics allow people to find a better solutions to problems. Individuals have perspectives and heuristics.
Teams of people are better in solving problems than the individuals in it because they have more tools, and those tools tend to be diverse. They have different perspectives and different heuristics, and all that diversity makes them better at coming up with new solutions and better solutions to problems.
Recombination means that solutions for different problems from different people can be combined to produce even better solutions for those problems. Sophisticated products like a houses, automobiles and computers consist of all sorts of solutions to sub-problems. By recombining solutions to sub-problems we get ever better solutions, and that is really a big driver of innovation, and growth depends on sustained innovation.
When you think about a problem, a perspective is how you represent it. Landscape is a way to represent the solutions along the horizontal axis and the value of these solutions as the height. The landscape metaphor can be formalised into a model. In this model a perspective is a representation or encoding of the set of all possible solutions. Then we can create our landscape by assigning a value to each one of those solutions. The right perspective depends on the problem.
Perspectives help us find solutions to problems and to be innovative. In the history of science a lot of great breakthroughs, such as Newton's theory of gravity, are new perspectives on old problems.
For example, Mendeleev came up with the periodic table, where he represented the elements by atomic weight. In doing so, he found all sorts of structures, for example all the metals being lined up in certain columns. He could have organised them alphabetically, but that wouldn't have made much sense. Atomic weight representation gives a lot of structure. When Mendeleev wrote down all the elements that were around at the time according to atomic weight, there were gaps in his representation. They were eventually found years later. The perspective of atomic weight was useful because it made people look for missing elements.
We use different perspectives all the time. For example, when evaluating applicants for a job, you might look at competence or achievement in the form of grade point average (GPA), work ethic in the form of thickness of the resume, or creativity as indicated by the colourfulness of the personality. Depending on what you’re hiring for, any one of these might be fine. All these ways of organising applicants are perspectives.
For example, I'm tasked with inventing a new candy bar. There are many different options and I want to find the very best one. One way to represent those candy bars might be by the number of calories. In this case there may be three local optima. Alternatively, I might represent those candy bars by masticity, which is chew time. Chew time probably isn't the best way to look at candy bars so, as a result this produces a landscape with many more peaks.
Suppose we’re shovelling coal and I want to figure out how many pounds of coal can one shovel in a day as a function of the size of the pan. The larger the shovel gets, workers can shovel more coal, until the shovel gets too big and too heavy to lift. The shovel landscape is therefore single peaked and easy to solve. You are only certain to find a solution if the landscape is single peaked. If there are many peaks, you can easily get stuck on some local peak.
This game can be seen in a different perspective using the magic square where every row, column and diagonal adds up to fifteen. Peter goes first, and takes the four. David goes next, and takes the five. Peter takes the six, which is an odd choice, because now he can’t win. David then takes the eight. Peter blocks him with the two. But now it turns out, either the nine or seven will let Peter win. What game is this? This is tic-tac-toe.
Sum to fifteen is just tic-tac-toe, but on a different perspective. If you move the cards into the magic square, you create a Mount Fuji landscape. You make the problem really simple. So a lot of great breakthroughs, like the periodic table, Newton’s Theory of Gravity, are perspectives on problems that turned something really difficult to figure out into something that suddenly made a lot of sense.
In his book The Difference, Professor Page discusses the Savant Existence Theorem, which states that for any problem, there exists some way to represent it, so that it can be turned into a Mount Fuji problem. All you have to do is, is to put the very best solution in the middle, put the worst ones at the end, and line up the solutions in such a way so that you turn it into a Mount Fuji. In order to make the Mount Fuji, you would have to know the solution already. This isn’t a good way to solve problems. But there is such a perspective, and if you change your perspective, you might find it, and in this way you might find the solution .
There are a large amount of bad perspectives. With N alternatives, you have N! ways to create one dimensional landscapes. Suppose I have just ten alternatives and I want to order them. here’s ten things I could put first, nine things I could put second, eight things I could put third and so on. So there are 10 × 9 × 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 perspectives. Most of those are not very good because they are not going to organise this set of solutions in any useful way. Only a few of them are going to create Mount Fujis. If we just think in random ways, we’re likely to get a landscape that’s so rugged that we’re going to get stuck just about everywhere.
Heuristics is about finding solutions to problems once they have been represented in a perspective. A heuristic is a technique in which you look for new solutions. An example is hillclimb, which is moving locally to better points. It is one of many possible heuristics. Heuristics are defined relative to the problem to be solved. For example, do the opposite is one famous heuristic that's in a lot of books on how to innovate. It means to do the exact opposite of the existing solution.
For example, when you go to buy something, the seller tells you the price. Do the opposite would be that the buyer tells the price. A lot of companies have been starting to do exactly this. So Priceline, lets buyers go to hotels and tell them how much they would like to pay to stay at the hotel or to use the airline. Alternatively, companies can go for lower costs or do the opposite, and charge a high price to signal quality. Doing the opposite can sometimes lead to interesting innovations.
Another heuristic is big rocks first. If you have a bucket and a bunch of rocks of various sizes, it is better to put in the big rocks first, because then you have a bigger chance of putting all the rocks in the bucket. The little rocks can fill in the gaps. In his book The 7 Habits of Highly Effective People, Stephen Covey argues that this is one heuristic successful people use. The big rocks represent the important things. It means that you solve the important issues first, and then arrange the solutions of the less important things around them, you will find better solutions .
But there is a drawback. There's a famous theorem in computer science called the no free lunch theory, proved by Wolpert and McCready. All algorithms that search the same number of points with the goal of locating a maximum value of a function defined on a finite set perform exactly the same when averaged over all possible functions. This means that some of these problems are incredibly hard and some are really easy so that no heuristic is any better than any other.
This doesn't mean that Covey is wrong. The free lunch theory states that if you look across all problems, no heuristic is better than the other. Covery spent a lot of time in management and he thinks that management problems lend themselves to the big rock search first heuristic. Another way of looking at the free lunch theory is that, unless you know something about the problem being solved, no algorithm or heuristic performs better than any other. Once you know something about the problem, you might decide that big rocks first does a good job in solving it. When you're digging a hole in the ground then little rocks first may be a better solution.
Perspectives and heuristics can be used to show why teams of people often can find solutions to problems that individuals can't. That's why teams are better. The term teams is used in a very loose sense. For example, some person invented the toaster. Then somebody else improved it. Then somebody come up with the crumb tray. Then somebody else came up with the automatic shut off. Others came up with further improvements
Why are groups of people better than individuals? Think about the candy bar example. One perspective was based on calories. It had three local peaks. Let's call them A, B and C. Another landscape based on masticity had five peaks. Let's call these A, B, D, E and F. These peaks are different than the peaks for the caloric landscape, with the exception of A and B. A is the best possible point. The best possible point has to be a point in every landscape. Mr. Page doesn't explain why this is so.
The caloric landscape is better than the masticity landscape because it has fewer local optima. The heuristic is just hill climbing. The peaks where people get stuck are A, B, C, D, E, and F. We can assign a value to each of those peaks. Suppose A is the global optimum, and some of these other peaks aren't so good. We can ask what's the average value of a peak for the caloric problem solver? It is the average of A, B and C. A = 10, B = 8 and C = 6, so the average is 8, which is the ability of the caloric problem solver. You can do the same for the masticity problem solver. If A = 10, B = 8, D = 6, E = 4, F = 2 then this ability is 6.
The caloric problem solver had fewer local optima, but also a higher average, so this is another reason why the caloric problem solver is better. The caloric problem solver may get stuck at B, and then pass the problem on to the masticity problem solver. He will then say that B looks good. If the caloric problem solver gets stuck at C, this point doesn't look good for the masticity problem solver. This person can get from C to some other local optima. If that is D, E, or F, then it doesn't look good to the caloric problem solver. Consequently, the team will get stuck at A or B, which is a better outcome. The average is 9, so the ability of the team is 9.
The ability of the team is higher than the ability of either person. This is because the team's local optima is the intersection of the local optima for the individuals. This is why over time products get better, and why teams are innovative. The reason why a lot of science is done by teams is because the only place a team can get stuck is where everybody on the team can get stuck. This simple model of perspectives and heuristics can explain why teams are better than individuals and why, over time, we keep finding better and better solutions to problems.
The big claim is that the team can only get stuck at a local optima for everyone on the team. That means the team is better than the people in it. Therefore it is better to have people with different local optima, diverse perspectives and diverse heuristics. And that diversity produces different local optima, and those different local optima will mean that the intersections are taken, so that we end up with better points.
What's missing? This model is highly stylised. Two things are left out. First, there is communication. The model assumed that team members communicate their solutions to one another right away. That is not always the case. There are a lot of misunderstandings and people might not listen. If you make a better product, for instance a better toaster, this could be the way of communicating. Second, There might be an error in interpreting the value of a solution. If a good proposal is made, others can think that it is a bad idea. The model assumes that there is no error in assessing the value of a solution.
In a more advanced model, there could be room for communication error and errant evaluation. That is going to hurt the case for using teams. Even so, this model has shown us something fairly powerful, which is that diverse representations of problems in diverse ways of coming up with solutions can make teams of people better able at coming up with solutions than individuals. And it gave an indication where innovation is coming from. Innovation is coming from different ways of seeing problems and different ways of finding solutions.
Until now we focused on individual problems and individual solutions. Recombination is combing a solution or a heuristic to come up with even more solutions or more heuristics. Recombination is an incredibly powerful tool for science, innovation and economic growth. If we have a few solutions or a few heuristics, then we can combine those to create more. The real driving force behind innovation in the economy is that when we come up with a solution and then recombine it with all sorts of other solutions.
For example, fill in the missing number. 1 2 3 5 _ 13, the missing number is 8, because you add those numbers up, or substract them if you go back. 1 4 _ 16 25 36, the missing number is 9 because it is the square of the next number, which is 3. 1 2 6 _ 1806, the missing number here is 42. The solution is harder to find because you have to combine the first two techniques. 2 - 1 = 1 = 1², 6 - 2 = 4 = 2², 42 - 6 = 36 = 6², 1806 - 42 = 1764 = 42².
Recombining is a driver of economic growth and also of science because when a new solution can be combined with other solutions. This produces a geometric explosion in the number of possibilities. This may be the reason why it was possible to sustain economic growth by increasing the technology parameter A.
To show how this works, let's start with finding out how many ways there are to pick three objects from ten. There are 10 I can pick first, 9 I can pick second and 8 I can pick third. 10 * 9 * 8 is too much because picking A, B and C is the same as picking B, A and C. There are three things I can pick first, 2 I can pick second, and 1 I can pick last. Hence, the answer is (10 * 9 * 8) / (3 * 2 * 1) = 120. With far more than 10 solutions, the number is very big. For example, if you have 52 cards, and you want to combine 20 of them, then the number of combinations becomes (52 * 51 * ... * 33) / (20 * 19 * ... * 1) ≈ 125,000,000,000,000.
This idea of ideas building on ideas is the foundation of the theory of recombinant growth of Martin Weitzman. This theory states that ideas get generated all the time. For example, the steam engine gets invented and developed, the gasoline engine gets developed, the microprocessor gets developed. And all these things get recombined into interesting combinations. And those combinations, in turn, get recombined to create ever more growth. For example, many parts in the steam engine were solutions to previous problems. The same applies to the desk top computer.
All those parts of the steam engine weren't developed the steam engine in mind. They were developed for other purposes. This is an idea from biology called exaptation. The classic example of exaptation, is the feather. Birds developed feathers primarily to keep them warm, but eventually those same feathers allowed them to fly. Expectation means that some innovation for one reason, gets used in another context. Another example is the laser. The laser was not invented with the idea of laser printers in mind. So once something is developed, it gets used for all sorts of unexpected things through the power of recombination.
This also applies to perspectives, for example the masticity perspective of a candy bar. Masticity can be a useful perspective for other problems, for example pasta or breakfast cereal. So even failed solutions for one problem may work well as a solution to other problems. For example, the glue in the post it note was originally a failure because the glue didn't stick very well. But it turned out to be useful for other sorts of problems, mainly making sticky notes.
There is more to it than this. It's not just the recombination of ideas, because for hundreds and thousands of years people had ideas. There had to be some way to communicate those ideas. Joel Mokyr argues in his book, The Gifts of Athena: Historical Origins of the Knowledge Economy, that the rise of modern universities, the printing press, and scientific communication allowed ideas to be transferred from one location and one person to another. The technological revolution was driven by the fact that people could share ideas and then recombine them .
Markov models consist of entities that can be in a set of states and there are transition probabilities between those states. For example, there is a set of students. Those students could be in either one of two states, alert or bored. There is some probability P that they move from alert to bored and some probability Q that they move from bored to alert. Over time these students are moving back and forth from the alert state to the bored state. The Markov process gives us a framework to understand how those dynamics take place.
The Markov convergence theorem state that, as long as a couple of assumptions hold, namely that there is a finite number of states, and the transition probabilities stay fixed, and you can get from any state to any other state, then the system goes to equilibrium. This is a really powerful finding that has all sorts of implications.
To understand Markov processes, matrices can be used. Matrices are grids of numbers. Those numbers will be the transition probabilities. Multiplication by matrices is used to understand Markov processes and the Markov convergence theorem. These matrices can be used to explain why these systems go to equilibria.
Markov processes are interesting for two reasons. First, Markov processes are a useful way to think about how the world works and this gives powerful results The Markov convergence theorem says that these systems are going to equilibria. Second, is the idea of exaptation. That the Markov model is incredibly fertile and can be applied in a whole range of different settings. Transition probabilities and matrices can also be used in a lot of settings as well.
Let's use the simple example of alert and bored students. Some percentage of the students are alert and some percentage of the students are bored. An alert student can switch and become bored and a bored student can switch and become alert. We need to assume something about the transition probabilities. Assume that in any given period, 20% of the alert students become bored, but 25% of the bored students become alert. Here the matrices will be useful.
We can calculate this by hand. Assume we start with 100 alert and 0 bored, so (A,B)->(100,0). After one period, you have 80 alert and 20 bored, so (A,B) -> (100-20+0=80, 0-0+20=20). If you go on then (A,B) -> (80-16+5=69, 20+16-5=31). This is rather complicated so there must be a better way to keep track of this.
Where does this process stop? It is possible to do the same calculation starting with all students being bored. After 6 turns 53% is alert and 47% is bored. It looks like there is an equilibrium.
This is a stochastic equilibrium. The thing that doesn't change is the probability. The population is still moving from alert to bored and from bored to alert.
Let's start off by assuming 30% are democracies and 70% are dictatorships. After a decade, (0.95 * 0.3 + 0.2 * 0.7) * 100% = 42.5% are going to be democracies while 57.5% will be dictatorships. One decade later, 52% will be democracies and 48% will be dictatorships. The equilibrium can also be calculated. 0.95p + 0.2(1-p) = p => 1/5 - p/5 = p/20 => p/4 = 1/5 => p = 4/5. The surprising finding is that we only end up with 80% democracies even though 95% of democracies stay democracies and 20% of dictatorships become democracies in each decade.
You can use five year increments, use transition probabilities, and do some crude estimates. This results in the following. Each decade, 5% of free, and 15% of not free become partly free. And 5% of not free and 10% of partly free become free. And 10% of partly free become not free.
This is a three by three matrix. With computers it is possible to make a huge matrices and solve them for equilibrium. So what does that equilibrium look like? All we do is take each one of these rows, and multiply by the columns. The algebra results in 62.5% of countries being free, 25% being partly free, and 12.5% being not free, presuming that the transition probabilities stay fixed.
The model shows some general trends. The graph generated from the model doesn't look exactly the same as the real picture, but it doesn't look bad either. The model comes up at the end of the 40 year period with values that are close to those in the real world but that is because the estimated transition probabilities were based on the actual data. What's more interesting is that the patterns look fairly similar as well.
The Markov convergence theorem tells that, provided a few fairly mild assumptions are met, Markov processes converge to a stochastic equilibrium. There are movements, but the probability of being in each state stays fixed. The conditions that must hold for that to be true are the following:
- (1) a finite number of states;
- (2) fixed transition probabilities;
- (3) eventually you can get from any state to any other state;
- (4) not a simple cycle.
If these conditions hold then a Markov process converges to an equilibrium distribution that is unique, which means that there is only one equilibrium distribution that is independent of the initial state. It is determined entirely by the transition probabilities. In Markov processes, the initial state, history, and interventions to change the state have no effect on the long run on the system.
Interventions can have effect. First, it could take a long time before the system is back to equilibrium so that an intervention may have a significant benefit. Second, some of the conditions of the theorem may not hold in the real world, most notably the fixed transition probabilities. The transition probabilities may change over time as the function of the state of the system. Changing the state has a temporary effect, but changing the transition probabilities has permanent consequences. Useful interventions change the transition probabilities. This may happen to tipping points.
The Markov model can be used in contexts and problems we never would have thought of. The first way is taking the entire process and modelling to other things. The second way is using a part of the Markov model, the transition probability matrix, to understand some things that are surprising and interesting. A Markov process is. Fixed set of states with fixed transition probabilities between those states. If it is possible to get from any state to any other through a sequence of transitions, then the Markov convergence theorem states that the process is going to a unique equilibrium.
This can be applied to voter turnout. Assume that there is a set of voters at time t, and there is a set a of non-voters at time t. We can make a transition matrix of that to find out how many are going to vote at time t+1, and how many are not going to vote at time t+1. If the transition probabilities stay fixed then there is a unique equilibrium that tells the number of people that is expected to vote in any election. This can also be applied to school attendance as each day there are children that go to school and children that don't go to school. These two applications are very standard.
It is also possible to use only a part of the Markov model, the Markov transition matrix, to identify writers. The transition matrix can be used to figure out who wrote a book. So suppose some anonymous person wrote a book and you are trying to figure out whether Bob or Elisa wrote it.
You can figure out transition probabilities by taking some key words, and then create transition matrices. For example, you can calculate at what percentage of the time does "the record", "example" or "the sake of" of follow the word "for". You can compare this with books Bob and Elisa wrote with the use of computers.
This can be applied to medical diagnosis and treatment. Typically, there's a sequence of reactions to that treatment. You can write down transition probabilities that can be multi-stage. For example, if the treatment is going to be successful, the patient goes through the following transitions: first pain, then feeling slightly depressed, then some more pain, but then the patient gets better. Alternatively, if the treatment is not successful, it could be that, initially the patient is depressed, then there is mild pain, then there's no pain, and then the treatment fails. This can be used to figure out early on whether or not a treatment will be successful.
Another example is the road to war. Suppose there are two countries and there is some tension. The political process goes through the following transitions. First there are some political statements on each side, then that leads to trade embargoes, followed by military buildup. Based on this sequence of three events, you estimate the likelihood of war based on historical data of previous times when those three transitions happened.
In these cases we are not using the full power of the Markov model and we do not assume that the transition probabilities necessarily stay fixed. We are not interested in solving for the equilibrium. All we're trying to do is just use this probability matrix to organise the data in such a way that we can think more clearly about what's likely to happen.