<- 80
number_people <- 1000 number_tests
Paradoxes are named as such because they are counterintuitive. And this gives us, data-curious people, an excellent chance to play around with them.
How many people are needed so the probability that at least, two of them share a birthday is 50%?
I suggest the reader stop for a couple of minutes and think about it. When I learned the answer, I was so surprised that I had to test it to believe it.
Go to Results if you just want to know what is the probability.
Plan
The idea is simple. I’ll create multiple groups from two to n number of people. For each group, I’ll count how many times at least two people share a birthday. Then repeat this process multiple times (y), to obtain reproducible results
So if n = 4, means a group of four people, and y = 5, means to check the probabilities three times. The simulation will be like this:
- Choose two random people (n - 2), do they share a birthday? –> Count the result. Repeat y times. Calculate the probability for a group of two people.
- Choose three random people (n - 1), do at least two of them share a birthday? –> Count the result. Repeat y times. Calculate the probability for a group of three people.
- Choose four random people (n), do at least two of them share a birthday?? –> Count the result. Repeat y times. Calculate the probability for a group of four people.
For each group, calculating the probability is simple. For example, if in the group of four people, only there was a simulation were people shared birthdays but no shared birthdays in the other four, the probability would be 1/5 = 20%.
Simulation
Let’s select bigger groups of up to 80 people, and let’s check them 1000 times each.
We need a function that creates random birthdays for a group of n
people.
<- function(n_people_group) {
generate_birthdays round(runif(n = n_people_group, min = 1, max = 365), digits = 0 )
}generate_birthdays(3)
[1] 94 39 354
generate_birthdays(10)
[1] 323 5 128 169 110 1 100 347 125 68
Note, how we are assuming that birthdays are randomly and uniformly distributed across the population. In reality, more people are born in September and October.
Let’s create a function that will tell us if at least two people share a birthday: Since there are 365 per year we can just numbers from 1 to 365.
<- c(10, 46, 209) # noone share birthday, should be false
bday_group_3_people <- c(10, 46, 209, 46, 265) # two people share birthdays, should be true
bday_group_5_people <- c(10, 46, 209, 46, 265, 46, 2) # Three people share birthday, should be true
bday_group_7_people
<- function(data, min_num_shared_birthdays = 2) {
check_share_birthday max(table(data)) >= min_num_shared_birthdays
}# table is a nice function that does the count for us:
table(bday_group_5_people)
bday_group_5_people
10 46 209 265
1 2 1 1
# Checking that the function works as expected
check_share_birthday(bday_group_3_people)
[1] FALSE
check_share_birthday(bday_group_5_people)
[1] TRUE
check_share_birthday(bday_group_7_people)
[1] TRUE
This is where the fun happens: Now we just iterate over the groups. We create a vector containing the birthdays of those people and check if those people share a birthday, if they do, we keep track of it. Then we repeat this process y times. Once we have completed y tests for a single group, we move to the next one and repeat the process.
set.seed(123)
<- data.frame()
results
for(individual_group in 2:number_people) {
<- 0
shared_birthdays_counter
for (test in seq_len(number_tests)) {
<- generate_birthdays(individual_group)
random_birthdays_vector
if (check_share_birthday(random_birthdays_vector)) {
<- shared_birthdays_counter + 1
shared_birthdays_counter
}
}<- 100 * (shared_birthdays_counter/number_tests)
probability_duplicates
<- rbind(results, c(individual_group, probability_duplicates))
results
}colnames(results) <- c("people_in_group", "probability_shared_birthday")
<- round(results, 1)
results head(results)
people_in_group probability_shared_birthday
1 2 0.1
2 3 0.7
3 4 1.2
4 5 2.7
5 6 4.2
6 7 5.6
Results
Let’s check how many people are needed in a group to get a probability of ~50%.
|>
results ::filter(
dplyr> 45,
probability_shared_birthday < 55
probability_shared_birthday )
people_in_group probability_shared_birthday
1 21 45.3
2 22 45.8
3 23 51.9
4 24 53.0
According to our simulation, only 23 people are needed to have a probability of 50% to share a birthday!
I’m also curious about what is the minimum amount of people where having a shared birthday is almost certain:
<- results |>
groups_prob_99_perc ::filter(
dplyr> 99
probability_shared_birthday
)head(groups_prob_99_perc)
people_in_group probability_shared_birthday
1 55 99.1
2 57 99.4
3 59 99.3
4 60 99.2
5 61 99.6
6 62 99.7
It seems that with 55 people, the chances to have at least one shared birthday are 99%!
Visualising the results
Let’s plot those results and see what we can observe.
Code
library(highcharter)
hchart(
results,type = "line",
hcaes(x = people_in_group, y = probability_shared_birthday)
|>
) hc_title(text = "<b>Probability of at least two people having the same birthday</b>") |>
hc_subtitle(text = "<i>A total of 1000 simulations per group were computed</i>") |>
hc_credits(enabled = TRUE, text = "https://svalvaro.github.io/") |>
hc_xAxis(title = list(text = "Number of people in the group")) |>
hc_yAxis(title = list(text = "Probability (%)")) |>
hc_yAxis(max = 100) |>
hc_tooltip(
pointFormat = "
<b># People in the group: </b>{point.x:,.f}<br>
<b>Probability: </b>{point.y:,.1f}%<br>"
|>
) hc_add_theme(hc_theme_economist())
Citation
@online{sánchez2023,
author = {Sánchez, Álvaro},
title = {Simulating the {Birthday} {Paradox}},
date = {2023-03-10},
url = {https://svalvaro.github.io/posts/03-10-2023/},
langid = {en}
}