Simulating the Birthday Paradox

Paradoxes are named as such because they are counterintuitive. And this gives us, data-curious people, an excellent chance to play around with them.

How many people are needed so the probability that at least, two of them share a birthday is 50%?

I suggest the reader stop for a couple of minutes and think about it. When I learned the answer, I was so surprised that I had to test it to believe it.

Go to Results if you just want to know what is the probability.

Plan

The idea is simple. I’ll create multiple groups from two to n number of people. For each group, I’ll count how many times at least two people share a birthday. Then repeat this process multiple times (y), to obtain reproducible results

So if n = 4, means a group of four people, and y = 5, means to check the probabilities three times. The simulation will be like this:

Choose two random people (n - 2), do they share a birthday? –> Count the result. Repeat y times. Calculate the probability for a group of two people.
Choose three random people (n - 1), do at least two of them share a birthday? –> Count the result. Repeat y times. Calculate the probability for a group of three people.
Choose four random people (n), do at least two of them share a birthday?? –> Count the result. Repeat y times. Calculate the probability for a group of four people.

For each group, calculating the probability is simple. For example, if in the group of four people, only there was a simulation were people shared birthdays but no shared birthdays in the other four, the probability would be 1/5 = 20%.

Simulation

Let’s select bigger groups of up to 80 people, and let’s check them 1000 times each.

number_people <- 80
number_tests <- 1000

We need a function that creates random birthdays for a group of n people.

generate_birthdays <- function(n_people_group) {
  round(runif(n = n_people_group, min = 1, max = 365), digits = 0 )
}
generate_birthdays(3)

[1]  94  39 354

generate_birthdays(10)

 [1] 323   5 128 169 110   1 100 347 125  68

Note, how we are assuming that birthdays are randomly and uniformly distributed across the population. In reality, more people are born in September and October.

Let’s create a function that will tell us if at least two people share a birthday: Since there are 365 per year we can just numbers from 1 to 365.

bday_group_3_people <- c(10, 46, 209) # noone share birthday, should be false
bday_group_5_people <- c(10, 46, 209, 46, 265) # two people share birthdays, should be true
bday_group_7_people<- c(10, 46, 209, 46, 265, 46, 2) # Three people share birthday, should be true

check_share_birthday <- function(data, min_num_shared_birthdays = 2) {
  max(table(data)) >= min_num_shared_birthdays
}
# table is a nice function that does the count for us:
table(bday_group_5_people)

bday_group_5_people
 10  46 209 265 
  1   2   1   1

# Checking that the function works as expected
check_share_birthday(bday_group_3_people)

[1] FALSE

check_share_birthday(bday_group_5_people)

[1] TRUE

check_share_birthday(bday_group_7_people)

[1] TRUE

This is where the fun happens: Now we just iterate over the groups. We create a vector containing the birthdays of those people and check if those people share a birthday, if they do, we keep track of it. Then we repeat this process y times. Once we have completed y tests for a single group, we move to the next one and repeat the process.

set.seed(123)
results <- data.frame()

for(individual_group in 2:number_people) { 
  shared_birthdays_counter <- 0

  for (test in seq_len(number_tests)) { 
    random_birthdays_vector <- generate_birthdays(individual_group) 
    
    if (check_share_birthday(random_birthdays_vector)) {
      shared_birthdays_counter <- shared_birthdays_counter + 1
    }

  }
  probability_duplicates <- 100 * (shared_birthdays_counter/number_tests)

  results <- rbind(results, c(individual_group, probability_duplicates))
}
colnames(results) <- c("people_in_group", "probability_shared_birthday")
results <- round(results, 1)
head(results)

  people_in_group probability_shared_birthday
1               2                         0.1
2               3                         0.7
3               4                         1.2
4               5                         2.7
5               6                         4.2
6               7                         5.6

Results

Let’s check how many people are needed in a group to get a probability of ~50%.

results |> 
  dplyr::filter(
    probability_shared_birthday > 45,
    probability_shared_birthday < 55
  )

  people_in_group probability_shared_birthday
1              21                        45.3
2              22                        45.8
3              23                        51.9
4              24                        53.0

According to our simulation, only 23 people are needed to have a probability of 50% to share a birthday!

I’m also curious about what is the minimum amount of people where having a shared birthday is almost certain:

groups_prob_99_perc <- results |> 
  dplyr::filter(
    probability_shared_birthday > 99
  )
head(groups_prob_99_perc)

  people_in_group probability_shared_birthday
1              55                        99.1
2              57                        99.4
3              59                        99.3
4              60                        99.2
5              61                        99.6
6              62                        99.7

It seems that with 55 people, the chances to have at least one shared birthday are 99%!

Visualising the results

Let’s plot those results and see what we can observe.

Code

library(highcharter)

hchart(
    results,
    type = "line",
    hcaes(x = people_in_group, y = probability_shared_birthday)
  ) |>
  hc_title(text = "<b>Probability of at least two people having the same birthday</b>") |>
  hc_subtitle(text = "<i>A total of 1000 simulations per group were computed</i>") |>
  hc_credits(enabled = TRUE, text = "https://svalvaro.github.io/") |>
  hc_xAxis(title = list(text = "Number of people in the group")) |> 
  hc_yAxis(title = list(text = "Probability (%)")) |> 
  hc_yAxis(max = 100) |> 
  hc_tooltip(
    pointFormat = "
    <b># People in the group: </b>{point.x:,.f}<br>
    <b>Probability: </b>{point.y:,.1f}%<br>"
  ) |>
  hc_add_theme(hc_theme_economist())

Citation

BibTeX citation:

@online{sánchez2023,
  author = {Sánchez, Álvaro},
  title = {Simulating the {Birthday} {Paradox}},
  date = {2023-03-10},
  url = {https://svalvaro.github.io/posts/03-10-2023/},
  langid = {en}
}

For attribution, please cite this work as:

Sánchez, Álvaro. 2023. “Simulating the Birthday Paradox.” March 10, 2023. https://svalvaro.github.io/posts/03-10-2023/.