This document was first conceptualized as a companion tutorial to a presentation given by Katie Von Holzen, titled “Using cluster-based permutation tests to analyze eye-tracking data”, at the CLaS Eye-tracking Workshop held at Macquarie University On September 8th, 2022. This document starts by explaining how to load open infant eye-tracking data from Peekbank and read it into R using the eyetrackingR package. This data is then analyzed using a cluster-based permutation test.

Any questions should be directed to Katie Von Holzen

Prepare Data

Load Packages

We’ll use the peekbankr package to load the open data and the eyetrackingr package to analyze it. In addition, we’ll need the tidyverse package for data manipulation and pbapply is nice to have a progress bar during the cluster-based permutation test (it can sometimes take a few minutes and humans like to have progress bars).


# install.packages("remotes")



Load Peekbank Data

Peekbank is an excellent resource in itself, independent of this tutorial. Its a database of data from infant eye-tracking studies and has been built so that one can explore how different choices made during data analysis can impact how the results appear. We’ll be analyzing data from Garrison et al. (2020). Their original hypothesis was different than what we’ll be looking at: whether word recognition differs in younger (12-13 months) and older (17-18 months) infants.

Garrison H, Baudet G, Breitfeld E, Aberman A, & Bergelson E. (2020). Familiarity plays a small role in noun comprehension at 12-18 months. Infancy. 25(4), 458-477. doi: 10.1111/infa.12333

# get peekbank data into eyetrackingR
# using example from one study, "garrison_bergelson_2020"
# but could use any example!

aoi_timepoints <- get_aoi_timepoints(dataset_name = "garrison_bergelson_2020")
administrations <- get_administrations(dataset_name = "garrison_bergelson_2020")
stimuli <- get_stimuli(dataset_name = "garrison_bergelson_2020")

ps_data <- aoi_timepoints %>%
  left_join(administrations)%>% # bring these two data sets together
  mutate(age_binned = ifelse(age < 14, "Younger", 
                             ifelse(age > 16, "Older", "sort_out")))%>% # create age bins
  filter(age_binned != "sort_out")%>% # get rid of infants we don't want
  mutate(Target = ifelse(aoi == "target", TRUE, FALSE))%>% # add column for target looks, logical, needed for eyetrackingR
  mutate(Distractor = ifelse(aoi == "distractor", TRUE, FALSE))%>% # add column for distractor looks, logical, needed for eyetrackingR
  mutate(Trackloss = ifelse(aoi == "missing", TRUE, FALSE)) # add column for looks away from the screen, logical, needed for eyetrackingR

Convert Data to eyetrackingR

Now that we have the data loaded into R, we want to make it into an “eyetrackingr object”. We prepared our data for this in the last step by creating the Target, Distractor, and Trackloss columns. Data sets will differ widely by eyetracker and software, so when it is your turn to analyze your own data, I would recommend checking out the EyetrackingR Vignette Preparing Your Data.

data <- make_eyetrackingr_data(ps_data, 
                       participant_column = "subject_id",
                       trial_column = "trial_id",
                       time_column = "t_norm",
                       trackloss_column = "Trackloss",
                       aoi_columns = c('Target','Distractor'),
                       treat_non_aoi_looks_as_missing = TRUE

Clean Data

This step cleans up the data a little bit. Choices are made here that are beyond the scope of this tutorial, but we are subsetting the trial down to a time window we’re interested in, as well as removing trials where more than 25% of the trial was not trackable.

# subset to response window post word-onset
response_window <- subset_by_window(data, 
                                    window_start_time = 0, 
                                    window_end_time = 4000, 
                                    rezero = FALSE)

# analyze amount of trackloss by subjects and trials
#(trackloss <- trackloss_analysis(data = response_window))

# remove trials with > 25% of trackloss
response_window_clean <- clean_by_trackloss(data = response_window,
                                            trial_prop_thresh = .25)

Two Ways of Looking at Eye-Tracking Data

The first step to any data analysis is inspecting the data. We can look at this data set in two ways:

Aggregated Across a Specific Time Period

Here, we use the “make_time_window_data” function to create an aggregate of the looking data over the entire response window after target word onset. This will give us a few different dependent variables to look at. These are explained in the EyetrackingR Vignette Window Analysis.

For the purpose of this tutorial, we’ll focus on the column “Prop”, which for us represents the Proportion of Target Looks. Its calculated by taking the total looks to the Target divided by the summed total looks to the Target and Distractor. But don’t worry, EyetrackingR does this for you.

# aggregate by subject across the response window
response_window_agg_by_sub <- make_time_window_data(response_window_clean, 
                                             summarize_by = "subject_id")

#plot(response_window_agg_by_sub, predictor_columns="age_binned", dv = "Prop")

We can then plot PTL for both the Younger and Older infants. As we can see, PTL is greater for Older compared with Younger infants.

# take a quick peek at data
ggplot(response_window_agg_by_sub, aes(x=age_binned, y = Prop, color = age_binned))+
      stat_summary( = mean_se, geom="pointrange")+
      coord_cartesian(ylim = c(0, 1))+
  labs(y = "Prop. of Target Looks")+
    geom_hline(yintercept = .5, color = "black")+
  theme_bw() +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank())

Small Bins Aggregated Across Entire Time Course

Here, we use the “make_time_sequence_data” function to create an aggregate of the looking data over small time bins (here every 25 ms) of the the response window after target word onset. This will give us the same dependent variables as in the Window Analysis, but for each value of Time.

# make into time series data
response_time <- make_time_sequence_data(response_window_clean,
                                  time_bin_size = 25, 
                                  predictor_columns = c("age_binned"),
                                  aois = "Target",
                                  summarize_by = "subject_id" )

We can then plot PTL over the course of the trial for both the Younger and Older infants As we can see, PTL is greater for Older compared with Younger infants, just as we saw in the previous plot, but here we can see that this difference is mostly present early in the trial.

# visualize timecourse
ggplot(response_time, aes(x = Time, y = Prop, color = age_binned))+
  stat_smooth(method="loess", span=0.1, se=TRUE, aes(fill=age_binned), alpha=0.3) +
  theme_bw() +
  labs(y = "Prop. of Target Looks")+
  geom_hline(yintercept = .5, color = "black")+
  geom_vline(xintercept = 0, color = "black")+
  coord_cartesian(ylim = c(0,1))+
  theme(legend.position = "bottom",
        legend.title = element_blank())

HowTo: Cluster-Based Permutation Tests

We’d like to use a cluster-based permutation test to determine when in time PTL differed for Older and Younger infants. The approach that we’ll use is based on the eyetrackingR vignette Estimating Divergences, under “Bootstrapped cluster-based permutation analysis”.

Step 1: Run statistical test on each time-bin

In this step, for every time bin, we’ll compare PTL between the two groups (older versus younger) by means of a t-test. Any type of statistical test can be used (e.g. F-statistic from an ANOVA, z-value from a GLM). That means, in our trial of 4000 ms, which has time bins of 25 ms, we’ll compute 160 t-tests.

We’re interested in the time points that are above the threshold of significance (\(\alpha\) = .05, two-tailed). For the number of subjects tested in this experiment (n = 16), the threshold any t-statistic needs to reach significance is either 2.1314495 or -2.1314495.

We compute these t-tests using the function “make_time_cluster_data”.

df_timeclust <- make_time_cluster_data(response_time, 
                                      test= "t.test", paired=FALSE,
                                      predictor_column = "age_binned", 
                                      threshold = threshold_t) 

Step 2: Identify significant clusters

We can then plot these results. The black line in the plot below represents the t-statistic for every t-test that was conducted at each time bin. The dotted lines represent the threshold any t-statistic needs to reach significance (2.1314495 or -2.1314495). When the black line crosses either of these dotted lines, a time-cluster is identified (and subsequently highlighted in the plot, neat!). A time-cluster is any time-bin or series of adjacent time-bins that reached our critical t-statistic threshold of either 2.1314495 or -2.1314495. As we can see, there are several clusters in this data.

plot(df_timeclust) +  ylab("T-Statistic") + theme_light()

We can also look at a summary of this output, describing the results of these t-tests.

  • Cluster: There are 3 different clusters identified
  • Direction: All of the clusters are “Positive”, which refers to which t-statistic threshold was reached (2.1314495 in this case). This helps us determine which condition was greater than the other. In this case, our original plot helps us, as we know that when there was a difference, it showed that Older infants had a greater PTL than Younger infants, so this is how we can interpret a Positive cluster.
  • SumStatistic: When there was a significant t-statistic, or a run or cluster of significant t-statistics, what is the sum of these values? The bigger the number here the more consecutive time bins were significant. If there were only two significant time bins next to one another, reaching only the bare minimum of our threshold, for example 2 x 2.1314495, the SumStatistic value would be 4.2628991
  • StartTime: The start time of the cluster
  • EndTime: The end time of the cluster
## Test Type:    t.test 
## Predictor:    age_binned 
## Formula:  Prop ~ age_binned 
## Summary of Clusters ======
##   Cluster Direction SumStatistic StartTime EndTime
## 1       1  Positive    74.723687       850    1550
## 2       2  Positive     6.905622      2725    2800
## 3       3  Positive    46.294644      3500    3925

Step 3: Randomly shuffle the data

We next randomize the data set and compute the same analysis as before, but this time on the randomized data. We do this many times, in this case 1000. By doing this, we’re asking ourselves how likely it is that the original clusters that we identified in Step 2 represent random chance or a true effect. By randomizing the data in this step, we create a bunch of random chance against which to check our original data.

# num_shuffles was set earlier in the document
clust_analysis <- analyze_time_clusters(df_timeclust, within_subj=FALSE, paired=FALSE,

Since we created and then analyzed 1000 randomized datasets, its not worth it to look at each one individually. Instead, the results of this randomization and analysis are represented in the plot below. The summed t-statistic (x-axis, SumStatistic) for the clusters identified in the randomized data are plotted here in a histogram. As you can see, most of the randomized clusters have small summed t-statistics.

The values for our original 3 clusters are indicated by dashed line. We’re interested in whether the summed t-statistics of our original clusters beat the randomized ones. If we look at cluster 1, the biggest cluster identified in our original data, we see that very few clusters identified in the randomized data had a summed t-statistic as large as this cluster.

plot(clust_analysis) + theme_light()

Step 4: Calculate Cluster Probability

To interpret the results, we calculate a Monte Carlo p-value for each of our original clusters (this is what we see below under Probability). This is the proportion of randomized sets of data that produce a cluster with a summed t-statistic that is larger than that of the cluster in our original data. In other words, the Monte Carlo p-value represents the probability that this cluster would occur by chance. For our other clusters, this p-value does not reach significance, indicating that divergences this large, given the original data, are likely expected between 41% and 11% of the time.

## Test Type:    t.test 
## Predictor:    age_binned 
## Formula:  Prop ~ age_binned 
## Null Distribution   ====== 
##  Mean:        -0.0424 
##  2.5%:        -79.2637 
## 97.5%:        70.0291 
## Summary of Clusters ======
##   Cluster Direction SumStatistic StartTime EndTime Probability
## 1       1  Positive    74.723687       850    1550       0.051
## 2       2  Positive     6.905622      2725    2800       0.405
## 3       3  Positive    46.294644      3500    3925       0.111

Step 5: Summarize Results

If we were to summarize the results in a paper, we could write it like this:

Comparisons using cluster-based permutation analysis revealed that target looking behavior for Older and Younger infants significantly deviated from each other between 850 and 1550 ms (cluster t statistic = 74.72, Monte Carlo p = .05), with reduced target looking for Younger infants.