Sample and study design
This cross-sectional study is part of the EMOTIKON project. The study was mandated and approved by the Ministry of Education, Youth and Sport of the Federal State of Brandenburg, Germany. The Brandenburg School Law41 requires that parents are comprehensively informed prior to the start of the study. Consent is not needed given that the tests are obligatory for both, children and schools. Physical fitness tests were administered to all third-graders in the state annually between September and November from 2011 to 2019. Physical fitness tests were also administered to 2009 and 2010 cohorts, but later in the school year that is between March and April. Due to the seasonal variation in physical fitness these data were not included. Research was conducted according to the latest Declaration of Helsinki.
We started with data from 144,045 children. Of those, we included only healthy children who had been enrolled within the legal key date of the Federal State of Brandenburg, that is in a given year of school enrolment they were at least 6.00 and at most 6.99 years old on September 30th and, therefore, varied between 8.00 and 8.99 years in the third grade (n = 110,669). In addition to early-entry (n = 2,664), late-entry (n = 30,457) and children without information about birthdate (n = 255), we did not include children with signs of emotional (e.g., autism) and/or physical disorders (e.g., disabilities like infantile cerebral palsy) that were evaluated by the responsible and experienced physical education teacher based on a medical clearance (n = 28). After the first iteration, the LMM-based conditional means of the random effects identified one school as an extreme outlier on several tests across the years and the 171 children of that school were excluded as well. Finally, we applied a ± 3 SD criterion to individual test scores which led to the exclusion of another 2,175 children (2%). This left us with 108,295 children (i.e., 75% of those tested) from 515 different schools.
Physical fitness tests
Physical fitness was assessed with the EMOTIKON test battery (www.uni-potsdam.de/en/emotikon/projekt/methodik for further information on the test protocols). The five tests measured cardiorespiratory endurance (i.e., 6 min run test), coordination (i.e., star run test), speed (i.e., 20-m linear sprint test), power of lower limbs (powerLOW [i.e., standing long jump test]), and power of upper limbs (powerUP [i.e., ball push test]). The EMOTIKON test battery officially includes six tests. Up to 2015 the sixth test was the stand and reach test (flexibility) that was then exchanged against the single-leg balance test (balance). Due to the much smaller of number of scores and their confound with cohort these tests were not included in the analyses. The five tests yielded 525,126 scores from the 108,295 children (i.e., 3% missing test scores).
Qualified physical education teachers of each school administered the tests according to standardized test protocols during the regular physical education classes in the participating schools (www.uni-potsdam.de/en/emotikon/projekt/methodik for further information on the test protocols). Teachers were instructed in a standardized assessment through an advanced training. Tests were always conducted in the morning between 8 and 12 o’clock. Encouragement to achieve the best performance was permitted. Prior to testing, all third-graders performed a standardized warm-up program consisting of different running exercises (e.g., side-steps) and small games (e.g., playing tag).
Cardiorespiratory endurance
Cardiorespiratory endurance was assessed with the 6 min run test. Children had to run as far as they could within six minutes around an official volleyball field (9 × 18 m, every 9 m a pylon/marker was set beside the running court [i.e., six pylons around the field]) at a self-paced velocity. Split time was given every minute. The maximal distance achieved during the six minutes in meters to the nearest nine-meters marker was used as dependent variable in the analysis. The 6 min run test was reliable (test–retest) in children aged 7–11 years with an intraclass correlation coefficient (ICC) of 0.9242. The 6 min run test correlated at r = 0.69 (p < 0.01) with V̇O2max assessed via a gas analysis during a progressive treadmill test in children aged 9–11 years43.
Coordination
Coordination under time pressure was tested with the star run test (see Fig. 5). Children had to complete a parkour with different movement directions and movement forms (i.e., running forward, running backward, side-steps to the left side, side-steps to the right side). The parkour had to be performed in a given order over a 9 × 9 m star-shaped area where each of the four spikes is marked by a pylon. After starting in the centre of the star, children had to complete the parkour as fast as they could by running in every movement form two times within the given sequence. They had to touch each pylon with the hand. The whole covered distance is 50.912 m. The faster of two test trials was used in the analysis. The shortest time for completing the parkour in seconds to the nearest 1/10 s was measured using a stopwatch and was used as dependent variable in the analysis. The star run test was reliable (test–retest) in 8–10 year old children with an ICC of 0.6844.
Schematic description of the star run test (adapted from Golle et al. 6).
Speed
Speed was assessed with the 20-m linear sprint test. After an acoustic signal, the children had to sprint out of a frontal erect posture as fast as they could over a distance of 20 m for two times; the faster test trial was used in the analysis. The shortest time for sprinting the 20 m in seconds to the nearest 1/10 s was measured using a stopwatch and was used as dependent variable in the analysis. The 20-m linear sprint test was reliable (test–retest) in children aged 7 to 11 years with an ICC of 0.9042.
Power of lower limbs (PowerLOW)
PowerLOW was tested using the standing long jump test. Out of a standing frontal posture the children had to jump as far as they could. The participants had to land with both feet together. They were allowed to swing their arms prior to and during the jump, but after landing the hands were not allowed to touch the floor. The distance in meters to the nearest one centimeter between toes at take-off and heels at landing was determined using a measuring tape; the better of two test trials was used in the analysis. The standing long jump test was reliable (test–retest) in children aged 6–12 years with an ICC of 0.9445.
Power of upper limbs (PowerUP)
PowerUP was assessed through the ball push test. From a standing position the children had to push a 1 kg medicine ball starting in front of the chest with both hands as far as they could for two times; the better of two test trials of longest pushing distance was used in the analysis. The maximal ball push distance in meters to the nearest ten centimeters was determined with a measuring tape and used as dependent variable in the analysis. The ball push test was reliable (test–retest) in children aged 8–10 years with an ICC of 0.8144.
Statistics
Pre- and post-processing of data were carried out in the R environment of statistical computing46 using the tidyverse package47. For measures of cardiorespiratory endurance (i.e., 6 min run test), powerLOW (i.e., standing long jump test) and powerUP (i.e., ball push test), higher scores indicated better physical fitness. For measures of coordination (i.e., star run test) and speed (i.e., 20-m linear sprint test), a Box-Cox distributional analyses indicated that a reciprocal transformation brought scores in line with the assumption of a normal distribution48. Therefore, we converted scores from seconds to meters/seconds (i.e., pace scores; star run test = 50.912 [m] / time [s]; 20-m linear sprint test = 20 [m] / time [s]). These transformations also had the advantage that a large value was indicative of a good physical fitness for all five measures.
For each test, we determined the ± 3 SD boundary separately for boys and girls. Measurement outside these boundaries were usually implausible (i.e., recording errors) or extreme outliers. They were treated as missing values (3%). Finally, we converted scores within tests (aggregated over boys and girls) to z-scores to facilitate comparison of test, age and sex effects.
Statistical inference was based on a linear mixed model (LMM) estimated with the MixedModels package49 in the Julia programming language50. The LMM included child (N = 108,295), school (N = 515), and cohort (N = 9) as three random factors; the total number of observations (i.e., max = 5 per child) was 525,126. Fitted model objects were processed with random-effects principal component analysis to obtain loadings of the variance–covariance matrix of the random effects and facilitate its interpretation.
As fixed effects, we specified four sequential-difference contrasts for the five tests: (H1) coordination vs. cardiorespiratory endurance, (H2) speed vs. coordination, (H3) powerLOW vs. speed, and (H4) powerUP vs. powerLOW. Also included were the effect of age (centered at 8.5 years) as a second-order polynomial trend, the effect of sex (boys–girls), and all interactions between contrasts, age, and sex. Given the large number of observations, children, and schools, we adopted a two-sided z-value > 3.0 as significance criterion for the interpretation of fixed effects.
Child, school, and cohort were included as random factors. With three random factors there was a need for selecting a random-effect structure that included theoretically relevant and reliable variance components (VCs) and correlation parameters (CPs), but was also still supported by the data (i.e., was not overparameterized). Tests varied within children, schools, and cohorts; age and sex varied between children, but within schools and within cohorts. Therefore, in principle, VCs and CPs of linear effects of age and sex could be estimated for schools and cohorts, but not for children.
Parsimonious model selection occurred in two major steps without knowledge or consideration of fixed-effect estimates51; details are provided in Supplement A. We started with a model including Grand Mean (varying intercepts) for all three random factors and, given the large numbers of 108,926 children and 515 schools and the small number of nine cohorts, included also test-related VCs and CPs for child and school and age-related and sex-related VCs and CPs for school, but not for cohort. This LMM m1 was well supported by the data. In the second major step, we increased the complexity of the random-effect structure for cohort by adding test-related VCs (LMM m2), then test-related CPs (LMM m3), and finally age- and sex-related VCs and CPs (LMM m4).
LMM m4 was not supported by the data (i.e., the fit was singular) and did not significantly improve the goodness of fit over LMM m3; delta χ2 (13) = 14.11, p = 0.37. LMM m3 improved the goodness of fit over LMM m2 according to the likelihood ratio test, χ2 (10) = 48.45, p < 0.001, but not when the increase in model complexity is penalized according to LMM m2 = 0,00012345 and LMM m3 = 0,00012345). As we had no directed hypotheses relating to test-related CPs for the factor cohort, we stayed with LMM m2 which represented a very large improvement in goodness of fit relative to LMM m1; χ2 (4) = 1489.57, p < 0.001. We also estimated LMM m2 with two alternative parameterizations that did not change the goodness of fit, but yielded information about CPs between test scores instead of test effects (i.e., contrasts). Finally, we fitted two control LMMs to test the significance of quadratic age trends for fixed effects and the absence of evidence for sex x age interactions separately for each fitness component (i.e., nested within the five levels of the factor test).

