Publications
Topic Group 2 (TG2) Publications
These publications were written on behalf of STRATOS-TG2. They center around the topic group’s core topic, the selection of variables and functional forms in multivariable models.
A Systematic Categorization of Performance Measures for Estimated Non-Linear Associations Between an Outcome and Continuous Predictors
Ullmann T, Heinze G, Abrahamowicz M, Perperoglou A, Sauerbrei W, Schmid M, Dunkler D, for TG2 of the STRATOS initiative, 2025. A Systematic Categorization of Performance Measures for Estimated Non-Linear Associations Between an Outcome and Continuous Predictors. WIREs Computational Statistics 17(3), e70042. https://doi.org/10.1002/wics.70042
Abstract
Abstract
In regression analysis, associations between continuous predictors and the outcome are often assumed to be linear. However, modeling the associations as non-linear can improve model fit. Many flexible modeling techniques, like (fractional) polynomials and spline-based approaches, are available. Such methods can be systematically compared in simulation studies, which require suitable performance measures to evaluate the accuracy of the estimated curves against the true data-generating functions. Although various measures have been proposed in the literature, no systematic overview exists so far. To fill this gap, we introduce a categorization of performance measures for evaluating estimated non-linear associations between an outcome and continuous predictors. This categorization includes many commonly used measures. The measures can not only be used in simulation studies, but also in application studies to compare different estimates to each other. We further illustrate and compare the behavior of different performance measures through some examples and a Shiny app.
Evaluating variable selection methods for multivariable regression models: A simulation study protocol
Ullmann, T., Heinze, G., Hafermann, L., Schilhart-Wallisch, C., Dunkler, D., for TG2 of the STRATOS initiative, 2024. Evaluating variable selection methods for multivariable regression models: A simulation study protocol. PLoS ONE 19, e0308543. https://doi.org/10.1371/journal.pone.0308543
Abstract
Abstract
Researchers often perform data-driven variable selection when modeling the associations between an outcome and multiple independent variables in regression analysis. Variable selection may improve the interpretability, parsimony and/or predictive accuracy of a model. Yet variable selection can also have negative consequences, such as false exclusion of important variables or inclusion of noise variables, biased estimation of regression coefficients, underestimated standard errors and invalid confidence intervals, as well as model instability. While the potential advantages and disadvantages of variable selection have been discussed in the literature for decades, few large-scale simulation studies have neutrally compared data-driven variable selection methods with respect to their consequences for the resulting models. We present the protocol for a simulation study that will evaluate different variable selection methods: forward selection, stepwise forward selection, backward elimination, augmented backward elimination, univariable selection, univariable selection followed by backward elimination, and penalized likelihood approaches (Lasso, relaxed Lasso, adaptive Lasso). These methods will be compared with respect to false inclusion and/or exclusion of variables, consequences on bias and variance of the estimated regression coefficients, the validity of the confidence intervals for the coefficients, the accuracy of the estimated variable importance ranking, and the predictive performance of the selected models. We consider both linear and logistic regression in a low-dimensional setting (20 independent variables with 10 true predictors and 10 noise variables). The simulation will be based on real-world data from the National Health and Nutrition Examination Survey (NHANES). Publishing this study protocol ahead of performing the simulation increases transparency and allows integrating the perspective of other experts into the study design.
Regression without regrets –initial data analysis is a prerequisite for multivariable regression
Heinze, G., Baillie, M., Lusa, L., Sauerbrei, W., Schmidt, C.O., Harrell, F.E., Huebner, M., on behalf of TG2 and TG3 of the STRATOS initiative, 2024. Regression without regrets –initial data analysis is a prerequisite for multivariable regression. BMC Med Res Methodol 24, 178. https://doi.org/10.1186/s12874-024-02294-3
Abstract
Abstract
Statistical regression models are used for predicting outcomes based on the values of some predictor variables or for describing the association of an outcome with predictors. With a data set at hand, a regression model can be easily fit with standard software packages. This bears the risk that data analysts may rush to perform sophisticated analyses without sufficient knowledge of basic properties, associations in and errors of their data, leading to wrong interpretation and presentation of the modeling results that lacks clarity. Ignorance about special features of the data such as redundancies or particular distributions may even invalidate the chosen analysis strategy. Initial data analysis (IDA) is prerequisite to regression analyses as it provides knowledge about the data needed to confirm the appropriateness of or to refine a chosen model building strategy, to interpret the modeling results correctly, and to guide the presentation of modeling results. In order to facilitate reproducibility, IDA needs to be preplanned, an IDA plan should be included in the general statistical analysis plan of a research project, and results should be well documented. Biased statistical inference of the final regression model can be minimized if IDA abstains from evaluating associations of outcome and predictors, a key principle of IDA. We give advice on which aspects to consider in an IDA plan for data screening in the context of regression modeling to supplement the statistical analysis plan. We illustrate this IDA plan for data screening in an example of a typical diagnostic modeling project and give recommendations for data visualizations.
Review of guidance papers on regression modeling in statistical series of medical journals
Wallisch, C., Bach, P., Hafermann, L., Klein, N., Sauerbrei, W., Steyerberg, E.W., Heinze, G., Rauch, G., on behalf of topic group 2 of the STRATOS initiative, 2022. Review of guidance papers on regression modeling in statistical series of medical journals. PLoS ONE 17, e0262918. https://doi.org/10.1371/journal.pone.0262918
Abstract
Abstract
Although regression models play a central role in the analysis of medical research projects, there still exist many misconceptions on various aspects of modeling leading to faulty analyses. Indeed, the rapidly developing statistical methodology and its recent advances in regression modeling do not seem to be adequately reflected in many medical publications. This problem of knowledge transfer from statistical research to application was identified by some medical journals, which have published series of statistical tutorials and (shorter) papers mainly addressing medical researchers. The aim of this review was to assess the current level of knowledge with regard to regression modeling contained in such statistical papers. We searched for target series by a request to international statistical experts. We identified 23 series including 57 topic-relevant articles. Within each article, two independent raters analyzed the content by investigating 44 predefined aspects on regression modeling. We assessed to what extent the aspects were explained and if examples, software advices, and recommendations for or against specific methods were given. Most series (21/23) included at least one article on multivariable regression. Logistic regression was the most frequently described regression type (19/23), followed by linear regression (18/23), Cox regression and survival models (12/23) and Poisson regression (3/23). Most general aspects on regression modeling, e.g. model assumptions, reporting and interpretation of regression results, were covered. We did not find many misconceptions or misleading recommendations, but we identified relevant gaps, in particular with respect to addressing nonlinear effects of continuous predictors, model specification and variable selection. Specific recommendations on software were rarely given. Statistical guidance should be developed for nonlinear effects, model specification and variable selection to better support medical researchers who perform or interpret regression analyses.
Systematic review of education and practical guidance on regression modeling for medical researchers who lack a strong statistical background: Study protocol
Bach, P., Wallisch, C., Klein, N., Hafermann, L., Sauerbrei, W., Steyerberg, E.W., Heinze, G., Rauch, G., for topic group 2 of the STRATOS initiative, 2020. Systematic review of education and practical guidance on regression modeling for medical researchers who lack a strong statistical background: Study protocol. PLoS ONE 15, e0241427. https://doi.org/10.1371/journal.pone.0241427
Abstract
In the last decades, statistical methodology has developed rapidly, in particular in the field of regression modeling. Multivariable regression models are applied in almost all medical research projects. Therefore, the potential impact of statistical misconceptions within this field can be enormous Indeed, the current theoretical statistical knowledge is not always adequately transferred to the current practice in medical statistics. Some medical journals have identified this problem and published isolated statistical articles and even whole series thereof. In this systematic review, we aim to assess the current level of education on regression modeling that is provided to medical researchers via series of statistical articles published in medical journals. The present manuscript is a protocol for a systematic review that aims to assess which aspects of regression modeling are covered by statistical series published in medical journals that intend to train and guide applied medical researchers with limited statistical knowledge. Statistical paper series cannot easily be summarized and identified by common keywords in an electronic search engine like Scopus. We therefore identified series by a systematic request to statistical experts who are part or related to the STRATOS Initiative (STRengthening Analytical Thinking for Observational Studies). Within each identified article, two raters will independently check the content of the articles with respect to a predefined list of key aspects related to regression modeling. The content analysis of the topic-relevant articles will be performed using a predefined report form to assess the content as objectively as possible. Any disputes will be resolved by a third reviewer. Summary analyses will identify potential methodological gaps and misconceptions that may have an important impact on the quality of analyses in medical research. This review will thus provide a basis for future guidance papers and tutorials in the field of regression modeling which will enable medical researchers 1) to interpret publications in a correct way, 2) to perform basic statistical analyses in a correct way and 3) to identify situations when the help of a statistical expert is required.State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues
Sauerbrei, W., Perperoglou, A., Schmid, M., Abrahamowicz, M., Becher, H., Binder, H., Dunkler, D., Harrell, F.E., Royston, P., Heinze, G., for TG2 of the STRATOS initiative, 2020. State of the art in selection of variables and functional forms in multivariable analysis—outstanding issues. Diagn Progn Res 4, 3, s41512-020-00074–3. https://doi.org/10.1186/s41512-020-00074-3
Abstract
Background
How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc ‘traditional’ approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics.
Methods
We briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling.
Results
Our overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research.
Conclusions
Selection of variables and of functional forms are important topics in multivariable analysis. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, further comparative research is required.A review of spline function procedures in R
Perperoglou, A., Sauerbrei, W., Abrahamowicz, M., Schmid, M., 2019. A review of spline function procedures in R. BMC Med Res Methodol 19, 46. https://doi.org/10.1186/s12874-019-0666-3Abstract
Background
With progress on both the theoretical and the computational fronts the use of spline modelling has become an established tool in statistical regression analysis. An important issue in spline modelling is the availability of user friendly, well documented software packages. Following the idea of the STRengthening Analytical Thinking for Observational Studies initiative to provide users with guidance documents on the application of statistical methods in observational research, the aim of this article is to provide an overview of the most widely used spline-based techniques and their implementation in R.
Methods
In this work, we focus on the R Language for Statistical Computing which has become a hugely popular statistics software. We identified a set of packages that include functions for spline modelling within a regression framework. Using simulated and real data we provide an introduction to spline modelling and an overview of the most popular spline functions.
Results
We present a series of simple scenarios of univariate data, where different basis functions are used to identify the correct functional form of an independent variable. Even in simple data, using routines from different packages would lead to different results.
Conclusions
This work illustrate challenges that an analyst faces when working with data. Most differences can be attributed to the choice of hyper-parameters rather than the basis used. In fact an experienced user will know how to obtain a reasonable outcome, regardless of the type of spline used. However, many analysts do not have sufficient knowledge to use these powerful tools adequately and will need more guidance.Recent activities of the Topic Group on Selection of Variables and Functional Forms in Multivariable Analysis (TG2)
Georg Heinze, Aris Perperoglou, Willi Sauerbrei on behalf of STRATOS TG2
Biometric Bulletin 2021; 38(2):7-8
Abstract
This Biometric Bulletin article provides an update of the activities of TG2 by July 2021.Introducing the Topic Group on Selection of Variables and Functional Forms in Multivariable Analysis (TG2)
Aris Perperoglou, Georg Heinze, Willi Sauerbrei on behalf of STRATOS TG2
Biometric Bulletin 2018; 35(3):18-19
Abstract
The Biometric Bulletin has recently introduced its readership to the STRATOS initiative and described the activities of the Topic Groups on Missing Data (TG1), Measurement Error (TG4) and on Initial Data Analysis (TG3). This series now continues with an introduction to TG2, dealing with selection of variables and functional forms in multivariable analysis.Further Suggested Publications
Here we list a number of textbooks and papers that were not written on behalf of STRATOS-TG2. These materials are endorsed here because they fit the aim of guidance on selection of variables and functional forms in multivariable modeling.
Books
Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer: New York, 2015. https://doi.org/10.1007/978-3-319-19425-7
Integrates statistical rigor with practical guidance on variable selection, model specification, and model validation across a range of regression contexts. Focuses on restricted cubic splines as tool for modeling functional forms.Royston P, Sauerbrei W. Multivariable model-building: a pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables. John Wiley & Sons; 2008. Link
Presents a pragmatic, structured approach to multivariable model building, where emphasis is on modeling continuous variables using fractional polynomials. Focuses on interpretability and clinical relevance of statistical models.Wood S. Generalized Additive Models. Chapman & Hall/CRC: New York, 2006. https://doi.org/10.1201/9781315370279
Offers an accessible yet thorough treatment of additive modeling techniques for capturing nonlinear relationships in multivariable settings.Miller A. Subset Selection in Regression. Taylor & Francis: Boca Raton, Florida, 2002. https://doi.org/10.1201/9781420035933
Provides a critical examination of subset selection methods, discussing their statistical properties, limitations, and applications.Boer C de. A Practical Guide to Splines revised edn. Springer: New York, 2001. Link
Covers spline-based methods in detail, offering practical advice for incorporating flexible functional forms into regression models.Hastie T, Tibshirani R.. Generalized Additive Models. Chapman & Hall/CRC: New York, 1990. Link
Introduces foundational concepts in additive modeling, balancing statistical theory with illustrative applied examples.
Papers
Lopez-Ayala, P., Riley, R.D., Collins, G.S., Zimmermann, T. (2025). Dealing with continuous variables and modelling non-linear associations in healthcare data: practical guide. BMJ 390, e082440. https://doi.org/10.1136/bmj-2024-082440
Proper handling of continuous variables is crucial in healthcare research, for example, within regression modelling for descriptive, explanatory, or predictive purposes. However, inadequate methods are commonly used. This article highlights the importance of appropriately handling continuous variables, and illustrates the consequences of categorisation. This article also explains why assuming a linear relationship between the independent and dependent variable might be inappropriate, and describes how to use splines or fractional polynomials to model non-linear relationships..Kipruto, E., & Sauerbrei, W. (2025). Evaluating Prediction Performance: A Simulation Study Comparing Penalized and Classical Variable Selection Methods in Low-Dimensional Data. Applied Sciences, 15(13), 7443. https://doi.org/10.3390/app15137443
Highlights how the relative performance of penalized and classical selection methods depends on data conditions, guiding informed method choice.Sauerbrei, W., Royston, P., & Kipruto, E. (2025). Multivariable Fractional Polynomial Models and Extensions. In International Encyclopedia of Statistical Science (pp. 1609-1616). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-662-69359-9_397
Summarizes fractional polynomial modeling as a structured and practical approach to functional form selection.Kipruto, E., & Sauerbrei, W. (2024). Post‐Estimation Shrinkage in Full and Selected Linear Regression Models in Low‐Dimensional Data Revisited. Biometrical Journal, 66(7), e202300368. https://doi.org/10.1002/bimj.202300368
Demonstrates how post-estimation shrinkage can reduce overfitting and bias, offering an alternative to penalized regression after variable selection.Lu Z, Lou W. Bayesian approaches to variable selection: a comparative study from practical perspectives. Int J Biostat 2021; https://doi.org/10.1515/ijb-2020-0130
Provides practical comparisons of Bayesian variable selection approaches, supporting their application in real-world data analysis.Heinze G, Wallisch C, Dunkler D. Variable selection – a review and recommendations for the practicing statistician. Biometrical J. 2018; 60:431–49. https://doi.org/10.1002/bimj.201700067
Offers pragmatic recommendations on variable selection methods, emphasizing model stability, validity, and transparency.Binder H, Sauerbrei W, Royston P. Comparison between splines and fractional polynomials for multivariable model building with continuous covariates: a simulation study with continuous response. Statistics in Medicine 2013; 32(13): 2262– 2277. https://doi.org/10.1002/sim.5639
Compares splines and fractional polynomials, clarifying their relative strengths for modeling continuous covariates.Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Statistics in Medicine 2007; 26: 5512– 5528. https://doi.org/10.1002/sim.3148
Integrates strategies for variable selection and functional form choice, presenting fractional polynomials as a coherent solution.Abrahamowicz M, MacKenzie TA. Joint estimation of time‐dependent and non‐linear effects of continuous covariates on survival. Statistics in Medicine 2007; 26(2): 392– 408. https://doi.org/10.1002/sim.2519
Illustrates how joint modeling of time-varying and nonlinear effects improves accuracy and avoids bias in survival analysis.Abrahamowicz M, Du Berger R, Grover SA. Flexible modeling of the effects of serum cholesterol on coronary heart disease mortality. American Journal of Epidemiology 1997; 145(8): 714– 729. https://doi.org/10.1093/aje/145.8.714
Shows how flexible modeling of continuous exposures can reveal nonlinear risk relationships that parametric models may obscure.Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose‐response and trend analysis. Epidemiology (Cambridge, Mass.) 1995; 6(4): 450– 454. https://doi.org/10.1097/00001648-199507000-00025
Demonstrates the loss of power and information from categorizing continuous exposures, reinforcing the need for flexible functional form modeling.Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Applied Statistic 1994; 43(3): 429– 467. https://doi.org/10.2307/2986270
Introduces fractional polynomials as a parsimonious and interpretable alternative to categorization or high-order polynomials.