Glossary

On this page, you find descriptions of all keywords used in our papers.

Keywords used in Perperoglou 2019

Multivariable modeling

Multivariable modeling in statistics is the process of using several independent variables to predict a single dependent variable. It accounts for the relationships between multiple variables and identifies which factors are most strongly associated with the outcome of interest. This method is commonly used in regression analysis and can help to improve the accuracy of predictions.

Functional form of continuous covariates

The functional form of continuous covariates in statistics refers to the mathematical relationship between the independent variable and the dependent variable. This can be linear, nonlinear, or a combination of both. The choice of functional form can have a significant impact on the results of statistical analysis and should be carefully considered when selecting a model.

Spline function

A spline function is a type of mathematical function used in statistics and numerical analysis to approximate complex curves or surfaces. It consists of multiple polynomial segments that join smoothly at specific points called knots. Spline functions are flexible and can be used to fit data or smooth curves.

R (software)

R is a free and open-source software environment for statistical computing and graphics. It includes functions for data analysis, modeling, and visualization, and is extendable through packages. R is widely used by statisticians, data scientists, and researchers.

Keywords used in Bach 2020

Regression

Regression analysis is a statistical method used to investigate the relationship between a dependent variable and one or more independent variables. It can be used to model and predict outcomes based on input variables, and to identify which variables have the strongest impact on the outcome of interest.

Regression, Linear

Linear regression is a statistical method used to model the relationship between a continuous dependent variable and one or more independent variables, which can be of any type. A linear function of the independent variables is used to model the expected value of the dependent variable. The task of estimating the model means to estimate the unknown coefficients of the independent variables. A constant (intercept) term is added to calibrate the predicted values of the dependent variable to have average zero errors (differences between predicted and observed values of the dependent variable, also called residuals). Mostly, but not necessarily, the coefficients are estimated by minimizing the sum of squared errors. Linear regression is commonly used for prediction and inference and is a fundamental tool in statistical analysis.

Regression, Logistic

Logistic regression is a statistical method used to model the relationship between a binary dependent variable and one or more independent variables, which can be of any type. Unlike linear regression, the outcome variable in logistic regression is not continuous, but rather represents a binary outcome (e.g., success or failure). A linear function of the independent variables is used to model the log odds of one of the levels of the dependent variable. The task of estimating the model means to estimate the unknown coefficients of the independent variables. A constant (intercept) term is added to calibrate the predicted values of the dependent variable to have average zero errors. The coefficients are estimated by the maximum likelihood principle, which maximizes the probability of observing the given data given the model. Logistic regression is commonly used in classification problems, such as predicting whether a patient has a disease based on their symptoms.

Regression, Cox

Cox regression, also known as proportional hazards regression, is a statistical method used to model the relationship between the time to an event and one or more independent variables, which can be of any type. The outcome variable is a time-to-event variable that can be censored in some observations (meaning the event of interest has not yet occurred). A linear function of the independent variables is used to model the log relative hazard of an event. The model assumes that the ratio of hazards between any two groups of individuals is constant over time (the proportional hazards assumption), but this assumption can be relaxed by specifying time-dependent coefficients. The coefficients are estimated by maximizing a partial likelihood, which accounts for the censoring of the data. Cox regression is commonly used in survival analysis and is a fundamental tool in medical research and epidemiology.

Regression, Poisson

Poisson regression is a statistical method used to model the relationship between a count-type dependent variable and one or more independent variables, which can be of any type. A linear function of the independent variables is used to model the log of the expected count of the dependent variable. The model assumes that the expected count of the dependent variable is proportional to the value of the independent variables and that the counts are independent and identically distributed Poisson random variables. The coefficients of the model represent the change in the log expected count of the dependent variable associated with a one-unit change in the corresponding independent variable, holding all other variables constant. The coefficients are estimated by maximizing the likelihood of the model. Poisson regression is commonly used in fields such as epidemiology and finance to model count data.

Multivariable

In statistical modeling, “multivariable” typically refers to models that include more than one independent variable (also known as predictor or explanatory variables) to predict a dependent variable. In contrast, “multivariate” refers to models that analyze multiple dependent variables at the same time, while “univariable” or “univariate” refers to models that only analyze one independent variable to predict a dependent variable.

Multivariate

In statistical modeling, “multivariate” typically refers to models that analyze multiple dependent variables at the same time. This is in contrast to “multivariable,” which refers to models that include more than one independent variable, and “univariate” or “univariable,” which refers to models that only analyze one independent variable to predict a dependent variable.

Regression modeling

See Regression

Keywords used in Sauerbrei 2020

Descriptive modeling

Descriptive modeling refers to a type of statistical modeling used to summarize a dataset and identify patterns, relationships, and trends. The purpose of descriptive models is to capture the main associations between a dependent variable and the independent variables in the data set, typically using measures such as means, medians, and standard deviations of target variables conditional on the values of other variables and associational quantities such as regression coefficients or nonlinear (smoothing) response functions. Descriptive models are useful for gaining insight into the underlying structure of the data, but are not mainly intended for predictions of future outcomes or for causal conclusions.

Methods for variable selection

Variable selection is the process of identifying the most relevant independent variables for a given dependent variable in a statistical model. One commonly used method is stepwise regression, which involves sequentially adding or removing variables based on their statistical significance. Another method is regularization, which shrinks the coefficients of less important variables towards zero and can be implemented through techniques such as Lasso or the Elastic Net. Information criteria, such as AIC or BIC, can also be used to compare different models and select the best subset of variables. Finally, some machine learning algorithms, such as decision trees and random forests, automatically perform variable selection by choosing the most informative variables for predicting the outcome.

Spline procedures

See Spline function

Fractional polynomials

Fractional polynomials are a method for modeling nonlinear relationships between a continuous independent variable and a dependent variable in regression analysis. They involve transforming the independent variable by raising it to a power that is a fraction rather than a whole number, allowing for more flexible modeling of nonlinear relationships. Specifically, fractional polynomials use a series of power transformations (e.g., x^-2, x^-1, x^0.5, x^1, x^2) to estimate the functional form of the relationship between the variables. The best fitting model is selected by comparing the likelihoods of different fractional polynomial models, typically using the Akaike Information Criterion or the Bayesian Information Criterion. Fractional polynomials are useful for modeling complex nonlinear relationships between variables in a data-driven way.*

Categorisation

Categorization, in the context of statistical modeling, is the process of grouping continuous independent variables into discrete categories based on a pre-specified set of cut-off points or thresholds. This is done in order to simplify the modeling process and to allow for the use of categorical variables in the model. Categorization can be used for a variety of reasons, including to reduce the effect of outliers or to capture nonlinear relationships between the independent and dependent variables. However, categorization can also lead to loss of information and can make the model less flexible. As such, the decision to categorize continuous variables should be based on careful consideration of the research question, the data, and the assumptions of the model.

Bias

In statistical modeling, bias refers to the systematic error or deviation of a model’s predicted values from the true values. Bias can arise for a variety of reasons, such as using an incorrect functional form for the model, selecting the wrong variables, or omitting important variables from the model. Bias can lead to underestimation or overestimation of the true relationship between the variables, and can result in inaccurate or misleading predictions. It is important to minimize bias in statistical modeling in order to obtain valid and reliable results. This can be achieved by using appropriate modeling techniques, careful variable selection, and by checking the assumptions of the model. Bias can also be reduced by using larger sample sizes, collecting better quality data, and by using appropriate statistical methods to correct for bias if it is detected.

Shrinkage

In statistical modeling, shrinkage refers to the process of reducing the variability of model coefficients by constraining them towards a prior expectation or estimate. This is often done to avoid overfitting, which occurs when a model is too complex and fits the noise in the data rather than the underlying patterns or relationships. Shrinkage can be achieved through various techniques, such as ridge regression, lasso regression, and Bayesian modeling. These methods penalize large coefficients and encourage them to take on smaller values, thereby reducing the variance of the estimates. Shrinkage can improve the accuracy and generalizability of the model by reducing the effects of noise and improving the model’s ability to make predictions on new data.

Empirical evidence

In the context of the evaluation of statistical methods, empirical evidence refers to the results of experiments or analyses that provide insight into the performance of a given statistical method. Empirical evidence is based on the observed behavior of a method when applied to real-world data, and is used to determine whether the method is suitable for a particular task or problem.

Empirical evidence can be obtained through various means, such as simulation studies, cross-validation, or experiments using real-world data. These studies are designed to assess the performance of a statistical method with respect to its accuracy, robustness, speed, and scalability.

Empirical evidence is crucial in evaluating statistical methods because it provides a basis for determining their strengths and weaknesses, and for comparing their performance with that of other methods. The quality of the evidence depends on the design of the study, the choice of evaluation metrics, and the representativeness and relevance of the data.

Keywords used in Wallisch 2022

Statistical regression modeling

See Regression

Keywords used in Ullmann 2024

Model performance

Describes how accurate the predictions of a model are, but may also cover aspects of model interpretability.

Simulation study

A simulation study is a research approach in which artificial datasets are generated under known conditions (using predefined statistical models and parameters) to systematically evaluate the performance of statistical methods. This allows researchers to assess how well different methods perform across a range of scenarios—such as varying sample sizes, noise levels, or model assumptions—when the “true” data-generating process is known. It’s especially useful for comparing methods in terms of accuracy, bias, variance, and robustness.

Neutral comparison study

A neutral comparison study is a type of research specifically designed to compare statistical methods in an unbiased and objective manner. In such studies, the primary goal is not to promote a new method but to fairly evaluate the relative performance of existing methods using consistent criteria and transparent protocols. Neutrality is maintained by ensuring that the study is not conducted by the developers of the methods being evaluated (or at least includes independent collaborators), and by using standardized, realistic datasets or simulations, along with clear, pre-specified performance metrics. These studies play a key role in evidence-based method selection and promote reproducibility and methodological rigor.

Keywords used in Heinze 2024

Initial data analysis

The main aim of IDA is seen in providing reliable knowledge about the data to enable responsible statistical analyses and interpretation. IDA has the following phases: (1) metadata setup; (2) data cleaning; (3) data screening; (4) initial data reporting; (5) refining and updating the research analysis plan; and (6) documenting and reporting IDA. IDA is aligned with the research aims and the statistical analysis plan and does not include hypothesis generating activities.

Source: Schmidt CO, Vach W, le Cessie S, Huebner M. STRATOS: Introducing the Initial Data Analysis Topic Group (TG3). Biometric Bulletin 2018; 35 (2): 10-11

IDA framework

A structured, principled approach to preparing and understanding data before formal statistical modeling begins. The IDA framework emphasizes systematic exploration and documentation of key aspects of the data, including data quality, distributions, missing values, outliers, and the plausibility of variable values. It also supports informed decisions about data cleaning, transformations, and the selection of analysis populations. In observational studies, the IDA framework helps ensure that subsequent analyses are based on a well-understood and appropriately prepared dataset, enhancing the credibility and transparency of research findings.

Source: Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192.

Data screening

Data screening consists of reviewing and documenting the properties and quality of the data that may affect future analysis and interpretation.

Source: Huebner M, le Cessie S, Schmidt CO, Vach W. A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192.

Reporting

The process of transparently and comprehensively documenting how an observational study was designed, conducted, analyzed, and interpreted. Good reporting ensures that all relevant details—such as data sources, inclusion criteria, variable definitions, statistical methods, handling of missing data, and sensitivity analyses—are clearly communicated. This enables readers to assess the validity, reproducibility, and generalizability of the findings. Adherence to established reporting guidelines (e.g., STROBE) is key to promoting clarity, transparency, and trust in observational research.

Functional form

See Functional forms of continuous covariates

Variable transformation

The process of applying mathematical functions to variables before including them in a regression model, with the aim of improving model fit, meeting statistical assumptions (e.g., linearity, normality), or enhancing interpretability. Common transformations include logarithmic, square root, and standardization (e.g., z-scores). In observational studies, variable transformation can help address issues such as skewed distributions, outliers, or non-linear relationships between predictors and outcomes. Careful documentation and justification of transformations are essential for transparency and reproducibility.