• Keine Ergebnisse gefunden

RELEX: an Excel-based software tool for sampling split-half reliability coefficients

N/A
N/A
Protected

Academic year: 2022

Aktie "RELEX: an Excel-based software tool for sampling split-half reliability coefficients"

Copied!
7
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

REL EX : An Excel-based software tool for sampling split-half reliability coef fi cients

Alexander Steinke

*

, Bruno Kopp

Department of Neurology, Hannover Medical School, Carl-Neuberg-Straße 1, 30625, Hannover, Germany

A R T I C L E I N F O Keywords:

Reliability Split-half reliability Consistency Sampling Excel

A B S T R A C T

Split-half reliability provides a method for estimating inter-item reliability of measures obtained from repeatedly administered items (or trials). We introduce RELEX; a freely available, Microsoft Excel-based software tool that randomly assembles test halves, yielding a sampling-based distribution of split-half reliability coefficients. Esti- mates of parallel, tau-equivalent, and congeneric split half reliability are reported. RELEXoffers a graphical user interface, and provides freely eligible numbers of sampled splits as well as easy handling of missing values and variable trial or item numbers. Results are summarized in the form of a histogram, along with measures of its central tendency and uncertainty.

1. Introduction

Reliability is an essential psychometric characteristic of all behavioral measures (Cho, 2016; Nunnally and Bernstein, 1994; Parsons et al., 2018;

Revelle and Condon, 2018). Indices of reliability indicate the consistency between rank orders of individuals when, for example, tests were cut into halves (Hedge et al., 2018). According to the most recent standards for psychological testing (AERA, APA and NCME 2014), indices of reliability refer to measures obtained in particular samples that the studies were examining, rather than to the measures in general. Hence, reliability should not be considered as a property of a measure that would be invariant across samples. These psychometric standards, however, are still widely unrecognized. An integral part of all behavioral sciences therefore includes reporting reliability estimates for the considered measures, whenever possible, from the practitioner's or researcher's sample or samples (Appelbaum et al., 2018).

Cronbach (1957)commented on the historically evolved divide of scientific psychology into correlational and experimental psychology.

Safeguarding appropriate reliability is relatively well-established in correlational psychology, particularly in psychological testing, which is inherently concerned with the assessment of individual differences (Hedge et al., 2018; Rouder and Haaf, 2019). Experimental psychology, on the other hand, is traditionally much less interested in the assessment of individual differences. Experimental psychology often focuses on

sample-size requirements for minimizing sampling error, with much lower weight devoted to the reliability of its measures (Kolossa and Kopp, 2018).

There is increasing interest in bridging correlational and experi- mental psychology in terms of utilizing experimental tasks for studies of individual differences (Hedge et al., 2018). Well-established experi- mental tasks provide well-replicable measures of cognitive processes (Rouder and Haaf, 2019). However, the robust detection of the effects of experimentally manipulated variables is often achieved by minimizing unrequested sources of variance, to which inter-individual variance may contribute. As a consequence, many measures gained from typical experimental studies show relatively low reliability, a fact that often remains unrecognized. Consequently, conclusions from correlational re- lationships are less consistently replicable (Hedge et al., 2018). Taken together, the success of experimental tasks for studying individual dif- ferences depends on the reliability of the measures of interest, rendering routine reliability estimations inevitable (Hedge et al., 2018; Miller and Ulrich, 2013; Parsons et al., 2018; Rouder and Haaf, 2019).

A fundamental distinction concerning reliability can be made with regard to the inter-item reliability and the stability of a measure (Kopp et al., 2019). The latter aspect of reliability (stability) is typically assessed by test-retest reliability (Kopp et al., 2019), which is applicable when measures were repeatedly taken over time. A measure of the inter-item reliability may be more adequate for other studies, such as typical

* Corresponding author.

E-mail addresses:steinke.alexander@mh-hannover.de(A. Steinke),kopp.bruno@mh-hannover.de(B. Kopp).

Contents lists available atScienceDirect

Methods in Psychology

journal homepage:www.journals.elsevier.com/methods-in-psychology

https://doi.org/10.1016/j.metip.2020.100023

Received 17 October 2019; Received in revised form 19 February 2020; Accepted 27 April 2020 Available online 29 April 2020

2590-2601/©2020 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-

nc-nd/4.0/).

Methods in Psychology 2 (2020) 100023

(2)

experimental studies. Split-half reliability1 represents a classical approach to inter-item reliability (Cho, 2016; Green et al., 2016;

Kowalczyk and Grange, 2017; Parsons et al., 2018; Warrens, 2016), due to the fact that split-half reliability estimates are applicable whenever a set of items or trials2(e.g., all items in a questionnaire or all trials on an experimental task) can be partitioned into two equally sized subsets of items or trials (Parsons et al., 2018). The sole requirement for compiling indices of split-half reliability is that the behavioral measure of interest was repeatedly assessed. For example, response times and/or response accuracy are assessed in multi-trial tasks, and psychological self-report measures typically include multi-item questionnaires. Commonly uti- lized subsets of trials (or items) are odd/even splits (i.e., trials are assigned to subsets by odd and even trial numbering) orfirst/second half splits (i.e., subsets containing the first half and the second half of all administered trials). Depending on the assumed measurement model (i.e., whether one assumes parallel, tau-equivalent, or congeneric test splits; a definition is given in the Method section; see alsoCho, 2016;

Warrens, 2016), an appropriate estimate of split-half reliability can be derived. For example, if the presence of parallel test halves can be assumed, an estimate of the task's split-half reliability can be computed by correlating individual scores that were obtained on each of the two subsets of trials, corrected for task length by the well-known Spear- man-Brown‘prophecy’formula (Brown, 1910; Spearman, 1910). Other indices of split-half reliability might be obtained by the Flanagan-Rulon formula for assumed tau-equivalent test halves (Flanagan, 1937; Gutt- man, 1945; Mosier, 1941; Rulon, 1939; it is equal to coefficient alpha for test splits into halves), or the Angoff-Feldt coefficient for assumed congeneric test halves (Angoff, 1953; Feldt, 1975).

In most cases, these commonly utilized subsets (i.e., the odd/even or thefirst/second half split) for split-half reliability estimation represent just one possibility out of many applicable ways to split a set into two equally sized subsets. In fact, the number of possible splits increases strongly with the number of administered trials. For a task withntrials and subsets of lengthn/2, the number of applicable ways to split is given as a function of the binomial coefficient,

n

n=2 2¼ n!

2*n2!2 (1)

For example, a set ofn¼10 trials can be split in 126 ways into two equally sized subsets ofn/2¼5 trials, corresponding to more than 126 distinct split-half reliability coefficients that could be obtained in this situation. However, a set ofn¼100 trials results in more than 1028ways to split the set into equally-sized halves, corresponding to more than 1028 distinct split-half reliability coefficients in this situation. Thus, split-half reliability estimates that are based on single splits (such as the commonly utilized odd/even and/or thefirst/second half split) are not exhaustively conclusive because their results represent point estimates of an (un- known) underlying distribution of (usually many) obtainable reliability coefficients (Kopp et al., 2019; Kowalczyk and Grange, 2017; Parsons et al., 2018).

Indices of split-half reliability coefficients can be strongly variable, depending on the applied method for splitting trials into halves. For example, consider the distinction between odd/even and first/second half splits. Task performance may be subject to long-term trends, such as various forms of learning or fatigue (Kopp et al., 2019; Kowalczyk and Grange, 2017). Individuals may differ in their susceptibility to these long-term trends, introducing inter-individual variability, which exerts

its effects mainly on those splits that compare performance on temporally separated splits (such asfirst/second half splits), whereas splits in close temporal proximity (such as odd/even splits) would be less vulnerable to these long-term effects. The Appendix provides an illustrative example how split-half reliability estimation is affected by the presence of long-term trends that differ between individuals. With these consider- ations in mind, it comes as no surprise that the split-half reliability es- timates (i.e., the Angoff-Feldt coefficient in this example) that we obtained from the simulated data were lower when based on thefirst/- second half split (ρSC ¼ 0.82) compared to the odd/even split (ρSC¼0.92). The example illustrates potential effects exerted by arbi- trary split choices on reliability estimates. The question then is how representative estimates of split-half reliability can be gained.

One promising approach toward an answer to this question is to randomly sample splits from the potentially huge set of all potential splits (we refer to this method as reliability sampling throughout our article).

The resulting sample of split-half reliability coefficients approximates the distribution of all obtainable coefficients (Cooper et al., 2017; Enock et al., 2014; Kopp et al., 2019; MacLeod et al., 2010; Meule et al., 2019;

Parsons et al., 2018; Revelle and Condon, 2018). Representative reli- ability estimates are provided by measures of central tendency of the sampling distribution (e.g., its mean or median), whereas the width of the sampling distribution indicates the remaining uncertainty. For example, reliability sampling of the simulated data (Appendix) shows that split-half reliabilities of the odd/even and thefirst/second half splits were over- and underestimating the representative split-half reliability, respectively. 95% of sampled reliability coefficients lay between ρSC¼0.78 andρSC¼0.90, with a median ofρSC¼0.85 (seeFig. 1).

Until now, reliability sampling is not easily applicable. Reliability sampling is available in the form of freely available R packages (e.g., splithalf: Parsons, 2017; psych:Revelle, 2018; and multicon: Sherman, 2015). Many practitioners and researchers may not be sufficiently experienced in utilizing R, nor may they have time and/or knowledge to develop their own software solutions. Thus, there is no widely applicable software that enables practitioners and researchers to calculate and report reliability estimates that were obtained from reliability sampling.

In addition, these R packages exclusively allow computing split-half reliability coefficients for parallel test halves, which represents an assumption that will be rarely met. We close this gap by introducing RELEX, an easy to use and freely available Microsoft-Excel-based program for reliability sampling, which we describe in detail below.

Fig. 1.A showcase comparison of split-half reliability coefficients obtained from commonly utilized splits (left vertical blue line: first/second test half splitting; right vertical blue line: odd/even splitting) and from reliability sam- pling. Representative reliability estimates from reliability sampling are provided by measures of central tendency (i.e., the median; vertical red line) of the sampling distribution (histogram). The width of the sampling distribution in- dicates the uncertainty (i.e., the horizontal red line indicating the interval that contains 95% of the sampled reliability coefficients). Reliability sampling was done using 10.000 iterations;ρSC: split-half congeneric reliability (i.e., Angoff- Feldt coefficient). (For interpretation of the references to colour in thisfigure legend, the reader is referred to the Web version of this article.)

1We follow the nomenclature of reliability as suggested by Cho (2016).

Following from that, we use the term split-half reliability for any measure of inter-item reliability that is based on splitting a test into halves, which includes, but is not limited to, coefficients obtained by the Spearman-Brown formula.

2Note that the termsitem, which is frequently used to refer to an element in a questionnaire, andtrial, which is frequently used to refer to an element on an experimental paradigm, are used interchangeably.

(3)

2. Description of RELEX 2.1. Implementation

RELEXcan be downloaded free of charge fromhttps://osf.io/qu9jg/.

The software tool is implemented in Microsoft Excel using Visual Basic for Applications. The program runs on all compatible versions of Microsoft Excel that provide support for Visual Basic for Applications. In line with other authors of freely available Microsoft Excel based pro- grams (e.g.,Barnette, 2005; Cho, 2016; Houghton and Grange, 2011), we chose Microsoft Excel in order to facilitate the accessibility of RELEX. Most practitioners and researchers have access to Microsoft Excel and are experienced with its use. Thereby, RELEXrequires minimal training and can be used immediately after its download. In addition, the program can be easily adapted for subsequent analysis using the Microsoft Excel's spreadsheet interface.

2.2. Starting RELEX

RELEXis started as any other Microsoft Excelfile by double-clicking on thefile's icon. As the program incorporates macros based on Visual Basic for Applications, a message will appear asking to enable its content when thefile is started. To use RELEX, make sure to enable the program's content (i.e., macros). Note that the message can differ between Micro- soft Excel versions and operating systems.

The program starts with three worksheets named“Master”,“Data” and“Output”. The Master worksheet contains the settings and shows the results (Fig. 2). Reliability sampling is started on this worksheet. The Data worksheet contains the data that is processed by reliability sampling (Fig. 3). The use of the Output worksheet is optional (Fig. 5). It contains the final parallel, tau-equivalent, and congeneric split-half reliability samples, which can be analyzed further.

2.3. Master worksheet

The Master worksheet appears with two sections named“Settings” and“Results”(seeFig. 2). In the Settings section (located on the left side),

the following preferences must be specified prior to the start of reliability sampling:

Number of subjects. The number of subjects has to be provided as a positive numerical value. Note that the number of subjects corresponds to the number of rows that contain non-empty cells in the Data worksheet.

Maximum number of trials. The program can process equal numbers of trials between subjects as well as different numbers of trials between subjects. For both cases, the maximum number of trials has to be defined manually as a positive numerical value. For example, inFig. 3b, there is a variable trial number between subjects, but the maximum number of trials is 6. Note that the maximum number of trials corresponds to the maximum number of columns that contain non-empty cells in the Data worksheet.

Number of iterations. The number of iterations defines the number of split-half reliability coefficients that are generated by reliability sampling. The number of iterations must be provided as a positive numerical value. If there is a small number of trials, the number of it- erations can exceed the number of possible splits, which results a warning message.

Missing values. If single trial data is missing, the respective cell in the Data worksheet can befilled with missing values. The default missing value is“NA”, but any numerical value or string is valid as a missing value. Note that the missing value has to be defined even if there are no missing values in the data. For a detailed discussion of the handling of missing data, see section Data worksheet.

If any information in the Settings section is missing, an error message appears that gives details about the missing information. Pressing the

“Reset Data”button deletes all entries in the Data worksheet. Pressing the

“Reset Results”button deletes all entries in the Results section in the Master worksheet as well as in the Output worksheet. The Results section in the Master worksheet (located on the right side) displays summary statistics and a histogram of one of the sampled split-half reliability es- timates. The presented split-half reliability estimate can be selected by option buttons located on the right side. Further details of the Results section are discussed later.

Fig. 2. Interface of the Master worksheet. In the Settings section (located on the left side; shown in light blue), users must provide additional information on the data;

i.e., the number of subjects, the maximum number of trials, the number of iterations, and the definition of missing values. In this example, reliability sampling was run on 30 subjects, 16 trials, and 10.000 iterations. Missing values were defined as“NA”. Results are displayed on the right side (shown in light green) of the interface including summary statistics of the three computed split-half reliability estimates and a histogram of relative frequencies of a selected sample of split-half reliability coefficients. Note that when opened for thefirst time, the number of subjects and the maximum number of trials must befilled in order for reliability sampling to begin. (For interpretation of the references to colour in thisfigure legend, the reader is referred to the Web version of this article.)

(4)

2.4. Data worksheet

The data to be processed by reliability sampling must be entered in the Data worksheet, which can be done by copy and paste, e.g. from Microsoft Excel or SPSS, or by entering data manually. Valid entries are all numerical values as well as the missing value defined in the Master worksheet. The data must be entered with row-wise arranged subjects and column-wise arranged trials, i.e., a single cell in the Data worksheet contains the value for a trial of a specific subject (for an example, see Fig. 3).

In many studies, the number of trials to be analyzed differs between subjects. Reasons might be that the number of trials was a-priori not constant between subjects or single trials were excluded prior to analysis, because trials followed an error or response times were too fast or too slow. RELEXprovides two ways of handling varying trial numbers. First, trials with missing data can be entered as missing values (Fig. 3a).

Thereby, all subjects contribute a constant number of trials, allowing the reliability-sampling algorithm on any iteration to apply the same random test split to all subjects. Second, for each subject, only as many cells in the Data worksheet arefilled as there are trials containing data (Fig. 3b). In this case, for any subject and iteration, a unique test split will be applied.

For handling of missing data, we recommend applying the latter pro- cedure whenever possible, as it ensures that reliability estimation will not be distorted by unbalanced task splits. See Method for further details.

2.5. Running the analysis

Reliability sampling is started by pressing the“Go!” button in the Master worksheet. After starting reliability sampling, the progress is presented in the status bar (located at the bottom of the worksheet).

When reliability sampling isfinished, summary statistics and a histogram are presented in the Master worksheet. The duration of the computation varies with the number of subjects, trials, and iterations. Additionally, if there are different numbers of trials between subjects, the computation time is prolonged, as for any iteration and subject a random split needs to be generated.

2.6. Method

RELEXrepeatedly samples parallel, tau-equivalent, and congeneric split-half reliability coefficients from random splits. On any iterationi withi¼(1,..,I), trials are randomly assigned to test halvesAandB. For any iterationi, task halfAandB, and subjectjwithj¼(1,..,J), a sum score is computed. For split-half parallel reliability, which is appropriate

under the assumption of parallel test halves (Cho, 2016), the Pearson correlation coefficientr(i) of subjects’sum scores between test halvesAi

andBiis calculated. In order to account for the reduced task length,r(i) is corrected by the Spearman-Brown formula (Brown, 1910; Spearman, 1910):

ρSPðiÞ ¼ 2rðiÞ

1þrðiÞ (2)

For split-half tau-equivalent reliability, which is appropriate under the assumption of tau-equivalent test halves, RELEXutilizes the Flanagan- Rulon (Flanagan, 1937; Guttman, 1945; Mosier, 1941; Rulon, 1939) formula:

ρSTðiÞ ¼4σAB

σ2X (3)

WithσABdenotes the covariance between subjects' sum scores on test halves A and B andσ2Xdenotes the overall variance of subject's sum scores on the task. Note that the Flanagan-Rulon formula is equivalent to co- efficient alpha for task splits into two parts (Warrens, 2016).

Lastly, if a congeneric measurement model is assumed, split-half congeneric reliability is computed using the Angoff-Feldt coefficient (Angoff, 1953; Feldt, 1975) as:

ρSCðiÞ ¼ 4σAB

σ2X ð

σ2A σ2BÞ2

σ2X

(4)

We implemented the Angoff-Feldt coefficient, as it is appropriate whenever the number of trials may not be a good indicator of the relative importance of test halves (Warrens, 2016).

The obtained split-half reliability coefficients of iteration i,ρSP(i), ρST(i), and,ρSC(i), are written to the Output worksheet and the procedure is repeated for the next iteration untili¼I. Finally, summary statistics are computed and presented in the Master worksheet. For an outline of the reliability sampling procedure, seeFig. 4.

The assumed measurement model determines which split-half- reliability estimate is appropriate for the data that are under consider- ation (Cho, 2016). The least restrictive measurement model, i.e., the congeneric model, is based on the assumption that manifest variables (i.e., subjects’sum scores on any test half) have a common latent variable and errors that are random and independent of each other. The tau-equivalent measurement model additionally assumes that all factor loadings of manifest variables are equal. The most restrictive measure- ment model, i.e., the parallel model, is the tau-equivalent model with the Fig. 3. The Data worksheet. Data must be entered as a matrix with columns containing trials and rows containing subjects. Valid entries are any numerical values as well as a string (or numerical value) for missing data. In this example, data from 10 subjects is provided with missing data for some trials. Thus, the number of trials containing data varies between subjects (e.g., four trials for subject one andfive trials for subject two). Varying numbers of trials can be handled in two ways:a. Trials with missing data can be entered byfilling in missing values (here,“NA”serves as indicator of missing values). Thereby, each subject provides entries for a constant number of columns (here, six columns arefilled for any subject), allowing the reliability sampling algorithm on any iteration to apply the same random test split to all subjects.b. Alternatively, trials with missing values can be excluded from the Data worksheet by entering only trials that contain (valid) data, resulting in variable numbers offilled columns between subjects. For example, in 3b, the same data as in 3a is presented. However, it is rearranged by excluding trials with missing data and shifting subsequent trials to the left. Here, for any subject and iteration, a unique test split will be applied. Note that this variant is also appropriate if trial numbers differed a-priori between subjects. See Method for details.

(5)

additional assumption of equal error variances. See (Cho, 2016; Warrens, 2016) for more detailed discussion.

Prior to reliability sampling, the individual number of trials is computed.

For each subject, the program selects the cell in the Data worksheet that corresponds to the maximum number of trials as defined in the Master worksheet. If this cell is empty, the algorithm selects the cell left to it. This procedure is repeated until a non-empty cell is selected. Thefirst non-empty cell defines the number of trials for that subject. For example, as shown in Fig. 3b, the maximum number of trials is 6. For subject one, trial number 6 and 5 appear empty but not trial 4. Therefore, the number of trials for subject one is identified as 4. If the number of trials is odd, the task cannot be split into halves of equal length. If there is an odd number of trials, the program randomly excludes one trial on any iterationi. The remaining trials, which are of even number, are randomly assigned to test halves.

Reliability sampling can follow two variations depending on constant or variable trial numbers between subjects. First, if all subjects provide data for the same number of trials, on any iterationi, the same random split is applied on all subjects. Second, if the number of trials differs between subjects, on any iterationi, a random split is generated for each subject. We implemented processing of constant and variable trial numbers in RELEX, as the application of the same random split on subjects with differing number of trials would result in unbalanced test halves.

For example, if test halves in Fig. 3b are generated by splitting the maximum of 6 trials into afirst (trials 1 to 3) and second half (trials 4 to 6), subject one would provide 3 trials for thefirst task half but only 1 trial for the second task half. In contrast, subject-level splitting generates subsets of equal length (for subject one, trials 1–2 and trials 3–4). See Fig. 4for an outline of variations of reliability sampling.

There are three conditions in which it is impossible or not appropriate to calculate split-half reliability coefficients. If one of these conditions is met, RELEXpresents a warning with further information. 1) If at least one task half does not contain trials, no split-half reliability coefficient can be computed. This might be the case if, for at least one subject, the number of missing values is equal to or higher than the length of a task half. 2) If sum scores are equal for all subjects on a task half or the complete task, no split-half reliability coefficient can be computed, as there is no variance

between subjects. For example, sum scores might be equal between in- dividuals on very easy or very hard tasks. 3) Split-half reliability co- efficients might be negative, which are not interpretable. If the final sample contains negative split-half reliability coefficients, try to inverse trials that are negatively correlated with the sum score.

2.7. Results

Results are presented in the Results section of the Master worksheet.

The obtained samples of parallel, tau-equivalent, and congeneric split- half reliability are summarized by the mean, the standard deviation (SD), the median, the minimum and maximum, and the 95% highest density interval (HDI). The 95% HDI gives the interval that contains 95%

percent of thefinal sample of split-half reliability coefficients. The 95%

HDI quantifies the uncertainty associated with split-half reliability esti- mation and allows for evaluation of the effect of various splits on split- half reliability coefficients. The wider the 95% HDI, the stronger the impact of splits on split-half reliability estimates. The Output worksheet contains all obtained split-half reliability coefficients from reliability Fig. 4. An outline of the procedure for reliability sampling. Dependent on constant or variable numbers of trials between subjects, reliability sampling follows one of two variations.a. Constant trial number: If all subjects provide the same number of trials, a single random split is generated on any iterationi(cf. resulting test halves AiandBi).b. Variable trial number: If the number of trials differs between subjects, for any iterationiand subjectj, a separate random split is generated (cf. resulting test halvesAijandBij). For further details, see Method. In both variations, on any iterationi, the overall set of trials is split into two halves of equal length according to the generated random split. For each subjectjwithj¼(1,..,J), the sum score is calculated separately for the two test halves, yielding two scores,AijandBijfor each subject. Three split-half reliability coefficients are calculated across allAijandBijpairings, which are parallel split-half reliabilityρSP(i) (i.e., Spearman-Brown corrected Pearson correlation coefficients), tau-equivalent split-half reliabilityρST(i) (i.e., Flanagan-Rulon formula; coefficient alpha), and congeneric split-half reliabilityρSC(i) (i.e., Angoff-Feldt coefficient). The emergent coefficients are added to their respective sampleρSPST, orρSC. The procedure is repeated until all iterations arefinished (i.e.,i¼I). Summary statistics are based on thefinal parallel, tau-equivalent, and congeneric split-half reliability samples.

Fig. 5. The Output worksheet. The raw parallel, tau-equivalent, and congeneric split-half reliability samples are saved in the Output worksheet. In this example, only split-half reliability coefficients of thefirst 10 iterations are presented.

(6)

sampling (Fig. 5). The availability of these reliability coefficients enables the computation of further summary statistics.

Any obtained sample of split-half reliability coefficients can be graphically summarized in a histogram (Fig. 2). The histogram can be used for quick diagnostics of the reliability coefficient sample. The x- axis shows the split-half reliability coefficient and the y-axis shows the relative frequency of the obtained reliability coefficients. The bin size for the histogram was set to 0.01, resulting in 100 bins that are 0–0.01;

0.01–0.02; …; 0.99–1. Note that negative reliability coefficients are excluded from the histogram (see Method for handling of negative reliability coefficients). Relative bin frequencies are calculated by dividing the absolute frequency of sampled reliability coefficients within that bin by the number of all iterations. The appearance of the histogram and further settings can be manipulated manually using Microsoft Excel's implemented graphic features. The height of a bar might exceed the default limit of 0.5, which prompts the display of a warning. In order to show the full histogram, the range of the y-axis can be adjusted manually.

3. What to report

We recommend reporting the median of the reliability coefficient sample and the HDI as a measure of its certainty. For example, the results of the showcase split-half reliability sampling as given in the Introduction (seeFig. 1), might be reported as“Split-half reliability sampling using 10.000 iterations revealed a median reliability coefficient ofρSC¼0.85.

95% of the sampled reliability coefficients lay betweenρSC¼0.78 and ρSC ¼ 0.90”. However, researchers and practitioners might choose a descriptive statistic that is most appropriate for the studied research question. For example, if the reliability coefficient sample appears to be normally distributed, a 95% confidence interval around the mean can be reported, which is computed as 1.96 standard errors of the mean (i.e., the standard deviation divided by the square root of the number of itera- tions) around the mean. We further suggest that researchers and practi- tioners carefully consider which measurement model might be most appropriate for the data under consideration, and that they assume the least restrictive (i.e., congeneric) measurement model in case of uncertainty.

4. Limitations

RELEX samples parallel, tau-equivalent, and congeneric split-half reliability. The assumed underlying measurement model determines which one is appropriate for the data under consideration. Procedures for testing measurement models were described byCho (2016). However, these procedures can only be conducted when tests are split into three or more parts, hindering their utilization for split-half reliability estimation.

Please also note that all considered split-half reliability estimates are based on the assumption of a uni-dimensional latent structure of the data.

For reliability estimation in case of a multi-dimensional latent structure, seeCho (2016).

5. Conclusion

The assessment of individual differences depends on reliability, rendering the estimation of reliability in corresponding studies essential

(Hedge et al., 2018; Miller and Ulrich, 2013; Parsons et al., 2018; Rouder and Haaf, 2019). Split-half reliability estimates are appropriate for all measures obtained from repeatedly administered trials or items (Brown, 1910; Kowalczyk and Grange, 2017; Parsons et al., 2018; Spearman, 1910). However, the application of split-half reliability estimates comes with caveats. That is, indices of split-half reliability, which are based on single test splits, are solely point-estimates of a distribution of obtainable reliability coefficients. Conclusions from them remain questionable as long as the distribution of potential reliability coefficients remains unknown.

Precise estimates of split-half reliability can be computed by repeat- edly sampling random test splits, which approximates the underlying distribution of all obtainable reliability coefficients (Cooper et al., 2017;

Enock et al., 2014; Kopp et al., 2019; MacLeod et al., 2010; Parsons et al., 2018; Revelle and Condon, 2018). The width of the distribution quan- tifies uncertainty associated with reliability estimation. Hence, reliability sampling allows practitioners and researchers improving their confi- dence in a measure's reliability. Previously implemented reliability sampling procedures were not easily applicable, as their implementation was restricted to statistical programming languages, such as R (Parsons, 2017; Revelle, 2018; Sherman, 2015). In this study, we present a Microsoft-Excel-based, easy-to-use software tool for reliability sampling.

We hope that RELEXenables practitioners and researchers to compute precise indices of reliability routinely, which is recommended for all behavioral sciences (Parsons et al., 2018).

Manifold split-half reliability indices were proposed, such as the largest obtainable split-half reliability coefficient or the lowest obtain- able split-half reliability coefficient (Cho, 2016; Hunt and Bentler, 2015;

Revelle, 1979). Indisputably, each of these split-half reliability indices has advantages and disadvantages depending on the research question asked. However, a recurring problem is how to estimate these particular reliability indices (e.g., Hunt and Bentler, 2015). RELEX provides a straightforward approximation of the distribution of all obtainable reli- ability coefficients, allowing researchers and practitioners to study this split-half reliability sample in detail and, most importantly, to derive the split-half reliability indices that is most appropriate for the research question.

Funding

This work was supported by a grant to BK from the Petermax-Müller- Foundation, Hannover, Germany.

Declaration of competing interest

The authors declare that they have no known competingfinancial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Alexander Steinke: Conceptualization, Methodology, Software, Validation, Formal analysis, Data curation, Writing - original draft, Writing - review&editing, Visualization.Bruno Kopp:Conceptualiza- tion, Methodology, Resources, Writing - original draft, Writing - review&

editing, Supervision, Project administration, Funding acquisition.

Appendix

We simulated single-trial data including individual long-term trends, i.e., individual scores increased or decreased systematically over trials. For S¼30 subjects andT¼16 trials, single trial datays,tof subjectson trialtwas simulated by applying the following equation,

ys;t¼as*tþbsþεs;t (A.1)

Inter-individual variability was introduced via the intercept parameterbsand the slope parameteras. For each subjects, these parameters were

(7)

randomly determined by drawing from normal distributions,

aseNðμ¼0; σ¼1=TÞ (A.2)

bseNðμ¼0; σ¼1Þ (A.3)

In addition, normally distributed error varianceεs,twas added to each trial,

εs;teNðμ¼0;σ¼2Þ (A.4)

The simulated data can also be downloaded as the illustrative showcase for the application of the RELEXsoftware fromhttps://osf.io/qu9jg/.

Appendix A. Supplementary data

Supplementary data to this article can be found online athttps://doi.org/10.1016/j.metip.2020.100023.

References

American Educational Research Association (AERA), American Psychological Association (APA), National Council on Measurement in Education (NCME), 2014. In: Standards for Educational and Psychological Testing. AERA, Washington, DC.

Angoff, W.H., 1953. Test reliability and effective test length. Psychometrika 18 (1), 1–14.

https://doi.org/10.1007/BF02289023.

Appelbaum, M., Cooper, H., Kline, R.B., Mayo-Wilson, E., Nezu, A.M., Rao, S.M., 2018.

Journal article reporting standards for quantitative research in psychology: the APA Publications and Communications Board task force report. Am. Psychol. 73 (1), 3–25.

https://doi.org/10.1037/amp0000389.

Barnette, J.J., 2005. ScoreRel CI: an Excel program for computing confidence intervals for commonly used score reliability coefficients. Educ. Psychol. Meas. 65 (6), 980–983.

https://doi.org/10.1177/0013164405278577.

Brown, W., 1910. Some experimental results in the correlation of mental abilities. Br. J.

Psychol. 3 (3), 296–322.https://doi.org/10.1111/j.2044-8295.1910.tb00207.x.

Cho, E., 2016. Making reliability reliable: a systematic approach to reliability coefficients.

Organ. Res. Methods 19 (4), 651–682.https://doi.org/10.1177/

1094428116656239.

Cooper, S.R., Gonthier, C., Barch, D.M., Braver, T.S., 2017. The role of psychometrics in individual differences research in cognition: a case study of the AX-CPT. Front.

Psychol. 8, 1482.https://doi.org/10.3389/fpsyg.2017.01482.

Cronbach, L.J., 1957. The two disciplines of scientific psychology. Am. Psychol. 12 (11), 671684.https://doi.org/10.1037/h0043943.

Enock, P.M., Hofmann, S.G., McNally, R.J., 2014. Attention bias modification training via smartphone to reduce social anxiety: a randomized, controlled multi-session experiment. Cognit. Ther. Res. 38 (2), 200216.https://doi.org/10.1007/s10608- 014-9606-z.

Feldt, L.S., 1975. Estimation of the reliability of a test divided into two parts of unequal length. Psychometrika 40 (4), 557–561.https://doi.org/10.1007/BF02291556.

Flanagan, J.C., 1937. A proposed procedure for increasing the efficiency of objective tests.

J. Educ. Psychol. 28 (1), 17–21.https://doi.org/10.1037/h0057430.

Green, S.B., Yang, Y., Alt, M., Brinkley, S., Gray, S., Hogan, T., Cowan, N., 2016. Use of internal consistency coefficients for estimating reliability of experimental task scores.

Psychonomic Bull. Rev. 23 (3), 750–763.https://doi.org/10.3758/s13423-015- 0968-3.

Guttman, L., 1945. A basis for analyzing test-retest reliability. Psychometrika 10 (4), 255–282.https://doi.org/10.1007/BF02288892.

Hedge, C., Powell, G., Sumner, P., 2018. The reliability paradox: why robust cognitive tasks do not produce reliable individual differences. Behav. Res. Methods 50 (3), 11661186.https://doi.org/10.3758/s13428-017-0935-1.

Houghton, G., Grange, J.A., 2011. CDF-XL: computing cumulative distribution functions of reaction time data in Excel. Behav. Res. Methods 43 (4), 10231032.https://

doi.org/10.3758/s13428-011-0119-3.

Hunt, T.D., Bentler, P.M., 2015. Quantile lower bounds to reliability based on locally optimal splits. Psychometrika 80 (1), 182195.https://doi.org/10.1007/s11336- 013-9393-6.

Kolossa, A., Kopp, B., 2018. Data quality over data quantity in computational cognitive neuroscience. Neuroimage 172, 775–785.https://doi.org/10.1016/

j.neuroimage.2018.01.005.

Kopp, B., Lange, F., Steinke, A., 2019. The Reliability of the Wisconsin Card Sorting Test in Clinical Practice.Assessment, Advance Online Publication.https://doi.org/

10.1177/1073191119866257.

Kowalczyk, A.W., Grange, J.A., 2017. Inhibition in task switching: the reliability of the N- 2 repetition cost. Q. J. Exp. Psychol. 70 (12), 2419–2433.https://doi.org/10.1080/

17470218.2016.1239750.

MacLeod, J.W., Lawrence, M.A., McConnell, M.M., Eskes, G.A., Klein, R.M., Shore, D.I., 2010. Appraising the ANT: psychometric and theoretical considerations of the attention network test. Neuropsychology 24 (5), 637651.https://doi.org/10.1037/

a0019803.

Meule, A., Lender, A., Richard, A., Dinic, R., Blechert, J., 2019. Approachavoidance tendencies towards food: measurement on a touchscreen and the role of attention and food craving. Appetite 137, 145151.https://doi.org/10.1016/j.appet.2019.03.002.

Miller, J., Ulrich, R., 2013. Mental chronometry and individual differences: modeling reliabilities and correlations of reaction time means and effect sizes. Psychon. Bull.

Rev. 20 (5), 819858.https://doi.org/10.3758/s13423-013-0404-5.

Mosier, C.I., 1941. A short cut in the estimation of split-halves coefficients. Educ. Psychol.

Meas. 1 (1), 407–427.https://doi.org/10.1177/001316444100100133.

Nunnally, J.C., Bernstein, I.H., 1994. Psychometric Theory, third ed. McGraw-Hill.

Parsons, S., 2017. splithalf: calculate task split half reliability estimates. R package version 0.3.1.https://cran.r-project.org/package¼splithalf.

Parsons, S., Kruijt, A.-W., Fox, E., 2018. Psychological Science Needs a Standard Practice of Reporting the Reliability of Cognitive Behavioural Measurements.https://doi.org/

10.31234/osf.io/6ka9z.

Revelle, W., 1979. Hierarchical cluster analysis and the internal structure of tests.

Multivariate Behav. Res. 14 (1), 57–74.https://doi.org/10.1207/

s15327906mbr1401_4.

Revelle, W., 2018. Psych: Procedures for Psychological, Psychometric, and Personality Research. R package version 1.8.12.https://cran.r-project.org/package¼psych.

Revelle, W., Condon, D., 2018. Reliability from alpha to omega: a tutorial.https://

doi.org/10.31234/osf.io/2Y3W9.

Rouder, J.N., Haaf, J.M., 2019. A psychometrics of individual differences in experimental tasks. Psychonomic Bull. Rev. 26, 452467.https://doi.org/10.3758/s13423-018- 1558-y.

Rulon, P.J., 1939. A simplified procedure for determining the reliability of a test by split- halves. Harv. Educ. Rev. 9 (1), 99103.

Sherman, R.A., 2015. Multicon: Multivariate Constructs. R package version 1.6 (R package version 1.6).https://cran.r-project.org/package¼multicon.

Spearman, C., 1910. Correlation calculated from faulty data. Br. J. Psychol. 3 (3), 271–295.https://doi.org/10.1111/j.2044-8295.1910.tb00206.x.

Warrens, M.J., 2016. A comparison of reliability coefficients for psychometric tests that consist of two parts. Adv Data Anal Classif 10 (1), 71–84.https://doi.org/10.1007/

s11634-015-0198-6.

Referenzen

ÄHNLICHE DOKUMENTE

EXTRA English 22 The Entertainers Fragen zum Inhalt?. A. Decide which is the correct meaning of

Decide which is the correct meaning of these expressions. Beware of the number three. a) The number three is very lucky. b) The number three could be unlucky. Today you will be in for

Nick's horoscope tells him to be careful of a number and a colour.. Their neighbour comes to tell them that he has lost

Decide which is the correct meaning of these expressions. Beware of the number three. a) The number three is very lucky. b) The number three could be unlucky. Today you will be in for

Thurgau Institute of Economics and Department of Economics at the University of

Thus polymerization of haem may not occur spontaneously under the reaction conditions corresponding to food vacuoles of the malarial parasite, the physiological site

In order to make the book more visually appeal- ing, I would suggest that at least all the species be portrayed in colour plates, which would pro- vide the “amateur user”

2015 IT IS 3 MINUTES TO MIDNIGHT Unchecked climate change, global nuclear weapons modernizations, and outsized nuclear weapons arsenals pose extraordinary and undeniable threats