208
Upper Bleeding:
Assessing the Diagnostic Contributions of the History
and Clinical Findings
CHRISTIAN OHMANN, PHD, KLAUS THON, MD, HARTMUT
STÖLTZING, MD, QIN YANG, WILFRIED LORENZ, MD
Various strategies can be used in the
diagnosis
of uppergastrointestinal
tract bleeding. This study investigates the relevance of anamnestic and clinical findings for the diagnosis of thebleeding
source. The authors introduced a computer-aided diagnostic system using Bayes’theorem and compared it with clinicians’ predictions using anamnestic and clinical
findings only.
There was no difference in the overall accuracy rates, but a difference was observed in thediagnostic
behaviors of the two"systems."
In addition, thediscriminatory ability
of the computer-aided system, the sharpness of thepredictions
obtained, and thereliability
of theposterior probabilities
wereanalyzed.
It is concluded that the clinician and the computer- aided system are not able to discriminate well between the diseasecategories.
Derivedclassification matrices and
probability-based
measures show the reasons for theinadequacy
of
diagnostic
information obtainable from the clinical history and physicalfindings.
Key words:computer-aided
diagnosis;
Bayes’ theorem; probabilisticdiagnosis;
discriminatory ability;reliability;
clinical accuracy; uppergastrointestinal
tractbleeding.
(Med DecisMaking
6:208- 215, 1986)Patients admitted to the
hospital
with acute uppergastrointestinal
tracthemorrhage present
manyprob- lems,
one of which is the need forearly diagnosis
ofthe source of
hemorrhage.
Differentdiagnostic
strat-egies
can be used.21 Thediagnosis
may be based on thehistory
and clinicalfindings,
on uppergastroin-
testinal
radiography,
or onendoscopic findings.
Sev-eral
prospective
trials showed thatendoscopy
is moreaccurate than
radiography.&dquo;
However, there is ahigher potential
risk inusing endoscopy compared
with ra-diography.’
Clinicalhistory
and examination arethought
to be inferior to both indiagnostic
accuracy, but carry no risk.’.
These results raise the
question
whether thehistory
and clinical
findings
are necessary in thediagnostic
decision
making
process. The answerdepends pri- marily
on the amount ofdiagnostic
information pro- videdby
these data.If,
as has beensuggested,
littlediagnostic
information isobtained,
this process of carefulquestioning
has little clinical relevance. How- ever, if usefuldiagnostic
information could thus beobtained,
thepatient
could bespared
the risk and discomfort ofendoscopy
orradiography.
Studies
performed
to measure thediagnostic
rele-vance of the
history
and clinicalfindings
have been rare,and cover
only
someaspects
of theproblem.4, 19, 24
Weinvestigated
thediagnostic predictions
ofexperienced
clinicians and of a successful
computer-aided
model.3The
analysis
of thesepredictions,
which were com-pared
with proven finaldiagnoses,
was not restrictedto the common but
inadequate concept
of discrimi-natory ability, e.g.,
measuredby
accuracy orpredictive
value. Additional criteria such as
sharpness
of the di-agnostic predictions
andreliability
of theprobabilities
were considered.’, 13, 14
Patients and Methods
PATIENTS
We
investigated
457 consecutivepatients
admittedon an
emergency
basis for acute uppergastrointestinal
tract
bleeding
to theMarburg Surgical
Clinic betweenJanuary
1978 andFebruary
1983. The criterion for acuteupper
gastrointestinal
tractbleeding
was either he-matemesis or melena as defined in the O.M.G.E. In- ternational
Upper
Gastro-IntestinalBleeding Survey. 18
As soon as each
patient
was admitted to thehospital,
a detailed
history
was taken and a carefulphysical
examination was
performed.
All data weredoc-
mented on a
computer questionnaire especially
de-signed
for the purpose. Theprotocol
contained 35history
variables and nine clinicalinvestigations
whichReceived June 25, 1985, from the Department of Theoretical Sur- gery and Surgery Clinic, Centre for Operative Medicine I, University of Marburg, Marburg, West Germany. Accepted for publication after
revision October 15, 1985. Supported by grant from Deutsche For-
schungsgemeinschaft (Oh 39/2-1). Presented in part at the Royal College of Physicians Computer
Workshop,
Paris, France, 1983, and at the annual meeting of the German Society for Medical Docu- mentation and Statistics (GMDS), Heidelberg, Germany, 1983.Address correspondence and reprint requests to Dr. Ohmann:
Department of Theoretical Surgery, Centre for Operative Medicine
I, University of Marburg,
Baldingerstraf3e,
D-3550 Marburg, West Ger-many.
were
expected
to discriminate well between the pos- sible diseases (table119).
In order to estimate the con-ditional
probabilities adequately,
four diseasecategories
were formed:
gastric ulcer,
duodenalulcer, esophageal varices,
and agroup containing
all otherpossible bleeding
sources.FINAL DIAGNOSIS
Endoscopy
wasperformed
on eachpatient,
almostalways
within four hours of admission. About 50% of thepatients
had a second or thirdendoscopic
ex-amination
during
the first tendays
afteradmission,
and 15% of the
patients
wereoperated
on. The finaldiagnosis
of thebleeding
source was based on thefindings
at the emergencyendoscopy
and on histo-logic
and x-rayfindings, findings
atoperation,
and allfurther
endoscopic findings.
When the data did not
yield
a cleardiagnosis,
twoclinicians from the
endoscopy
unit were called toagree
upon the finaldiagnosis.
In 82% of thepatients
aunique bleeding
source could beidentified,
but there wereproblems
indiagnosing
thebleeding
sources in pa- tients who hadmultiple
lesions (18%). Patientshaving
one lesion with
signs
ofbleeding
and another lesion withoutsigns
ofbleeding
wereassigned
to the formerdiagnostic category.6
In theremaining
cases the twoclinicians were asked to define the
major bleeding
source and the
patients
wereassigned
to the -pro-priate diagnostic categories.
COMPUTER-AIDED DIAGNOSIS
The
computer-aided diagnosis
wasperformed
withthe
&dquo;Independence Bayes&dquo; model,
which assumes theconditional
independence
of thesymptoms
within ev- ery diseasecategory
and usesBayes’
theorem to cal-culate the
posterior probabilities.10, 17
An apriori probability
of P(D) = 0.25 for every diseasecategory
D was
chosen,
which agreesapproximately
with ouradmission rates. The conditional
probabilities
P(S/D)were estimated
by dividing
the number ofpatients
with disease D and
symptom
Sby
the number of pa- tients with disease D. For eachpatient
the disease D with thehighest posterior probability
was taken as thecomputer prediction.
To achieve an unbiased estimate of the actual error rates of the
computer-aided diagnostic system,
thepatients
were divided into two groups, atraining
setand a test set. 27 The
training
set included allpatients
admitted to the
hospital
betweenJanuary
1978 andDecember 1981 (n = 362) and was used to estimate the conditional
probabilities
P(S/D). Theperformance
of the
computer-aided system
was tested in aseparate
validationsample
(test set) of allpatients
admitted tothe
hospital
betweenJanuary
1982 andFebruary
1983(n = 95). All calculations were done on a Hewlett- Packard
desk-top computer
(HP 9815A).CLINICIANS’
PREDICTIONSIn addition to the
computer-aided prediction,
a di-Stable 1
. Features of the History and Physical Examination Used in theDiagnosis
of Upper Gastrointestinal TractBleeding
agnostic prediction
from theclinician, using
the his-tory
andphysical findings only,
was notedprospectively
on the
computer questionnaire
for everypatient
in thetest set. The same clinician took the
history, performed
the
physical examination,
and filled in thequestion-
naire for any
given patient.
In a six-monthpilot period
from
July
1981 to December 1984 the fourparticipating
clinicians from the
endoscopy
unit were able to fa-miliarize themselves with this
type
ofprediction.
Forfive
patients
in the test set nodiagnostic prediction
was made
by
theclinician,
hence 90diagnostic
pre-dictions
by
the clinicians could beanaly
Table 2 .
The Forced Classification Matrix for theDiagnostic
Predictions of the Clinician in the Test Set (n = 95)*
*Five of the clinicians’ predictions were missing.
Results
’
CLINICIANS’
PREDICTIONS VERSUS FINAL DIAGNOSESTable 2 shows the forced classification matrix for the
diagnostic prediction
of the clinician.’ The pre- dictions were accurate in 55 of 90patients
(61%). Ac-curacies in the different disease
categories
were 14 of18 (78%) in the duodenal ulcer group, 78% in the var- ices group, 56% in the
gastric
ulcer group, and 42%in the
diagnostic category
&dquo;other.&dquo; Of 21predictions
of varices as the
bleeding
source 18 were correct, whichgives
apredictive
value of 86%o .9 Thepredictive
valuefor the
diagnostic category
&dquo;other&dquo; was 72%; for duo- denalulcer,
48%; and forgastric ulcer,
45%.COMPUTER PREDICTION: CLASSIFICATION MATRIX
The forced classification matrix in table 3 shows accurate
predictions
for 57 of 95patients
(60%). Thecomputer prediction
was accurate in 19 of 24 cases(79%) in the varices group, 65% in the disease
category
&dquo;other,&dquo;
48% in thegastric
ulcergroup,
and 42% in the duodenal ulcergroup.
Predictive valuesranged
from 19 of 23 cases (83%) in the varices group, to 63%
for
&dquo;other,&dquo;
56%for gastric ulcer,
and 36% for duodenal ulcer.CLINICIAN VERSUS COMPUTER
Although
there was very little difference between the overall accuracies of the clinicians’predictions
(61%) and the
computer’s predictions (60%),
there weremarked differences with
regard
to two disease cate-gories
(tables 2 and 3). For duodenal ulcer the clinicianwas 36% more accurate than the
computer.
In thediagnostic category
&dquo;other&dquo; theopposite
was true, witha difference of 33% in the accuracy rates. The
predic-
tive values showed
only
moderate differences of up to 12% between the clinicians and thecomputer.
Since our two
systems
were tested on the samecases,
paired-comparison techniques
areappropriate
to test for differences in
performance.&dquo;
Table 4 showsthat in addition to 40
patients correctly diagnosed by
T8b18 3 9
The Forced Classification Matrix for theDiagnostic
Predictions of the Computer in the Test Set (n = 95)
Table 4 o
PairedComparison
of the Clinicians’ Predictions and--
theComputer
Predictions in the Test Set (n = 95)-
both
systems,
15 cases werecorrectly diagnosed by
the clinician and not
by
thecomputer
and 16 the other way around. Thisgives
anonsignificant
result in the McNemar test, which means that the nullhypothesis
of
equal
nonerror rates cannot berejected.
On theother hand there is a difference in the
diagnostic
be-haviors of the two
systems,
which can be documentedby
thehigh frequency
of 31 of 90 cases (34%) in theheteronomous cells of table 4. The null
hypothesis
ofa
non-agreement
coefficientequals
zero between theclinician
and thecomputer
is testedby
an inversion§ of Pearson’s
phi-coefficient (D
(table4). 16
Using
thechi-square
distribution with 1degree
offreedom,
asignificant
result(p
< 0.001) is obtained.Thus,
the alternativehypothesis
ofnon-agreement
be-tween the
systems
has to beaccepted.
COMPUTER: DERIVED CLASSIFICATION MATRICES
All
previous
measurements ofperformance
werebased on the forced classification matrix, in which all
patients
are allocated to a disease.9, 20 However, whenstudying discriminatory ability,
it is alsointeresting
tolook at the
assigned probabilities.
This can be doneonly
for thecomputer-aided system.
For further consideration of the
data,
diseases with lowprobabilities
could be omitted. This is illustrated in table 5, where those diseases D, with aposterior
probability
(P(D/S) < 0.10 were excluded. The exclu- sion matrix shows that thediagnosis
&dquo;varices&dquo; can bewell
distinguished
from the otherdiagnostic catego-
ries.
In 18 of 21 cases of
gastric
ulcer(86% ), 89%
of cases of duodenalulcer,
and 84% of cases in the diseasecategory &dquo;other,&dquo;
thediagnosis
&dquo;varices&dquo; could be ex-cluded. For the 24
patients
who hadvarices,
the bleed-ing
source&dquo;gastric
ulcer&dquo; could be excluded 16 times(67%),
duodenal ulcer could be excluded 18 times(75% ),
and &dquo;other&dquo; could be excluded 15 times (63%). The discrimination of thecomputer-aided system
betweenpatients
who had ulcers and allpatients
with &dquo;other&dquo;sources of
hemorrhage
was moderate. The discrimi-natory ability
toseparate
duodenal ulcerpatients
fromgastric
ulcerpatients
was bad. This can be seen in thelow exclusion rates of 7 of 21 (33%) duodenal ulcers in
gastric
ulcerpatients
and of 7 of 19 (37%)gastric
ulcers in duodenal ulcer
patients.
In table 6 the
patients
for whom a confidentdiag-
nosis was made are
separated
frompatients
for whomthe
diagnosis
was not conclusive.’ In 60 of 95 (63%)computer-aided predictions
thelargest posterior probability
(P(D/S) did not exceed 0.8.Defining sharp-
ness of a
diagnostic system
as theability
toassign high probability
values to onedisease,
oursystem
couldnot be described as
sharp
in the presence of so many doubtful cases.14 On examination of thesharp diag-
noses
only,
it isinteresting
that thediagnostic
accu-Table 5 o
Exclusion Matrix of theComputer-aided
System in theTest Set (n = 95)* *
*Diseases D with p(D/S) < 0.1 are excluded.
Table 8 0
Classification Matrix with Doubt of theComputer-aided
-
System
in the Test Set (n = 95)* **For patients with the largest probability p(D/S) not exceeding 0.80 the computer-aided prediction was classified as doubt.
FIGURE 1. Dot diagrams of the probabilities assigned to the actual
disease categories in the test set (n = 95). Each dot represents a patient.
racy was 24 of 35 (69%), which is
hardly
different fromthe overall accuracy of 60%.
COMPUTER: PROBABILITY-BASED MEASURES
In addition to the classification matrices used to measure the
performance
of adiagnostic system,
sev- eral other measures which are continuous functions of theassigned probabilities
should be used.13, 14, 20 The dotdiagram
infigure
1provides
a firstimpression
of the distributions of the
probabilities assigned
to theactual diseases. The overall average
probability
for theactual diseases was 0.52 in the test set (table
7),
with marked differences between the fourdiagnostic
cat-egories (fig.
1). The varices groupespecially
had a dif-ferent
distribution,
with a smallpeak
near 0 and ahigh peak
near 1,compared
with theapproximately
uni-form distributions in the other three
diagnostic
cat-egories.
Two other criteria reflect other
aspects
of the de-grees
of discrimination between thediagnostic
cate-gories
(table 7). These criteria are based on scores that describe thediscrepancy
between the actual disease D and theposterior probabilities assigned
to the fourdisease
categories.
One of the mostpopular scoring
methods in nonmedical
applications
is thequadratic
score or Brier score:
where N is the number of
patients, Pij
theposterior
probability
forDi
inpatient
i, andd(i)
the index of theTable 7 9
Discriminatory Ability andReliability
of the Computer-aided
System
*Criteria are defined in the text.
tcalculated under the null hypothesis of perfect reliability of the probabilities.
te = 0.01.
actual disease of
patient
i.13 If theassigned probability
to the actual disease is 1.00, then
patient
iclearly
con-tributes
nothing
to thequadratic
score. On the otherhand,
if some other disease isassigned
aprobability
of 1, the term of the
ith patient
becomes 2. Hence thelower limit is 0 and the upper limit is 2. In our case
the
quadratic
score was 0.59 in the test set (table 7).Utilizing
thequadratic
score, there is little difference betweenusing
oursystem
andusing
an uninformative indifferentsystem,
where each disease isassigned
aproability
of 0.25throughout,
which leads to aquad-
ratic score of 0.75.14
The
E-modified logarithmic
score:where N is the number of
patients, Pij
theposterior probability
forDi
inpatient
i, d(i) the index of the actualdisease of
patient
i, E > 0 andW(Pij) =
(1 - E) -Pij
+E,
penalizes especially
lowprobabilities
for the actual disease.14 The E-modifiedlogarithmic
score is approx-imately equal
to:where N is the number of
patients, P;ac,~
theposterior probability
for the actual disease and E > 0.Using
anE = 0.01
produces
a theoretical minimum of - 4.56 and a derived maximum of 0. Ourcomputer-aided diagnostic system produces
an E-modifiedlogarithmic
score
of -1.00,
which isagain
not very different from the score of -1.26 of the indifferentsystem,
where each disease isassigned
aprobability
of 0.25 (table 7).A
comparison
between the twosamples
in table 7shows that the criteria calculated in the
training
setare
superior
to the same criteria calculated in the test set.COMPUTER:
RELIABILITY*
OF THE PROBABILITIESOne
important aspect
of agood performance
inprobabilistic diagnosis
is thereliability
of theposterior
probabilities,
which isquite
distinct from thequestion
of discrimination.ll, 13, 14 The
posterior probability
Pthat a
patient
has disease Dgiving
asymptom
vectorS is called reliable when in a
sample
ofadequate
sizeof
patients
allhaving
the samesymptom
vectorS,
aboutP% do
actually
have the disease D.Usually
it is notpossible
to collectenough
cases with identical symp- toms andverify
that withinsampling fluctuations,
theassigned diagnostic probabilities
can be trusted. One method ofovercoming
these difficulties is to consider the test set as a whole andhypothesize
that wheneveran event is
assigned
aprobability
P it will occur withfrequency
P.Using perfect reliability
as the nullhy- pothesis, departures
from thisperfect
state of affairscan be measured and
tested. 13, 14
In table 7 the
expected
values of thediagnostic
scoresare calculated under the null
hypothesis
ofperfect reliability.
If we use the difference between the ob- served and theexpected
values as areliability
mea-sure, we can see that the observed non-error rate is 13% lower than the
expected
rate, which has to be calculated as the average maximumprobability. 13
Theobserved average
probability
for the actual disease isonly
52% and therefore 11% smaller thanexpected.
Regarding
these tworeliability
measures asnormally distributed,
the nullhypothesis
ofperfect reliability
must be
rejected (p
< 0.01, p <0.001).13, 14, 20
In ad-dition,
theexpected
values of thequadratic
score andthee-modified
logarithmic
score dosuggest
better re- sults than could be observed in thestudy.
Thetraining
set shows the same trend for all
reliability
measuresas the test set.
There are many ways in which a
system
may deviate from reliableperformance.
In order to measure whethera
system
favors aparticular
disease (sizebias),
a com-parison
of the observed andexpected frequencies
forevery disease is necessary. The
expected frequency
ina disease
category
D is calculated as theaverage
sum of theposterior probabilities
for the disease D.13 Table 8 shows that there is anoverassignment
in the duo-denal ulcer
group,
with 23.7expected
instead of 19 observed cases. In the varicesgroup
and in the &dquo;other disease&dquo; class there were smallunderassignments,
with21.2 and 28.5
expected
casescompared
with 24 and31 observed cases,
respectively.
Thisgives
anonsig-
nificant test result
using approximate
standard normaltest statistics.13 Another
possibility
for the measure-ment of the
reliability
of theposterior probabilities
isto divide the
probabilities
into intervals and compare theexpected
and observedfrequencies
in eachsubgroup, using
achi-square goodness-of-fit
test forevery
diseased
In table 8 this isdone, using
fourequi-
distant
probability
intervals. The common trend in all* &dquo;Reliability&dquo; as used in the European literature cited here cor-
responds broadly to &dquo;calibration&dquo; in recent North American liter- ature.-Ed.
four disease
categories
is ahigher expected
than ob-served value in the interval 0.76 to 1.00 and a smaller
expected
than observed value in the interval 0.00 to 0.25.Only
the results in the varicesgroup
and thosein the &dquo;other disease&dquo;
category
aresignificant (p
<0.05).
Discussion
The clinicians and the
computer-aided system
werenot able to discriminate
adequately
between the fourgiven
diseasecategories,
as could be seen in the ac-curacy rates of 61% and 60%. The results of our com-
puter-aided diagnostic system
arecomparable
to theresults in a multicenter trial with an accuracy of 59%
and to our earlier results with accuracy rates of 65%
to 69% .4, z4 We could not achieve the excellent results of
computer-aided diagnostic systems
used for otherdiagnostic problems
such as the acute abdomen.3~ 25 These resultssuggest
that there is little relevant di-agnostic
information in thehistory
andphysical
find-ings ; nevertheless,
somepoints
must be further discussed before any definite conclusions can be reached.Regarding
the poorperformance
of theclinicians,
it is
important
to note that noinexperienced
doctortook
part
in thisstudy.
All doctors wereexperienced
members of the
endoscopic
unit and had had a min- imum of two years ofregular training
in thediagnosis
of upper
gastrointestinal
tractbleeding.
It may be ar-gued
that neitherexperienced
doctors nor successfulcomputer-aided
models canproduce good
results ifthe correct
questions
are notposed
and thewrong
physical
examinations areperformed.
The variables collected in ourstudy
contained all clinical attributes which werethought
to beimportant
indiagnostic
terms.The
computer questionnaire
was based on the pro- tocol of the O.M.G.E. InternationalUpper
Gastro-Intes- tionalBleeding Survey, expanded
and clarified to adetailed
protocol by
our senior clinician.4, 18 Therefore it isunlikely
that anyimportant diagnostic
variableshave been omitted.
The
quality
of the data isthought
to behigh,
for tworeasons. Before
starting
our trial in 1978, we discussedterminology
indetail;
all terms used indescribing
up- pergastrointestional
tractbleeding
werecarefully
de-fined.4, 18
Inaddition,
there was aprospective
trial ofcollection of the data
using
acomputer questionnaire, performed by experienced
clinicians.Nevertheless,
for19% of the
patients
more than 20% of the data wasmissing.
The mainpart
of this data lossprobably
re-lates to the poor condition of some
patients
at thetime of
admission,
so that neither detailed historiesnor careful
physical
examinations could be obtained.A
comparison
of thecomputer-aided system’s perfor-
mances for
patients
with and withoutmissing
datareduces the accuracy rate
by
about 9% fordiagnostic predictions
based onmissing
data.Table 8 ~
Comparison of the Observed and Expected Frequencies (Goodness of Fit) for Every Disease inFour Intervals of Probabilities in the Test Set (n =
95)
*Obs = observed frequency.
tExp = expected frequency = sum of probabilities for the actual disease.
The calculation was done seperately for every combination of the disease
categories and the intervals of probabilities.
In about 20% of our emergency cases the
patients
have
multiple
lesions in the uppergastrointestinal
tract.Most of these
patients
haveonly
onebleeding
sourceand one or two
accompanying
lesions. A bias is intro- duced if thesepatients
areassigned
to one of the four diseasecategories.
The accuracy of thecomputer
pre- diction is about 10%higher
forpatients
with asingle
lesion
compared
withpatients
withmultiple lesions,
which underlines the
problems
ofusing
one-disease models.2’ The contributions ofmissing
data and mul-tiple
diseases to the error rate are moderate andonly partly explain
the poor results.Computer-aided diagnostic systems using Bayes’
theorem are very
popular.l’~ 25 Nevertheless,
the ques- tion arises whether theappropriate
model was usedin our
study.
Thesimplifying assumption
ofindepen-
dence of
symptoms
is a matter ofgreat controversy.22 Comparisons
of differentdiagnostic techniques showed, however,
that theindependence
model is agood
dis-criminator even when the
assumptions
arestrictly
un-justified.2, 23
This does notimply
that theindependence model, using
all the data from thehistory
andphysical
examination, is the best choice of all
possible
statistical models. However, the results in the literaturesuggest
that differences indiagnostic
accuracies due to the choice of the model are often smallcompared
withthe influences of other factors such as the
type,
thequality
and thecompleteness
of the datacollected. 2,
22, z3 If medical decision
making
methods are to standany chance of success,
they
must besimple
to useand
comprehensible
to theclinician,
conditions thatare well satisfied
by
the&dquo;independent Bayes&dquo;
model.For better
understanding
of theunderlying
structureof the
diagnostic problem
from the statistical view-point,
it would beinteresting
to useonly
a few im-portant diagnostic
variables instead oflooking
at allsigns, symptoms,
anddiagnostic
tests. Thispoint
iscurrently
underinvestigation by
theapplication
of astepwise
linearlogistic
mode122 and theindependence
model
together
with different variable-selection pro- cedures.8 8In this
study
we were not restricted to thesimple
determination of
diagnostic
accuracy but tried to ana-lyze
the reasons for thedisappointing
results. Thediagnostic predictions
of the clinicians were different from thecomputer predictions
(table 4). This meansthat
computer-aided diagnostic systems,
which havebeen used since 1978 in our
Surgical Clinic,
haveprob- ably
had no substantial influence on the clinicians’views of the
diagnostic
process. Since the clinicianswere not forced to
assign probabilities
to the different diseasecategories,
a definite answer to thisquestion
cannot be
given.
Theimpression
that clinicians are nowcoming
toregard
clinicaldiagnosis
as a process of statistical orprobabilistic
nature seems to be ratheroverly optimistic.5, 11, 26
One mainproblem
that pre- vents achange
from the traditional view of thediag-
nostic process as an intuitive art, based upon
personal experience
and textbookknowledge,
to aprobabilistic
and statistical
diagnosis
is that calculatedposterior probabilities
ofcomputer-aided
models cannot betrusted. In our
study,
theindependence
model pro- ducesfigures
that are not realprobabilities
and thuscannot
help
the clinicians to estimateprobabilities.
Atthe worst, it may
engender
a false sense ofcertainty
and mislead the clinician in his decision
making
pro- cess. 9, 11, 13, 14Assuming perfect reliability
of theprob-
abilities of the
independence model, departures
fromthis
perfect
state of affairs have been measured in ourstudy. Significant
differences between observed andexpected
values for the non-error rate, theaverage probability
for the actualdisease,
thequadratic
cri-terion, and the E-modified
logarithmic
criterion indi- cate that thediscriminatory performance
is less thanwould be
expected
from thepredictions
themselves(table
7).12-14
Theprobabilistic predictions
are over-confident,
which may be related to the fact that in theindependence
model related information is consid- ered as unconnected evidence. 13, 22 The overconfidentpredictions
aresymmetrically
distributedthroughout
the
diagnostic categories,
which means that no par- ticular disease is fovouredby
thecomputer-aided
sys- tem (table 8).The
nonreliability
of theprobabilities produced by
the
computer-aided system
leads to difficulties in in-terpreting
derived classification matrices andproba- bility-based
measures ofperformance
(tables5-7).13,
14 Even when the
probabilities
of theindependence
model could be
trusted,
the differentperformance
measures
(expected
values)give disappointing
resultsconcerning
thediscriminatory ability
(table 7). The mainreason for this is that the
computer-aided
model isnot able to
assign high probability
values to one dis-ease, as could be seen in the average maximum
prob- ability
of 73% ( =expected
non-error rate) and in theother
performance
measures (table7).13, 14 Computer-
aided
systems
that havegood discriminatory ability
must
necessarily produce sharp predictions,
i.e., pre-dictions that
assign nearly
100% to one disease.14 The manynon-sharp predictions
in ourstudy
indicate that littlediagnostic
information isprovided by
the clinicalhistory
andphysical
examination.Only
thebleeding
source
esophageal
varices could be well discriminated from other sources. Theseparation
of duodenal ulcerpatients
fromgastric
ulcerpatients
was badusing
thismodel (table 5).
One reason for the
disappointing
results is that inupper
gastrointestinal
tractbleeding
clinicalsigns
andsymptoms
thatnormally
couldpoint
to aparticular diagnosis
may be dominatedby
the effects of the bloodloss, especially
in dramatic cases with severe hem-orrhage.
On the otherhand,
thehistory
andphysical findings occasionally suggest
adiagnosis
that is notthe
bleeding
source. Jaundice andascites,
forexample,
indicate
esophageal varices,
but this may be mislead-ing
sincebleeding
in apatient
who has liver disease withesophageal
varices may be the result ofpeptic
ulceration or
gastric
erosions.4 The various interac- tions between elements of thehistory
and the clinicalfindings,
the effects of thebleeding,
and theunderlying
lesion limit the
ability
of both the clinician and thecomputer-aided system
tocorrectly identify
the sourceof
hemorrhage.
It appears that the initial clinical fea- tures are morehelpful
indetermining prognosis
thandiagnosis.
Several studies have shown that the short-term
prognosis, i.e.,
whether thebleeding
would con-tinue or
subside,
could bepredicted
with sufficient accuracyusing
clinicalsigns
andsymptoms
on ad-mission and
computer-aided prognostic systems. 4, 18
In summary, it is concluded that at
present
thereseems to be no combination of
symptoms
andsigns
that
reliably points
to aparticular diagnosis,
even whensophisticated computer-aided systems
are used. If anaccurate
diagnosis
of the source ofbleeding
is re-quired
at anearly stage, high-technology investiga-
tions such as
endoscopy
must beemployed. 18
The authors thank Dr. Madeleine Ennis and Marlene Verfiirth for assistance in the preparation of this report.
References
1. Cox DR: The analysis of binary data. London, Methuen, 1970, pp 90-95
2. Croft JD: Mathematical models in medical diagnosis. Ann Biomed Engineering 2:69-89, 1974
3. De Dombal FT, Leapper DJ, Staniland JR, et al: Computer-aided diagnosis of acute abdominal pain. Br Med J 2:9-13, 1972 4. De Dombal FT, Morgan AG, Staniland JR, et al: Clinical features—
computer analysis, in: Dykes PW, Keighley MRB (eds): Gastroin-
testinal Hemorrhage. Bristol, John Wright, 1981, pp 155-165 5. Diamond GA: Computer diagnosis: revolution or revelation. Int
J Cardiol 2:219-220, 1982
6. Forrest JAH, Finlayson NDC, Shearman DJC: Endoscopy in gas- trointestional bleeding. Lancet II:394-397, 1974
7. Gilbert DA, Silverstein FE, Tedesco FJ, et al: National ASGE survey
on upper gastrointestinal bleeding. Complications on endos-
copy. Dig Dis Sci 26:55-59, 1981
8. Habbema JDF, Hermanns J: Selection ofvariables in discriminant