This publication is available free of charge from:

(1)

NISTIR 8271

Face Recognition Vendor Test (FRVT)

Part 2: Identification

Patrick Grother Mei Ngan Kayee Hanaoka

This publication is available free of charge from:

https://doi.org/10.6028/NIST.IR.8271

(2)

NISTIR 8271

Face Recognition Vendor Test (FRVT)

Part 2: Identification

Patrick Grother Mei Ngan Kayee Hanaoka Information Access Division Information Technology Laboratory

This publication is available free of charge from:

https://doi.org/10.6028/NIST.IR.8271

September 2019

U.S. Department of Commerce Wilbur L. Ross, Jr., Secretary National Institute of Standards and Technology Walter Copan, NIST Director and Undersecretary of Commerce for Standards and Technology

(3)

Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately.

Such identification is not intended to imply recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.

National Institute of Standards and Technology Interagency or Internal Report 8271 Natl. Inst. Stand. Technol. Interag. Intern. Rep. 8271, 186 pages (September 2019)

This publication is available free of charge from:

https://doi.org/10.6028/NIST.IR.8271

(4)

Thispublicationisavailablefreeofchargefrom:https://doi.org/10.6028/NIST.IR.8271

FRVT-FACE RECOGNITION VENDOR TEST-IDENTIFICATION 1

ACKNOWLEDGMENTS

The authors are grateful to Wayne Salamon and Greg Fiumara at NIST for designing robust software infras- tructure for image and template storage and parallel execution of algorithms across our computers. Thanks also to Brian Cochran at NIST for providing highly available computers and network-attached storage.

DISCLAIMER

Specific hardware and software products identified in this report were used in order to perform the evaluations described in this document. In no case does identification of any commercial product, trade name, or vendor, imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the products and equipment identified are necessarily the best available for the purpose.

2019/09/11 17:24:52

FNIR(N, R, T) = False neg. identification rate N = Num. enrolled subjects T = Threshold

FPIR(N, T) = False pos. identification rate R = Num. candidates examined

T= 0→Investigation

T>0→Identification

(5)

Executive Summary

This report updates and extendsNIST Interagency Report 8238, documenting the evaluation of automated face recognition algorithms submitted to NIST in November 2018. The algorithms, which implement one-to-many identification of faces appearing in two-dimensional images, are prototypes from the research and development laboratories of mostly commercial suppliers, and are submitted to NIST as compiled black-box libraries implementing a NIST-specified C++

test interface. The report therefore does not describe how algorithms operate.

The evaluation used four datasets - frontal mugshots, profile views, webcam photos and wild images - and the report lists accuracy results alongside developer names. It will therefore be useful for comparison of face recognition algorithms and assessment of absolute capability. The primary dataset is comprised of 26.6 million reasonably well- controlled live portrait photos of 12.3 million individuals. The three smaller datasets contain more unconstrained photos: 3.2 million webcam images; 200 thousand side-view images; and 2.5 million photojournalism and amateur photographer photos. These datasets are sequestered at NIST, meaning that developers do not have access to them for training or testing. The last dataset, however, consists of images drawn from the internet for testing purposes so while it is not truly sequestered, its composition is unknown to the developers.

The evaluation was run in three phases, starting Feburary, June, and November 2018 respectively, with developers re- ceiving technical feedback between phases. Results for 127 algorithms from 41 developers were published in Novem- ber 2018 asNIST Interagency Report 8238. This update adds results for an additional 76 algorithms from 42 developers submitted in October 2018. At that time seven developers ceased participation, and nine developers started. The developer totals constitute a substantial majority of the face recognition industry.

The major result given in NIST IR 8238 was that massive gains in accuracy have been achieved in the last five years (2013-2018) and these far exceed improvements made in the prior period (2010-2013). While the industry gains were broad - at least 30 developers’ algorithms outperformed the most accurate algorithm from late 2013 - there remains a wide range of capability. While this report shows accuracy gains only over the course of 2018, the most accurate algorithm reported here is substantially more accurate than anything reported in NIST IR 8238. This is evidence that face recognition development continues apace, and that FRVT reports are but a snapshot of contemporary capability.

From discussion with developers, the accuracy gains stem from the adoption of deep convolutional neural networks.

As such, face recognition has undergone an industrial revolution, with algorithms increasingly tolerant of poorly illu- minated and other low quality images, and poorly posed subjects. One related result is that a few algorithms correctly match side-view photographs to galleries of frontal photos, with search accuracy approaching that of the best c. 2010 algorithms executing frontal-frontal search. The capability to recognize under a 90-degree change in viewpoint - pose invariance - has been a long-sought milestone in face recognition research.

With good quality portrait photos, the most accurate algorithms will find matching entries, when present, in galleries containing 12 million individuals, with rank one miss rates of approaching 0.1%. The remaining errors are in large part attributable to long-run ageing, facial injury and poor image quality. In at least 5% of images identification often succeeds (i.e. the mate is returned at rank 1) but recognition similarity scores are weak such that true and false matches become indistinguishable, and human adjudication becomes necessary.

From Fall 2019 this report will be updated continuously as new algorithms are submitted to FRVT, and run on new datasets. Participation in theone-to-many identification trackrequires a devloper to first demonstrate high accuracy in theone-to-one verification trackof FRVT.

2019/09/11 17:24:52

(6)

Scope and Context

Audience: This report is intended for developers, integrators, end users, policy makers and others who have some familiarity with biometrics applications. The methods and metrics documented here will be of interest to organizations engaged in tests of face recognition algorithms. Some of these have been incorporated in the ISO/IEC 19795 Part 1 Biometric Testing and Reporting Framework standard, now underrevision.

Prior benchmarks:Automated face recognition accuracy has improved massively in the two decades since initial com- mercialization of the various technologies. NIST has tracked that improvement through its conduct of regular independent, free, open, and public evaluations. These have fostered improvements in the state of the art. This report serves as an update to the NIST Interagency Report 8238 on performance of face identification algorithms, published in November 2018.

Scope:As with NIST IR 8238, this report documents recognition results for four databases containing in excess of 30.2 million still photographs of 14.4 million individuals. This constitutes the largest public and independent evaluation of face recognition ever conducted. It includes results for accuracy, speed, investigative vs. identification applications, scalability to large populations, use of multiple images per person, images of cooperative and non-cooperative subjects.

The report also includes results for ageing, recognition of twins, and recognition of profile-view images against frontal galleries. It otherwise does not address causes of recognition failure, neither image-specific problems nor subject- specific factors including demographics. Separate reports on demographic dependencies in face recognition will be published in the future. Additionally out of scope are: performance of livehuman-in-the-loop transactional systems like automated border control gates; human recognition accuracy as used in forensic applications; and recognition of persons in video sequences (which NIST evaluated separately [9]). Some of those applications share core matching technologies thataretested in this report.

Images:Three kinds of images are employed. The primary dataset is a set of law enforcement mugshot images (Fig.3) which are enrolled and then searched with three kinds of images: 1) other mugshots (i.e. within-domain); 2) profile- view photographs (90 degree cross-view); 3) lower quality webcam images (Fig. 4) collected in similar detention operations (cross-domain); Additionally wild images (Fig.6) are searched against other wild images.

Participation and industry coverage: The report includes performance figures for 203 prototype algorithms from the research laboratories of 51 commercial developers and one university. This represents a substantial majority of the face recognition industry, but only a tiny minority of the academic community. Participation was open worldwide.

While there is no charge for participation, developers incur some software engineering expense in implementing their algorithms behind the NIST application programming interface (API). The test is a black-box test where the function of the algorithm, and the intellectual property associated with it, is hidden inside pre-compiled libraries.

Recent technology development:Most face recognition research with deep convolutional neural networks (CNNs) has been aimed at achieving invariance to pose, illumination and expression variations that characterize photojournalism and social media images. The initial research [18,24] employed large numbers of images of relatively few (∼ 10⁴) individuals to learn invariance. Inevitably much larger populations (∼ 10⁷) were employed for training [11,20] but the benchmark, Labeled Faces in the Wild with (essentially) an equal error rate metric [12], represents an easy task, one-to-one verification at very high false match rates. While a larger scale identification benchmark duly followed, Megaface [15], its primary metric, rank one hit rate, contrasts with the high threshold discrimination task required in most large-population applications of face recognition, namely credential de-duplication, and background checks.

There, identification in galleries containing up to10⁸ individuals must be performed using a) very few images per individual and b) stringent thresholds to afford very low false positive identification rates. FRVT 2018 was launched to measure the capability of the new technologies, including in these two cases. FRVT has included open-set identification tests since 2002, reporting both false negative and positive identification rates [7].

2019/09/11 17:24:52

(7)

Performance metrics for applications: This report documents the performance of one-to-many face recognition algorithms. The word ”performance” here refers to recognition accuracy and computational resource usage, as measured by executing those algorithms on massive sequestered datasets.

This report includes extensive tabulation of recognition error rates germane to the main use-cases for face search technology. The Figure below, inspired by the Figure 1 in [25] differentiates different applications of the technolgy. The last row directs readers to the main tables relevant to those applications, respectively threshold-based and rank-based metrics that are special cases of the metrics given in section3. The terms negative identification and positive identification are taken from theISO/IEC 2382-37:2017standardized biometrics vocabulary.

Detection and localization Feature extraction

e.g. CNN model

Search Algorithm e.g. N comparisons

Automated face recognition engine evaluated as black box.

This grey box is the scope of NIST’s evaluation.

Alice Bob Ben Eve Dawn Sam Alex Eva Pat Jack Jill Bill Zeke Zack

Search Photo

The enrollment database consists of images and any biographic data.

The algorithm is given the images and and a pointer to the record

Enrolled database:

Array, tree, index or other data structure

Feature extraction e.g. CNN model

ID Rank Score

Pat 1 3.142

Bob 2 2.998

Ben 3 1.602

Zeke 4 0.707

...

Output is a candidate list.It’s length is determined by preset configuration of rank and threshold, and these are set to implement objectives.

Positive identification Negative identification Post-event investigation Example Access to a gym or cruise ship Watchlist e.g. Detection of deportee or

duplicate drivers license applications Crime scene photos, or of detainee without ID documents.

Claim of identity Implicit claim to be enrolled Implicit claim to not be enrolled No claim: Inquiry Threshold High, to implement security

objective High, to limit false positives Zero

Num. candidates 1 0 L, set by request to algorithm

Human role Review candidate to assist user in resolution of false negatives, or to detect impostor

Review candidate to determine false

positive or correct hit Review multiple candidates, refer possible hits to examiners see [26]

Intended human involvement frequency

Rare – approx. the false negative identification rate plus prior probability of impostor

Rare – approx. the false positive identification rate plus prior probability of an actual mate

Always

Performance

metric of interest FNIR at low FPIR. See sec. 3.1, 3.2

and Tables 10, 19 FNIR at low FPIR. See sec. 3.1, 3.2 and

Tables 10, 19 FNIR at ranks 1... 50, say. FPIR = 1.

See sec. 3.2 and Table 12, 14, 16 ID Rank Score

Usually correct response is empty list as most searches are non-mated

ID Rank Score

Pat 1 3.142

Usually the correct response is for one entry as most searches will be mated

The algorithms are specifically configured for these applications by setting thresholds and candidate list lengths. Both

2019/09/11 17:24:52

(8)

rank-based metrics and threshold-based metrics include tradeoffs. In investigation, overall accuracy will be reduced if labor is only available to review a few candidates from the automated system. Note that when a fixed number of candidates are returned, the false positive identification rate of the automated face recognition engine will be 100%, because a probe image of anyone not enrolled will still return candidates. In identification applications where false positives must be limited to satisfy reviewer labor availability or a security objective, higher false negative rates are implied. This report includes extensive quantification of this threshold-based tradeoff. See Sec.3 Template diversity:The FRVT is designed to evaluate black-box technologies with the consequence that the templates that hold features extracted from face images are entirely proprietary opaque binary data that embed considerable intellectual property of the developer. Despite migration to CNN-based technologies there is no consensus on the optimal feature vector dimension. This is evidenced by template sizes ranging from below 100 bytes to more than four kilobytes. This diversity of approaches, suggests there is no prospect of a standard template something that would require a common feature set to be extracted from faces. Interoperability in automated face recognition remains solidly based on images and documentary standards for those, in particular the ICAO portrait [29] specification deriving from the ISO/IEC 19794-5 Token frontal [26] standard, which are similar to certain ANSI/NIST Type 10 [28] formats.

Training:The algorithms submitted to NIST have been developed using image datasets that developers do not disclose.

The development will often include application of machine learning techniques and will additionally involve iterative training and testing cycles. NIST itself does not perform any training and does not refine or alter the algorithm in any way. Thus the model, data files, and libraries that define an algorithm are fixed for the duration of the tests. This reflects typical operational reality where recognition software, once installed, is fixed and constant until upgraded.

This situation persists because on-site training of algorithms on customer data is atypical essentially because training is not a turnkey process.

Automated search and human review:Virtually all applications using automated face search require human review of the outputs at some frequency: Always for investigational applications; rarely in positive identification applications, after rejection (false or otherwise); and rarely in negative identification applications, after an alarm (false or otherwise).

The human role is usually to compare a reference image with the query image or the live-subject if present, to render either a definitive decision on “exclusion” (different subjects), or “identification” (same subject), or a declaration that one or both images have “no value” and that no decision can be made. Note that automated face recognition algorithms are not built to do exclusion - low scores from a face comparison arise from different facesandpoor quality images of the same face.

Human reviewers make recognition errors [5,19,27] and are sensitive to image acquisition and quality. Accurate human review is supported by high resolution - as specified in the Type 50, 51 acquisition profiles of the ANSI/NIST Type 10 record [28], and by multiple non-frontal views as specified in the same standard. These often afford views of the ear. Organizations involved in image collection should consider supporting human adjudication by collecting high-resolution frontal and non-frontal views, preparing low resolution versions for automated face recognition [26], and retaining both for any subsequent resolution of candidate matches. Along these lines, theISO/IEC Joint Technical Committee 1 subcommittee 37on biometrics has just initiated projects on image quality assessment and face-aware capture.

Next steps:NIST expects to publish a first report on demographic dependencies in face recognition in 2019. This will include the effects of age, sex and race.

2019/09/11 17:24:52

(9)

Technical Summary

.Rank-based accuracy:The inset table shows false negative “miss rates” realized when searching a 12 million person gallery populated with FRVT 2018 mugshots. The two most accurate algorithms fail to return the correct mate some- where within the top 50 ranks in fewer than 0.1% of searches (Table1, rows 1,2). This is achieved for galleries populated with multiple images per person. In the case where only the most recent image is present the miss rate is modestly higher (rows 3,4). The mates are almost always at rank 1, so in cases where only very short candidate lists must be used, the rank-1 miss rate is barely higher 0.12% (row 5) which again modestly rises when persons are enrolled with a single image (row 7). All the miss rates are measured over a fixed set of 154 549 searches, and the lowest false negative error

Investigation Num- Enrolled Num- Algorithm FNIR miss rate at subjects image images Raw Corrected

1 Rank-50 12M Lifetime 26.1M NEC-2 0.09% 0.09%

2 Rank-50 12M Lifetime 26.1M Microsoft-5 0.06% 0.06%

3 Rank-50 12M Recent 12M NEC-2 0.25% 0.08%

4 Rank-50 12M Recent 12M Microsoft-5 0.21% 0.09%

5 Rank-1 12M Lifetime 26.1M NEC-2 0.14% 0.12%

6 Rank-1 12M Lifetime 26.1M Microsoft-5 0.25% 0.24%

7 Rank-1 12M Recent 12M NEC-2 0.31% 0.13%

8 Rank-1 12M Recent 12M Microsoft-5 0.52% 0.37%

9 Rank-50 640K Lifetime 1.25M NEC-2 0.08% 0.08%

10 Rank-50 640K Lifetime 1.25M Microsoft-5 0.04% 0.04%

Table 1: Rank-based accuracy floor 2018.

rate recorded in this report (0.038%, row 10) cor- responds to just 58 misses. Given such low error rates, what misses remain? By inspection they arise in five categories, those due to: a) ageing i.e. long- term time lapse between images; b) images of in- jured individuals e.g. bruised or bandaged faces;

c) the presence of a second face e.g. printed on a T-shirt; d) images of some object that is not a face;

e) profile-view images, and f) actual clerical ID la- bel errors. As discussed in section 3.8.2, the first three categories are legitimately part of a test designed to measure accuracy on portrait images collected in law-enforcement settings. The latter three

categories, however, should not be included in a test that is attempting to measure accuracy on only frontal images.

Thus, by removing all known images in those categories, the rightmost column shows error rates that would be attain- able in an application where exclusively frontal portrait images were collected without identity labeling errors.

Error rates today are two orders of magnitude below what they were in 2010, a massive reduction that stems from wholesale replacement of the old algorithms with those based on (deep) convolutional neural networks (CNNs). This constitutes a revolution rather than the evolution that defined the period 2010-2013. The rapid innovations around CNN architectures and loss functions including, both proprietary and published in the academic literature¹, may yet produce further gains. Even without that possibility, the results imply that prospective end-users should establish whether installed algorithms pre-date the development of the prototypes evaluated here and inquire with suppliers on availability of the latest versions. The gains mean that searches that had previously failed to yield candidates may now do so, such that unsolved cases could be revisited.

Given this impressive achievement - close to perfect recognition - an advocate might claim that frontal face recognition is a solved problem, a statement that should be refuted with the following context and caveats:

. Algorithm accuracy spectrum:Many algorithms do not achieve the low error rates tabulated above, and while many of those may still be useful and valuable to end-users, only the most accurate excel on poor quality images and those collected long after the initial enrollment sample.

. Versioning: While results for up to seven algorithms from each developer are reported here, the intra-provider accuracy variations are usually smaller than the inter-provider variations. That said different versions give order of magnitude fewer misses. Some developers demonstrate speed-accuracy tradeoffs². See Figs.17,18.

1For example, Resnets [11], Inception [23], very deep networks [18,21] and spatial transformers.

2NEC-0 prepares templates much faster than NEC-2 but gives twenty times more misses. Dermalog-5 executes a template search much more quickly than Dermalog-6 but is also much less accurate.

2019/09/11 17:24:52

(10)

. Quality: The low error rates here are attained using mostly excellent cooperative live-capture mugshot images collected with an attendant present. Recognition in other circumstances, particularly those without a dedicated photographic environment and human or automated quality control checks, will lead to declines in accuracy.

This is documented here for poorer quality webcam images and unconstrained “wild” images.

. Low similarity scores: In thousands of cases the correct gallery image is returned at rank 1 but its similarity score is nevertheless low, below some operationally required score threshold. This does not matter when face recognition is used for “lead generation” in investigational applications because human reviewers are specifically required to review potentially long candidate lists and the threshold is effectively 0. In applications where search volumes are higher and labor is not available to review the results from searches, a higher threshold can be applied. This reduces the length of candidate lists and false positive identification rates at the expense of increased false negative miss rates. The tradeoff between the two error rates is reported extensively later.

. Population size:As the number of enrolled subjects grows, some mates are displaced from rank one, decreasing accuracy. As tabulated later for N up to 12 million, false negative rates generally rise slowly with population size.

. Database integrity:An operational error rate should be added to all false negative rates in this report reflecting the proportion of images in a real database that are un-matchable. Such anomalies arise from images that: do not contain a face; include multiple persons; cannot be decoded; are rotated by 90^◦or 180^◦; depict a face on clothing;

and others introduced by a long tail of various clerical errors. While the mugshot trials in this report have been constructed to minimize such effects, they are a real problem in actual operations.

. Threshold-based accuracy: Recognition accuracy is very strongly dependent on the algorithm and, more

cogent_2 cognitec_3

idemia_4 microsoft_4

nec_3

neurotechnology_5

rankone_5

visionlabs_7 yitu_4

Same photo under two IDs

Same person under two IDs

Twins

Siblings

Lookalikes

Investigational always uses human review Identification seldom

uses human review 0.001

0.002 0.003 0.004 0.005 0.007 0.010 0.020 0.030 0.040 0.050 0.070 0.100 0.200 0.300 0.400 0.500

0.0001 0.0003 0.001 0.003 0.01 0.03 0.1 0.3 1

False positive identification rate, FPIR(T)

False negative identification rate, FNIR(T)

Algorithm cogent_2 cognitec_3 idemia_4 microsoft_4 nec_3 neurotechnology_5 rankone_5 visionlabs_7 yitu_4

Figure 1: Miss rates across the false positive range generally, on the developer of

the algorithm. False negative error rates in a particular sce- nario range from a few tenths of one percent to beyond fifty percent. This is tabulated exhaus- tively later: For example Table22 shows accuracy across datasets.

The inset figure here compares algorithms on mugshot searches in a consolidated gallery of 12 million subjects and 26.1 million photos. In positive or negative identification applications, a score threshold is set to limit the rate at which non-mate searches produce false positives. This has the consequence that some mated searches will report the mate below threshold, i.e. a miss, even if it is at rank 1. The utility of this is that many non-mated searches will usually not return any candidate identities at all. As the

2019/09/11 17:24:52

(11)

inset error-tradeoff characteristic

shows, investigational miss rates on the right side are very low but then rise steadily (in the center region) as threshold is increased to support “lights-out” applications, and ultimately rise quickly (left side) as discussed below. Thus, if we demand that just one in one thousand non-mate searches produce any false positives, the most accurate algorithm there (NEC-3) would fail on 7.9% of mated searches. Even though the graph shows results for the most accurate algorithms, all but two would fail to find the mate in more than 10% of mated searches. While the NEC algorithm produces a relatively flat error tradeoff until the threshold is raised to limit false positives to about 1 in 400 non-mated searches³ Thereafter, as the threshold is raised to further reduce false positives, miss rates rise rapidly. This means that low false positive identification rates are inaccesible with these algorithms, a result that does not apply for ten-finger identification algorithms. The rapid rise occurs because the lower mate scores are mixed with very high non-mate scores, the low scores from poor image quality and ageing, the high non-mates from the presence of lookalikes persons (doppel- gangers), twins (discussed next) and, ultimately, the presence of a few unconsolidated subjects i.e. persons present under multiple IDs.

.False positives from twins: By enrolling 640 000 mugshots, adding photos of one twin, and then searching photos of those subjects and their twin the inset figure shows, for one typical algorithm, the similarity is generally greater when searching twins against themselves (A) than when searching twins against their sibling (B) but very

15.0 15.5 16.0 16.5 17.0

AA fraternal SameSex

AA identical SameSex

AB fraternal DifferentSex

AB fraternal SameSex

AB identical SameSex Gallery: Twin A; Probe: Twin A or B; Type of Twin

Similarity Score

TVAL FPIR = 0.001 FPIR = 0.003 FPIR = 0.010 FPIR = 0.030

Figure 2: Intra- and inter-twin scores often still above even stringent thresholds i.e. those

corresponding to one in one thousand searches pro- ducing a false positive. Thus twins will very often produce a high-scoring non-match on a candidate list and a false alarm in an online identification system.

The plot shows that some fraternal twins are correctly rejected at those thresholds - these are largely from different sex twins (at center). Figure21shows substantially similar behavior for all algorithms tested. In an investigative search, a twin would typically appear at rank 1, or rank 2 if their sibling happened to also be the gallery. Twins (and triplets etc.) constituted 3.3%

of all live births [17] in recent years⁴, and because that number is higher today than when the individuals in current adult databases were born, the false positives that arise from twins are now, and will increasingly be, an operational problem. Relative to the United States, twins are born with considerable regional variation. For example they are much less common in East Asia, and much more common in Sub-Saharan Africa [22]. The presence of twins in the mugshot database is inevitable given its size, around 12.3 million people. As this is not an insignificant sample of the domestic United States population, people with other familial ties will be present also. The data was collected over an extended period and because location information is not available, we are unable to estimate the proportion of the domestic population that is present in the dataset. However, if we assume twins are neither more or less disposed to arrest than the general population, we can estimate that hundreds of thousands of individuals in the dataset are twins.

This will affect false positive rates because we randomly set aside 331 254 individuals for nonmate searches, and some proportion of those will be twins with siblings in the gallery.

3The gallery size here is 12 million people, 26.1 million images. Given 331 254 non-mated searches, an exhaustive implementation of one-too-many search would execute 8.6 trillion comparisons. At a false positive identification rate of 0.0025 the number of false positives is, to first order, 828 corresponding to single-comparison false match rate of 828 / 8.6 trillion = 9.6 10⁻¹¹i.e. about 1 in 10 billion. Strictly this FMRR computation meaningful only for algorithms that implement 1:N search using N 1:1 comparisons, which is not always the case.

4See the CDC’s National Vital Statistics Report for 2017: https://www.cdc.gov/nchs/data/nvsr/nvsr67/nvsr67 08-508.pdf

2019/09/11 17:24:52

(12)

. False negatives from ageing: A large source of error in long-run applications where subjects are not re-enrolled on a schedule is ageing. This is a function of the time elapsed between photographs. Change in facial appearance

Algorithm Metric, FNIR@ (0,2] (2,4] (4,6] (6,8] (8,10] (10,12] (12,14] (14,18]

nec-2 Rank = 1 0.3 0.4 0.4 0.4 0.4 0.5 0.6 0.4

microsoft-4 Rank = 1 0.3 0.5 0.6 0.7 0.9 1.0 1.3 1.6

yitu-4 Rank = 1 0.6 0.8 0.8 0.8 0.9 1.1 1.5 2.1

everai-3 Rank = 1 0.5 0.7 0.9 1.1 1.3 1.5 1.8 2.2

idemia-4 Rank = 1 1.1 1.5 1.9 2.3 2.8 3.1 3.7 5.1

cogent-3 Rank = 1 0.8 1.1 1.3 1.5 1.7 1.9 2.4 3.1

cognitec-2 Rank = 1 1.0 1.4 1.7 2.0 2.4 2.6 3.1 3.9

nec-2 FPIR = 0.001 0.7 0.9 1.1 1.3 1.5 1.7 2.1 2.7

microsoft-4 FPIR = 0.001 2.7 4.7 7.2 10.1 12.9 16.1 20.5 25.9 yitu-4 FPIR = 0.001 1.2 2.0 3.1 4.7 6.7 9.6 14.2 20.1 everai-3 FPIR = 0.001 3.5 6.2 9.3 12.9 16.2 19.6 24.1 29.2 idemia-4 FPIR = 0.001 3.7 5.9 8.3 11.0 13.4 15.8 19.1 24.8 cogent-3 FPIR = 0.001 5.8 9.7 14.2 19.2 23.8 28.4 34.4 42.1 cognitec-2 FPIR = 0.001 5.2 8.8 12.7 17.1 21.0 24.6 29.2 35.3

Table 2: Impact of ageing on accuracy.

causes recognition similarity scores to decline such that over the longer term, accuracy will decline. All faces age and while this usually proceeds in a graceful and progressive man- ner, drug use can accelerate this [30]. Elec- tive surgery may be effective in delaying it al- though this has not been formally quantified with face recognition. As ageing is essentially unavoidable, it can only be mitigated by sched- uled re-capture, as in passport re-issuance. To quantify ageing effects, we used the more accurate algorithms to enroll the earliest image of 3.1 million adults and then search with 10.3 million newer photos taken up to 18 years after the the initial enrollment photo. In the inset ta-

ble, accuracy is seen to degrade progressively with time, as mate scores decline and non-mates displace mates from rank 1 position. More accurate algorithms tend to be less sensitive to ageing. The more accurate algorithms give fewer errors after 18 years of ageing than middle tier algorithms give after four. Note also we do not quantify an ageing rate - more formal methods [2] borrowed from the longitudinal analysis literature have been published for doing so (given

suitable repeated measures data). See Figures62,72and77.

.Image quality matters:Poor quality photographs undermine recognition, either because the imaging system is poor

Algorithm Metric, FNIR@ Wild Mugshot Webcam

cognitec-3 Rank = 1 5.1 0.9 2.5

everai-3 Rank = 1 3.8 0.5 1.9

idemia-5 Rank = 1 4.4 1.1 3.9

microsoft-5 Rank = 1 3.3 0.3 1.1

nec-3 Rank = 1 8.8 0.3 1.0

ntechlab-6 Rank = 1 3.8 0.6 1.7

visionlabs-5 Rank = 1 4.3 0.4 1.9

yitu-4 Rank = 1 4.4 0.4 0.8

cognitec-3 FPIR = 0.01 32.5 2.8 10.0

everai-3 FPIR = 0.01 35.7 1.8 6.0

idemia-5 FPIR = 0.01 34.0 2.8 10.2 microsoft-5 FPIR = 0.01 34.4 1.2 4.1

nec-3 FPIR = 0.01 38.0 0.4 1.3

ntechlab-6 FPIR = 0.01 38.1 2.1 5.9 visionlabs-5 FPIR = 0.01 34.4 2.2 8.7

yitu-4 FPIR = 0.01 30.6 0.7 1.7

Table 3: Impact of image quality on accuracy.

(lighting, camera, etc.) or because the subject mis-presents to the camera (head orientation, facial expression, occlusion, etc.).

Imaging problems can be mitigated by design i.e. ensuring ad- herence to long-standing face image capture standards. Presen- tation problems, however, must be detected at capture time, either by the photographer, or by an automated system, and re- capture performed. The most accurate algorithms in FRVT are highly tolerant of image quality problems. This derives from the invariances afforded by CNN-based algorithms, and this is the fundamental reason why accuracy has improved since 2013. For example, the Microsoft algorithms are can match many profile- view images to frontal mugshots - see Figures100 and102. As the inset table shows, rank-1 false negative identification rates are much higher with wild images than webcams and, in turn, mugshots. Further, even with the most capable algorithms, comparison scores are lower with unconstrained images, so that when (high) thresholds are necessary to limit false positives, here to 1

in 100 searches, error rates are very high. Such figures should guide prospective users of face recognition to consider whether face recognition can meet a formal written accuracy requirement.

. Accuracy in large populations: This report documents identification accuracy in galleries containining up to 12 million people and 26.1 million images. False negative rates climb very slowly as population size increases. For the most accurate algorithm, NEC-2, when searching a database of size 640 000, about 0.26% of searches fail to produce the

2019/09/11 17:24:52

(13)

correct mate as its best hypothesized identity. In a database of 12 000 000 this rises to just 0.31%. This benign growth in miss rates is fundamentally the reason for the utility of face recognition in large scale one-to-many search applications.

See Table12and Figure22.

The reason for this is that as more identities are enrolled into an database, the possibility of a false positive increases due to lookalike faces that yield extreme values from the right tail of the non-mate score distribution. However, these scores are lower than most mate scores such that when an identification algorithm is configured with a threshold of zero (so human adjudication is always necessary), rank-one identification miss rates scale very favorably with population size, N, growing slowly, approximately as a power law,aN^bwith b 1. This dependency was first noted in 2010.

Depending on the algorithm, the exponentbfor mugshot searches is low, around 0.06 for the some of the more accurate

algorithms with up to 12 million identities. See Table12.

In any case, variations in accuracy with increasing population size are small relative to both ageing and algorithm

choice. See Figure20.

.Utility of adjudicating long candidate lists:In the regime where a system is configured with a threshold of zero, and where human adjudication is always necessary, the reviewer will find some mates quite far down candidate lists. This usually occurs because either the probe image or its corresponding enrolled mate image have poor quality, or large time-lapse. The accuracy benefits of traversing say 50 candidates versus just the first one is broadly a reduction in error

by up to a factor of two. See Figure30and compare Tables12and13.

However, accuracy from the leading algorithm is now so high - mates that in 2013 were placed at rank>1, are now at rank1- such that reviewers can expect to review substantially fewer candidates. Note, however, for the proportion of searches where there is no mate, reviewers might still examine all candidates, fruitlessly. This report does not address the issue of human error in adjudicating candidates produced in one-to-many searches.

.Utility of enrolling multiple images per subject:We run three kinds of enrollment: First, by enrolling just the most recent image; second by creating a single template from a person’s full lifetime history of images; and third by enrolling multiple images of a person separately, as though under different identities. The overall effect is that the enrollment of multiple images yields as much as a factor of two lower miss rates. This occurs due to higher information content and because the most recent image may sometimes be of poorer quality than historical images. See Table12.

Gains depend on the number of available images: FNIR drops steadily. Some algorithms reduce FPIR or maintain it - the desirable behaviors - but others give higher false positive rates. See Figures leading up to Figure87.

.Reduced template sizes:There has been a trend toward reduced template sizes, i.e. a smaller feature representation of an image. In 2014, the most accurate algorithm used a template of size 2.5KB; the figure in 2018 is around 1600 bytes.

Close competitors produce templates of size 256, 364, 512, and about 2KB bytes. In 2014, the leading competitors had templates of size 4KB to 8KB. Some algorithms, when enrolling more than one image of a person, produce a template whose size is independent of the number of images given to the algorithm. This can be achieved by selecting a “best”

image, or by integrating (fusing) information from the images. See Table16.

. Template generation times: Template generation times, as measured on a single circa-2016 server processor core

5, vary from below 20 milliseconds up to nearly 1 second. This wide variation across developers may be relevant to end-users who have high-volume workflows. There has not been a wide downward trend since 2014. Note that speed may be expedited over the figure reported here by exploiting new vector instructions on recent chips. Note that GPUs were not used and, while indispensable for training CNNs, are not necessary for feeding an image forward through a

network. See Table16.

. Search duration and scalability: Template search times, as measured on circa-2016 Intel server processor cores,

5Intel Xeon CPU E5-2630 v4 running at 2.20GHz.

2019/09/11 17:24:52

(14)

vary massively across the industry. For a database of size 1 million subjects, and the more accurate implementations, durations range from below 1 to 500 milliseconds, with other less accurate algorithms going much slower still. Several algorithms exhibit sublinear search time i.e. the duration does not double with a doubling of the enrolled population size, N. This was noted also in 2014. This has improved in 2018, however, such that close-to-logarithmic growth is evident for several developers’ algorithms and extremely fast search. The consequence of this is that as N increases even the fastest linear algorithm (NEC-3) will quickly become much slower than the strongly sublinear algorithms. For the Dermalog-5 algorithm, search of a template against a database of N = 12 million images takes 850 microseconds on a single core of a contemporary CPU. That number is faster than any other algorithm even with the smallest gallery we

tested (N = 640 000). See Table6and Figure111.

.Accuracy gains June - October 2018NIST Interagency Report 8238 documented massive gains from 2013 to 2018. This

Application Metric Algorithm FNIR

Mode: Mugshot Miss rate Date Name

Investigation at Rank=1 2018-JUN NEC-0 3.20%

Investigation at Rank=1 2018-OCT NEC-2 0.31%

Investigation at Rank=1 2018-JUN Microsoft-4 0.45%

Investigation at Rank=1 2018-OCT Microsoft-5 0.52%

Investigation at Rank=1 2018-JUN Yitu-2 0.55%

Investigation at Rank=1 2018-OCT Yitu-5 0.55%

Identification at FPIR=0.001 2018-JUN NEC-0 20.0%

Identification at FPIR=0.001 2018-OCT NEC-3 5.8%

Identification at FPIR=0.001 2018-JUN Microsoft-4 15.8%

Identification at FPIR=0.001 2018-OCT Microsoft-6 15.6%

Identification at FPIR=0.001 2018-JUN Yitu-2 12.4%

Identification at FPIR=0.001 2018-OCT Yitu-5 11.1%

Table 4: Accuracy gains since June - October 2018 report shows most developers achieved gains over the four

month interval between June and October 2018. For a set of 12 million subjects enrolled with their most recent mugshot image, the inset table shows, for selected algorithms, the proportion of searches where mates are not returned against the given criteria (column 2). The result is that substantial reductions in false negatives - by a factor of two or more - were realized by algorithms submitted by Cogent, Cog- nitec, Dermalog, Hikvision, Innovatrics, NEC, Rank One, Shaman, Tiger-IT, and Vigilant Solutions. In particular, in this same time period one developer, NEC, which had produced broadly the most accurate algorithms in 2010 and 2013, submitted algorithms that are substantially more accurate than their June 2018 versions, and on many measures are now the most accurate. A number of other developers

produced slightly less accurate implementations. See Tables16and19, and Figure19.

. Non-technical considerations: Recognition accuracy is likely the most important technical indicator for an algorithm. But even among the more accurate developers accuracy, template size, and resource consumption vary widely.

This, incidentally, implies that technological diversity remains, that there is no consensus on approach and that algorithms are not commoditized. But beyond the performance statements in this report, face recognition outcomes in complete systems will be influenced by things like code and model size, software maturity, extensibility, reliability, ease of integration and maintenance, cost, availability of monitoring tools, and support for human review of true and false matches using, for example, capable graphical user interfaces.

. Conclusions: As with other biometrics, accuracy of facial recognition implementations varies greatly across the industry. Absent other performance or economic parameters, users should prefer the most accurate algorithm. Note that accuracy, and algorithm rankings, vary somewhat with the kinds of images used and the mode of operation:

investigation with zero threshold; or identification with high threshold.

.Supplementary Data:This document is accompanied by a supplement that includes a three page report for each of the algorithms evaluated. Each report includes various performance plots pertinent to the particular algorithm under test. The supplement, which currently runs to more than 600 pages, is available from thesame webpageas this report.

2019/09/11 17:24:52

(15)

Release Notes

FRVT Activities: NIST restartedFRVT’s one-to-many track in February 2018, inviting participants to send up to seven prototype algorithms. Since February 2017, NIST has been evaluating one-to-one verification algorithms on an ongoing basis. This allows developers to submit updated algorithms to NIST at any time but no more frequently than four calendar months. This more closely aligns development and evaluation schedules. Results are posted to the web within a few weeks of submission. Details and full report are linked from theOngoing FRVT site.

FRVT Reports: The results of the FRVT appear in the series NIST Interagency Reports tabulated below. The reports were developed separately and released on different schedules. In prior years NIST has mostly reported FRVT results as a single report; this had the disadvantage that results from completed sub-studies were not published until all other studies were complete.

Date Link Title No.

2014-03-20 PDF FRVT Performance of Automated Age Estimation Algorithms 7995

2015-04-20 PDF Face Recognition Vendor Test (FRVT) Performance of Automated Gender Classification Algorithms 8052

2014-05-21 PDF FRVT Performance of face identification algorithms 8009

2017-03-07 PDF Face In Video Evaluation (FIVE) Face Recognition of Non-Cooperative Subjects 8173

2017-11-23 PDF The 2017 IARPA Face Recognition Prize Challenge (FRPC) 8197

2018-04-13 WWW Ongoing Face Recognition Vendor Test (FRVT) Draft

Details appear on pages linked fromhttps://www.nist.gov/programs-projects/face-projects.

Appendices: This report is accompanied by appendices which present exhaustive results on a per-algorithm basis.

These are machine-generated and are included because the authors believe that visualization of such data is broadly informative and vital to understanding the context of the report.

Typesetting: Virtually all of the tabulated content in this report was produced automatically. This involved the use of scripting tools to generate directly type-settable L^ATEX content. This improves timeliness, flexibility, maintainability, and reduces transcription errors.

Graphics: Many of the Figures in this report were produced using theggplot2package running underR, the capabilities of which extend beyond those evident in this document.

2019/09/11 17:24:52

(16)

1 Introduction

One-to-many identification represents the largest market for face recognition technology. Algorithms are used across the world in a diverse range of biometric applications: detection of duplicates in databases, detection of fraudulent applications for credentials such as passports and driving licenses, token-less access control, surveillance, social media tagging, lookalike discovery, criminal investigation, and forensic clustering.

This report contains a breadth of performance measurements relevant to many applications. Performance here refers to accuracy and resource consumption. In most applications, the core accuracy of a facial recognition algorithm is the most important performance variable. Resource consumption will be important also as it drives the amount of hardware, power, and cooling necessary to accommodate high volume workflows. Algorithms consume processing time, they require computer memory, and their static template data requires storage space. This report documents these variables.

1.1 Open-set searches

FRVTtested open-set identification algorithms. Real-world applications are almost always “open-set”, meaning that some searches have an enrolled mate, but some do not. For example, some subjects have truly not been issued a visa or drivers license before; some law enforcement searches are from first-time arrestees⁶. In an “open-set” application, algorithms make no prior assumption about whether or not to return a high-scoring result, and for a mated search, the ideal behaviour is that the search produces the correct mate at high score and first rank. For a non-mate search, the ideal behavior is that the search produces zero high-scoring candidates.

Too many academic benchmarks execute only closed-set searches. The proportion of mates found in the rank one position is the default accuracy metric. This hit rate metric ignores the score with which a mate is found; weak hits count as much as strong hits. This ignores the real-world imperative that in many applications it is necessary to elevate a threshold to reduce the number of false positives.

2 Evaluation datasets

This report documents accuracy for four kinds of images - mugshots, webcam, profiles and wild - as described in the following sections.

2.1 Mugshot images

The main mugshot dataset used is referred to as theFRVT 2018set. This set was collected over the period 2002 to 2017 in routine United States law enforcement operations. This set has been extracted from a larger operational parent set by excluding non-face images, and setting aside webcam and profile-view images, for use in separate tests.

NIST Interagency Report 8238 includes a comparison of this set of mugshots with the smaller and easier sets of mugshots used in tests run in 2010 and 2014.

6Operationally closed-set applications are rare because it is usually not the case that all searches have an enrolled mate. One counter-example, however, is a cruise ship in which all passengers are enrolled and all searches should produce exactly one identity. Another example is forensic identification of dental records from an aircraft crash.

2019/09/11 17:24:52

This publication is available free of charge from:

NISTIR 8271

Face Recognition Vendor Test (FRVT)

Part 2: Identification

Patrick Grother Mei Ngan Kayee Hanaoka

This publication is available free of charge from:

https://doi.org/10.6028/NIST.IR.8271

NISTIR 8271

Face Recognition Vendor Test (FRVT)

Part 2: Identification

Patrick Grother Mei Ngan Kayee Hanaoka Information Access Division Information Technology Laboratory

This publication is available free of charge from:

https://doi.org/10.6028/NIST.IR.8271

September 2019

National Institute of Standards and Technology Interagency or Internal Report 8271 Natl. Inst. Stand. Technol. Interag. Intern. Rep. 8271, 186 pages (September 2019)

This publication is available free of charge from:

https://doi.org/10.6028/NIST.IR.8271

ACKNOWLEDGMENTS

DISCLAIMER

Executive Summary

Scope and Context

Technical Summary

Release Notes

Contents

1 Introduction

1.1 Open-set searches

2 Evaluation datasets

2.1 Mugshot images