Geautomatiseerde screening middels AI

Beoordeeld: 20-01-2023

Uitgangsvraag

Wat is de plaats van automatische grading-systemen bij screening op diabetische retinopathie bij patiënten met diabetes mellitus (type 1 en 2)?

Aanbeveling

Overweeg gebruik van een geautomatiseerd DR-grading-systeem als één van de middelen ter ondersteuning van programmatische screening op diabetische retinopathie bij patiënten met diabetes mellitus:

Zet een automatisch grading-systeem in binnen een ‘triage’ strategie: laat alle positieve uitslagen (referable DR) beoordelen door een humane grader voor doorverwijzing naar een oogarts.
Gebruik een volgens de Medical Device Regulation Directive (MDR) gecertificeerd systeem (klasse IIa CE-geregistreerd) en geef de voorkeur aan een systeem dat additioneel door de FDA toegelaten is.
Wees alert op de uitslag die het systeem geeft, per oog of per twee ogen. Sommige grading-systemen geven alleen een uitslag per twee ogen zodat het niet mogelijk is conform de huidige richtlijn te handelen, dus herhalen van screening bij R1 in een oog, maar doorverwijzen naar oogarts bij R1 in twee ogen.
Wees bewust dat de performance van automatische grading-systemen met betrekking tot het verschil tussen milde (R1) en geen (R0) diabetische retinopathie niet gerapporteerd is in de literatuur. Daardoor kan mogelijk bij minder patiënten het screening interval van een jaar naar twee of drie jaar verlengd worden.
Geef voorkeur aan een systeem dat directe terugkoppeling geeft over de kwaliteit van de foto en inzicht geeft in hoe de diagnose tot stand is gekomen.

Overwegingen

Voor- en nadelen van de interventie en de kwaliteit van het bewijs

In totaal zijn er 17 studies (d.w.z. één SR en 16 individuele studies) beschreven die de plaats van automatische grading-systemen bij de screening van de fundus op DR bij patiënten met diabetes mellitus (type 1 en 2) beschrijven. Er wordt gebruikt gemaakt van verschillende automatische grading-systemen. De algemene bewijskracht is laag, echter is deze redelijk tot goed voor de systemen IDx-DR X2.1, EyeArtSystem v2.0 en in mindere mate Google Inc.. Op basis van de conclusies met betrekking tot de cruciale uitkomstmaten (sensitiviteit en specificiteit) kunnen we niet goed beoordelen of er sprake is van een klinisch relevant verschil. Wel laat de triage strategie van Heydon (2020) zien dat het een klinisch relevant voordeel oplevert ten opzichte van de strategie gebaseerd op alleen humane graders.

Niet alle klasse IIa CE-gecertificeerde automatische grading-systemen zijn meegenomen in deze analyse. Het algoritme RetCAD® bereikte b.v. in een retrospectieve studie met routine foto’s een sensitiviteit/specificiteit van 90,1%/90,6% bij het opsporen van referable DR. Dit algoritme is niet geïncludeerd in de analyse omdat hier gelijktijdig referable DR en maculadegeneratie opgespoord wordt en niet alleen DR zoals in de PICO (Gonzalez-Gonzalo, 2020).

Er worden meerdere algoritmes gebruikt voor de automatische grading-systemen in verschillende settingen en populaties. Veelal worden eigen algoritmes ontwikkeld (gebruikmakend van grote bestaande datasets) en gevalideerd op verzamelde data van ziekenhuiscohorten. Daarnaast worden er geen strategieën beschreven op welke wijze een automatische grading-systeem ingezet moet worden in de klinische praktijk, met uitzondering van één studie (Heydon, 2020). Deze studie geeft aan het systeem te willen inzetten als een ‘triage’ strategie (d.w.z. als filter voordat een patiënt doorgestuurd wordt naar een humane grader).

Veel studies gebruiken bestaande datasets om algoritmen te ontwikkelen die alleen foto’s van hoogwaardige kwaliteit bevatten. Vervolgens worden de ‘slechte’ foto’s ook verwijderd uit de datasets die worden gebruikt voor validatie van algoritmes. Dit is van invloed op de diagnostische waardes en is tegenstrijdig met de klinische praktijk waar ook slechte foto’s moeten worden beoordeeld. Om deze reden zullen deze waardes in de praktijk lager (slechter) zijn. Dit kan tot gevolg hebben dat het aantal fout-positieve testuitslagen groter is in de klinische praktijk. In de registratie-studies (FDA) van EyeArt en IDx-DR is de ‘diagnosibility’ onderzocht. De ‘diagnosability’ was 96,1%, voor IDx-DR wat betekent dat bij 96,1% van de foto’s een gradering door het systeem mogelijk was. 23% van de patiënten had daarvoor mydriasis nodig. De ‘diagnosibility’ van EyeArt was 96,5% in een nog niet gepubliceerde prospectieve klinische multicenter studie (Ipp, 2021). Bij het gebruik van deze systemen wordt de kwaliteit van de foto direct beoordeeld en teruggekoppeld.

Strategieën

Wanneer we gaan kijken naar verschillende strategieën (replacement (d.w.z. automatische grading i.p.v. humane graders) of triage van positieve uitslagen van automatische grading), dan kan het aantal fout-positieve testuitslagen verschillende consequenties hebben. Bij de strategie ‘replacement’ wordt bespaard op de kosten van humane graders maar kunnen de kosten voor doorverwijzing naar de oogarts stijgen als het aantal fout-positieven groter is bij automatische grading dan bij routinematige screening door humane graders. Als daarentegen niet wordt gekozen voor een directe verwijzing naar de oogarts maar voor een triage strategie, dan wordt er weliswaar minder bespaard op de kosten van humane graders, maar nemen vrijwel zeker ook de kosten van doorverwijzing naar de oogarts af door een daling in het aantal patiënten dat door humane graders naar de oogarts wordt verwezen ten opzichte van een replacement strategie. Een humane grader kan veel fout-positieve testuitslagen corrigeren voordat de patiënt wordt doorverwezen naar de oogarts. Een voorbeeld van deze verhoudingen (TP, FP, TN, FN) op basis van het artikel van Heydon, is weergegeven in Appendix. Een hoog percentage (84 tot 93%) van de foto’s met mild tot matige non-proliferatieve DR en geen maculopathie (R1M0) werden door EyeArt software positief beoordeeld. Van de R0M0 groep werden door EyeArt 32,1% als positief afgegeven.

Deze groep zou zonder triage-strategie ten onrechte doorgestuurd worden naar de oogarts, wat een hoge belasting is voor zowel de gezondheidszorg als voor de patiënt. In het Engelse systeem worden alle positieve uitslagen beoordeeld door een human grader.

In de studie van Abramoff worden de vals positieven in de groep van R0M0 en R1M0 niet expliciet genoemd, alleen een specificiteit van 90,7% in de gehele populatie wordt gerapporteerd.

Bij de studie van Abramoff wordt gebruik gemaakt van een ‘enrichment’ strategie, zodat het percentage van mtmDR 23,8% was. De implementatiestudie van Heydon maakt gebruik van een “gewone” screening populatie waarbij het percentage van referable DR 7,2 % was waaronder 2,4% ungradable DR. Voor beide strategieën geldt dat patiënten met een fout-negatieve testuitslag na automatische grading niet worden gezien door de oogarts. In beide strategieën worden dan ook hoge eisen gesteld aan de sensitiviteit van de automatische grading, die tenminste vergelijkbaar zou moeten zijn met de sensitiviteit van routinematige screening door humane graders. In de studie van van der Heijden (2018) was de sensitiviteit van human graders van ‘referable DR 63%-92% en de specificiteit 99%-100%. Kortom: de sensitiviteit van human graders was lager en de specificiteit hoger in vergelijking met automatische grading.

Aansprakelijkheid

Sinds 2020 geldt de Europese regelgeving in Nederland, te weten de Medical Device Regulation (https://www.rijksoverheid.nl/onderwerpen/medische-hulpmiddelen/nieuwe-wetgeving-medische-hulpmiddelen). Deze is van toepassing op alle medische hulpmiddelen (o.a. screenings devices). De fabrikant van een medisch hulpmiddel is verantwoordelijk voor de kwaliteit, prestaties en veiligheid daarvan.

Op grond van de regels over productaansprakelijkheid is de fabrikant jegens een patiënt aansprakelijk als de patiënt schade lijdt als gevolg van een gebrekkig product.

Binnen het aansprakelijkheidsrecht is er discussie over de vraag of in geval van systemen die nagenoeg autonoom handelen wel sprake kan zijn van een aansprakelijke partij. In die situatie ligt het niet voor de hand dat de schade ten laste komt van het slachtoffer. Er wordt in de literatuur gesproken over de introductie van een vorm van risicoaansprakelijkheid, waarbij niet zozeer gekeken wordt of de schade is veroorzaakt door een gebrek in het product, maar meer gekeken wordt naar de actor die verantwoordelijk is voor het voorkomen en minimaliseren van gevaar en schade van het autonome systeem. Dat zal in de meeste gevallen de fabrikant zijn. De fabrikant bepaalt immers welke elementen het systeem meeneemt in zijn leer- en besluitvormingsproces.

In de Wet kwaliteit klachten en geschillen in de zorg (Wkkgz) zijn bepalingen opgenomen over de kwaliteit van zorg verleend door zorgverleners en zorginstellingen. De wet geeft de verplichting om ‘goede zorg’ te verlenen.
De WGBO geeft aan dat de hulpverlener, in dit geval de medisch specialist of huisarts, de geneeskundige behandelingsovereenkomst aangaat met de patiënt. Op grond hiervan kan de patiënt de arts en ook het ziekenhuis aanspreken als er sprake is van een tekortkoming in de uitvoering van de behandelingsovereenkomst indien de hulpverlener bij het gebruik van de hulpzaak niet de zorg van een goed hulpverlener in acht heeft genomen. Deze bepaling geeft ruimte aan de patiënt om ook de huisarts of medisch specialist aan te spreken bij het gebruik van het hulpmiddel. Er moet wel een causaal verband zijn tussen schade en het gebrekkige product. Een medisch hulpmiddel is overigens niet gebrekkig als de medisch specialist kan aantonen dat het product ten tijde van het gebruik ‘state of the art’ was (juridische handreiking, 2020).

Sinds 25 mei 2018 is de Algemene verordening gegevensbescherming (AVG) van toepassing. De organisatie valt bij de inzet van AI onder de verplichtingen van de AVG bij het verwerken van persoonsgegevens. De AVG legt de verantwoordelijkheid bij de organisatie om aan te tonen dat men zich aan de wet houdt. Dit middels documenten die aantonen dat de juiste organisatorische en technische maatregelen zijn genomen om aan de AVG te voldoen. Er is een kennislacune in de wijze waarop AI onder AVG moet worden ingeregeld.

Op het moment van schrijven participeert de Federatie van Medisch Specialisten op bureauniveau aan de ontwikkeling van een Veldnorm over de verantwoorde ontwikkeling van kwalitatief goede AI in de zorg. Ten aanzien van de AVG staat daarin onderstaande in conceptversie opgenomen:

De privacy van personen waar data van is verkregen moet door de ontwikkelaar worden gerespecteerd en gewaarborgd.
Herleidbaarheid van data naar personen moet zoveel mogelijk worden voorkomen en zo min mogelijk data moet worden gebruikt (dataminimalisatie). Hierbij is de geldende regelgeving (de huidige AVG) leidend.
Daarnaast moet, indien van toepassing, door de ontwikkelaar expliciet in het datamanagementplan worden vastgelegd hoe om wordt gegaan met eventuele toevalsbevindingen (bevindingen die aan het licht komen tijdens een onderzoek wat een ander doel dient) en het recht op vernietiging van data van personen waar data van is verkregen.

Tijdens het ontwikkeltraject van deze module is er een leidraad over de toepassing van AI in de zorg gepubliceerd (zie: https://www.datavoorgezondheid.nl/wegwijzer-ai-in-de-zorg/documenten/publicaties/2021/12/17/leidraad-kwaliteit-ai-in-de-zorg). Deze leidraad biedt o.a. ondersteuning om de kwaliteit van een AI-systeem te beoordelen. Daarnaast geeft het inzicht in criteria en bijbehorende eisen om de kwaliteit van AI te toetsen en het vertrouwen van AI te realiseren

Verlengen van screeninginterval bij screening middels automatische grading-systemen

Volgens de Nederlandse richtlijn ‘diabetische retinopathie’ kan bij R0 (geen DR) bij één of resp. twee opeenvolgende screenings het screening-interval verlengd worden naar 2 of 3 jaar. De beschikbare versie van IDx-DR in de Verenigde Staten maakt geen onderscheid tussen enerzijds geen DR (R0) en anderzijds milde DR (R1). De versie voor de Europese markt kan wel onderscheid maken tussen geen DR (R0) en milde DR (R1) in één of beide ogen.

EyeArt kan ook per oog de graad van DR bepalen en de aan- of afwezigheid van diabetisch maculaoedeem. Deze prestatie om per oog de graad van DR te bepalen is voor EyeArt en IDx-DR niet gerapporteerd in de literatuur.

IDx-DR en EyeArt kunnen patiënten zonder DR identificeren, waardoor het screeningsinterval naar 2 of 3 jaar verlengd kan worden. In de eerder beschreven studies is de sensitiviteit en specificiteit voor het diagnosticeren van R1 (milde NPDR) versus R0 (geen DR) niet onderzocht. Het zou dus mogelijk kunnen zijn dat meer patiënten ten onrechte worden geclassificeerd voor jaarlijkse screening en niet worden ingedeeld voor 2- of 3-jaarlijkse screening of dat patiënten ten onrechte worden geclassificeerd voor 2- of 3-jaarlijkse screening i.p.v. jaarlijkse screening.

Milde DR in één of twee ogen

Conform de richtlijn ‘diabetische retinopathie’ dient bij milde DR in één oog de screening herhaald te worden na een jaar, waartegen bij milde DR in twee ogen verwijzing naar de oogarts geïndiceerd is. De Nederlandse richtlijn maakt hierbij gebruik van de Scanlon-classificatie (NHS DESP), waarin milde DR (R1) overeenkomt met ETDRS-level 20-35.

De uitslag van de IDx-DR geeft niet aan of milde DR in één of twee ogen geconstateerd is. Maar IDx-DR gebruikt de ICDR-classificatie, waarbij ‘more than mild DR’ overeenkomt met ETDRS-level ≥35.[1] Bij IDx-DR ligt door gebruik van een ‘strengere’ grading scale (ICDR) dus de drempel voor rfDR iets lager t.o.v. de Nederlandse richtlijn.

Waarden en voorkeuren van patiënten (en evt. hun verzorgers)

Er zijn geen studies verricht die de waarden en voorkeuren van patiënten van een automatisch grading-systeem versus een humane grader bij het screenen op DR vergelijken. Wel geeft één studie aan dat 78% van de patiënten voorkeur geeft aan een automatisch grading-systeem ten opzichte van menselijke beoordelaar (Keel, 2018). Tegelijkertijd kan het zijn dat patiënten een dergelijk automatisch grading-systeem minder vertrouwen dan een humane grader. Het vertrouwen van patiënten en zorgverleners zou groter kunnen zijn als het algoritme het resultaat inzichtelijk maakt. Een triage strategie zou tot meer vertrouwen kunnen leiden dan een vervangingsstrategie. In een triage strategie worden patiënten die door AI geïdentificeerd zijn als referable DR namelijk alleen naar de oogarts na bevestiging door een humane grader. Wanneer er gebruik wordt gemaakt van de strategie ‘triage’ worden minder patiënten doorverwezen naar de oogarts. Dit zal resulteren dat minder patiënten zich zorgen maken over een mogelijk ernstige oogaandoening met tevens minder belasting van de oogheelkundige zorg.

Kosten (middelenbeslag)

Er zijn geen studies verricht die de kosteneffectiviteit van een automatisch grading-systeem versus een humane grader bij het screenen op DR vergelijken. Er is wel één studie verricht die meerdere systemen en strategieën vergelijkt (indirect). Twee systemen (o.a. EyeArt) bespaarden kosten in vergelijking met humane graders, bij zowel ‘replacement’, als ‘triage’. Echter was de ‘triage’ strategie minder kosteneffectief. In deze studie zijn reeds beoordeelde foto’s door humane graders verzameld (ook foto’s met een lage kwaliteit) en opnieuw beoordeeld door de automatische grading-systemen. (Tufail, 2017) Noot: IDx-DR heeft zich teruggetrokken uit de studie van Tufail (2017) wegens commerciële redenen.

In het artikel van Heydon wordt een inschatting gemaakt dat met DR-screening middels het EyeArt algoritme, £ 0.5 miljoen per 100.000 screening episoden bespaard zouden kunnen worden. De Britse kostenanalyse is niet direct vertaalbaar naar de Nederlandse praktijk, omdat de screeningsstrategieën door humane graders in beide landen verschillen.[2]

Ook is het relevant voor de kosten(effectiviteit) om automatische grading-systemen in de Nederlandse klinische praktijk in te zetten middels integratie binnen werkprocessen, IT-systemen en EPDs. De mate waarin dit mogelijk is, verschilt per automatisch grading systeem.

Aanvaardbaarheid, haalbaarheid en implementatie

Er is geen literatuur beschikbaar die aangeeft op welke wijze automatische grading-systemen het meest effectief ingezet kunnen worden. Indien een automatisch grading-systeem ingezet wordt ter vervanging van humane graders, ook bij de ‘triage’ strategie, neemt de inzet van deze graders af op dit werkgebied en kunnen zo screeningskosten bespaard worden. Daardoor kunnen deze graders elders in de oogheelkundige zorg worden ingezet.

De automatische grading-systemen stellen mogelijkerwijze hogere eisen aan de kwaliteit van de foto’s dan human graders zodat foto’s vaker overgemaakt moeten worden. Een voordeel is, dat verschillende automatische grading-systemen direct feedback geven over de kwaliteit van de foto’s, zodat die direct overgemaakt kunnen worden in mydriasis.

De mogelijkheid bestaat dat door een automatisch grading-systeem andere oogheelkundige pathologie gemist wordt die een humane grader wel herkend had.

Deze punten kunnen voor mogelijke problemen zorgen bij de haalbaarheid, aanvaarbaarheid en implementatie van een automatisch grading-systemen op diabetische retinopathie in de dagelijkse (klinische) praktijk.

Rationale van de aanbeveling: weging van argumenten voor en tegen de interventies

Op basis van de literatuuranalyse kunnen we concluderen dat geautomatiseerde screening op DR met klasse IIa CE-gecertificeerde medical devices (guideline, 2016; wettenbank, 2021) met betrekking tot sensitiviteit even goed of beter is dan screening door humane graders. Het is een acceptabel alternatief bij programmatische screening op DR.

Met de implementatie in programmatische screeningsprogramma’s voor diabetische retinopathie kan mogelijk de kwaliteit en doelmatigheid van het screeningsproces verbeterd worden. Dit is van belang omdat de het aantal diabetespatiënten de komende jaren blijft stijgen, zo ook de benodigde inzet van humane graders en oogartsen.

In de enige real-life implementatie studie was de specificiteit van het automatische grading-systeem (EyeArt) laag (54%). De specificiteit van humane graders was over het algemeen hoger. Daarentegen was de specificiteit voor het IDx grading-systeem hoog in de studie van Abramoff (2018). De studie van Heydon (2020) is de enige studie die onderzoek heeft gedaan naar de diagnostische waarden van het systeem wanneer deze volgens een triage strategie wordt ingezet in de dagelijkse praktijk. Bij de strategie ‘replacement’ zouden mogelijkerwijze veel patiënten met een fout-positieve uitslag doorverwezen worden naar de oogarts, wat niet kosteneffectief is. Bij de triage-strategie worden alle positieve testuitslagen voor doorverwijzing naar de oogarts eerst beoordeeld door humane graders. Een humane grader kan veel fout-positieve testuitslagen corrigeren zodat minder patiënten onterecht naar een oogarts doorverwezen worden.

Bij de keus van het systeem is het belangrijk dat het systeem een door de MDR als klasse IIa CE-geregistreerd medical device is en dat het systeem gebruikt wordt conform de definitie door de fabrikant. Er zijn bij deze klasse strenge eisen verbonden aan de veiligheid en prestatie van het product en het kwaliteitsmanagementsysteem van de fabrikant. Op dit moment gaat een voorkeur uit naar het gebruik van IDx-DR of EyeArt. Voor beide systemen komt deze voorkeur voort uit het feit dat voor de toelating tot de Amerikaanse markt een uitgebreide De Novo beoordeling door de FDA heeft plaatsgevonden. Voor EyeArt is de voorkeur gebaseerd op het feit dat deze oplossing als enige geëvalueerd is in een real-life studie. Beide oplossingen zijn echter (nog) niet CE-gecertificeerd onder de MDR, maar onder diens voorloper de medical device directive (MDD).

Conform de richtlijn DR dient bij milde DR in één oog de screening herhaald te worden na een jaar, waartegen bij milde DR in twee ogen verwijzing naar de oogarts geïndiceerd is.

De uitslag van IDx-DR geeft niet aan of milde DR in één of twee ogen geconstateerd is maar maakt gebruik van een ‘strengere’ grading scale dan de richtlijn. EyeArt kan wel per oog de graad van DR bepalen, maar deze performance per oog is niet gerapporteerd in de literatuur. Hierbij is aan te merken dat de module screening op dit moment herzien wordt.

Onderbouwing

Achtergrond

In Nederland wordt periodieke screening op diabetische retinopathie (DR) uitgevoerd door oogartsen, optometristen en andere graders, zie ook module Screening op diabetische retinopathie. Doordat het aantal patiënten met diabetes mellitus toeneemt, is optimalisatie van de screening noodzakelijk. Recent zijn meerdere klasse IIa CE-gecertificeerde automatische grading-systemen (deep learning-based algorithms) beschikbaar gekomen, waarvan op dit moment twee zijn toegelaten tot de markt door de FDA (de Jong-Hesse, 2021). Deze automatische grading-systemen screenen op DR “autonoom”. “Autonoom” betekent dat het systeem zelfstandig een diagnose kan stellen en een advies voor verwijzing kan geven zonder controle door de zorgaanbieder.

Inzet van automatische grading-systemen leidt mogelijk tot besparing van tijd en kosten, en betere diagnostiek (hogere intra- en interbeoordelaars betrouwbaarheid). Deze module richt zich dan ook op de waarde van automatische DR-grading-systemen in vergelijking met routinematige DR-screening door humane graders. Op basis van de diagnostische waarde van automatische grading-systemen en een weging van voordelen en nadelen kan bepaald worden of deze systemen humane graders deels dan wel volledig kunnen vervangen. Deze module vervangt het standpunt “Geautomatiseerde screening op DR met klasse IIa CE-geregistreerde medical devices 04012021” van het NOG.

Conclusies / Summary of Findings

Idx-DR X2.1

Moderate GRADE

The sensitivity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 91% to 100%, using screening by human graders or a reading center (Abramoff) as reference.

Sources: Abramoff, 2018; Nielsen, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019.

Moderate GRADE

The specificity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 82% to 87%, using screening by human graders or a reading center (Abramoff) as reference.

Sources: Abramoff, 2018; Nielsen, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019.

Moderate GRADE

The area under the curve of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 0.94 to 0.98, using screening by human graders or a reading center (Abramoff) as reference.

Sources: Abramoff, 2018; Nielsen, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019.

Google Inc.

Moderate GRADE

The sensitivity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 97% to 98%, using screening by human graders as reference.

Sources: Gulshan, 2019; Nielsen, 2019 (2 studies); Ruamviboonsuk, 2019.

Moderate GRADE

The specificity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 84% to 96%, using screening by human graders as reference.

Sources: Gulshan, 2019; Nielsen, 2019 (2 studies); Ruamviboonsuk, 2019.

EyeArtSystem v2.0

High GRADE

The sensitivity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 91% to 96%, using screening by human graders as reference. The sensitivity was 95.7% in a real-life implementation setting.

Sources: Bhaskaranand, 2019; Heydon, 2020; Olvera-Barrios, 2020.

Moderate GRADE

The specificity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 54% to 91%, using screening by human graders as reference. The specificity was 54% in a real-life implementation setting.

Sources: karanand, 2019; Heydon, 2020.

High GRADE

The area under the curve of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 0.89-0.97, using screening by human graders as reference.

Sources: Bhaskaranand, 2019; Heydon, 2020; Olvera-Barrios, 2020.

Remaining algorithms

Low GRADE

The sensitivity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 89% to 99%, using screening by human graders as reference.

Sources: He, 2019; Hsieh, 2020; Li, 2018; Nielsen, 2019; Olvera-Barrios, 2020.

Low GRADE

The specificity of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 90% to 98%, using screening by human graders as reference.

Sources: He, 2019; Hsieh, 2020; Li, 2018; Nielsen, 2019; Olvera-Barrios, 2020.

Low GRADE

The area under the curve of automatic screening on referable diabetic retinopathy in patients with diabetes mellitus ranges from 0.90 to 0.99, using screening by human graders as reference.

Sources: He, 2019; Hsieh, 2020; Li, 2018; Nielsen, 2019; Olvera-Barrios, 2020.

Samenvatting literatuur

Description of SR

Nielsen (2019) performed an SR to investigate the diagnostic performance of deep learning-based algorithms in screening patients with diabetes mellitus for DR, compared to classification by human graders (reference standard). Studies meeting these criteria were eligible for inclusion. The search was performed on April 5, 2018. In total 11 studies were included. Eight studies reported sensitivity and specificity of 80% to 100% and 84% to 99%, respectively (point-estimates) for detecting DR. Two studies also calculated test accuracy, and report values of 79% and 81%. One study provides an area under the receiver operating curve of 0.955. In addition to diagnostic performance, one study also reported on patient satisfaction, showing that 78% of patients preferred an automated deep learning model over manual human grading (Keel, 2018).
Limitations of included studies: models were mostly validated on high-quality images (not representative for real-world screening programs), and real-world performance was not adequately determined (Nielsen, 2019). In addition, most of the included studies did not include a control (human grader) and reference standard (expert or preferably a panel of expert human graders).

Description of primary studies

Sixteen studies were published after April 2018. All studies investigated the diagnostic performance of deep learning-based algorithms in screening patients with diabetes mellitus for DR, compared to classification by human graders or reading centres (reference standard). Diagnostic values were reported in all studies. Outcomes related to costs and burden for patients were lacking. One study reported a potential strategy: “implementing AI into the screening pathway by replacing the primary grader” (Heydon, 2020). In the current summary of literature this study is described as performed in a “real-life study” setting, since only this study reported outcomes of the diagnostics values of automatic screenings devices in daily clinical practice. Importantly, the study of Abramoff (2018) used the highest quality reference standard as determined by the Wisconsin Fundus Photograph Reading Center (FPRC). The FPRC has adopted the use of a widefield stereoscopic retinal imaging protocol (4W-D), that includes four stereoscopic pairs of digital images per eye, each pair covering 45–60°, equivalent to the area of the retina covered by the older, modified 7-field stereo film protocol in combination with OCT. The other studies used at least two digital image fields.
Note: In current clinical practice, 2-field images are sufficient for screening. The 7-field images are used for scientific research.

Due to heterogeneity between studies with respect to, e.g., design, deep-learning algorithm, population, and prevalence of DR, it is not justified to calculate overall (pooled) effect estimates. A description of each study, and outcomes per study, are provided in the evidence tables, a short summary is provided below. Most studies were performed in Europe, Asia and/or Australia. Seven of the sixteen studies also included images of low quality in their analysis, which are difficult (or even impossible) to assess for an automatic screening system (Gulshan, 2019; He, 2019; Hesie, 2020; Heydon, 2020; Kanagasingam, 2018; Olvera-Barrios, 2020; Ruamviboonsuk, 2019). The remaining nine of the sixteen studies were less representative of a typical clinical setting including only images of high quality. Eight studies applied their model in a clinical setting, i.e., using prospectively collected data (Abramoff, 2018; Bellemo, 2019; Heydon, 2020; Kanagasingam, 2018; Ruamviboonsuk, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019). The remaining studies used retrospectively collected data and/or data from large databases. Eleven of the sixteen studies used an existing algorithm. This means an algorithm which is already validated and/or has a patent (Abramoff, 2018; Bhaskaranand, 2019; He, 2019; Heydon, 2020; Kanagasingam, 2018; Olvera-Barrios, 2020; Ruamviboonsuk, 2019; Shah, 2020a; Shah, 2020b; van der Heijden, 2018; Verbraak, 2019). The remaining studies developed (tested and trained) the algorithm by themselves, and the resulting systems are not (yet) commercially available.

Outcomes for diagnostic values are summarized per algorithm and/or setting below.

Results per algorithms

Five studies used the IDx-DR X2.1 algorithm. Diagnostic values per severity of DR are reported below;

rDR; sensitivity range 91%- 100%, specificity range 82%- 87%, AUC range 0.940-0.980

vtDR; sensitivity range 64%- 100%, specificity range 95%- 98%, AUC range 0.910-0.998

mtmDR (i.e., mild to moderate DR); sensitivity range 79%-87%, specificity range 91%-94%,
(Abramoff, 2018; Nielsen, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019).

**Sensitivity of 64% is possibly explained by a small number of cases (van der Heijden, 2018).

Four studies used the Google Inc. algorithm. Diagnostic values per severity of DR are reported below;

rDR; sensitivity range 97%-98%, specificity range 84%-96%

DR; sensitivity range 89%-92%, specificity range 92%-93%

DME; sensitivity range 94%-97%, specificity range 91%-98%

(Gulshan, 2019; Nielsen, 2019 (2 studies); Ruamviboonsuk, 2019).

**Specificity of 84% is possibly explained by the high sensitivity (which was of interest) (Nielsen, 2019).

Three studies used the EyeArtSystem v2.0 algorithm. Diagnostic values per severity of DR are reported below;

rDR; sensitivity range 91%-96%, specificity range 54-91%**, AUC range 0.89-0.97

DR; sensitivity 92%, AUC range 0.95-0.97

(Bhaskaranand, 2019; Heydon, 2020; Olvera-Barrios, 2020).
** Specificity of 54% is possibly explained by the real-world setting in which the study of Heydon (2020) was performed.

Fifteen studies used other algorithms (mostly self-developed). Diagnostic values per severity of DR are reported below;

rDR; sensitivity range 89%-99%, specificity range 90%-98%, AUC range 0.900-0.989

DR; sensitivity range 80%-100%, specificity range 87%-99%, AUC range 0.787-0.991

vtDR; sensitivity 99%*, AUC 0.934*

DME; sensitivity 95%*, specificity 93%*, AUC 0.986*

*only reported in 1 study.

Algorithm in clinical practice vs human graders or reading center

Eight studies applied their algorithm in a clinical setting. Diagnostic values per severity of DR are reported below;

rDR; sensitivity range 91%-100%, specificity range 82%-96%, AUC range 0.940-0.987

DR; sensitivity 92%, specificity 89%, AUC 0.973

vtDR; sensitivity range 64%-100%, specificity 95%-98%, AUC range 0.910-0.998
DME; sensitivity 95%, specificity 98%, AUC 0.993

mtmDR; sensitivity 79%-87%, specificity 90%-94%

(Abramoff, 2018; Bellemo, 2019; Heydon, 2020; Kanagasingam, 2018; Ruamviboonsuk, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019).

**Sensitivity of 64% is possibly explained by a small number of cases (van der Heijden, 2018).

Of these studies, three studies applied the Idx-DR X2.1 algorithm in a clinical setting. Diagnostic values per severity of DR are reported below;

rDR; sensitivity 91% (95%CI 69.0-98.0), specificity range 84% (95%CI 81.0-86.0), AUC 0.94 (95%CI 0.88-0.93)

vtDR; sensitivity range 64%- 100%, specificity range 95%- 98%, AUC 0.91 (95%CI 0.83-0.98)*

mtmDR; sensitivity range 79%-87%, specificity range 91%-94%

*only reported in 1 study.

(Abramoff, 2018; van der Heijden, 2018; Verbraak, 2019)

One study applied the Google Inc. algorithm in a clinical setting. Diagnostic values per severity of DR are reported below;

rDR; sensitivity 98.0% (95%CI 93.9-100) specificity 95.6% (95%CI 98.3-98.7), AUC 0.987 (0.977-0.995)

DME; sensitivity 95.3% (95%CI 85.9-100) specificity 98.2% (95%CI 94.4-99.1), AUC 0.993 (95%CI 0.993-0.994)

(Ruamviboonsuk, 2019).

One study applied the EyeArtSystem v2.0 algorithm in a clinical setting. This study reported a sensitivity of 95.7% (95%CI 94.8-96.5) and a specificity of 54.0% (95%CI 53.4-54.5%) for rDR (Heydon, 2020). The specificity of 54% is possibly explained by the real-life study setting in which the study of Heydon (2020) was performed.

Human graders

One study reported the diagnostic values of human graders in a clinical setting. Van der Heijden (2018) investigated the sensitivity and specificity of human graders against the adjudicated reference standard for ICDR. Regarding referable DR, the sensitivity ranges from 63 to 92, and the specificity from 99 to 100.

Level of evidence of the literature

The level of evidence (GRADE method) is determined per comparison and diagnostic outcome measure and is based on results from diagnostic accuracy studies and therefore starts at level “high”. Subsequently, the level of evidence was downgraded if there were relevant shortcomings in one of the several GRADE domains: risk of bias, inconsistency, indirectness, imprecision, and publication bias.

Idx-DR X2.1 (Abramoff, 2018; Nielsen, 2019; Shah, 2020a; van der Heijden, 2018; Verbraak, 2019)

The level of evidence regarding the outcome measure sensitivity, specificity, AUC was downgraded by 1 level because of indirectness, as the automatic screening system was not tested/performed in a real-life study setting with a specific predefined strategy (e.g., replacement test or triage test).

Google Inc. (Gulshan, 2019; Nielsen, 2019 (2 studies); Ruamviboonsuk, 2019)

EyeArtSystem v2.0 (Bhaskaranand, 2019; Heydon, 2020; Olvera-Barrios, 2020)

The level of evidence regarding the outcome measure sensitivity, AUC was not downgraded.

The level of evidence regarding the outcome measure specificity was downgraded by 1 level because of inconsistency, as the specificity of one study (Heydon, 2020, i.e., real-life study implementation setting) was not in line with the other study which reported this outcome.

Remaining algorithms

The level of evidence regarding the outcome measure sensitivity, specificity, AUC was downgraded by 2 levels because of indirectness, as the automatic screening system was not tested/performed in a real-life study setting with a specific predefined strategy (e.g., replacement test or triage test), and secondly, several algorithms were used.

The level of evidence regarding the outcome measures costs, burden for patients and implementation characteristics could not be assessed since literature data on these outcomes were lacking.

Zoeken en selecteren

A systematic review of the literature was performed to answer the following question:

What are the diagnostic values, and additional benefits and harms, of automatic screening devices for diabetic retinopathy in patients with diabetes mellitus, compared with screening by human graders?

P: patients (adults or children) with diabetes mellitus invited for routine (periodic) screening of diabetic retinopathy

I: automatic DR screening devices of retinal fundus images using artificial intelligence (deep learning or hybrid approaches) *

C: human grader(s) in clinical setting

R: reference test, human graders (expert graders, preferably expert panel, reading center)

O: diagnostic values (sensitivity, specificity, positive predictive value, negative predictive value, area under the curve (AUC), technical failure/ non-gradable; focus on referable retinopathy), costs (cost-effectivity), burden for patients, implementation characteristics

*systems not involving deep learning algorithms but only handcrafted feature engineering for recognition of specific lesions are excluded.

Relevant outcome measures

The guideline development group considered diagnostic values as a critical outcome measure for decision making; and patient burden, implementation characteristics and costs as important outcome measures for decision making. Depending on the goal and position of the automatic screening device in the diagnostic pathway, high sensitivity and high specificity are equally important (see Table 1, replacement strategy: automatic screening replaces human grader) or high sensitivity is the dominant diagnostic criterium (Table 1, triage strategy: automatic screening precedes human grader).

Since in both diagnostic strategies (replacement and triage) patients are ultimately referred to specialist eye care, the most important diagnostic characteristics of the automatic screening devices are those defined in relation to referable diabetic retinopathy (rDR). Generally rDR is defined as more than mild non-proliferative diabetic retinopathy and/or features of diabetic macular edema using the ICDR classification or the NHS DESP Scanlon classification.

Other important diagnostic characteristics relate to referable diabetic macular edema (rDME), vision-threatening diabetic retinopathy (vtDR), defined as ETDRS level >53 and /or DME, and proliferative diabetic retinopathy (PDR). The outcome implementation characteristics refers to user-friendliness, logistic demands, and ease of implementation.

The working group did not define minimal clinically (patient) important differences for diagnostic values or costs (cost-effectivity).

Search and select (Methods)

The databases Medline (via OVID) and Embase (via Embase.com) were searched with relevant search terms from 2015 until September 9, 2020. The detailed search strategy is depicted under the tab Methods. The systematic literature search resulted in 561 hits. Studies were selected based on the following criteria;
- systematic review or primary diagnostic study

- patients with diabetes mellitus

- automatic screening for diabetic retinopathy using deep learning algorithms

- diagnostic values (at least sensitivity or specificity in relation to referable DR) using screening by human grader(s) (preferably an expert panel) as the reference test.

Twenty-two studies were initially selected based on title and abstract screening. After reading the full text, 6 studies were excluded (see the table with reasons for exclusion under the tab Methods), and 17 studies were included.

Results

One SR (containing 11 primary studies) and fifteen (new) individual studies were included in the analysis of the literature. Important study characteristics and results are summarized in the evidence tables. The assessment of the risk of bias is summarized in the risk of bias tables.

Referenties

Abràmoff MD, Lavin PT, Birch M, Shah N, Folk JC. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit Med. 2018 Aug 28;1:39. Doi: 10.1038/s41746-018-0040-6. PMID: 31304320; PMCID: PMC6550188.
Bellemo V, Lim ZW, Lim G, Nguyen QD, Xie Y, Yip MYT, Hamzah H, Ho J, Lee XQ, Hsu W, Lee ML, Musonda L, Chandran M, Chipalo-Mutati G, Muma M, Tan GSW, Sivaprasad S, Menon G, Wong TY, Ting DSW. Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa: a clinical validation study. Lancet Digit Health. 2019 May;1(1):e35-e44. Doi: 10.1016/S2589-7500(19)30004-4. Epub 2019 May 2. PMID: 33323239.
Bhaskaranand M, Ramachandra C, Bhat S, Cuadros J, Nittala MG, Sadda SR, Solanki K. The Value of Automated Diabetic Retinopathy Screening with the EyeArt System: A Study of More Than 100,000 Consecutive Encounters from People with Diabetes. Diabetes Technol Ther. 2019 Nov;21(11):635-643. Doi: 10.1089/dia.2019.0164. Epub 2019 Aug 7. PMID: 31335200; PMCID: PMC6812728.
González‐Gonzalo C, Sánchez‐Gutiérrez V, Hernández‐Martínez P, Contreras I, Lechanteur YT, Domanian A, ... & Sánchez CI. Evaluation of a deep learning system for the joint automated detection of diabetic retinopathy and age‐related macular degeneration. Acta ophthalmologica. 2020;98(4),368-377.
Guidelines on the qualification and classification of standalone software used in healthcare within the regulatory framework of medical devices, MEDDEV 2.1/6, July 2016, available via https://ec.europa.eu/health/sites/health/files/md_topics-interest/docs/md_meddev-guidance-216_en.pdf.
Gulshan V, Rajan RP, Widner K, Wu D, Wubbels P, Rhodes T, Whitehouse K, Coram M, Corrado G, Ramasamy K, Raman R, Peng L, Webster DR. Performance of a Deep-Learning Algorithm vs Manual Grading for Detecting Diabetic Retinopathy in India. JAMA Ophthalmol. 2019 Jun 13;137(9):987–93. Doi: 10.1001/jamaophthalmol.2019.2004. Epub ahead of print. PMID: 31194246; PMCID: PMC6567842.
Handreiking AI in de zorg, jurdische handreiking, Melita van der Mersch, January 2020, available via https://cloud.blogbird.nl/sites/velinkdedie/pdf/Handreiking-AI-in-de-zorg.pdf
He J, Cao T, Xu F, Wang S, Tao H, Wu T, Sun L, Chen J. Artificial intelligence-based screening for diabetic retinopathy at community hospital. Eye (Lond). 2020 Mar;34(3):572-576. Doi: 10.1038/s41433-019-0562-4. Epub 2019 Aug 27. PMID: 31455902; PMCID: PMC7042314.
Heydon P, Egan C, Bolter L, Chambers R, Anderson J, Aldington S, Stratton IM, Scanlon PH, Webster L, Mann S, du Chemin A, Owen CG, Tufail A, Rudnicka AR. Prospective evaluation of an artificial intelligence-enabled algorithm for automated diabetic retinopathy screening of 30 000 patients. Br J Ophthalmol. 2020 Jun 30:bjophthalmol-2020-316594. Doi: 10.1136/bjophthalmol-2020-316594. Epub ahead of print. PMID: 32606081.
Hsieh YT, Chuang LM, Jiang YD, Chang TJ, Yang CM, Yang CH, Chan LW, Kao TY, Chen TC, Lin HC, Tsai CH, Chen M. Application of deep learning image assessment software VeriSee™ for diabetic retinopathy screening. J Formos Med Assoc. 2021 Jan;120(1 Pt 1):165-171. Doi: 10.1016/j.jfma.2020.03.024. Epub 2020 Apr 16. PMID: 32307321.
Ipp E, Liljenquist D, Bode B, Shah VN, Silverstein S, Regillo CD, Lim JI, Sadda S, Domalpally A, Gray G, Bhaskaranand M, Ramachandra C, Solanki K; EyeArt Study Group. Pivotal Evaluation of an Artificial Intelligence System for Autonomous Detection of Referrable and Vision-Threatening Diabetic Retinopathy. JAMA Netw Open. 2021 Nov 1;4(11):e2134254. doi: 10.1001/jamanetworkopen.2021.34254. Erratum in: JAMA Netw Open. 2021 Dec 1;4(12):e2144317. PMID: 34779843; PMCID: PMC8593763.
Kanagasingam Y, Xiao D, Vignarajan J, Preetham A, Tay-Kearney ML, Mehrotra A. Evaluation of Artificial Intelligence-Based Grading of Diabetic Retinopathy in Primary Care. JAMA Netw Open. 2018 Sep 7;1(5):e182665. Doi: 10.1001/jamanetworkopen.2018.2665. PMID: 30646178; PMCID: PMC6324474.
Keel S, Lee PY, Scheetz J, Li Z, Kotowicz MA, MacIsaac RJ, He M. Feasibility and patient acceptability of a novel artificial intelligence-based screening model for diabetic retinopathy at endocrinology outpatient services: a pilot study. Sci Rep. 2018 Mar 12;8(1):4330. Doi: 10.1038/s41598-018-22612-2. PMID: 29531299; PMCID: PMC5847544.
Li Z, Keel S, Liu C, He Y, Meng W, Scheetz J, Lee PY, Shaw J, Ting D, Wong TY, Taylor H, Chang R, He M. An Automated Grading System for Detection of Vision-Threatening Referable Diabetic Retinopathy on the Basis of Color Fundus Photographs. Diabetes Care. 2018 Dec;41(12):2509-2516. Doi: 10.2337/dc18-0147. Epub 2018 Oct 1. PMID: 30275284.
Nielsen KB, Lautrup ML, Andersen JKH, Savarimuthu TR, Grauslund J. Deep Learning-Based Algorithms in Screening of Diabetic Retinopathy: A Systematic Review of Diagnostic Performance. Ophthalmol Retina. 2019 Apr;3(4):294-304. Doi: 10.1016/j.oret.2018.10.014. Epub 2018 Nov 3. PMID: 31014679.
Olvera-Barrios A, Heeren TF, Balaskas K, Chambers R, Bolter L, Egan C, Tufail A, Anderson J. Diagnostic accuracy of diabetic retinopathy grading by an artificial intelligence-enabled algorithm compared with a human standard for wide-field true-colour confocal scanning and standard digital retinal images. Br J Ophthalmol. 2021 Feb;105(2):265-270. Doi: 10.1136/bjophthalmol-2019-315394. Epub 2020 May 6. PMID: 32376611.
Raumviboonsuk P, Krause J, Chotcomwongse P, Sayres R, Raman R, Widner K, Campana BJL, Phene S, Hemarat K, Tadarati M, Silpa-Archa S, Limwattanayingyong J, Rao C, Kuruvilla O, Jung J, Tan J, Orprayoon S, Kangwanwongpaisan C, Sukumalpaiboon R, Luengchaichawang C, Fuangkaew J, Kongsap P, Chualinpha L, Saree S, Kawinpanitan S, Mitvongsa K, Lawanasakol S, Thepchatri C, Wongpichedchai L, Corrado GS, Peng L, Webster DR. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit Med. 2019 Apr 10;2:25. Doi: 10.1038/s41746-019-0099-8. Erratum in: NPJ Digit Med. 2019 Jul 23;2:68. PMID: 31304372; PMCID: PMC6550283.
Shah A, Clarida W, Amelon R, Hernaez-Ortega MC, Navea A, Morales-Olivas J, Dolz-Marco R, Verbraak F, Jorda PP, van der Heijden AA, Peris Martinez C. Validation of Automated Screening for Referable Diabetic Retinopathy With an Autonomous Diagnostic Artificial Intelligence System in a Spanish Population. J Diabetes Sci Technol. 2020 Mar 16:1932296820906212. Doi: 10.1177/1932296820906212. Epub ahead of print. PMID: 32174153.
Shah P, Mishra DK, Shanmugam MP, Doshi B, Jayaraj H, Ramanjulu R. Validation of Deep Convolutional Neural Network-based algorithm for detection of diabetic retinopathy – Artificial intelligence versus clinician for screening. Indian J Ophthalmol. 2020 Feb;68(2):398-405. Doi: 10.4103/ijo.IJO_966_19. PMID: 31957737; PMCID: PMC7003578.
Sharma, Sunil, Maheshwari, Saumil and Shukla, Anupam. “An intelligible deep convolution neural network based approach for classification of diabetic retinopathy” Bio-Algorithms and Med-Systems, vol. 14, no. 2, 2018. https://doi.org/10.1515/bams-2018-0011
Tufail A, Rudisill C, Egan C, Kapetanakis VV, Salas-Vega S, Owen CG, Lee A, Louw V, Anderson J, Liew G, Bolter L, Srinivas S, Nittala M, Sadda S, Taylor P, Rudnicka AR. Automated Diabetic Retinopathy Image Assessment Software: Diagnostic Accuracy and Cost-Effectiveness Compared with Human Graders. Ophthalmology. 2017 Mar;124(3):343-351. Doi: 10.1016/j.ophtha.2016.11.014. Epub 2016 Dec 23. PMID: 28024825
U.S. Food and Drug Administration (FDA). 510(k) Premarket Notification Database. EyeArt. Eyenuk, Inc., Los Angele, CA. Summary of Safety and Effectiveness. No. K200667. Rockville, MD: FDA. August 03, 2020. Available at: https://www.accessdata.fda.gov/cdrh_docs/pdf20/K200667.pdf. Accessed on October 14, 2021
Verbraak FD, Abramoff MD, Bausch GCF, Klaver C, Nijpels G, Schlingemann RO, van der Heijden AA. Diagnostic Accuracy of a Device ob e e Automated Detection of Diabetic Retinopathy in a Primary Care Setting. Diabetes Care. 2019 Apr;42(4):651-656. Doi: 10.2337/dc18-0148. Epub 2019 Feb 14. PMID: 30765436.
Van der Heijden AA, Abramoff MD, Verbraak F, van Hecke MV, Liem A, Nijpels G. Validation of automated screening for referable diabetic retinopathy with the Idx-DR device in the Hoorn Diabetes Care System. Acta Ophthalmol. 2018 Feb;96(1):63-68. Doi: 10.1111/aos.13613. Epub 2017 Nov 27. PMID: 29178249; PMCID: PMC5814834.
Wet medische hulpmiddelen, wettenbank, overheid.nl, July 2021, available via https://wetten.overheid.nl/BWBR0042755/2021-07-17#Hoofdstuk2

Evidence tabellen

Evidence tables of Diagnostic Test Accuracy studies

Research question: What is the value of automated diabetic retinopathy screening of retinal fundus images based on artificial intelligence and deep learning-based algortims?

Study (Year)

Method

Target

Condition

Reference Standard

Grading Scale

Training Dataset

(No. Of Images)

Validation Dataset

(No. Of Images)

Performance Score(s)*

(95% CI)

Remarks

Funding

COI

Nielsen 2019: systematic review (search date April 5, 2018; includes 11 primary studies)

Abramoff 2016

Idx-DR X2.1

Hybrid model with

multiple CNNs and

random forest classifier

rDR‡

vtDR

DME

Consensus among 3 US board certified retinal specialists

DR: ICDR

DME: modified definition

of DME (0-2)

EyeCheck project and

Iowa University

(n=10 000 – 1250000,

samples from images)

Messidor-2 (n=1748)

(images duplicated for

the system ob e able to

run)

rDR

Se 96.8% (93.3-98.8)

Spe 87.0% (84.2-89.4)

AUC 0.980 (0.968-0.992)

Gulshan

2016

Google Inc.

CNN

rDR‡

Majority vote of at least 7

US board certified

ophthalmologists

DR: ICDR

rDME: 0-1

EyePacs (n=94281)**

3 eye hospitals in India (n=33894)

Messidor-2 (n=1748)**

EyePacs-1 (n=9963)**

Se 96.7%† (95.7-97.5)

Spe 84.0† (83.1-85.0)

AUC 0.974 (0.971-0.978)

Gargeya

2017

CNN with decision tree classifier

DR‡

Medical experts

(Messidor-2 label)

DR: R0-R3

DME: 0-2

EyePacs (n=75125)

Messidor-2 (n=1748)

Se 93.0

Spe 87.0

AUC 0.940

Quellec

2017

Ensemble CNN with

random forest classifier

rDR‡

Licensed clinician

(Kaggle label)

DR: 0-4 (DME not included)

Kaggle DR competition

dataset (n=35126)

Kaggle DR competition

dataset (n=53576)

AUC 0.955

Takahashi

2017

GoogLeNet

Third-party CNN

DR 4-class

classification‡

Human grader

DR: modified Davis grading (0-3) (DME

not included)

Patients from a Japanese hospital (n=9443)

Patients from a Japanese

hospital (n=496)

PABAK 0.74

Acc 81.0%

Ramachandran

2017

Visiona

Third-party DNN

rDR‡

Otago: ophthalmic

medical photographer

with ophthalmologist

Messidor: medical

specialists (Messidor

label)

Otago:

DR: R0-R5

DME: M0-M6

Messidor:

DR: R0-R3

DME: 0-2

Third-party training data (n>100000)

Otago (n ≥1764)**

Messidor (n=1200)

Se 96.0%

Spe 90.0%

AUC 0.980 (0.973-0.986)

Mansour 2017

AlexNet DNN

CNN with support vector machine classifier

DR‡

Licensed clinician

(Kaggle label)

DR: 0-4 (DME not included)

ImageNet††

(n=14197122)

Kaggle DR competition

dataset (n 35126)

Se 100.0%

Spe 99.0%

Acc 97.28%

Ting 2017

CNN

rDR‡

vtDR

Retinal specialist or

trained professional

graders, retinal

specialists, optometrists

or ophthalmologists

DR: ICDR

DME: 0-1

SIDR 2010-2013

(n=76 370)

SIDR 2014-2015

(n= 71896)

10 external datasets

(n= 40752)

Se 98.9% (97.5-99.6)

Spe 92.2% (89.5-94.3)

AUC 0.983 (0.972-0.991)

Raju 2017

CNN

DR‡

Licensed clinician

(Kaggle label)

DR: 0-4 (DME not included)

Kaggle DR competition

dataset (n= 35126)

Kaggle DR competition

dataset (n= 53576)

Se 80.28%

Spe 92.29%

Keel 2018

Third-party DL algorithm

rDR‡

Human grader at a

centralized retinal

grading center

DR: R0-R3

DME: M0, M1, P, U

Online dataset from

China (n= 66790)

96 patients from 2

outpatient clinics in

Australia (n= 192)

Se 92.3%

Spe 93.7%

AUC 0.933

Chang 2018

ResNet-34

CNN with a sigmoid

classifier

DR‡

Licensed clinician

(Kaggle label)

DR: 0-4 (DME not included)

Kaggle DR competition

dataset (n= 35126)**

Kaggle DR competition

dataset (n= 35126)**

Acc 78.7%

Update (relevant primary studies published after April 1, 2018: 15 studies)

Hsieh 2020

VeriSee^TM

CNN

rDR‡

PDR

Two ophthalmologists with > 1 training, non-concordant judgements resolved by ophthalmologist with > 10 years of experience

Diagnostic results EYEPACS dataset also treated as reference standard

mild NPDR, moderate NPDR, severe NPDR, PDR according to ICDR

rDR = moderate NPDR or worse

EyePacs (n=31612)

NTUH (n=5649)

NTUH (n=1875)

Image quality:

Clear/gradable 585

Not clear/gradable 1290

‘images judged as ungradable by ophthalmologists were excluded’ [i.e. not included in NTUH dataset]

rDR

Se 89.2% (85.2-93.3)

Spe 90.1% (88.7-91.6)

Acc 90.0% (88.6-91.3)

PDR

90.9% (81.1-100.0)

99.3% (98.9-99.7)

99.1% (98.7-99.6)

Based on point on ROC curve with optimal Se and Spe (‘optimal operating point’)

rDR by individual opthalmologists:

Se 71.1% (66.8-75.1)

Spe 98.9% (98.5-99.2)

Acc 92.9% (91.9-94.0)

DTA for any DR: see publication

Own intellectual roperty VeriSee together with Acer Inc; co-funded by Acer Inc

Note: Taiwan (Asian population), 1 style of fundus camera, algorithm not trained to detect other retinal diseases, single-field fundus photography

Shah 2020a

Idx-DR v2

Hybrid model with

multiple CNNs and

random forest classifier (includes image quality analysis)

class Iia CE-mark; FDA-approved

rDR‡

vtDR

Consensus among 3 Spanish ophthalmologists (masked); non-concordant judgements resolved by independent fellowship

trained retinal specialist (masked)

Exclusion: laser scarring (grader indicated photocoagulation scars), poor exam quality (Subjective expert assessment indicating ob een of quality to grade safely)

vtDR (includes DME), MDR, ‘negative for more than mild DR’; according to ICDR

Not specified

regional primary care-based DR screening program in Spain (n=3531 patient exams; 2 images/patient)

rDR

Se 100% (97-100)

Spe 82% (80-83)

Acc 82.57%

AUC 0.984 (0.97-0.99)

vtDR

Se 100% (95-100)

Spe 95% (94-95)

Acc 94.78%

AUC 0.998 (0.997-0.999)

Screening efficiency: 111 TP and 467 FP out of 2680 exams, i.e. Efficiency gain (EG) 78.43%

3531 patient exams: 2680 included (76%), 851 excluded (250 for laser scarring, 195 for insufficient image quality according to graders, 404 for insufficient image quality according to AI system, and 2 rejected by both graders and AI system)

Authors are employed by and shareholders of manufacturer Idx Technol Inc; funded by Idx

Note: retrospective analysis of prospectively collected data;

Heydon 2020

EyeArt v2.1

CNN

rDR‡

English Diabetic Eye

Screening Programme (DESP) standard national protocol using human grading: primary grader (triage), and any patients with mild or worse retinopathy

or maculopathy (in addition ob e 10% graded ‘no

retinopathy’) are reviewed by a secondary grader, with discrepancies

between the primary and secondary grader reviewed by an

arbitration grader (tertiary grader)

NHS DESP grades / ETDRS⁺

DR: R0-R3

DME: M0, M1, U (human graded ungradable)

Not specified

Consecutive screening episodes from three English DESPs (NEL, GS and SEL), London, UK

Screening episode:

≥ 4 retinal images per episode (on average 5 images per patient)

rDR

Se 95.7% (94.8-96.5)

R0 (no retinopathy)

Spe 68% (67-69)

R0, R1, M0 (non-refer)

Spe 54.0% (53.4-54.5)

rDR comprises:

R1M1

Se 98.3% (97.3-98.9)

Se 100% (98.7-100)

Se 100% (97.9-100)

Triage by EyeArt:

15091/30405 = 49.6% of all screening episodes

would require further human grading (47% to 51% across the three

centres) >> 50% reduction in workload for human graders

Among test-positive results: 55% had any DR, and 14% rDR

Among test-negative results: 94.5% had no DR, and 99% hand no rDR

Test-positive = rDR + technical failure

rDR = human

graded ungradable, referable maculopathy, moderate-to-severe

non-proliferative or proliferative DR (U+M1+R2+R3)

Costs: potentially save £0.5 million per 100 000 screening

episodes (England)

Note: prospective study of actual screening programmes (DESP) at 3 sites (different populations, grading teams, cameras); to test possibility of replacing primary grader (triage) by EyeArt; CE marking (FDA approval in final stage)

He 2019

TensorFlow

Framework

Deep neural network

rDR

Human grader (ophthalmologists)

DR: ICDR
RDR has been defined more than mild NPDR and/or macular

oedema

ImageNet dataset (1.2 million images for

1000 categories classification)

3556 retinal images

DR:

Sen. 90.79% (86.4-94.1)

Spec. 98.5% (97.8-99.0)

AUC 0.946 (0.935-0.956)

rDR:

Sen. 91.18% (86.4-94.7)

Spec. 97.79% (98.1-99.3)

AUC 0.950 (0.939-0.960)

Shah 2020b

Four stage CNN classifiers

Human grader

Two retina specialists, disagreement with 3 grader.

DR: ICDR

N= 112489 deidentified fundus images sourced form various hospitals, 103578 of good quality

Internal:

1533 independent images

External:
MESSIDOR 1

1200 images

Internval validation

Sen. 99.7% (99.3-99.9)

Spec 98.5% (94.7-99.8)

AUC 0.991 (0.985-0.995)

External validation:

Sen. 90.4% (87.8-92.5)

Spec. 91.0% (88.3-93.3)

AUC 0.907 (0.889-0.923)

Olvera-Barrios 2020

EyeArt software V2.1.0, multiple CNNs

Human grader

(NDESP: level 1 or 2, disagreement; level 3)
(EIDON: a level 3 grader)

DR: R0-R3

DME: M0, M1, U

Not specified

External validation

1257 images

Outcomes for any retinopathy

EIDON:

Sens. 92.27% (88.4-94.7)

AUC 0.95 (0.88-0.98)

NDESP:

Sens. 92.26% (88.4-94.7)

AUC 0.97 (0.91-0.99)

Bellemo 2019a

Deep learning system:

Combination of 2 CNNs
- VGGNet
- ResNet

rDR

Human graders

DR: 0-3

vtDR: 0-2

76370 images from n=13099

4504 images from n=1574

DR:

Sens. 92.3% (90.1-94.1)

Spec. 89.0% (87.9-90.3)

AUC 0.973 (0.969-0.978)

vtDR:

Sens. 99.4% (99.2-99.7)

Spec.

AUC 0.934 (0.924-0.944)

Some authors are co-inventors of a patent on the deep learning system.

Gulshan 2019

Google Inc.

CNN

rDR‡

Human graders (retina specialist, trained grader)

DR: ICDR

rDME: 0-1

103634 images of 54149 patients

Tune: 40790 images of 20860 patients

Aravind: 1983 images of 997 patients.

Sankara: 3779 images of 2052 patients

Aravind:

DR:

Sens 88.9% (85.8-91.5)
spec 92.2% (90.3-93.8)

DME
sens 97.4% (92.5-98.8)
spec 90.7% (88.9-92.3)

Sankara:
DR:

Sens 92.1% (90.1-93.8)
spec 95.2% (94.2-96.1)

DME
sens 93.6% (97.0-98.3)
spec 92.5% (91.3-93.5)

This study received funding

from Google LLC.

Bhaskaranand 2019

EyeArtSystem v2.0
CNN

rDR

Human grader (single)

DR: ICDR

Not specified

EyePACS (n=101710)

Sens 91.3% (90.9-91.7)

Spec 91.1% (90.9-91.3)

AUC 0.965 (0.963-0.966)

Some authors are employees of EyeArt system.

Ruamviboonsuk 2019a

Google Inc.

(Inception-v4)

CNN

rDR

rDME

Human graders (retinal specialist)

DR & DME: ICDR

Krause, J. (2018) Grader variability and the importance of reference standards for

evaluating machine learning models for diabetic retinopathy

à external validation of model Krause, J. (2018)

N=29943 images

rDR

Sens 98.0% (93.9-100)

Spec 95.6% (98.3-98.7)

AUC 0.987 (0.977-0.995)

DME

Sens 95.3% (85.9-100)

Spec 98.2% (94.4-99.1)

AUC 0.993 (0.993-0.994)

Some authors are google employees

Verbraak 2019

Idx-DR-EU-2.1

Hybrid model with

multiple CNNs and

random forest classifier

vtDR

mtmDR

Human graders (two experienced readers)

DR: ICDR

See paper Abramoff 2016

vtDR

Sens 100% (77.1-100)

Spec 97.8% (96.8-98.5)

mtmDR
Sens 79.4% (66.5-87.9)
Spec 93.8% (92.1-94.9)

funded by

Idx.

Some authors have conflicts of interest with Idx.

Van der Heijden 2018

Idx-DR X2.1

Hybrid model with

multiple CNNs and

random forest classifier

rDR

vtDR

Consensus among 3 US board certified retinal specialists

DR: ICDR

DR: EURODIAB

See paper Abramoff 2016

ICDR:

rDR
sens 68% (56.0-79.0)

spec 86% (84.0-88.0)

AUC 0.87 (0.83-0.92)

vtDR
sens 62% (32.0-85.0)
Spec 95% (93.0-96.0)

AUC 0.90 (0.82-0.98)

EURODIAB:

rDR

Sens 91% (69.0-98.0)

Spec 84% (81.0-86.0)

AUC 0.94 (0.88-0.93)

vtDR

Sens 64.0% (36.0-86.0)

Spec 95.0% (93.0-96.0)

AUC 0.91 (0.83-0.98)

funded by

Idx.

One author have conflicts of interest with Idx.

Li 2018

CNN
4 DL models.

classification

for vision-threatening referable DR,

2) classification of DME,

3) evaluation of

image quality for DR, and

4) assessment of

image quality and of the availability of the

macular region for DME.

rDR

DME

Human graders

(ophthalmologists)

NHS diabetic screening guidelines

R: 0-3
M: 0-1

106244 nostereoscopic retinal images.

Development & internal validation; n=71043 [figure 1]

External validation

35201 images of 14520 eyes (including 3 cohorts [Australia, Singapore])

Internal:

rDR

Sens 97%

Spec 91.4%

AUC 0.989

DME

Sens 95.0%

Spec 92.9%

AUC 0.986

External:

rDR

Sens 92.5%

Spec 98.5%

AUC 0.955

Two authors reports conflict of interest; China patent application

number ZL201510758675.5; patent filing date 31

May 2017.

Abramoff 2018

Idx-DR

2 algorithms (image quality AI-based and diagnostic)

detection of DR and

DME, termed mtmDR

FPRC for mtmDR:

Read by three

experienced and validated readers at the FPRC according ob e well established

ETDRS SS.

OCT for DME:

Read by experienced readers at the FPRC according ob e DRCR grading paradigm.

DRCR grading

mtmDR

See paper Abramoff 2016

Identify mtmDR:

Sens 87.2% (81.8-91.2)

Spec 90.7% (88.3-92.7)

Competing of interest!

Kanagasingam 2018

CNN

Deep learning model Inception-v3

ophthalmologist

30000 (DiaretDB1, Kaggle (EyePACS), Tele-eye care DR)

386 images of 193 patients

Sens x ** only 2 by ophthalmologist, 17 by AI

Spec 92% (87-96)

Less information camera

Abbreviations: Se = Sensitivity; Spe = Specificity; Acc = Accuracy; AUC = area under the receiver operating curve; CI = confidence interval; CNN = convolutional neural network; DL = deep learning; DME = diabetic macular edema; DNN = deep neural network; DR = diabetic retinopathy; ICDR = International Clinical Diabetic Retinopathy; PABAK = prevalence-adjusted and bias-adjusted kappa; rDME= referable diabetic macular edema; rDR = referable diabetic retinopathy; SIDR = Singapore National Diabetes Retinopathy Screening Program; vtDR = vision-threatening diabetic retinopathy; PDR = proliferative diabetic retinopathy.

Note: some studies use multiple validation datasets; however, only datasets with full-scale DR severity grading, clear grading scale, and reference standard are included in this table. Where Cis are not listed, none were provided in the given study.

*If multiple performance scores were provided, the best performance score on a fully graded dataset is specified.

†The high-sensitivity set point ob ee all-cause referable analysis is provided, because high sensitivity is essential in a screening program to attain the highest possible patient safety.

‡The target condition that the specified performance scores represent.

**Only a subset of the full dataset was used.

††ImageNet contains solely generic photographs and no fundus images.

⁺No retinopathy (R0); mild-to-moderate non-proliferative retinopathy (R1); non-referable maculopathy (M0); ungradable images (U); referable maculopathy (M1), moderate-to-severe non-proliferative retinopathy (R2) and proliferative retinopathy (R3).

Evidence table of Diagnostic Test Accuracy studies: characteristics of Validation Datasets and Participants

Research question: What is the value of automated diabetic retinopathy screening of retinal fundus images based on artificial intelligence and deep learning-based algortims?

Dataset	No. Of Patients	No. of images	Clinical setting	Camera and Field of View	Images Taken with Pupil Dilation	Validation Dataset Mean Age, yrs (SD)	Sex (% male)	Studies using Dataset: Prevalence DR
DiaretDB1	N/A	89	University Hospital in Finland	Digital fundus camera with a 50-degree field of view	N/A	N/A	N/A	Quellec 2017
EyePacs-1	4997	9963	Randomly sampled images from EyePacs screening sites in the United States	Wide variety of digital retinal camera models all using 45-degree field of view centered at macula	~40.0%	54.4 (11.3)	37.8% male	Gulshan 2016
E-Ophtha	25702	107799	OPHDIAT telemedical DR screening network in Paris, France	Digital color fundus camera	N/A	N/A	N/A	Quellec 2017
Kaggle DR competition dataset	44351	88702*	Primary care sites in California and elsewhere in the United States. Images are uploaded to the EyePacs DR screening platform.	Wide variety of digital retinal camera models	N/A	N/A	N/A	Quellec 2017 Mansour 2017 Raju 2017 Chang 2018
Messidor	671	1200	Consecutive ob eent at 3 ophthalmological departments in France	Nonmydriatic fundus camera with 45-degree field of view centered at fovea	66.6%	N/A	N/A	Ramachandran 2017
Messidor-2	874	1748	Consecutive ob eent at 3 ophthalmological departments in France	Nonmydriatic fundus camera with 45-degree field of view centered at fovea	44.0%	57.6 (15.9)	57.0% male	Abramoff 2016 Gulshan 2016 Gargeya 2017
Otago	294	≥1764†	Local diabetic retinal screening program in New Zealand	Digital nonmydriatic retinal camera with 45- degree field of view	75%	N/A	N/A	Ramachandran 2017
SIDR 2014-2015	14880‡	71896	Singapore National Diabetic Retinopathy Screening Program from Singapore, Malaysia	Digital retinal camera centered at optic disc and fovea	N/A	60.16 (12.19)	51.02% male	Ting 2017
Royal Victoria Eye and Ear Hospital	588	2302	Clinic-based Australian Hospital	Digital retinal camera	N/A	N/A	N/A	Ting 2017
Australian patients from outpatient clinic	96	192	Australian endocrinologic outpatient clinics	Digital non-stereoscopic color retinal camera with 45-degree field of view pointed at a central-nasal field	N/A	44.26 (16.56)	57.0% male	Keel 2018
Patients from a Japanese hospital	2740	9939	Medical university in Japan	Nonmydriatic color fundus camera with 45-degree field of view (images are taken of a total of 4 fields per eye)	N/A	N/A	N/A	Takahashi 2017

Update (relevant primary studies published after April 1, 2018: 15 studies)
Patients from National Taiwan University Hospital EyePACS dataset for training VeriSee.	N/A	7524	National Taiwan University Hospital	single-field, 45-degree color fundus photography with a nonmydriatic fundus camera	none	N/A	N/A	Hsie 2020: rDR 11.9%, PDR 1,8% Used all images. Performed in Asia.
Spanish patients from regional primary care-based DR screening program (random sample)	3551	2680 x 4 (2 images per eye)	43 Health Centers from Health Departments in Valencia	45° retinal color images of two fields (1 disc and 1 fovea centered) for each eye with nonmydriatic retinal camera	100%	74 (median)	45% male	Shah 2020a: rDR 4.14%, vtDR 2,57% used existing model Excluded images with low quality. Performed in Spain.
UK patients attending DR screening programme (DESP)	30405	152000 (≥2 per eye; on average 5 per patient)	3 English Diabetic Eye Screening Programmes (DESPs) conducted in North East London (NEL), South East London (SEL) and Gloucestershire (GS)	≥2 digital image fields per eye, one centred on the optic disc and the other on the macula, in accordance with NHS DESP protocol	?	N/A	55% male	Heydon 2020: any DR 30% used existing model Used all images. Performed in UK.
Diabetes patients who attended PengPu Town Community Hospital of Jing’an district between May 30, 2018 and July 18, 2018 trained on ImageNet dataset	889	3356	Community Hospital PengPu, China	Automatic nonmydriatic camera. All retinal images captured two fields, macula-centred and disc-centred, according to EURODIAB protocol.	N/A	68.5 (7.2)	47% male	He 2019, 16.3% in AI vs 16.1% human graders used existing model Used all images. Performed in China.
Internal: Hospital data of macula-centered fundus images External (MESSIDOR-1)	Internal ? External ?	Internal 1553 External 1200	Internal: Sankara Eye Hospital, Bengaluru, India External: Consecutive ob eent at 3 ophthalmological departments in France	Internal: Fundus camera with 50-degree field, posterior pole-centered images. External: (Non)mydriatic fundus camera with 45-degree field of view centered at fovea	Internal: Dilated and undilated images External: 66.6%	N/A	N/A	Shah 2020b, 90% some form of DR (internal), 55% some form of DR (external) used existing model Excluded images with low quality à ? Mutliethics
Diabetic patients aged ≥18 years attending their DR screening visit in the NDESP	1257	11796	Homerton University Hospital	EIDON: Mydriatic fundus camera with 50-degree field of view. A combination of macula-centred and disc-centred cover a field of 75-degree horz. And 50-degree ver. NDESP: Mydriatic fundus camera with 45-degree field of view. A combination of macula-centred and disc-centred cover a field of 60-degree horz. And 45-degree ver.	N/A	N/A	N/A	Olvera-Barrios 2020, EIDON 42%; NDESP 35%. used existing model Used all images. Performed in UK.
Two diabetic retinopathy screening cohorts	Train: 13099 Validation: 1574	Train: 76370 Validation: 4504	Train: Singapore Integrated Diabetic Retinopathy Program Validation: diabetic retinopathy screening programme in Zambia	45-degree field of view centered at macula	N/A	Train: 62.8 (11.3) Validation: 55.0 (11.1)	Train: 50% male Validation: 44% male	Bellemo 2019a, 12% in trainset, 25% in validation set. Excluded images with low quality. Validated in population from Zambia.
Patients older than 40 and previously diagnosed with diabetes *trained on EyePacs-1*	Aravind: 1983 Sankara: 3779	Aravind: 997 Sankara: 2052	Tertiary eye care centers in South India	45-degree field of view centered at macula	N/A	Aravind: 56.6 (9.0) Sankara: 56.0 (10.0)	Aravind: 58% Sankara: 67%	Gulshan 2019, 36% in Aravind; 33% in Sankara. Used all images. Excluded images with low quality. Validated in population from India.
EyePACS Unselected consecutive diabetes patient visits/encounters	107001	850908	404 primary care clinics in the Eye-PACS DR telescreening program	Wide variety of digital retinal camera models	nonmydriatic (53.6%), mydriatic (45.8%), unknown dilation status (0.6%)	N/A	N/A	Bhaskaranand 2019, 32% in 107001 encouters. used existing model Excluded images with low quality. Population unknown.
Diabetic retinopathy screening cohort	7517	29943	Community-based nationwide screening program of DR in Thailand	A variety of cameras were used. Images were single-field, 45-degree field of view, and contained the optic disc and macula, centered on the macula	N/A	61.1 (11.0)	32.5% male	Ruamviboonsuk 2019a, 12% in 7517 patients. used existing model Used all images. Performed in Thailand.
People with type 2 diabetic	1616 1293 for analysis	?	diagnostic center in the Netherlands Star-SHL	Standard procedure: one macula centered and one disc centered [45° field of view])	Dilation was applied if necessary for grading	63 (11.3)	53% male	Verbraak 2019, 17% in 1425 patients. used existing model Excluded images with low quality. Performed in The Netherlands.
Persons with type 2 diabetes	1415 898 for analysis	?	the Hoorn DCS centre	Standard procedure: one macula centered and one disc centered [45° field of view])	Not as routine	65.0 (11.9)	56.1%	Van der Heijden 2018, 2.5% rDR (EURODIAB), 6% rDR (ICDR) used existing model Excluded images with low quality. Performed in The Netherlands
Diabetic patients population-based cohorts development model à LabelMe, Guangzhou, China	External: 14520 eyes (7260 patients?)	Internal: 19900 External: 35201	the Zhongshan Ophthalmic Center, Guangzhou, China the Royal Victorian Eye and Ear Hospital, East Melbourne, Australia	Standard procedure: one macula centered and one disc centered [45° field of view])	Only if necessary	external validation range 25-90	External validation: 49%	Li 2018, vtDR 18.5%, DME 27.5% Excluded images with low quality. Performed in Singapore and Australia (Indigenous and Caucasian).
People with diabetes, no history of DR	900 819 for analysis	?	10 primary care practice sites throughout the United States	standardized imaging protocol with one disc and one fovea centered 45° image per eye;	23.6%	Median 59 (range 22-84)	47.5% male	Abramoff 2018, mtmtDR 23.8%, of these DME 5.1% used existing model Excluded images with low quality. Performed in US (16% Hispanic or Latino)
Patients with diabetes development model DiaRetDB1, EyePACS, and Australian Tele-eye care DR.	193	386	primary care	Macula-centered images were acquired and 1 to 3 images per eye depending on image quality	?	55 (17)	52% male	Kanagasingam 2018, 1% (2 of 193 patients) used existing model Used all images. Performed in Australia. real world setting, but small sample size

Abbreviations: DR = diabetic retinopathy; N/A = not available; SD = standard deviation; SIDR = Singapore National Diabetic Retinopathy Screening Program

*These images are typically split into 2 datasets of 53 576 and 35 126 images ob e able to use the joint dataset for both training and testing without any images overlapping.

†At least 3 images were taken per eye. With 588 eyes, this adds up to a minimum of 1764 images.

‡A total of 8589 patients were unique, whereas 6291 patients were reappearances from the training dataset, because these were follow-up patients.

Risk of Bias

Research question: What is the value of automated diabetic retinopathy screening of retinal fundus images based on artificial intelligence and deep learning-based algortims?

Study reference	Patient selection	Index test	Reference standard	Flow and timing	Comments with respect to applicability
Hsieh 2020	Was a consecutive or random sample of patients enrolled? Unclear Was a case-control design avoided? Yes Did the study avoid inappropriate exclusions? Yes	Were the index test results interpreted without knowledge of the results of the reference standard? Yes If a threshold was used, was it pre-specified? Yes	Is the reference standard likely to correctly classify the target condition? Yes Were the reference standard results interpreted without knowledge of the results of the index test? Yes	Was there an appropriate interval between index test(s) and reference standard? Yes Did all patients receive a reference standard? Yes Did patients receive the same reference standard? Yes Were all patients included in the analysis? Yes	Are there concerns that the included patients do not match the review question? No, but note that only Asian patients were included. Are there concerns that the index test, its conduct, or interpretation differ from the review question? Yes, Taiwan study with all images taken with the same style of fundus camera; may not be representative of the Netherlands. Are there concerns that the target condition as defined by the reference standard does not match the review question? No
CONCLUSION: Could the selection of patients have introduced bias? RISK: UNCLEAR	CONCLUSION: Could the conduct or interpretation of the index test have introduced bias? RISK: LOW	CONCLUSION: Could the reference standard, its conduct, or its interpretation have introduced bias? RISK: LOW	CONCLUSION Could the patient flow have introduced bias? RISK: LOW
Shah 2020	Was a consecutive or random sample of patients enrolled? Yes Was a case-control design avoided? Yes Did the study avoid inappropriate exclusions? Unclear	Were the index test results interpreted without knowledge of the results of the reference standard? Yes If a threshold was used, was it pre-specified? Yes	Is the reference standard likely to correctly classify the target condition? Yes Were the reference standard results interpreted without knowledge of the results of the index test? Yes	Was there an appropriate interval between index test(s) and reference standard? Yes Did all patients receive a reference standard? Yes Did patients receive the same reference standard? Yes Were all patients included in the analysis? No	Are there concerns that the included patients do not match the review question? No Are there concerns that the index test, its conduct, or interpretation differ from the review question? No Are there concerns that the target condition as defined by the reference standard does not match the review question? No
	CONCLUSION: Could the selection of patients have introduced bias? RISK: LOW	CONCLUSION: Could the conduct or interpretation of the index test have introduced bias? RISK: LOW	CONCLUSION: Could the reference standard, its conduct, or its interpretation have introduced bias? RISK: LOW	CONCLUSION Could the patient flow have introduced bias? RISK: LOW
Heydon 2020	Was a consecutive or random sample of patients enrolled? Yes Was a case-control design avoided? Yes Did the study avoid inappropriate exclusions? Yes	Were the index test results interpreted without knowledge of the results of the reference standard? Yes If a threshold was used, was it pre-specified? Yes (commercial software)	Is the reference standard likely to correctly classify the target condition? Yes Were the reference standard results interpreted without knowledge of the results of the index test? Yes	Was there an appropriate interval between index test(s) and reference standard? Yes Did all patients receive a reference standard? Yes Did patients receive the same reference standard? Yes Were all patients included in the analysis? Yes	Are there concerns that the included patients do not match the review question? No Are there concerns that the index test, its conduct, or interpretation differ from the review question? No Are there concerns that the target condition as defined by the reference standard does not match the review question? No
	CONCLUSION: Could the selection of patients have introduced bias? RISK: LOW	CONCLUSION: Could the conduct or interpretation of the index test have introduced bias? RISK: LOW	CONCLUSION: Could the reference standard, its conduct, or its interpretation have introduced bias? RISK: LOW	CONCLUSION Could the patient flow have introduced bias? RISK: LOW	Real world situation (actual screening programme UK)

Study reference

Patient selection

Index test

Reference standard

Flow and timing

Comments with respect to applicability

Hsieh 2020

Was a consecutive or random sample of patients enrolled?

Unclear

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

No, but note that only Asian patients were included.

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Yes, Taiwan study with all images taken with the same style of fundus camera; may not be representative of the Netherlands.

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: UNCLEAR

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Shah 2020

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Unclear

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Heydon 2020

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes (commercial software)

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Real world situation (actual screening programme UK)

He 2019

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes, but participants who had

unclear fundus photographs due to small pupils, cataracts or

vitreous opacity was removed.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes (Airdoc, Beijing, China, software)

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: MODERATE

Shah 2020b

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Unclear. Data in internal validation was retrospectively collected.

Did the study avoid inappropriate exclusions?

Unclear

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: UNCLEAR

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Prevalence of DR high in internal validation.

Olvera-Barrios 2020

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Unclear, no information provided.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes (commercial software)

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?
No

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: UNCLEAR

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Bellemo 2019a

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Clinical validation study, The Lancet

Gulshan 2019

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Unclear, some images not used for analysis.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes, build own software

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

Real world situation (model trained and developed, validated in 2 cohorts [India])

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: UNCLEAR

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Bhaskaranand 2019

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes, but some images are not used due to low quality.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?
Yes, commercial software

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: MODERATE

Ruamviboonsuk 2019a

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes, commercial software

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Verbraak 2019

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes, some patients are excluded but with a valid/transparent reason.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes, Idx-DR-EU-2.1 software.

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: MODERATE

Dutch setting; received funding

Van der Heijden 2018

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes, some patients are excluded but with a valid/transparent reason.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes, Idx-DR-EU-2.1 software.

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: MODERATE

Dutch setting; received funding

Li 2018

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes, some patients are excluded but with a valid/transparent reason.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: MODERATE

Model development, validation, and external validation.

Conflict of interest!

Abramoff 2018

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes, some patients are excluded but with a valid/transparent reason.

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: MODERATE

Applying an AI model in a RCT.

Competing of interest

Kanagasingam 2018

Was a consecutive or random sample of patients enrolled?

Yes

Was a case-control design avoided?

Yes

Did the study avoid inappropriate exclusions?

Yes

Were the index test results interpreted without knowledge of the results of the reference standard?

Yes

If a threshold was used, was it pre-specified?

Yes

Is the reference standard likely to correctly classify the target condition?

Yes

Were the reference standard results interpreted without knowledge of the results of the index test?

Yes

Was there an appropriate interval between index test(s) and reference standard?

Yes

Did all patients receive a reference standard?

Yes

Did patients receive the same reference standard?

Yes

Were all patients included in the analysis?

Yes

Are there concerns that the included patients do not match the review question?

Are there concerns that the index test, its conduct, or interpretation differ from the review question?

Are there concerns that the target condition as defined by the reference standard does not match the review question?

CONCLUSION:

Could the selection of patients have introduced bias?

RISK: LOW

CONCLUSION:

Could the conduct or interpretation of the index test have introduced bias?

RISK: LOW

CONCLUSION:

Could the reference standard, its conduct, or its interpretation have introduced bias?

RISK: LOW

CONCLUSION

Could the patient flow have introduced bias?

RISK: LOW

Sample size of validation cohort too small; not generalisable

More images were made to exclude the risk on low quality images.

Judgments on risk of bias are dependent on the research question: some items are more likely to introduce bias than others, and may be given more weight in the final conclusion on the overall risk of bias per domain:

Patient selection:

Consecutive or random sample has a low risk to introduce bias.
A case control design is very likely to overestimate accuracy and thus introduce bias.
Inappropriate exclusion is likely to introduce bias.

Index test:

This item is similar to “blinding” in intervention studies. The potential for bias is related to the subjectivity of index test interpretation and the order of testing.
Selecting the test threshold to optimise sensitivity and/or specificity may lead to overoptimistic estimates of test performance and introduce bias.

Reference standard:

When the reference standard is not 100% sensitive and 100% specific, disagreements between the index test and reference standard may be incorrect, which increases the risk of bias.
This item is similar to “blinding” in intervention studies. The potential for bias is related to the subjectivity of index test interpretation and the order of testing.

Flow and timing:

If there is a delay or if treatment is started between index test and reference standard, misclassification may occur due to recovery or deterioration of the condition, which increases the risk of bias.
If the results of the index test influence the decision on whether to perform the reference standard or which reference standard is used, estimated diagnostic accuracy may be biased.
All patients who were recruited into the study should be included in the analysis, if not, the risk of bias is increased.

Judgement on applicability:

Patient selection: there may be concerns regarding applicability if patients included in the study differ from those targeted by the review question, in terms of severity of the target condition, demographic features, presence of differential diagnosis or co-morbidity, setting of the study and previous testing protocols.

Index test: if index tests methods differ from those specified in the review question there may be concerns regarding applicability.

Reference standard: the reference standard may be free of bias but the target condition that it defines may differ from the target condition specified in the review question.

Table of excluded studies

Author and year	Reason for exclusion
Tufail, 2017	Not according to PICO, for considerations; *cost effectiveness of automatic screening systems.
Kaur, 2018	Not according to PICO, about assessment of images.
Voets, 2019	Not according to PICO, assessing reproducibility of published algorithms.
Shaban, 2020	Not according to PICO, cross validation.

Zago, 2020	Not according to PICO, no human grader.
Sharma, 2018	Not according to PICO, no human grader.

Verantwoording

Beoordelingsdatum en geldigheid

Laatst beoordeeld : 20-01-2023

Initiatief en autorisatie

Initiatief:

Nederlandse Internisten Vereniging

Geautoriseerd door:

Nederlands Oogheelkundig Gezelschap
Nederlands Huisartsen Genootschap
Nederlandse Internisten Vereniging
Nederlandse Vereniging voor Kindergeneeskunde
Optometristen Vereniging Nederland

Algemene gegevens

De richtlijnontwikkeling werd ondersteund door het Kennisinstituut van Medisch Specialisten (www.kennisinstituut.nl) en werd gefinancierd uit de Kwaliteitsgelden Medisch Specialisten (SKMS). De financier heeft geen enkele invloed gehad op de inhoud van de richtlijnmodule.

Doel en doelgroep

Voor wie is de richtlijn bedoeld?

Deze richtlijn is bestemd voor alle zorgverleners die betrokken zijn bij de zorg voor patiënten met DR.

Voor patiënten

Een ander woord voor diabetes mellitus is suikerziekte. Als iemand diabetes mellitus heeft, kan het lichaam de bloedsuiker niet goed zelf regelen. DR is beschadiging van het netvlies van de ogen en is een veelvoorkomende oorzaak van blindheid en slechtziendheid bij volwassenen in de leeftijd tussen de 30 en de 65 jaar.

Samenstelling werkgroep

Werkgroep

Dr. M.V. (Manon) van Hecke, oogarts, ‎Elisabeth-Tweesteden ziekenhuis, Tilburg; Nederlands Oogheelkundig Gezelschap (NOG)
S.M.J. (Sandra) Smeets, (kinder)oogarts, VieCuri medisch centrum, Venlo; Nederlands Oogheelkundig Gezelschap (NOG)
Dr. Y. (Yvonne) de Jong-Hesse, oogarts, Leids Universitair Medisch Centrum (LUMC), Leiden; Nederlands Oogheelkundig Gezelschap (NOG)
Dr. P.A. (Peter) Grootenhuis, kaderhuisarts diabetes, Huisartsenpraktijk Grootenhuis, Hoorn; Nederlands Huisartsen Genootschap (NHG, DiHAG)
M.P. (Marianne) den Breejen, kinderarts, Diabeter, Nederlandse Vereniging voor Kindergeneeskunde (NVK)
G. (Gabriëlle) Janssen, optometrist, Oogklasse, Amsterdam; Optometristen Vereniging Nederland (OVN)
H.J. (Anneke) Jansen Molenaar, adviseur oogzorg (patiëntvertegenwoordiger), Oogvereniging, Utrecht

Klankbordgroep

P.J.H.M (Pauline) Stouthart, Nederlandse Vereniging voor Kindergeneeskunde (NVK)
T.M. (Ties) Obers, Diabetesvereniging Nederland (DVN)
J.M. (Marieke) van Haren-Loezmans, Koninklijke Nederlandse Maatschappij ter bevordering der Pharmacie (KNMP)
R.M. (Richard) Posthuma, Nederlandse Vereniging van ZiekenhuisApothekers (NVZA)
A.M. (Aisha) Salarbaks, Nederlandse Vereniging voor Klinische Geriatrie (NVKG)
S.C. (Sue) Holleman, Verpleegkundigen & Verzorgenden Nederland (V&VN), afdeling Diabeteszorg
M.W.F. (Martin) van Leen, Verenso
Dr. ir. C.H.L. (Chris) Peters, Nederlandse Vereniging voor Klinische Fysica (NVKF)
Dr. E.H. (Erik) Serné, Nederlandse Internisten Vereniging (NIV)

Met ondersteuning van:

Dr. K.N.J. (Koert) Burger, epidemioloog, senior adviseur Kennisinstituut van de Federatie Medisch Specialisten
Dr. A.N. (Anh Nhi) Nguyen, adviseur Kennisinstituut van de Federatie Medisch Specialisten
Dr. M.M.A. (Maxime) Verhoeven, adviseur Kennisinstituut van de Federatie Medisch Specialisten
Dr. A.J. (Bart) Versteeg, adviseur, Kennisinstituut van Medisch Specialisten

Belangenverklaringen

De Code ter voorkoming van oneigenlijke beïnvloeding door belangenverstrengeling is gevolgd. Alle werkgroepleden hebben schriftelijk verklaard of zij in de laatste drie jaar (d.w.z. voor start van het traject) directe financiële belangen (betrekking bij een commercieel bedrijf, persoonlijke financiële belangen, onderzoeksfinanciering) of indirecte belangen (persoonlijke relaties, reputatiemanagement) hebben gehad. Gedurende de ontwikkeling of herziening van een module worden wijzigingen in belangen aan de voorzitter doorgegeven. De belangenverklaring wordt opnieuw bevestigd tijdens de commentaarfase.

Een overzicht van de belangen van werkgroepleden en het oordeel over het omgaan met eventuele belangen vindt u in onderstaande tabel. De ondertekende belangenverklaringen zijn op te vragen bij het secretariaat van het Kennisinstituut van de Federatie Medisch Specialisten.

*Achternaam werkgroeplid*	*Hoofdfunctie*	*Nevenfuncties*	*Gemelde belangen*	*Ondernomen actie*
*Van Hecke (voorzitter)*	Oogarts ETZ	werkgroep geneesmiddelen FMS onbetaald; werkgroep horizonscan geneesmiddelen ZiNL onbetaald; werkgroep intramurale geneesmiddelen ETZ, onbetaald; externe adviseur oogheelkundige geneesmiddelen CBG, betaald; jan 2019: praatje medische retina symposium: laser bij DR; eenmalig betaald door Bayer; dec 2019: E-learning gemaakt naar fluids bij AMD, eenmalig betaald door Novartis	Geen	Geen
*Smeets*	Oogarts: Viecuri Venlo	Werkgroep kinderoogheelkunde - onbetaald Werkgroep strabologie - onbetaald.	Geen	Geen
*De Jong* *– Hesse*	Oogarts, Amsterdam UMC, locatie VUMC	sinds jan 2019: voorzitter adviescommissie macula van het Oogfonds (onbetaald); t/m 2019: bestuurslid maculafonds (onbetaald); bestuurslid werkgroep Medische Retina (onbetaald)	2017/2018: tijdelijke overeenkomsten van dienstverlening met Novartis Pharma voor geven van presentaties/ redigeren van teksten	Geen trekker bij vragen m.b.t. antiVEGF
*Grootenhuis*	Huisarts, praktijkhouder Huisartsopleider VUMC Kaderhuisarts diabetes, THOON	Lid wetenschappelijke Adviesraad Geneesmiddelen Bulletin.	Geen	Geen
*Den Breejen*	Kinderarts bij Diabeter Nederland	Geen	Geen	Geen
*Janssen*	Optometrist en bedrijfskundige. ’ZZP Werkzaamheden bestaan uit: Organisatie congressen en scholingen voor optometristen (0,18 fte) Optometrist (0,2 fte) Adviseur taakhetrschikking optometrie Hoofdredacteur vakblad optometrie (0,1 fte)	Voorzitter Optometristen Vereniging Nederland – betaald (0,2 fte)	Geen	Geen
*Jansen Molenaar*	Adviseur oogzorg, Oogvereniging	Geen	Geen	Geen

Inbreng patiëntenperspectief

Er werd aandacht besteed aan het patiëntenperspectief door een afgevaardigde van de patiëntenvereniging (Oogvereniging) in de werkgroep. De conceptrichtlijn is tevens voor commentaar voorgelegd aan de Diabetes Vereniging Nederland.

Kwalitatieve raming van mogelijke financiële gevolgen in het kader van de Wkkgz

Bij de richtlijn is conform de Wet kwaliteit, klachten en geschillen zorg (Wkkgz) een kwalitatieve raming uitgevoerd of de aanbevelingen mogelijk leiden tot substantiële financiële gevolgen. Bij het uitvoeren van deze beoordeling zijn richtlijnmodules op verschillende domeinen getoetst (zie het stroomschema op de Richtlijnendatabase).

Uit de kwalitatieve raming blijkt dat er waarschijnlijk geen substantiële financiële gevolgen zijn, zie onderstaande tabel.

Module	Uitkomst raming	Toelichting
Module Screening bij kinderen	geen financiële gevolgen	Hoewel uit de toetsing volgt dat de aanbeveling(en) breed toepasbaar zijn (5.000-40.000 patiënten), volgt ook uit de toetsing dat het overgrote deel (±90%) van de zorgaanbieders en zorgverleners al aan de norm voldoet. Er worden daarom geen financiële gevolgen verwacht.
Module Geautomatiseerde screening middels AI	geen financiële gevolgen	Hoewel uit de toetsing volgt dat de aanbeveling(en) breed toepasbaar zijn (>40.000 patiënten), volgt uit de toetsing dat het geen toename in het aantal in te zetten voltijdsequivalenten aan zorgverleners betreft. Er worden daarom geen financiële gevolgen verwacht.
Module Behandeling van diabetisch macula oedeem	geen financiële gevolgen	Hoewel uit de toetsing volgt dat de aanbeveling(en) breed toepasbaar zijn (>40.000 patiënten), volgt uit de toetsing dat het overgrote deel (±90%) van de zorgaanbieders en zorgverleners al aan de norm voldoet. Er worden daarom geen financiële gevolgen verwacht.

Werkwijze

AGREE

Deze richtlijnmodule is opgesteld conform de eisen vermeld in het rapport Medisch Specialistische Richtlijnen 2.0 van de adviescommissie Richtlijnen van de Raad Kwaliteit. Dit rapport is gebaseerd op het AGREE II instrument (Appraisal of Guidelines for Research & Evaluation II; Brouwers, 2010).

Uitkomstmaten

Na het opstellen van de zoekvraag behorende bij de uitgangsvraag inventariseerde de werkgroep welke uitkomstmaten voor de patiënt relevant zijn, waarbij zowel naar gewenste als ongewenste effecten werd gekeken. Hierbij werd een maximum van acht uitkomstmaten gehanteerd. De werkgroep waardeerde deze uitkomstmaten volgens hun relatieve belang bij de besluitvorming rondom aanbevelingen, als cruciaal (kritiek voor de besluitvorming), belangrijk (maar niet cruciaal) en onbelangrijk. Tevens definieerde de werkgroep tenminste voor de cruciale uitkomstmaten welke verschillen zij klinisch (patiënt) relevant vonden.

Methode literatuursamenvatting

Een uitgebreide beschrijving van de strategie voor zoeken en selecteren van literatuur is te vinden onder ‘Zoeken en selecteren’ onder Onderbouwing. Indien mogelijk werd de data uit verschillende studies gepoold in een random-effects model. Review Manager 5.4 werd gebruikt voor de statistische analyses. De beoordeling van de kracht van het wetenschappelijke bewijs wordt hieronder toegelicht.

Beoordelen van de kracht van het wetenschappelijke bewijs

De kracht van het wetenschappelijke bewijs werd bepaald volgens de GRADE-methode. GRADE staat voor ‘Grading Recommendations Assessment, Development and Evaluation’ (zie http://www.gradeworkinggroup.org/). De basisprincipes van de GRADE-methodiek zijn: het benoemen en prioriteren van de klinisch (patiënt) relevante uitkomstmaten, een systematische review per uitkomstmaat, en een beoordeling van de bewijskracht per uitkomstmaat op basis van de acht GRADE-domeinen (domeinen voor downgraden: risk of bias, inconsistentie, indirectheid, imprecisie, en publicatiebias; domeinen voor upgraden: dosis-effect relatie, groot effect, en residuele plausibele confounding).

GRADE onderscheidt vier gradaties voor de kwaliteit van het wetenschappelijk bewijs: hoog, redelijk, laag en zeer laag. Deze gradaties verwijzen naar de mate van zekerheid die er bestaat over de literatuurconclusie, in het bijzonder de mate van zekerheid dat de literatuurconclusie de aanbeveling adequaat ondersteunt (Schünemann, 2013; Hultcrantz, 2017).

GRADE	Definitie
Hoog	er is hoge zekerheid dat het ware effect van behandeling dichtbij het geschatte effect van behandeling ligt; het is zeer onwaarschijnlijk dat de literatuurconclusie klinisch relevant verandert wanneer er resultaten van nieuw grootschalig onderzoek aan de literatuuranalyse worden toegevoegd.
Redelijk	er is redelijke zekerheid dat het ware effect van behandeling dichtbij het geschatte effect van behandeling ligt; het is mogelijk dat de conclusie klinisch relevant verandert wanneer er resultaten van nieuw grootschalig onderzoek aan de literatuuranalyse worden toegevoegd.
Laag	er is lage zekerheid dat het ware effect van behandeling dichtbij het geschatte effect van behandeling ligt; er is een reële kans dat de conclusie klinisch relevant verandert wanneer er resultaten van nieuw grootschalig onderzoek aan de literatuuranalyse worden toegevoegd.
Zeer laag	er is zeer lage zekerheid dat het ware effect van behandeling dichtbij het geschatte effect van behandeling ligt; de literatuurconclusie is zeer onzeker.

Bij het beoordelen (graderen) van de kracht van het wetenschappelijk bewijs in richtlijnen volgens de GRADE-methodiek spelen grenzen voor klinische besluitvorming een belangrijke rol (Hultcrantz, 2017). Dit zijn de grenzen die bij overschrijding aanleiding zouden geven tot een aanpassing van de aanbeveling. Om de grenzen voor klinische besluitvorming te bepalen moeten alle relevante uitkomstmaten en overwegingen worden meegewogen. De grenzen voor klinische besluitvorming zijn daarmee niet één op één vergelijkbaar met het minimaal klinisch relevant verschil (Minimal Clinically Important Difference, MCID). Met name in situaties waarin een interventie geen belangrijke nadelen heeft en de kosten relatief laag zijn, kan de grens voor klinische besluitvorming met betrekking tot de effectiviteit van de interventie bij een lagere waarde (dichter bij het nuleffect) liggen dan de MCID (Hultcrantz, 2017).

Overwegingen (van bewijs naar aanbeveling)

Om te komen tot een aanbeveling zijn naast (de kwaliteit van) het wetenschappelijke bewijs ook andere aspecten belangrijk en worden meegewogen, zoals aanvullende argumenten uit bijvoorbeeld de biomechanica of fysiologie, waarden en voorkeuren van patiënten, kosten (middelenbeslag), aanvaardbaarheid, haalbaarheid en implementatie. Deze aspecten zijn systematisch vermeld en beoordeeld (gewogen) onder het kopje ‘Overwegingen’ en kunnen (mede) gebaseerd zijn op expert opinion. Hierbij is gebruik gemaakt van een gestructureerd format gebaseerd op het evidence-to-decision framework van de internationale GRADE Working Group (Alonso-Coello, 2016a; Alonso-Coello 2016b). Dit evidence-to-decision framework is een integraal onderdeel van de GRADE methodiek.

Formuleren van aanbevelingen

De aanbevelingen geven antwoord op de uitgangsvraag en zijn gebaseerd op het beschikbare wetenschappelijke bewijs en de belangrijkste overwegingen, en een weging van de gunstige en ongunstige effecten van de relevante interventies. De kracht van het wetenschappelijk bewijs en het gewicht dat door de werkgroep wordt toegekend aan de overwegingen, bepalen samen de sterkte van de aanbeveling. Conform de GRADE-methodiek sluit een lage bewijskracht van conclusies in de systematische literatuuranalyse een sterke aanbeveling niet a priori uit, en zijn bij een hoge bewijskracht ook zwakke aanbevelingen mogelijk (Agoritsas, 2017; Neumann, 2016). De sterkte van de aanbeveling wordt altijd bepaald door weging van alle relevante argumenten tezamen. De werkgroep heeft bij elke aanbeveling opgenomen hoe zij tot de richting en sterkte van de aanbeveling zijn gekomen.

In de GRADE-methodiek wordt onderscheid gemaakt tussen sterke en zwakke (of conditionele) aanbevelingen. De sterkte van een aanbeveling verwijst naar de mate van zekerheid dat de voordelen van de interventie opwegen tegen de nadelen (of vice versa), gezien over het hele spectrum van patiënten waarvoor de aanbeveling is bedoeld. De sterkte van een aanbeveling heeft duidelijke implicaties voor patiënten, behandelaars en beleidsmakers (zie onderstaande tabel). Een aanbeveling is geen dictaat, zelfs een sterke aanbeveling gebaseerd op bewijs van hoge kwaliteit (GRADE gradering HOOG) zal niet altijd van toepassing zijn, onder alle mogelijke omstandigheden en voor elke individuele patiënt.

Implicaties van sterke en zwakke aanbevelingen voor verschillende richtlijngebruikers

	Sterke aanbeveling	Zwakke (conditionele) aanbeveling
Voor patiënten	De meeste patiënten zouden de aanbevolen interventie of aanpak kiezen en slechts een klein aantal niet.	Een aanzienlijk deel van de patiënten zouden de aanbevolen interventie of aanpak kiezen, maar veel patiënten ook niet.
Voor behandelaars	De meeste patiënten zouden de aanbevolen interventie of aanpak moeten ontvangen.	Er zijn meerdere geschikte interventies of aanpakken. De patiënt moet worden ondersteund bij de keuze voor de interventie of aanpak die het beste aansluit bij zijn of haar waarden en voorkeuren.
Voor beleidsmakers	De aanbevolen interventie of aanpak kan worden gezien als standaardbeleid.	Beleidsbepaling vereist uitvoerige discussie met betrokkenheid van veel stakeholders. Er is een grotere kans op lokale beleidsverschillen.

Organisatie van zorg

In de knelpuntenanalyse en bij de ontwikkeling van de richtlijnmodule is expliciet aandacht geweest voor de organisatie van zorg: alle aspecten die randvoorwaardelijk zijn voor het verlenen van zorg (zoals coördinatie, communicatie, (financiële) middelen, mankracht en infrastructuur). Randvoorwaarden die relevant zijn voor het beantwoorden van deze specifieke uitgangsvraag zijn genoemd bij de overwegingen. Meer algemene, overkoepelende, of bijkomende aspecten van de organisatie van zorg worden behandeld in de module Organisatie van zorg.

Commentaar- en autorisatiefase

De conceptrichtlijnmodule werd aan de betrokken (wetenschappelijke) verenigingen en (patiënt) organisaties voorgelegd ter commentaar. De commentaren werden verzameld en besproken met de werkgroep. Naar aanleiding van de commentaren werd de conceptrichtlijnmodule aangepast en definitief vastgesteld door de werkgroep. De definitieve richtlijnmodule werd aan de deelnemende (wetenschappelijke) verenigingen en (patiënt) organisaties voorgelegd voor autorisatie en door hen geautoriseerd dan wel geaccordeerd.

Zoekverantwoording

Literature search strategy

Ovid/Medline

1 exp Diabetes Mellitus/ or diabete*.ti,ab,kf. or diabetic*.ti,ab,kf. or dm2.ti,ab,kf. or d2m.ti,ab,kf. or niddm.ti,ab,kf. or 'dm 2'.ti,ab,kf. or t2d*.ti,ab,kf. or 'dm type 2'.ti,ab,kf. or 'dm type ii'.ti,ab,kf. or dm1.ti,ab,kf. or d1m.ti,ab,kf. or iddm.ti,ab,kf. or 'dm 1'.ti,ab,kf. or t1d*.ti,ab,kf. or 'dm type 1'.ti,ab,kf. or 'dm type i'.ti,ab,kf. or 'type* 1 diabet*'.ti,ab,kf. or 'type* i diabet*'.ti,ab,kf. or 'type* one diabet*'.ti,ab,kf. (696742)

2 Algorithms/ or exp Artificial intelligence/ or (artificial intelligen* or machine intelligen* or machine learn* or neural network* or deep learn* or algorithm*).ti,ab,kf. (499344)

3 exp Diabetic retinopathy/ or (retinopath* or retinitis or retinosis).ti,ab,kf. (64728)

4 1 and 2 and 3 (999)

5 4 not ((exp animals/ or exp models, animal/) not humans/) not (letter/ or comment/ or editorial/) (968)

6 limit 5 to yr="2015 -Current" (570)

7 (meta-analysis/ or meta-analysis as topic/ or (meta adj analy$).tw. or (systematic*or literature adj2 review$1).tw. or (systematic adj overview$1).tw. or exp "Review Literature as Topic"/ or cochrane.ab. or cochrane.jw. or embase.ab. or medline.ab. or (psychlit or psyclit).ab. or (cinahl or cinhal).ab. or cancerlit.ab. or ((selection criteria or data extraction).ab. and "review"/)) not (Comment/ or Editorial/ or Letter/ or (animals/ not humans/)) (294638)

8 6 and 7 (12)

9 exp "Sensitivity and Specificity"/ or (Sensitiv* or Specific*).ti,ab. or (predict* or ROC-curve or receiver-operator*).ti,ab. or (likelihood or LR*).ti,ab. or exp Diagnostic Errors/ or (inter-observer or intra-observer or interobserver or intraobserver or validity or kappa or reliability).ti,ab. or reproducibility.ti,ab. or (test adj2 (re-test or retest)).ti,ab. or "Reproducibility of Results"/ or accuracy.ti,ab. or Diagnosis, Differential/ or Validation Studies.pt. (6539234)

10 6 and 9 (369)

Embase

No.	Query	Results
#10	#6 AND #9	446
#9	'sensitivity and specificity'/de OR sensitiv:ab,ti OR specific:ab,ti OR predict*:ab,ti OR 'roc curve':ab,ti OR 'receiver operator':ab,ti OR 'receiver operators':ab,ti OR likelihood:ab,ti OR 'diagnostic error'/exp OR 'diagnostic accuracy'/exp OR 'diagnostic test accuracy study'/exp OR 'inter observer':ab,ti OR 'intra observer':ab,ti OR interobserver:ab,ti OR intraobserver:ab,ti OR validity:ab,ti OR kappa:ab,ti OR reliability:ab,ti OR reproducibility:ab,ti OR ((test NEAR/2 're-test'):ab,ti) OR ((test NEAR/2 'retest'):ab,ti) OR 'reproducibility'/exp OR accuracy:ab,ti OR 'differential diagnosis'/exp OR 'validation study'/de OR 'measurement precision'/exp OR 'diagnostic value'/exp OR 'reliability'/exp	8070749
#8	#6 AND #7	18
#7	('meta analysis'/de OR 'meta analysis (topic)'/exp OR cochrane:ab OR embase:ab OR psycinfo:ab OR cinahl:ab OR medline:ab OR ((systematic NEAR/1 (review OR overview)):ab,ti) OR ((meta NEAR/1 analy):ab,ti) OR metaanalys:ab,ti OR 'data extraction':ab OR cochrane:jt OR 'systematic review'/de) NOT (('animal experiment'/exp OR 'animal model'/exp OR 'nonhuman'/exp) NOT 'human'/exp)	526170
#6	#4 AND [1-1-2015]/sd NOT ('conference abstract'/it OR 'editorial'/it OR 'letter'/it OR 'note'/it) NOT (('animal experiment'/exp OR 'animal model'/exp OR 'nonhuman'/exp) NOT 'human'/exp)	669
#5	#4 NOT ('conference abstract'/it OR 'editorial'/it OR 'letter'/it OR 'note'/it) NOT (('animal experiment'/exp OR 'animal model'/exp OR 'nonhuman'/exp) NOT 'human'/exp)	978
#4	#1 AND #2 AND #3	1438
#3	'artificial intelligence'/exp OR 'artificial intelligence':ti,ab,kw OR 'machine intelligence':ti,ab,kw OR 'machine learn':ti,ab,kw OR 'neural network':ti,ab,kw OR 'deep learn':ti,ab,kw OR algorithm:ti,ab,kw	411154
#2	'diabetes mellitus'/exp OR diabete:ti,ab,kw OR diabetic:ti,ab,kw OR dm2:ti,ab,kw OR d2m:ti,ab,kw OR niddm:ti,ab,kw OR 'dm 2':ti,ab,kw OR t2d:ti,ab,kw OR 'dm type 2':ti,ab,kw OR 'dm type ii':ti,ab,kw OR dm1:ti,ab,kw OR d1m:ti,ab,kw OR iddm:ti,ab,kw OR 'dm 1':ti,ab,kw OR t1d:ti,ab,kw OR 'dm type 1':ti,ab,kw OR 'dm type i':ti,ab,kw OR 'type* 1 diabet':ti,ab,kw OR 'type i diabet':ti,ab,kw OR 'type one diabet*':ti,ab,kw	1170128
#1	'diabetic retinopathy'/exp OR retinopath*:ti,ab,kw OR retinitis:ti,ab,kw OR retinosis:ti,ab,kw	92234

Appendix

Table 1. Overview to elaborate consequences for patient.

Outcome	Consequence	Consequence relevant for patient	Importance
Replacement: AI before human grader. If the test result is positive, a referral is made to the ophthalmologist. The aim of the AI-system is to replace human graders. The AI-system must not lack referable retinopathy (: high sensitivity; no false negatives), and the number of patients incorrectly referred to the ophthalmologist must be low (: high specificity; few false positives) due to high costs and burden on the ophthalmologist.
TP	The patient is rightly diagnosed with referable retinopathy. The patient is referred to an ophthalmologist.	The patient is rightly referred to the ophthalmologist and receives appropriate care.	8
TN	The patient is rightly diagnosed with no identifiable retinopathy. The patient is not referred to an ophthalmologist.	The patient is rightly not referred to an ophthalmologist.	8
FP	The patient is misdiagnosed with referable retinopathy. The patient is referred to an ophthalmologist.	The patient is incorrectly referred to the ophthalmologist for a potentially serious eye abnormality. This extra visit is not necessary. Can be stressful for the patient.	9
FN	The patient is misdiagnosed with no identifiable retinopathy. The patient is not referred to an ophthalmologist.	The patient is incorrectly not referred to the ophthalmologist and may therefore not receive appropriate medical care. This can cause serious health damage.	9
Inconclusive to interpret results	Quality of images is insufficient.	Delay in final diagnosis.	7
Burden of test	Images: are normally also taken. This burden is negligible. AI: a computer assesses whether the patient should be referred to an ophthalmologist. This can be burden and risky for the patient. Keep FN as low as possible.	Possibly less burden because faster results and/or faster referral to an ophthalmologist?	6
Seizure of resources (costs)	Images: are commonly taken. It may be that extra images have to be taken to guarantee the quality. The software must be purchased, but it also replaces a human grader.		8
Triage: AI as filter for human grader. If the test result is positive, a referral is made to a human grader, the human grader determines whether the patient is referred to an ophthalmologist. The aim of the AI-system is to reduce the burden on the human graders. The AI-system may not lack referable retinopathy (: high sensitivity; no false negatives), but the number of patients incorrectly referred to the human grader may be higher than in a diagnostic strategy involving direct referral to an ophthalmologist (: high specificity is less important; number of false positives may be higher).
TP	The patient is rightly diagnosed with referable retinopathy. The patient is referred to a human grader.	Patient may not receive the right care directly. Images are also assessed by humane grader.	8
TN	The patient is rightly diagnosed with no identifiable retinopathy. The patient is not referred to a human grader.	The patient is rightly not referred to a human grader.	7
FP	The patient is misdiagnosed with referable retinopathy. The patient is referred to a human grader.	The patient is referred to the human grader. This extra visit is not necessary. Might be burden for the patient.	7
FN	The patient is misdiagnosed with no identifiable retinopathy. The patient is not referred to a human grader.	The patient is not referred to the human grader. This visit is essential. You want to refer these patients to an ophthalmologist as soon as possible so that they receive the right care.	9
Inconclusive to interpret results	Quality of images is insufficient.	Patient is automatically referred to human grader	7
Burden of test	Images: are commonly taken. It may be that extra images have to be taken to guarantee the quality. The software must be purchased, but the burden on human graders is reduced.		8
Seizure of resources (costs)	Images: are commonly taken. It may be that extra images have to be taken to guarantee the quality. The software must be purchased, but it also replaces a human grader. Extra costs for human grader not necessary for FP, keep this group as low as possible.		8
TP= true positives, TN= true negatives, FP= false positives, FN= false negatives.

Uitwerking model Heydon, 2020

Screening rDR (prevalentie = 5.5%)

	Wel volgens human	Niet volgens human
Wel volgens AI	1585	13506	15091
Niet volgens AI	78	15236	15314
	1663	28742

* hoe minder fout positieven, hoe hoger specificiteit.
* hoe minder fout negatieven, hoe hoger sensitiviteit.

Sensitiviteit = 95.3%

Specificiteit = 53.0%**

Terecht positieven = 1585 (5.2%)

Terecht negatieven = 15236 (50.1%)

Fout positieven = 13506 (44.4%)

Fout negatieven = 78 (0.3%)

Richtlijnendatabase

Diabetische retinopathie

Diabetische retinopathie

Geautomatiseerde screening middels AI

Uitgangsvraag

Aanbeveling

Overwegingen

Onderbouwing

Achtergrond

Conclusies / Summary of Findings

Samenvatting literatuur

Zoeken en selecteren

Referenties

Evidence tabellen

Verantwoording

Beoordelingsdatum en geldigheid

Initiatief en autorisatie

Algemene gegevens

Doel en doelgroep

Samenstelling werkgroep

Belangenverklaringen

Inbreng patiëntenperspectief

Module

Werkwijze

Zoekverantwoording

Bijlagen