Guide for reviewing AI papers
Does the manuscript report on AI/Radiomics?
Please let us know if the manuscript’s main focus is on AI / radiomics, for example:
– Development of AI models
– Application of self-developed AI models or commercially available AI-powered products
– Development and validation of radiomics models
Keep in mind:
- Radiomics here refers to the extraction of complex mathematical features describing gray value distributions like GLRLM, GLSZM, etc. as described here for example.
- More simple imaging biomarkers like ADC values or simple morphological measurements should not be referred to as radiomics to avoid confusion.
- Basic machine learning models such as linear regression are not considered artificial intelligence here, whereas more complex methods such as Support Vector Machines, Random Forests, etc. would fall under this category for the purpose of the review.
Is the presented AI/Radiomics approach relevant for clinical practice?
In some cases, proofs of concept might be interesting for our readership. However, as a clinical radiological journal, we would like to encourage authors to discuss the impact of the proposed AI / radiomics model in detail.
– How would patient management be affected by the proposed model? Are different treatment options available?
– Is the model sufficiently accurate and reliable so that alternative methods could be safely omitted (for example, could biopsy/surgery be avoided)?
Keep in mind:
- Models that, for example, predict a specific gene mutation in non-small-cell lung cancer might not affect clinical routine much as histopathological confirmation of NSCLC would be required for the model to be applicable at which point a specimen would be available for genetic testing.
- Similarly, models with limited capabilities (for example, differentiation of HCC vs. liver hemangioma) might not be useful as important differential diagnoses are not assessed (CCC, liver adenoma, etc.).
- Statements like “could be used to guide more personalized treatment” may not be helpful to our readership if no options for individualized care are presented / discussed.
Is the reference standard used to train/validate the AI/radiomics reliable?
The reference standard for training / validating AI (or radiomics) models should be as reliable as possible. Good reference standards are obtained by independent objective tests with structured results like:
– Laboratory tests
– Outcome data
Keep in mind:
- In some cases, such objective tests might not be available or feasible. However, the reference standard needs to be clearly defined and reported.
- Automatically extracted labels from unstructured data (for example using NLP to extract data from radiological reports) might introduce relevant errors due to imperfections of the NLP algorithm used.
- Single reader visual assessment is not a reliable reference standard, while consensus of multiple readers might be valid. Ideally a consensus of multiple readers in a different modality would be even more valid (for example if an AI is to detect subtle fractures on X-ray, consensus reading of CT scans of the same cases can be a valid reference standard).
Was AI/Radiomics performance compared to a human reader or clinical model?
To be clinically relevant, any AI / radiomics approach should be compared to either a human reader’s performance, a model based on clinical information alone or any other clinically meaningful comparison, for example:
- A model predicting survival based on imaging features could be compared to a model including clinical data like age, pTNM, etc.
- A model differentiating HCC from liver hemangioma in MRI could be compared to a human reader’s assessment.
- The comparison of the AI / radiomics performance with an alternative should be carried out using appropriate statistical tests.