# Convolutional neural network for automatic maxillary sinus segmentation on cone-beam computed tomographic images

This study was conducted in accordance with the standards of the Helsinki Declaration on medical research. Institutional ethical committee approval was obtained from the Ethical Review Board of the University Hospitals Leuven (reference number: S57587). Informed consent was not required as patient-specific information was anonymized. The study plan and report followed the recommendations of Schwendicke et al.23 for reporting on artificial intelligence in dental research.

### data set

A sample of 132 CBCT scans (264 sinuses, 75 females and 57 males, mean age 40 years) from 2013 to 2021 with different scanning parameters was collected (Table 1). Inclusion criteria were patients with permanent dentition and maxillary sinus with/without mucosal thickening (shallow > 2 mm, moderate > 4 mm) and/or with semi-spherical membrane in one of the walls.24. Scans having dental restorations, orthodontic brackets and implants were also included. The exclusion criteria were patients with a history of trauma, sinus surgery and presence of pathologies affecting its contour.

The Digital Imaging and Communication in Medicine (DICOM) files of the CBCT images were exported anonymously. Dataset was further randomly divided into three subsets: (1) training set (n = 83 scans) for training of the CNN model based on the ground truth; (2) validation set (n = 19 scans) for evaluation and selection of the best model; (3) testing set (n = 30 scans) for testing the model performance by comparison with ground truth.

### Ground truth labeling

The ground truth datasets for training and testing of the CNN model were labeled by semi-automatic segmentation of the sinus using Mimics Innovation Suite (version 23.0, Materialize NV, Leuven, Belgium). Initially, a custom threshold leveling was adjusted between [− 1024 to − 200 Hounsfield units (HU)] to create a mask of the air (Fig. 1a). Subsequently, the region of interest (ROI) was isolated from the rest of the surrounding structures. A manual delineation of the bony contours was performed using eclipse and livewire function, and all contours were checked in coronal, axial, and sagittal orthogonal planes (Fig. 1b). To avoid any inconsistencies in the ROI of different images, the segmentation region was limited to the early start of the sinus ostium from the sinus side before continuation into the infundibulum (Fig. 1b). Finally, the edited mask of each sinus was exported separately as a standard tessellation language (STL) file. The segmentation was performed by a dentomaxillofacial radiologist (NM) with seven years of experience and subsequently re-assessed by two other radiologists (KFV&RJ) with 15 and 25 years of experience respectively.

### CNN model architecture and training

Two 3D U-Net architectures were used25both of which consisted of 4 encoder and 3 decoder blocks, 2 convolutions with a kernel size of 3 × 3 × 3, followed by a rectified linear unit (ReLU) activation and group normalization with 8 feature maps26. Thereafter, max pooling with kernel size 2 × 2 × 2 by strides of two was applied after each encoder, allowing reduction of the resolution with a factor 2 in all dimensions. Both networks were trained as a binary classifier (0 or 1) with a weighted Binary Cross Entropy Loss:

$${L}_{BCE}={y}_{n}*logleft({p}_{n}right)+left(1-{y}_{n}right)*log left(1-{p}_{n}right)$$

for each voxel n with ground truth value ({y}_{n}) = 0 or 1, and the predicted probability of the network =({p}_{n})

A two-step pre-processing of the training dataset was applied. First, all scans were resampled at the same voxel size. Thereafter, to overcome the graphics processing unit (GPU) memory limitations, the full-size scan was down sampled to a fixed size.

The first 3D U-Net was used to provide roughly low-resolution segmentation for proposing 3D patches and cropped only those which belonged to the sinus. Later, those relevant patches were transferred to the second 3D U-Net where they were individually segmented and combined to create the full resolution segmentation map. Finally, binarization was applied and only the largest connected part was kept, followed by application of a marching cubes algorithm on the binary image. The resulting mesh was smoothed to generate a 3D model (Fig. 2).

The model parameters were optimized with ADAM27 (an optimization algorithm for training deep learning models) having an initial learning rate of 1.25e−4. During training, random spatial augmentations (rotation, scaling, and elastic deformation) were applied. The validation dataset was used to define the early stopping which indicates a saturation point of the model where no further improvement can be noticed by the training set and more cases will lead to data overfitting. The CNN model was deployed to an online cloud-based platform called virtual patient creator (creator.relu.eu, Relu BV, Version October 2021) where users could upload DICOM dataset and obtain an automatic segmentation of the desired structure.

### Testing of AI pipeline

The testing of the CNN model was performed by uploading DICOM files from the test set to the virtual patient creator platform. The resulting automatic segmentation (Fig. 3) could be later downloaded in DICOM or STL file format. For clinical evaluation of the automatic segmentation, the authors developed the following classification criteria: A—perfect segmentation (no refinement was needed), B—very good segmentation (refinements without clinical relevance, slight over or under segmentation in regions other than the maxillary sinus floor), C—good segmentation (refinements that have some clinical relevance, slight over or under segmentation in the maxillary sinus floor region), D—deficient segmentation (considerable over or under segmentation, independent of the sinus region, with necessary repetition) and E—negative (the CNN model could not predict anything). Two observers (NM and KFV) evaluated all the cases, followed by an expert consensus (RJ). In cases where refinements were required, the STL file was imported into Mimics software and edited using the 3D tools tab. The resulting segmentation was denoted as refined segmentation.

### Evaluation metrics

The evaluation metrics28.29 are outlined in Table 2. The comparison of outcome amongst the ground truth and automatic and refined segmentation was performed by the main observer on the whole testing set. A pilot of 10 scans were tested at first, which showed a Dice similarity coefficient (DSC) of 0.985 ± 004, Intersection over union (IoU) of 0.969 ± 0.007 and 95% Hausdorff Distance (HD) of 0.204 ± 0.018 mm. Based on these findings, the sample size of the testing set was increased up to 30 scans according to the central limit theorem (CLT)30.

#### Time efficiency

The time required for the semi-automatic segmentation was calculated starting from opening the DICOM files in Mimics software till export of the STL file. For automatic segmentation, the algorithm automatically calculated the time required to have a full resolution segmentation. The time for the refined segmentation was calculated similarly to that of semi-automatic segmentation and later added to the initial automatic segmentation time. The average time for each method was calculated based on the testing set sample.

#### Accuracy

A voxel-wise comparison amongst ground truth, automatic and refined segmentation of the testing set was performed by applying a confusion matrix with four variables: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). ) voxels. Based on the aforementioned variables, the accuracy of the CNN model was assessed according to the metrics mentioned in Table 2.

#### Consistency

Once the CNN model is trained it is deterministic; hence it was not evaluated for consistency. For illustration, one scan was uploaded twice on the platform and the resulting STLs were compared. Intra- and inter-observer consistency were calculated for the semi-automatic and refined segmentation. The intra-observer reliability of the main observer was calculated by re-segmenting 10 scans from the testing set with different protocols. For the inter-observer reliability, two observers (NM and KFV) performed the needed refinements, then the STL files were compared with each other.

### Statistical analysis

Data were analyzed with RStudio: Integrated Development Environment for R, version 1.3.1093 (RStudio, PBC, Boston, MA). Mean and standard deviation was calculated for all evaluation metrics. A paired-sample t-test was performed with a significance level (p < 0.05) to compare timing required for semi-automatic and automatic segmentation of the testing set.