Virtual tests for COVID-19 based on machine learning are at hand


In this post, we review a machine-learning-based model that was trained to predict the results of COVID-19 testing based on clinical symptoms, vital signs, and background data only. The model was trained on a high-quality clinical data repository that was collected by Carbon Health and Braid Health in California and that was published as an open dataset. The dataset is relatively small, including about 1,600 SARS-CoV-2 RNA RT-PCR tests. Nevertheless, the results are encouraging. The trained model predicts the actual test results with a ~70% probability, tested on a hold-out test set. Our work suggests that machine learning models could be included as part of routine screening for COVID-19 and can assist in prioritizing RT-PCR testing.

We estimate that training such models on larger datasets will improve the prediction performance and enable “virtual testing” to be carried out. Such virtual tests will allow rapid prioritization of RT-PCR testing at scale and will increase the effective number of people tested for COVID-19. We encourage additional test centers to publish de-identified clinical data of patients who have taken a COVID-19 test.

The dataset

Some important facts about the dataset that we used for the experiment described in this blog post:

  1. It is maintained by Braid Health and Carbon Health, two providers of health services from the San Francisco Bay Area.
  2. It is relatively small and includes, as of this moment, 1,610 records—1,610 RT-PCR tests performed for COVID-19. Of these, 1,509 tests have a negative result, and 101 tests have a positive result.
  3. The dataset is very rich. Each record includes background data on the patient, epidemiological factors, comorbidities, vital signs, patient self-reported symptoms, clinician-assessed symptoms, and more.
  4. Some of the records also include links to the corresponding chest radiology images and findings—data that we did not use in the present analysis.
  5. The data seems to be of high quality. We did not detect any obvious mistakes or biases.
  6. It maintains the privacy of the subjects and is compliant with HIPAA Privacy Rule’s De-Identification Standard.
  7. It is an open-source open-dataset, published under the Creative Commons license.
  8. Braid Health and Carbon Health intend to publish updates to the dataset.

Purpose and general approach

The purpose of our analysis is to examine the feasibility of our idea—whether machine learning models can be trained to predict some of the results of the COVID-19 RT-PCR tests at sufficient sensitivity and specificity based on reported and assessed clinical symptoms, vital signs, and epidemiological factors only.

Furthermore, it is important to mention what this analysis is not about:

  1. We’re not competing in a Kaggle competition. We do not have a public or private test set, and we are not monitoring our position at a leaderboard. Our goal is not to squeeze the last bit of accuracy. We don’t intend to train a giant ensemble of ensembles of models (those tend to win most of Kaggle competitions). Our goal is essentially descriptive and exploratory: we want to explore whether this problem is predictable. The dataset is relatively small, so we don’t have enough data for a thorough search for the optimal model and feature set. We allow ourselves to use all the data to explore several hypotheses. It is quite clear that we will not leave here, after training a model on 1,610 records, with a real-world, production-ready model. If we can train a model that will demonstrate reasonable predictability on unseen data, on this small dataset, that would be very encouraging. What is a model that will demonstrate a reasonable predictability? According to this article, a binary classification model with a ROC-AUC score greater than 0.7 is considered acceptable (0.7–0.8: acceptable, 0.8–0.9: excellent, above 0.9: outstanding).
  2. This is not a content marketing post. “If you torture the data long enough, it will confess to anything,” taught us the novelist Ronald Coase. “You juke the stats, and majors become colonels,” declared The Wire detective Roland “Prez” Pryzbylewski. I’ve been here before. It is very easy to train too strong a model, to snoop on the data, to cherry-pick the results—and to create wonderful content (that is worthless). That’s not this.

So our strategy is simple. We have a relatively small dataset of high quality. We will try to train a simple predictive model, or more precisely, an ensemble of simple models. We will try to “keep” the models simple enough to prevent excessive overfitting. We will avoid extensive feature engineering and complete hyperparameter search, as we don’t have enough data for this. We will explore the results and assess whether they demonstrate a predictive relation between the input features and the target variable.

Shall we?

Dataset exploration

The dataset includes 1,610 records, of which 1,509 are negative and 101 positive for COVID-19. The first warning sign—this is an imbalanced dataset; we will have to pay careful attention to that.

The attributes of each test can be roughly divided into seven categories: background data, vital signs, clinician-assessed symptoms, patient-reported symptoms, comorbidities, data about the test itself, and the test results. In the table below, we have summarized the properties that we have used (full details of the properties appear in the dictionary attached to the original data repository).

Category Properties
Background data on the subject and on the epidemiological risk factors Patient age, whether the patient is in a profession with a high risk of exposure, whether the patient might have been exposed through contact with a high risk of exposure.
Vitals Temperature, pulse, blood pressure (systolic and diastolic), respiratory rate, oxygen saturation; rapid test results for flu and for strep (if taken).
Clinician-assessed symptoms Whether the lung exam is normal (clear to auscultation bilaterally), whether the patient is experiencing labored respiration, rhonchi (coarse rattling sound like snoring), wheezing (high-pitched whistling sound).
Patient-reported symptoms Number of days from symptoms onset; whether the patient reports cough and shortness of breath, (including their severity in a mild–moderate–severe scale); fever, diarrhea, fatigue, headache, loss of smell, loss of taste, runny nose, muscle pain, sore throat.
Comorbidities Whether the patient has diabetes, coronary heart disease, hypertension, cancer, asthma, chronic obstructive pulmonary disease, or autoimmune disease.
Test data Test name: Rapid COVID-19 Test, SARS COV 2 RNA RTPCR, SARS COV2 NAAT, SARS CoV w/ CoV 2 RNA, SARS-CoV-2;
Sample collection area: nasal, nasopharyngeal, oropharyngeal.
COVID-19 test result Positive/Negative.

The age distribution of the patients, as expected, is bell-shaped, with the left portion of the distribution appearing truncated, showing very few patients in the 0–20 age range, probably due to the fact that children and adolescents seem to be less affected by coronavirus. A difference between the distribution of positive and negative cases is presented, which may indicate the relevance of age as a predictive feature.

Figure 1: Age distribution of patients—all patients (left), negative-tested patients (center), and positive-tested patients (right).

Vital signs, such as body temperature and pulse, also show an approximate normal distribution and some differences between the positive and negative subjects. Because the data is imbalanced, the negative test group is the dominant group and dictates the distribution of the entire dataset:

Figure 2: Body heat distribution (top) and pulse (bottom), left to right—all subjects, negative subjects, and positive subjects.

The dataset includes quite a few missing values, probably due to some latent aspects of the data collection mechanism. In figure 3 below, we show the distribution of the missing values ​​(features that didn’t have any missing value are omitted from the figure).

Figure 3: Missing values. In red—values ​that ​are missing for positive subjects. Blue—missing values ​​for negative subjects. Left—absolute numbers (the number of tests that include missing values ​​for each feature). Right—relative numbers (the ratio between the missing reports among the positive subjects and the total missing reports for each characteristic). The orange vertical line on the right represents the percentage of positive tests out of all tests (6.3%). Features that had no missing values ​​are not shown in the figure.

Two major conclusions can be drawn from the missing value figure above:

  1. It reinforces our assessment that the dataset is of high quality. We didn’t find any evidence that the dataset contains a non-reporting bias in favor of the positive or negative patients (“Missing not at random” bias). Contrary to a similar dataset published by the Israeli Ministry of Health, here we have no evidence that the positive patients suffer from over-reporting or under-reporting of any feature such as symptoms. It can be seen that for each feature, the percentage of missing values ​​of the positive patients group out of the missing values ​​of all patients is approximately constant and represents, approximately, the ratio of positive patients to all tested patients (6.3%). This can be clearly seen in the chart on the right. The width of the red columns is approximately similar to the vertical orange line representing a value of 6.3%. The only notable exception to this is the pulse rate, where there is a lack of missing values for positive patients. However, from the left graph, it seems like a “small numbers” fluctuation—the dataset contains only a small number of tests for which the pulse is not reported (out of the 1,610 patients, only 71 patients didn’t have pulse value reported, of which only 1 patient is positive). 
  2. The features that represent patient-reported symptoms suffer from a very large portion of missing values. For symptoms such as sore throat, muscle pain, runny nose, loss of taste and smell, headache, fatigue, and diarrhea, most of the values are missing. Such a large portion of missing values can significantly decrease the diagnostic value of those features. We will, therefore, consider assembling them to aggregated features (in the “feature engineering” phase, which will be described later) and will try to capture the “signal” that is so prized.

In figure 4 below, we show the correlation between the different features.

Figure 4: Heatmap of correlation between each pair of features. Each raw is a feature (the first raw is the test result). Each column is also a feature. The value in each cell represents the Pearson Correlation Coefficient between the two properties—dark cells represent a strong positive correlation, light cells represent a strong negative correlation.

Most features are not correlated with each other, which is a desirable trait, as it increases the predictive potential of each feature. In addition, it can be seen that none of the features correlates strongly with the target variable (test result)—that is, our model will have to “work hard” and learn a non-trivial relation between the input and the output variables. If, for example, most people who are infected with COVID-19 suffer from high fever, and most of those who suffer from high fever and who are tested for COVID-19 do indeed test positive, we should have seen a strong correlation between the two (body temperature and test result). The absence of such direct correlation suggests that the relationship between the input variables and the target variable, if present, is not straightforward. The lack of correlation between any feature to the target variable also reduces the concern that a “leakage” of information from the target variable has occurred into one of the other features.

Three “chunks” of features demonstrate a relatively strong correlation:

  1. The chunk in the center includes respiratory features. CTAB, rhonchi, wheezes, cough—are all respiratory features—so a relatively strong correlation between them, positive or negative, is somewhat expected.
  2. The darker right lower chunk is of symptoms such as fatigue, headache, diarrhea, loss of taste and smell, runny nose, muscle pain, and sore throat. Even though their reporting is relatively rare in the dataset (we saw in figure 3 that the reports for those features were missing for most patients), it still makes sense that those symptoms correlate, as they represent general illness symptoms (not only COVID-19 related).
  3. The bright chunk in the lower right represents an elementary correlation due to the one-hot encoding we used to code the categorical features of the test type and swab type as well as an expected relationship between the type of test and the swab type.


The dataset was published in CSV format and included various types of variables—numeric, categorical, Boolean, etc. We performed a simple and basic “data cleaning” step which included:

  1. Omitting the records for which the test results are unknown (there was one such record).
  2. Transforming all the variables to numeric types (categorical variables were one-hot encoded).
  3. Completing missing values. As mentioned, some of the data is missing. We have used a simple and naive imputation technique. (We simply substituted each missing value with the corresponding column average.)
    Note: The imputation was performed using all the data, which produces a certain “leakage” from the test set to the training set. We performed a sensitivity test and found that the effect of this leakage is negligible. The sensitivity test was done by substituting all the missing values with an arbitrary constant and comparing the ROC AUC (less than 0.01 difference in ROC AUC was found, cross-validated).
    Looking forward: more sophisticated methods for missing value data imputation may slightly improve results. In addition, due to the fact that some of the features that are mostly missing (specifically patient-reported symptoms such as anosmia) were found in other studies to be strongly correlated to COVID-19, we recommend that the data repository maintainers consider, if possible, filling missing values (backward and forward-looking). 

After we arranged and cleaned up the data, we went on to feature engineering. In addition to the original features, we chose to engineer a small number of aggregated features—sums and multiples of features that are logically grouped into the same category. For example, we added a feature that counts the number of symptoms the patient has reported and a feature that multiplies the different vital signs (pulse, blood pressure, etc.) for each patient. Looking forward, it is possible that engineering more complex features will improve the results, especially features that incorporate domain knowledge.

Basic choice of prediction model

We chose to train a random forest prediction model—an ensemble of decision trees that makes a binary classification decision—predicting a positive or negative outcome of COVID-19 RT-PCR tests. This choice lies in the fact that random forest models are relatively good at generalization and are relatively easy to regularize and avoid overfitting, both because of their basic ensemble architecture and the bagging-based training process. We also experimented briefly with other types of models—feedforward neural networks and logistic regression—and the performances were comparable.

We have chosen to set the model’s hyperparameters in advance and run with them throughout the experiment. A more accurate choice of hyperparameters may improve the prediction results, but at this point we chose not to invest energy in it. This is for two main reasons:

  1. Our dataset is relatively small. We don’t have enough data to conduct a thorough search for an optimal model in the hyperparameter space without using most of the data and then risk overfitting again.
  2. Our goal is not to “squeeze” every possible milligram of performance from this dataset. We are not trying to win a Kaggle competition but only to test the degree of predictability of this challenge.

The key hyperparameters we defined for the model:

  1. Relatively large number of trees in the ensemble (100) to prevent overfitting. Random forest models follow Condorcet’s jury theorem: if certain conditions are met, the probability of reaching a correct binary classification decision with a jury base majority vote is higher than the probability of reaching a correct decision by an individual voter, and the more jurors the jury (ensemble) includes (models, decision trees in our case), the higher the probability of making a right decision.
  2. Relatively shallow maximum depth of each tree (3 levels)—not letting the trees to go too deep, thus “memorizing” the data.
  3. Weigh samples in the training phase (class_weight: balanced) to deal with the imbalance nature of our data (6.3% of positive tests).
  4. Prevent overly fine partitioning by setting two-division thresholds (min_weight_fraction_leaf and min_impurity_decrease), again, to reduce overfitting.

Model training: cross-validation

In the first step, we trained the model using the cross-validation approach. We divided the dataset into 12 pieces and performed 12 training sessions. In each run, we trained the random forest model on 11 pieces of the data and evaluated the model’s performance on the remaining piece (which was not used for training). We averaged the results to create an average ROC curve, as seen in figure 5 below:

Figure 5: ROC (Receiver Operating Characteristic) curve of the random forest trained classifier. The curve shows the true positive rate (y-axis, also called recall or sensitivity) versus the false positive rate (x-axis, also stands for one minus specificity) given different binary-decision thresholds. Each thin line represents one run on 11 training pieces (purple) and one test piece (orange)—cross-validation. The thick lines represent a theoretical model obtained from averaging 12 different runs. The gray stripe represents a plus-minus one standard deviation relative to the mean on the test set. The light blue dashed line represents the ROC of a random guessing classifier (coin toss).

What can you learn from the figure?

  1. The average ROC curve on both the training runs (purple line) and the test runs (orange line) is better than a random guess (dashed white blue line). That is to say, despite our small dataset, a person who comes to perform a COVID-19 test at Carbon Health or Braid Health would rather believe the prediction of our model than taking a coin toss. Given that people are not randomly tested, and that in order to get tested for COVID-19 certain eligibility criteria should be met, this is not a trivial result at all.
  2. The area under the average ROC curve (ROC AUC [Area Under the Curve]) is 0.71 on test runs and 0.90 on training runs. ROC AUC is a common metric for estimating classification models because it describes the strength of the model and is not sensitive to specific decision threshold selection. The intuitive interpretation of ROC AUC = 0.71 obtained is as follows: there is a probability of 71% that a randomly chosen actual positive patient will be ranked by our model above a randomly chosen negative patient (a higher probability, assigned by our model, of being positive). As mentioned, ROC AUC above 0.7 is considered acceptable, ROC AUC above 0.9 is considered excellent. That is, the model shows exactly what we wanted to test; the output seems predictable, at least partially, from the input.
  3. Weighing the known bias–variance trade-off, it seems that our model suffers from relatively high variance and relatively low bias. In other words, our model suffers from some overfitting to the training data. A finer calibration of the model’s hyperparameters is likely to lead to a better balance between the two, which means reducing the ROC AUC on the training runs but increasing the ROC AUC on the test runs. As mentioned before, this is not quite our goal in this experiment, so we will leave this delicate calibration for future work.

Figure 6 below shows the most important features that the model has learned (feature importance). It can be seen that the model uses features from different categories—vital signs, symptoms, epidemiological factors—evidence that the information collected indeed prizes diagnostic value for our prediction task.

Figure 6: The Importance of Feature Importances in a Skilled Model

Figure 6 shows that the body temperature has a high importance in the learned model. We performed a rough sensitivity test for this temperature feature. We omitted it (and the engineered features that included it) from the dataset and performed another cross-validation model training round when the other parameters were kept the same. We indeed saw a non-negligible reduction in the model performance (ROC AUC decrease of 0.03 on the training pieces and 0.04 on the test pieces). Conclusion: body temperature is indeed a valuable feature in predicting the test result. The residual effects of each of the key features of the model (“one-at-a-time” sensitivity analysis) can be further examined.

Another interesting observation from the feature importances figure is that the aggregated features we engineered were indeed used by the learned model (vitals_sum_with_age, vitals_multiplication, etc.). Because our dataset is relatively small, it makes sense, as weak signals of separate features are amplified by aggregating them together. This is a point for future exploration. The aggregated features we engineered are simple and do not exploit clinical domain knowledge. A better feature engineering is likely to improve the model performance.

Model training—train/test split

In the second stage, we performed a “traditional” split of the dataset for training and test sets (80%-20%). This is to get one model and test its performance on data that was not used at all during the training phase. Such an approach facilitates the convenience of analyzing the results, although it is clearly just a subset of the cross-validation approach presented earlier.

We trained a model with exactly the same hyperparameters from the previous step, and the obtained results (visualized with the ROC curve) are shown in the following figure:

Figure 7: ROC curve of random forest trained classifier. The curve shows the true positive rate (y-axis) versus the false positive rate (x-axis) given different binary-decision thresholds versus the training set (purple) and the test set (orange). The light blue dashed line represents the ROC of a random guessing classifier (coin toss).

Figure 7 shows that a good model was obtained on the training set (ROC AUC = 0.90) and on the test set (ROC AUC = 0.74). Not overly excited, there is likely to be little random luck effect here (as we saw in the cross-validation runs, different random train-test splits yield different results; we may have fallen into a relatively easy prediction piece).

In figure 8 below, we show a scatterplot of the model predictions:

Figure 8: Scatterplot of the probabilities predicted by the model. Every point is a test. The color represents the true test result (red: positive, blue: negative). The x-axis describes the probability score that the trained model assigns to the test to be positive—for example, the model estimates that the rightmost sample in the scatter is positive at about 0.77. The y-axis represents nothing; differences in this axis are for display purposes only. Above: scatterplot of the model’s predictions on the training set. Below: Scatterplot of the model’s predictions on the test set.

Figure 8 shows that both on the training set and the test set, there is a tendency for red dots to concentrate on the right and a tendency for blue dots to concentrate on the left—that is, the model, in general, separates positive and negative tests. Moreover, this trend is more pronounced in the training set than in the test set (high variance). In addition, the figure shows that a binary decision threshold can be set to separate distinct from indistinct results according to the usage scenario. For example, if we take only predictions that score higher than 0.7 and classify them as positive for COVID-19, and all other tests are classified as “unknown,” no false positives will be classified in the training set and test set (however, the recall will be very low). Alternatively, if we classify all predictions that score below 0.3 as negative and classify all other tests as “unknown,” 10% of the tests can be spared (because they will be correctly classified as negative) without any actual positive falsely predicted negative!

This scatterplot shows a better intuition of the quality of the model and the significance of the ROC curves presented earlier. For each decision threshold we set—assuming that we do not have the privilege to classify tests as “unknown” and that we must classify each test for positive or negative—four sets of predictions are obtained: true positive, true negative, false positive, & false negative. The ROC curve shown in figure 7 determines a different decision threshold at a time, then counts the FP and TP, thus creating the response curve.

We will look at the same data, this time in a binned fashion. Figure 9 shows a histogram of the model’s predictions (weighted to balance the imbalance between the positive and negative tests):

Figure 9: Model predictions histogram. The x-axis expresses the probability score the model gave for tests to be positive for COVID-19. The color represents the actual result of the test (red: positive, blue: negative). Top chart: training set, bottom chart: test set. Because the dataset is imbalanced, the actual positive test histogram was given a larger weight so that it represents the same area as the actual negative test histogram. 

The figure clearly shows that our model has learned to separate the significant tests, either positive or negative (the left end of the blue distribution and the right end of the red distribution). These are the histogram areas where you can only find blue columns (left) or only red columns (right) without the two overlapping. However, the chart clearly shows that there is a large proportion of tests, larger in the test set than in the training set, where the distributions overlap—that is, our model could not consistently separate them correctly—and all we have left is to choose whether we prefer false positive errors (and then pass a lower threshold) or false negative errors (then pass a higher threshold). If our model were the perfect classification model, the red and blue distributions would not overlap at all; thus, the decision threshold would have separated them completely.


  1. We observed a significant diagnostic value of the features recorded in the Carbon Health and Braid Health dataset for the task of predicting the actual results of RT-PCR tests for COVID-19. There is a relationship that does not describe an incidental relationship between the data on symptoms, vital signs, and epidemiological factors to whether the patient is positive or negative for COVID-19.
  2. We have seen that the relatively simple random forest model, trained on the dataset, enables us to construct a classifier that systematically predicts the test result better than a coin toss. Taking into account the fact that people are eligible to perform an RT-PCR test only when there is a genuine concern that they have been infected with the virus, this result is not trivial.
  3. We have seen that the trained model is a good predictor of the results “at the extremes of the distribution”—the positive tests and especially the distinct negative tests. This means that with very little effort, it is likely that such a model can be used to prioritize candidates for testing and to make decisions about a subset of the prospective group (the significant group). For example, if a patient is tested a “distinct positive” for COVID-19 in our virtual test, she can be presumed positive and instructed to self-quarantine even before a positive RT-PCR test result is obtained. Patients who are “distinctly negative” can spare the need for RT-PCR testing.
  4. We have seen that the trained model is incomplete and does not predict the results of most of the tests well. At the center of the distribution, there is an interference between the positive and negative distributions, and there is not a sufficiently strong separation between the two.
  5. Even without additional data, model performance can probably be improved by more subtle and accurate calibration of model hyperparameters by using advanced data imputation techniques and by engineering additional features from the existing data. In our estimation, the potential for improvement of these types of moves, although existing, is relatively low (will add a few points to the ROC AUC).
  6. Enlarging the dataset by at least two orders of magnitude (to about 100,000 tests and more) is expected to greatly improve the model performance and generalization strength. A larger dataset will probably allow us to train bigger and stronger models that can capture more sophisticated relations.


In this blog post, we have explored the significant potential of performing machine-learning-based virtual testing for COVID-19. Turning the ability demonstrated here from a demonstration into a real-world ready model, in our estimation, is likely to be a real force multiplier in the worldwide effort of establishing a new routine in the presence of the coronavirus and controlling additional outbreaks of the virus, mainly because it will allow:

  1. Smart screening and prioritization of the RT-PCR test queue. RT-PCR tests constitute a bottleneck in the effort to monitor and control the pandemic spread. Using a prediction model will allow patients who really need the test to be tested first.
  2. Make informed and differential decisions about suspected infections before performing an RT-PCR test. Patients who the model predicts to be positive with high confidence can be guided to quarantine even before RT-PCR test results are achieved. Patients who the model predicts to be negative with high confidence can be treated as such, saving the need for expensive and complex RT-PCR tests.
  3. Extend the testing and monitoring system and provide virtual testing capability to COVID-19 based on self-report of symptoms and vital signs. People would be able to make educated decisions about their own personal behavior on the basis of informed risk management and will know, at the very least at the extremes, of the true likelihood that they are infected with the virus.
  4. Accelerate the pace in which chains of transmission are broken. Rapid and scalable virtual tests will allow rapid and focused RT-PCR tests, thus improving the pace at which patients are found to be positive, and chains of transmission are broken.
  5. Accelerate the pace in which countries and states release the lockdowns. Virtual testings are scalable and rapid. One of the significant challenges in releasing the economies from lockdown is in the long “feedback cycle,” which is two weeks long and more, for examining the impact of the release of interventions measures on transmission rates. Part of this feedback time is due to the long time it takes to coordinate, execute, and decode the RT-PCR tests. Even if the accuracy of the virtual tests is much lower than that of the RT-PCR tests, their scalability and speed can be used as a good measure of the transmission and infection and will shorten the cycle of such a long feedback loop, especially in the face of secondary outbreak waves of the virus.

Carbon Health and Braid Health have shown that high-quality clinical data can be published while maintaining the privacy of the patients. We encourage more and more health providers to join and publish their data as well. We hope that such open data initiatives will convince other entities around the world to publish additional data. We hope that a feedback loop with a positive amplification factor to be established, as more open data will yield better models that will encourage more and better open data to be published, and so on and so forth. We hope that such a feedback loop will improve the performance of predictive models to the level of convergence into models that will significantly help manage the new routine in the presence of the COVID-19 virus, monitoring and controlling the transmission rates and reducing the financial and collateral costs of the pandemic.


  1. Link to the site where the Carbon Health and Braid Health have published the dataset.
  2. Link to a python notebook (Colab) with the code used for this analysis.
  3. Link to an open dataset published by the Israel Ministry of Health.

The post was written by Omer Koren, CEO of Webiks. Webiks specializes in developing data analysis applications and models based on open source and open data.