https://ift.tt/31FGeLK Forecasting cases, hospitalization and deaths with CDC’s wastewater data. Uses pandas windowing and correlation. P...
Forecasting cases, hospitalization and deaths with CDC’s wastewater data. Uses pandas windowing and correlation.
Summary
More than 50,000 wastewater tests across the United States were matched with COVID-19 outcomes (cases, hospitalization, death) from the same location and timeframe. The resulting dataset was analyzed for correlation between SARS-CoV-2 RNA level and the outcomes, showing a positive relationship with all three measures.
Background
In a previous data engineering project, I enhanced the wastewater tests from the CDC’s National Wastewater Surveillance System. The goal of that project was to add COVID-19 vaccination and disease outcome information to each wastewater test, so that a particular measured SARS-CoV-2 RNA level could be tied to disease results in that location in that timeframe.
For example, consider a water test on January 19, 2021 at a treatment plant in Scott County, Missouri that found 390.0 copies per milliliter of the N2 gene from SARS-CoV-2. (This is basically the information that the full NWSS dataset contains, repeated many times over for every day and every water treatment plant that is submitting data. )
I enhanced the NWSS dataset by adding to that row the vaccination rate in that county on January 9 (since vaccines need at least ten days to work), the known case rates on January 26 (seven days after the water test), the hospital and ICU utilization for COVID-19 patients on February 9 (14 days after the water sample), and the number of new COVID-19 deaths on February 16 (21 days after).
That data engineering project was successful, and the next step was to use the enhanced NWSS data for analysis. For example, did the level of wastewater SARS-CoV-2 RNA predict COVID-19 hospital admissions 14 days later? If so, how strong was that correlation? My current project, reported here, begins that analysis.
Note: This analysis, as well as my previous project that enhanced the NWSS dataset, are built on detailed wastewater test results that are available from CDC by signing a data-use agreement. To obtain that restricted dataset, contact the data owner listed on the public data page.
Missing Hospital Values
The first step of this effort was to fix a problem with the hospital information. I acquired the hospital data from CovidActNow.org, but the same issue is common in many hospital data sources. Hospitals generally report admissions, beds filled, and ICU usage just once per week. On the other six days, those values are empty. Of course, on the days with missing data there are still people in the hospital, so a null/NaN cannot be interpreted as zero. The solution is to use the available data to estimate the hospitalization data between weekly reports.
The pandas window function does the trick:
# Sort by county then date.
CovidDF = CovidDF.sort_values(["fips", "covid_facts_date"], ascending=[True, True])
# Create new columns for ICU, beds and admits with the rolling average over 10 days.
CovidDF["metrics.icuCapacityRatioRolling10"] = CovidDF["metrics.icuCapacityRatio"].rolling(10, min_periods=1, center=True, closed='both').mean()
CovidDF["metrics.bedsWithCovidPatientsRatioRolling10"] = CovidDF["metrics.bedsWithCovidPatientsRatio"].rolling(10, min_periods=1, center=True, closed='both').mean()
CovidDF["metrics.weeklyCovidAdmissionsPer100kRolling10"] = CovidDF["metrics.weeklyCovidAdmissionsPer100k"].rolling(10, min_periods=1, center=True, closed='both').mean()
Early Vaccination Values
The CovidActNow data has another item that needs to be cleaned up. (It is common for all data sources to need various cleanups and fixes.) Many vaccination fields for early dates are empty. But in this case, unlike hospitalization, the missing values are known — they are zero because the vaccines did not exist on those dates.
Pandas.DataFrame.loc() solves this problem by explicitly setting all vaccine counts to zero for all days before vaccines appeared in the US.
# First vaccines on 14 December 2020.
CovidDF.loc[CovidDF["covid_facts_date"] <= "2020-12-13", "actuals.vaccinationsInitiated"] = 0.0
CovidDF.loc[CovidDF["covid_facts_date"] <= "2020-12-13", "actuals.vaccinationsCompleted"] = 0.0
CovidDF.loc[CovidDF["covid_facts_date"] <= "2020-12-13", "metrics.vaccinationsInitiatedRatio"] = 0.0
CovidDF.loc[CovidDF["covid_facts_date"] <= "2020-12-13", "metrics.vaccinationsCompletedRatio"] = 0.0
# First boosters on 16 August 2021.
CovidDF.loc[CovidDF["covid_facts_date"] <= "2021-08-15", "actuals.vaccinationsAdditionalDose"] = 0.0
CovidDF.loc[CovidDF["covid_facts_date"] <= "2021-08-15", "metrics.vaccinationsAdditionalDoseRatio"] = 0.0
Consistent Test Type
The NWSS dataset contains water test reports as submitted by various wastewater treatment plants. Most, but not all, look for the N1 or N2 gene of SARS-CoV-2 in raw wastewater. I threw out rows that do not hold this type of test.
RawDF = RawDF.query("pcr_target == 'sars-cov-2' ")
RawDF = RawDF.query("pcr_gene_target == 'n1' or pcr_gene_target == 'n2' or pcr_gene_target == 'n1 and n2 combined' ")
RawDF = RawDF.query("sample_matrix == 'raw wastewater' ")
Consistent RNA Detection Units
A slightly trickier problem is that the number of gene copies detected is reported in three different ways: copies per liter, copies per milliliter, and the log base 10 of copies per liter. I created two new columns: one that states the normalized consistent units (copies per milliliter), and the other with normalized gene counts converted as needed.
RawDF["pcr_target_units_norm"] = "copies/ml wastewater"
RawDF.loc[RawDF["pcr_target_units"] == "copies/l wastewater", "pcr_target_avg_conc_norm"] = (RawDF["pcr_target_avg_conc"] / 1000)
RawDF.loc[RawDF["pcr_target_units"] == "log10 copies/l wastewater", "pcr_target_avg_conc_norm"] = ((10 ** RawDF["pcr_target_avg_conc"]) / 1000)
Finished Dataset
There are 61,526 rows in the initial NWSS dataset (after expanding county FIPS to one per row). There are 53,521 rows after selecting only tests for the N1 or N2 genes in raw wastewater, and 53,503 after selecting only rows with valid gene counts (not NaN, not negative). The portion of rows with usable data is quite high compared to the initial dataset rows.
RNA concentrations in wastewater samples vary for many reasons — the number of people using the sewer system, dilution of the wastewater by rain around that time, the sampling method used, and how many people in that area have COVID-19. The fact that one test shows X copies of an RNA gene and another test shows 2X copies does not mean necessarily there is more disease found by the second test. The NWSS dataset, however, contains many water tests over many months in many locations, so these variations tend to smooth out over time and geography.
Results
I used simple Spearman rank correlation for this analysis. Two variables have a correlation of 1.0 if they exactly track each other as ordered pairs. In this case, a correlation of 1.0 would mean that the ordered list of RNA levels in all water samples perfectly predicts the ordered list of the other variable, such as hospitalization.
Here are the Spearman correlations:
- SARS-CoV-2 gene copies and COVID-19 test positivity ratio (7 days later) = 0.491.
- Gene copies and COVID-19 case density per capita (7 days later) = 0.541.
- Gene copies and COVID-19 hospital admissions per capita (14 days later) n= 0.377.
- Gene copies and ratio of hospital beds with COVID-19 patients (14 days later) = 0.490.
- Gene copies and COVID-19 deaths per capita (21 days later) = 0.282.
There is consistent positive correlation between SARS-CoV-2 detected in wastewater and multiple measures of COVID-19 disease prevalence in that geographic area.
For all of the correlation numbers, the 95% confidence values are very tight because of the large sample size. For example, with the hospital admissions correlation of 0.377, the 95% confidence interval is 0.3695 to 0.3845.
Future Work
The latest available full NWSS dataset is from early February 2022. At this time (early May), there are three months of data missing from the dataset and therefore from the analysis. As soon as NWSS releases a new dataset, the software shown here should be rerun and the results verified — perhaps strengthening or weakening the correlations.
I computed standard rank correlation between the variables described above. The enhanced dataset could also be subjected to more advanced statistical analysis, perhaps Approximate Bayesian Computation or tests that take into account the large size (50,000+) of the sample set and multiple outcome measures.
My pandas source code allows for easy adjustment of the “look-ahead” values. The same analysis can be rerun with hospitalization 21 days later and deaths 28 days later, for example. It would be interesting to experiment with different look-ahead intervals to see which create the strongest correlations between wastewater RNA and COVID-19 outcomes.
For More Information
https://covid.cdc.gov/covid-data-tracker/#wastewater-surveillance (NWSS dashboard)
https://data.cdc.gov/browse (master page for all CDC datasets)
https://en.wikipedia.org/wiki/Correlation (Spearman and Pearson correlation)
COVID-19 Outcome Predictions from Wastewater was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/xrUb8oq
via RiYo Analytics
ليست هناك تعليقات