Our goal in this paper is to assess possible connections in voice data with the diagnosis of Parkinson’s Disease (PD). We were given a csv file containing all of the voice measurements from 31 people, where 23 have Parkinson’s. We were also given further information about each patient in an xlsv file, which we intend to condense both into one table to better analyze the connections between the voice measurements with further personal information.
In order to best join the tables and run appropriate statistics on the data, we intend to average the voice measurements for each variable and then merge the two tables by joining columns. Once here, we intend to explore the data, and then run statistics of interest, one of which is identifying if there are differences in the mean and variance of the ages that men and women are diagnosed with PD. The last thing of interest is we will look to identify what variables from the voice measurements could be statistically significant in identifying patients who have PD. By doing so, we will look to see if there are differences between the means, variances, and medians of those that have and do not have PD across all voice measurements.
| Table 1: Variables explained. | ||||||
| Terms | Summary Statistics | |||||
|---|---|---|---|---|---|---|
| Term | Definition | Ratio | Min | Mean | Max | SD |
| Sex | Biological sex (M/F) | 60.85% Male | ||||
| Status | Presence of PD in patient (Y/N) | 74.6% Positive | ||||
| Age | Age in years | 46 | 66.05 | 85 | 9.83 | |
| Stage | Severity of Parkenson's disease (1-4) | 1 | 2.23 | 4 | 0.75 | |
| Diagnosis | Years since first positive diagnosis (If applicable) | 0 | 6.69 | 28 | 7.08 | |
| Fo | Average vocal fundamental frequency (Hz.) | 88.33 | 154.22 | 260.1 | 41.95 | |
| Fhi | Maximum vocal fundamental frequency (Hz.) | 102.14 | 197.07 | 592.03 | 92.94 | |
| Flo | Minimum vocal fundamental frequency (Hz.) | 65.48 | 117.46 | 239.17 | 43.73 | |
| Jitter | Measure of variation in fundamental frequency | 0.0000 | 0.0032 | 0.0175 | 0.0027 | |
| Shimmer | Measure of variation in amplitude | 0.02 | 0.07 | 0.31 | 0.05 | |
| NHR | Noise/Harmonic Ratio | 0 | 0.03 | 0.31 | 0.04 | |
| HNR | Harmonic/Noise Ratio | 8.44 | 21.99 | 33.05 | 4.46 | |
The data used in this study was composed of 31 patients; 23 of which have tested positive for PD and the remaining 8 at risk. Each of these patients have taken various tests to determine their scores in the variables defined in Table 1. Throughout this study, we will use these terms to search for connections between the various measures of voice and the presence of PD in patients.
Data compared to normal distribution.
As shown in Figure 1, none of the 4 variables appear to be normally distributed. This is likely due to the small sample size, as well as the fact that the patients used in this test are only those who either have or are at risk of developing PD; thus not representative of the entire population. For this reason, many statistical methods cannot be used, due to normality being an assumption of the tests.
All analysis conducted using R 4.2.2
To confirm our observations in Figure 1 and generalize to the entire dataset, we ran the Shapiro-Wilkes test for normality on each of the 19 numeric values. With an \(\alpha\) level of 0.05, we confirmed our prior observations and determined that none of the variables are normally distributed, and thus we must use nonparametric approaches for all analysis conducted.
The next thing we wanted to determine was whether our categorical variables, namely Sex and Age Group, had an effect on the other variables in our study. In order to find this information, we conducted a series of permutation tests; 1,000 permutations per variable. The categories we tested were Male vs Female and Young (45-65) vs Old (66-85).
| Table 2: significant variables when grouped by categorical variables. | |||
| Sex | Age | ||
|---|---|---|---|
| Male | Female | Young | Old |
| Jitter | Fo | Flo | Fhi |
| Status | Fhi | HNR | Jitter |
| Flo | Shimmer | ||
| NHR | |||
| Status | |||
As shown in Table 2, at an \(\alpha\) value of .05, we have significant evidence that the mean values of the Jitter, Status, and the 3 Vocal Frequency variables differ between sexes. Furthermore, if we conduct the same tests with alternative hypothesis of males having a higher mean, we can conclude that males have overall higher values of Jitter and Status, while females have higher overall values in the Vocal Frequency measurements.
For age, we can conclude that there is a significant difference between means for every variable, excluding average vocal frequency. With a one sided test, we can say that the older group has higher values for every variable besides HNR rating and low vocal frequency in which the younger group has higher mean values.
Lastly, we are interested in testing which voice measurement variables are significant in differentiating patients that are diagnosed with PD and those that are not. For this, we will use a two-sample simulated permutation test on all of the variables. The null hypothesis for all of the tests is that either the mean, variance, or median is equal between patients with PD and those without, and the alternative is that they differ. For each test we will use an \(\alpha\) equal to .05.
| Table 3: Probabilities of Two-Sample Permutation Tests | |||
| Variable | Pval_mean | Pval_var | Pval_median |
|---|---|---|---|
| MDVP.Fhi.Hz. | 0.151 | 0.980 | 0.012 |
| MDVP.Flo.Hz. | 0.013 | 0.022 | 0.021 |
| MDVP.Jitter... | 0.051 | 0.502 | 0.096 |
| MDVP.Jitter.Abs. | 0.024 | 0.402 | 0.073 |
| MDVP.RAP | 0.073 | 0.484 | 0.062 |
| MDVP.PPQ | 0.056 | 0.476 | 0.059 |
| Jitter.DDP | 0.074 | 0.463 | 0.066 |
| MDVP.Shimmer | 0.017 | 0.128 | 0.081 |
| MDVP.Shimmer.dB. | 0.025 | 0.285 | 0.077 |
| Shimmer.APQ3 | 0.027 | 0.078 | 0.077 |
| Shimmer.APQ5 | 0.026 | 0.176 | 0.070 |
| MDVP.APQ | 0.022 | 0.316 | 0.061 |
| Shimmer.DDA | 0.021 | 0.089 | 0.070 |
| NHR | 0.207 | 0.711 | 0.136 |
| HNR | 0.045 | 0.611 | 0.137 |
As you can see in Table 3, there is a statistically significant difference in the means between people with PD and those without for Flo, the absolute value of Jitter, all types of Shimmer, MDVP.APQ, and lastly HNR. Then, there is statistically significant differences between the variances of people with PD to those without only for the variable Flo. Lastly, there is statistically significant differences in the median for people with PD to those without only for Fhi and Flo.
Next, we looked for correlation between the numeric variables. Since this data is not normal, Spearman’s \(\rho\) coefficient was used. In order to find good correlation, we took the mean of all measures of shimmer and jitter and put them into their respective groups.
Noise and Harmonics compared with Shimmer
We found that jitter, shimmer, NHR and HNR were all very correlated with each other, with \(\rho\) values ranging from 0.75 to 0.87. To illustrate this, Figure 2 shows off the correlation between both NHR and HNR with shimmer. It is also notable that those in the older age group tend to have higher shimmer and NHR values, but lower HNR values as compared to the younger group, which visualizes a few of our observations in Table 2.
Next, we decided to create a logistic regression model to predict the probability of a patient being positive for PD given their testing results. Since normality is not an assumption of regression, this is conducted the same way as normal. The resulting model uses the patient’s average vocal frequency, average jitter, average shimmer, and noise/harmonic scores and is shown below:
\[P(\text{Positive}) = \frac{e^{-0.016\text{Fo}+1198\text{Jitter}+45.4\text{Shimmer}-69.37\text{NHR}}}{1+ e^{-0.016\text{Fo}+1198\text{Jitter}+45.4\text{Shimmer}-69.37\text{NHR}}}\]
This equation could be very useful to doctors during the screening process for PD, as it could possibly limit the amount of tests required for a diagnosis.
Logistic regression graphs.
The plots in Figure 3 show that the model is fairly accurate, as the blue dots tend to cluster more towards the top of the graph, as it shows the cases that the model predict would have a higher chance to be positive are actually positive.
We discovered earlier how the number of men and women that are diagnosed with PD is statistically different. We are now interested if there is a statistical difference between the age at which men and women are diagnosed with PD, along with if there is a difference in the variability between when men and women are diagnosed.
First, we will run a right-tailed two-sample simulated permutation test to see if the mean age at which women are diagnosed with PD is statistically greater than men. The null hypothesis is given that the mean age diagnosed with PD is the same for both men and women, and the alternative is that the mean age for women is greater. For this we will use an \(\alpha\) of .05.
Historgram of Mean Age PD Diagnosis Sex Comparison Permutation Test
The Dobs from the two-sample simulated permutation test is 3.6 which you can see displayed as the blue line above in Figure 4, and the test gives a corresponding p-val of around 0.084. Therefore we fail to reject \(H_0\) at the .05 level. We can conclude that we do not have enough evidence to say that women are diagnosed with PD at a later age than men.
Next we are interested in testing if there is a difference between the variability of when men and women are diagnosed. For this we will use a two-sided Siegel Tukey Test with an \(\alpha\) level of .05. Then, we will check the result of that test with another two-sample simulated permutation test for the variances. For the Siegel Tukey test we have the null hypothesis that the variance between men and women is the same for when they are diagnosed, and the alternative is that they are different.
The Siegel-Tukey test returned a test statistic of 32, and a corresponding p-value of 0.00695. We can then reject the null hypothesis. Therefore, we can say that there is significant evidence that there is a difference in variance between genders.
We will then run another test for this using the simulated permutation test to see if the variability for the age women are diagnosed with PD is greater than that for men. We will be using the two-sample simulated permutation test for this because it has a greater power than the Siegel Tukey test. The null hypothesis for this test is that the variances for when men and women are diagnosed with PD is the equal, and the alternative is that the variance for women is greater. For this test we will use an \(\alpha\) of .05.
Variance of Age Diagnosed with PD Between Sexes Permutation Test
The two-sample simulated permutation test gives a Dobs of 85.54 which you can see above as the blue line in Figure 5, and a corresponding p-val of around 0.003. We can then reject \(H_0\) at the .05 level. We can then conclude with this test that we have significant evidence to suggest that the variance for when women are diagnosed with PD is greater than it is for men.
We were largely interested in testing if there is differences between the age that women and men were likely to be diagnosed with PD. Our testing showed that we did not have enough evidence to suggest anything on that front. We were then interested if there was any difference in the variance that either gender was diagnosed. Using a Siegel Tukey test with an \(\alpha\) of .05 resulted where we were able to find that the variance in what age women get diagnosed is greater than men, and this was then confirmed with another two-sample one-sided permutation test.
Our last interests were testing what variables between patients with PD to those without were statistically different. Flo was found to be the only variable that was statistically different for mean, variance, and median. This indicates that Flo is the best variable in determining whether a patient is diagnosed with PD or not. Further evidence to show that Flo is the best indicator of PD, is that there was not one other variable to test statistically different for more than one test between mean, variance, and median. We also found that logistic regression was a fairly effective way to predict PD using the variables Fo, average Jitter, average Shimmer, and NHR. In the end, we were able to find some interesting background knowledge on the diagnosis of patients with PD and concluded with finding the variable Flo to the best in diagnosing patients with PD.