When the European Food Safety Authority (EFSA) reviews the science about a chemical, they review all of the science, right? Like, they look at the statistical analyses and the study design, right? Because those two factors are what determines the all-important p-value (which we’ve already discussed shouldn’t be treated like a gatekeeper, but regulators continue to do so in violation of the guidance from the American Statistical Association). And study size, a part of the study design, is also critical to consider.

Nope — EFSA does not look at the statistical analysis or the study design. See for yourself, here. To be clear, EFSA does mention study design several times in that Appendix. But they’re not talking about aspects of the study design that will impact or bias a p-value. They are talking about things like dye interference with nanoparticle counting approaches, the type of study (e.g., cell counting), and relevance of the test system. When EFSA says study design they don’t mean “study design” the way a statistician means it.

So when I say “study design”, to be clear, I mean are the animals housed in the same cage (e.g., nesting), are the animals all siblings (e.g., nesting), were assays run on separate days (e.g., day effects), were the cells in the same incubator (e.g., incubator effects), were the same lots of media and constituents used (e.g., media and constituent effects). In other words, when I say study design, I mean identifying and taking appropriate steps to minimize the bias/variance associated with specific non-experimental sources of variability. You know — the kinds of things that impact a p-value.

## EFSA and Risk of Bias

What this all means is that EFSA likely has a false positive problem on its hands. In other words, there is a high risk of bias towards results that say titanium dioxide is likely genotoxic when in fact it’s probably not.

So let’s take a look at some the studies EFSA is relying upon that suggest titanium dioxide is genotoxic.

## What’s Wrong with Kang et al.?

EFSA used Kang et al. (2008) to argue that there was positive evidence of genotoxicity.

So what’s wrong with Kang et al.? For starters, the authors used cells from just 1 human volunteer. That’s a bit of a non-starter. We cannot possibly make any inferences about how the human population might respond to a chemical exposure based on the results in a single person. The fact that this even got published is striking to me.

No one can perform statistics on cells from a single person. The cells, because they are from a single person, are not independent samples — they are too highly correlated.

Beyond that, the statistical analysis was not done right. The authors start off with an ANOVA. They follow that up with a Mann-Whitney U test as a post hoc test. The purpose of a post hoc test is to identify the specific group-wise comparisons that are significant. The problem is that the authors need to control the Family-Wise Error Rate (FWER). The Mann-Whitney U does not perform any type of FWER. Thus, the false positive rate is automatically higher for this approach. In other words, any significant findings are likely false positives.

Translation: any time this study concludes there is genotoxicity is likely false. In my opinion, this paper never should have been published and should be retracted as it is misleading.

## What’s Wrong with Kurzawa-Zegota et al.?

Kurzawa-Zegota et al. (2017) has a lot of issues in my opinion. The first is pretty obvious that EFSA’s panelists should have noticed without even being statistical experts. It’s the confounding.

The authors state clearly that they have evidence of confounding in their study. That’s not a good thing. They say it on page 9282:

It means they cannot draw any causal conclusions. Yet, the authors still do. This is a basic study design issue that any Ph.D. should be able to pick out in a heartbeat — confounding (if not properly controlled) is bad. The authors don’t even try to control it — mostly because of the way they structured their participant recruitment they can’t.

So what should the authors have done? Well, there isn’t a way to fix this. I actually had to find the dissertation that this paper arose from. When you look at the dissertation it becomes clear as day there is no way for the authors to control the confounding. Here’s the table from the dissertation (it’s Appendix 3):

If you look carefully at the Appendix 3 from the dissertation (above) you’ll notice that there is no way to match any of these people. There are far too few people to begin with. But what’s more, they are all over the place with respect to smoking and drinking status, their preferred diet, and the reported family history of cancers. These people simply aren’t able to be matched. As a result, there are too many nuisance variables meaning we cannot perform an apples-to-apples comparison. What that means is that this is not a true experiment, where the subjects differ only with respect to the experimental intervention — there are too many confounding variables.

This alone is sufficient grounds for the paper to be retracted, in my opinion. This paper never should have been published in the first place. Yet, here we are.

Furthermore, why wasn’t this table made part of the paper? Why wasn’t this a supplemental table at the very least? Why did I have to dig to find a dissertation to get this information?

Next issue: the results are all likely false positives. Why?

Let’s look at Figure 2:

Consider all of the pairwise comparisons being made here. Dunnett’s test will only control the false positive rate within a grouping. For instance, in OTM (Figure 2A), Dunnett’s will only control the FWER for the 10 ug/mL group, and then only for the 40 ug/mL group, and then only for the 80 ug/mL group. But Dunnett’s isn’t controlling the FWER across all of these groups simultaneously. So, the authors need to do more work to control the FWER globally within OTM. That would mean 5 different comparisons (control, 10, 40, 80 ug/mL TiO2, and 100uM H2O2). So, to maintain alpha = 0.05, the p-value threshold now becomes 0.05/5 = 0.01.

Some statisticians would argue that they also need to consider additional testing with the % Tail DNA. So that 0.05 needs to be divided by 10 = 0.005 becomes the threshold.

It also appears that the authors did not run their ANOVA correctly. The normal protocol is that post hoc tests are only run when the ANOVA indicates there is a difference. That means that the magnitude is large enough to be biologically meaningful, and the p-value is < 0.05. Well, in the case of the OTM 80 ug/mL group, I observed a p-value of 0.296. So, the authors should not have run Dunnett’s test on those. The same is true for the tail DNA at the 80 ug/mL group, where I observed a p-value of 0.34.

What the authors should have done is a more complicated general linear mixed model. The authors clearly have 2 independent variables in addition to their confounding variables — health status and exposure levels.

The authors also have issues with their statistical assumptions. The control exposure group in the OTM study either 1) are not normally distributed or 2) do not represent the population response — potentially both. When I model the data using the reported mean of 3.60, and the standard error of 0.93, which is converted to a standard deviation of 4.16 (0.93 * sqrt(20)), I end up with 19.3% of the population being less than 0 (i.e., negative micronucleus counts). That’s simply not possible. This could result from the sample under-representing the number of micronuclei in the population — which would lead to false positives. It could also be that the data are not actually normally distributed. Either way, this will lead to flawed inferences about the population, and likely will result in false positives.

The authors may try to raise the defense that the Kolmogorov-Smirnoff and Shapiro-Wilk tests were used to assess that the data were normally distributed. Those tests are well-known to not be very informative for testing the normality of a small sample of data. Also, just because the samples may appear to be normal does not mean that the normal distribution is the most accurate distribution. If the authors want to accept that their data are normally distributed, with mean 2.48 and standard deviation 1.36, then the authors must also accept that negative numbers are plausible values. We know that negative numbers are not plausible values. Therefore, the statistical analyses conducted are inappropriate given the data.

## What’s Wrong with Proquin et al.?

The biggest issue here is that the authors report having 16 samples per group, when they actually only have 4. How do I know it’s 4 samples instead of 16? Well, the authors tell us this on page 141:

The biological replicates are what actually count here. The authors say there are 4 — and those are what received the treatment. If each biological replicate is run in duplicate that doesn’t make the sample size 8 — because the duplicate is not independent of its clone. So the duplication does not make it 8 samples — it still is 4. And running samples on two slides doesn’t double the samples either. Again — those duplicate slides are all correlated with each other because they are duplicates. So what the authors have is a hierarchical design, where the number of samples is only 4, not 16.

The number of samples matter in calculating the p-value. More samples means lower p-values. So by inflating the p-values, the authors are automatically increasing their false positive rate.

It is clear the authors have a complicated nested study design. That means the authors need to perform a more complicated general linear mixed model, to account for some random effects, such as data of performance and slide effects. But they didn’t do that.

Toxicologically speaking, I’m really concerned that they had such a high degree of cell death at 24hrs and 48hrs at the same concentrations of nano TiO2. That means we cannot rely on the genotoxicity analyses.

## And There Are Problems with Srivastava et al, too?

Yes, yes there are. The study has an exceedingly small sample size — 3 per group. That means it is highly likely that the study will suffer from sampling bias. In fact, the standard deviations are exceedingly small (0.30 for the control micronucleus group), which may result in false positive results.

Micronucleus assays also tend to be performed with duplicate slides. That’s not happening here.

Given the type of data in the micronucleus assay, we normally wouldn’t analyze this data using an ANOVA. It’s count data, and count data typically are not normally distributed. So this should be analyze by a different way.

## Stoccoro et al., is Also Problematic

The authors chose to compare the nanoparticle exposed groups to the untreated (negative) control. Their justification is that, “no difference was observed between negative and solvent control…” The problem is that the absence of statistical significance is not the same as no difference. In addition, the authors do not show any data to substantiate the fact that untreated (negative) control is not different from solvent/vehicle control. In addition, for the micronucleus assay, the authors reported a sample size of 3 (or 4 based on the figure legend) — that is simply too few and it raises the possibility of having sampling bias.

Note that the authors are inconsistent about the number of replicates. Was it 3 or was it 4 for the micronucleus assay? It matters. And the fact that EFSA would use a paper that cannot get its story straight, and the fact that the peer reviewers didn’t catch it, suggests that this peer review for this paper was rather sloppy.