v1.2.3 Improvements

In v1.2.3 of codoff we replaced calculation of an empirical P-value with the metric "Discordance Percentile". These are very similar metrics and both rely on performing simulations to gauge how different the codon usage profile of the focal region of interest is to the background genome. The "Discordance Percentile" is effectively just the empirical P-value multiplied by 100. The rationale here is that using a percentile simplifies interpretation and also systematic investigation of multiple focal regions (especially when independence between multiple regions can't be assumed).

However, you will notice there are differences between v1.2.2 and prior versions with v1.2.3 beyond just the use of "Discordance Percentile". This is because in v1.2.3, our main change was actually to how the simulations are carried out.

In v1.2.2 and earlier versions, we simulated by creating hypothetical gene clusters of similar size to the focal gene cluster composed of genes from across the genome. While the genes were real, they could be from very different genomic regions and this made the simulation less realistic. In v1.2.3, for each simulation, we instead select a random point in the genome and then extract codon usage information for a neighborhood of equivalent size to the focal region. This new approach thus preserves information on genomic structure/organization that the previous simulation did not.

Checking the distributions:

The motivation for switching approaches for the simulation came from checking how distributions for the Discordance Percentile (previously empirical p-value) look for randomly selected regions of a certain size across an input genome. If the simulation is working properly, we would expect that the distribution of such values would be uniform. This testing was performed using the run_simulations.py script included in the main folder of codoff.