Preparing v0.5.0 release #62

leoisl · 2023-07-19T21:46:02Z

Main changes:

scikit-learn v0.24.2 now requires building, and I am not sure why... CI failed with python 3.8, 3.9, 3.10, 3.11. It was also failing for the same reason to build the docker image with python:3.10-slim, but somehow worked for python:3.8-slim. I tried several ways to fix this, but I didn't manage. This version of scikit-learn is not even that old, from April 2021. Anyway, due to these issues, I thought it would be a good time to upgrade it to the stable versions >=1.0.0, so this is one of the main changes in this PR. We then upgraded numpy to "^1.24.4" and scikit-learn to "^1.3.0". The issue is then that unit tests were getting stuck, and we could not run them. Fixing this required to upgrade pytest and run it with --forked, but then it was too slow to run the tests. Thus I've added pytest-xdist = "^3.3.1" to the dependencies, and we now run pytest also with -n 4 (i.e. 4 threads) so that is fast again;
Simplified force_overwrite tests, 3 tests were redundant;
I selected elkan as the kmeans algorithm in the updated scikit-learn version as tests were failing with the other algorithm. Weirdly the elkan kmeans changes the PRG/result for a single test case, the amira one. So I am a bit reluctant and wondering if this should be a prerelease until we test it more on real data...

This release properly handles Ns in the MSA, and in the denovo sequences. If fixes breaking issues reported by @Danderson123 in amira.

codecov · 2023-07-19T21:50:10Z

Codecov Report

Merging #62 (0faccb1) into dev (7fbbc8c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 0faccb1 differs from pull request most recent head dae2c1d. Consider uploading reports for the commit dae2c1d to get more accurate results

@@           Coverage Diff           @@
##              dev      #62   +/-   ##
=======================================
  Coverage   99.73%   99.73%           
=======================================
  Files          18       18           
  Lines        1525     1528    +3     
=======================================
+ Hits         1521     1524    +3     
  Misses          4        4

Impacted Files	Coverage Δ
make_prg/from_msa/cluster_sequences.py	`100.00% <100.00%> (ø)`
make_prg/update/denovo_variants.py	`100.00% <100.00%> (ø)`
make_prg/utils/seq_utils.py	`100.00% <100.00%> (ø)`

iqbal-lab

ok by me. Elkan algorithm is a bit different way of calculating kmeans (eg see https://github.com/jjcordano/elkans_kmeans) so a slight change is acceptable IMO. I would vote go ahead, and if an issue appears in the future, we deal with it.

leoisl · 2023-07-19T22:11:16Z

The options for the kmeans algorithm in new scikit-learn version are lloyd or elkan. For lloyd, this unit test fails and the amira integration test. For elkan, just the amira integration test failed. So I went for the elkan, but I did not really know which option is actually the best nor checked in details why lloyd was failing in that unit test... If you think is better to switch to lloyd (the default) or if I should investigate this in more details, please tell me

iqbal-lab · 2023-07-19T22:36:27Z

Elkan vs Lloyd - the choice is about speed (elkan faster) and ram (elkan uses a bit more i think). i do not think we need to be perfectly identical, is ok to switch to elkan. But we still need to understand the amira integration fail, right?

leoisl · 2023-07-20T09:23:14Z

In the amira integration test, we try to build 3 PRGs: alsB, glpG and group_18516. glpG is built exactly as before, but this is not true for the other two. group_18516 has small differences that happens in 2 sites, which are just the order of alleles and alleles more nested. One example follows, the other is similar:

<before
7  9 CCCCCGC 10 TCCCCGC 9  8 CCCCCGA 8 TCCCCGA 7 GTGGCGCAGGC
>after
7 CCCCCGC 8 CCCCCGA 8 TCCCCGA 8 TCCCCGC 7 GTGGCGCAGGC

alsB is a bit more complicated. The first difference is that the site has one more allele:

>alsB
<before
 17 C 18 A 17 ACTGGGCGTCAGCGTTGATATTTTTGCCTC
>after
 17 CA 18 CG 18 AA 17 CTGGGCGTCAGCGTTGATATTTTTGCCTC

if we compare before and after, both sites can spell CACTGGG... and AACTGGG... but only the second one can spell CGCTGGG... and I can find CGCTGGG... in the MSAs... so it seems the new version is actually doing the right thing.

The second difference is the second version having two more SNPs in a region than the first one, representing 4 alleles instead of 1:

<before
ATGTTGTCAACCAGCGCTTTTGCTGCCGCCGAATATGC
>after
ATGTTGTCAACCA 27 G 28 A 27 CGCTTTTGCT 29 G 30 A 29 CCGCCGAATATGC

The 1st allele, common to both versions, is the most common in the MSA (genotypes G and G in both SNPs, n=774). Two other alleles, only present in the second version, are also in the MSA but with very low frequency (genotyping G and A, n=2; genotyping A and G, n=1). A fourth allele is not present in the MSA at all, genotyping A and A, but this is expected as this SNP is required to make G and A an option.

I can continue this analysis, if you want, going to the 3rd, 4th, and so on differences on alsB, but the first evidences seems to show the new version is actually producing more correct PRGs than before...

iqbal-lab · 2023-07-20T09:46:21Z

looks good to me, am ok to merge

leoisl added 4 commits July 19, 2023 22:43

Preparing 0.5.0 release

54f4ee2

Upgrading numpy and scikit-learn

30506b6

Upgrading docker base OS to user python 3.10 again

b7b4682

Simplifying force overwrite tests

0faccb1

leoisl requested review from mbhall88, iqbal-lab and Danderson123 July 19, 2023 21:46

iqbal-lab approved these changes Jul 19, 2023

View reviewed changes

mbhall88 approved these changes Jul 20, 2023

View reviewed changes

Updating CHANGELOG.md

dae2c1d

leoisl merged commit 55ca0ec into iqbal-lab-org:dev Jul 20, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparing v0.5.0 release #62

Preparing v0.5.0 release #62

leoisl commented Jul 19, 2023

codecov bot commented Jul 19, 2023 •

edited

Loading

iqbal-lab left a comment

leoisl commented Jul 19, 2023

iqbal-lab commented Jul 19, 2023

leoisl commented Jul 20, 2023

iqbal-lab commented Jul 20, 2023

Preparing v0.5.0 release #62

Preparing v0.5.0 release #62

Conversation

leoisl commented Jul 19, 2023

codecov bot commented Jul 19, 2023 • edited Loading

Codecov Report

iqbal-lab left a comment

Choose a reason for hiding this comment

leoisl commented Jul 19, 2023

iqbal-lab commented Jul 19, 2023

leoisl commented Jul 20, 2023

iqbal-lab commented Jul 20, 2023

codecov bot commented Jul 19, 2023 •

edited

Loading