Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparing v0.5.0 release #62

Merged
merged 5 commits into from
Jul 20, 2023
Merged

Preparing v0.5.0 release #62

merged 5 commits into from
Jul 20, 2023

Conversation

leoisl
Copy link
Collaborator

@leoisl leoisl commented Jul 19, 2023

Main changes:

  1. scikit-learn v0.24.2 now requires building, and I am not sure why... CI failed with python 3.8, 3.9, 3.10, 3.11. It was also failing for the same reason to build the docker image with python:3.10-slim, but somehow worked for python:3.8-slim. I tried several ways to fix this, but I didn't manage. This version of scikit-learn is not even that old, from April 2021. Anyway, due to these issues, I thought it would be a good time to upgrade it to the stable versions >=1.0.0, so this is one of the main changes in this PR. We then upgraded numpy to "^1.24.4" and scikit-learn to "^1.3.0". The issue is then that unit tests were getting stuck, and we could not run them. Fixing this required to upgrade pytest and run it with --forked, but then it was too slow to run the tests. Thus I've added pytest-xdist = "^3.3.1" to the dependencies, and we now run pytest also with -n 4 (i.e. 4 threads) so that is fast again;
  2. Simplified force_overwrite tests, 3 tests were redundant;
  3. I selected elkan as the kmeans algorithm in the updated scikit-learn version as tests were failing with the other algorithm. Weirdly the elkan kmeans changes the PRG/result for a single test case, the amira one. So I am a bit reluctant and wondering if this should be a prerelease until we test it more on real data...

This release properly handles Ns in the MSA, and in the denovo sequences. If fixes breaking issues reported by @Danderson123 in amira.

@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Merging #62 (0faccb1) into dev (7fbbc8c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 0faccb1 differs from pull request most recent head dae2c1d. Consider uploading reports for the commit dae2c1d to get more accurate results

@@           Coverage Diff           @@
##              dev      #62   +/-   ##
=======================================
  Coverage   99.73%   99.73%           
=======================================
  Files          18       18           
  Lines        1525     1528    +3     
=======================================
+ Hits         1521     1524    +3     
  Misses          4        4           
Impacted Files Coverage Δ
make_prg/from_msa/cluster_sequences.py 100.00% <100.00%> (ø)
make_prg/update/denovo_variants.py 100.00% <100.00%> (ø)
make_prg/utils/seq_utils.py 100.00% <100.00%> (ø)

Copy link
Contributor

@iqbal-lab iqbal-lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok by me. Elkan algorithm is a bit different way of calculating kmeans (eg see https://github.com/jjcordano/elkans_kmeans) so a slight change is acceptable IMO. I would vote go ahead, and if an issue appears in the future, we deal with it.

@leoisl
Copy link
Collaborator Author

leoisl commented Jul 19, 2023

The options for the kmeans algorithm in new scikit-learn version are lloyd or elkan. For lloyd, this unit test fails and the amira integration test. For elkan, just the amira integration test failed. So I went for the elkan, but I did not really know which option is actually the best nor checked in details why lloyd was failing in that unit test... If you think is better to switch to lloyd (the default) or if I should investigate this in more details, please tell me

@iqbal-lab
Copy link
Contributor

Elkan vs Lloyd - the choice is about speed (elkan faster) and ram (elkan uses a bit more i think). i do not think we need to be perfectly identical, is ok to switch to elkan. But we still need to understand the amira integration fail, right?

@leoisl
Copy link
Collaborator Author

leoisl commented Jul 20, 2023

In the amira integration test, we try to build 3 PRGs: alsB, glpG and group_18516. glpG is built exactly as before, but this is not true for the other two. group_18516 has small differences that happens in 2 sites, which are just the order of alleles and alleles more nested. One example follows, the other is similar:

<before
7  9 CCCCCGC 10 TCCCCGC 9  8 CCCCCGA 8 TCCCCGA 7 GTGGCGCAGGC
>after
7 CCCCCGC 8 CCCCCGA 8 TCCCCGA 8 TCCCCGC 7 GTGGCGCAGGC

alsB is a bit more complicated. The first difference is that the site has one more allele:

>alsB
<before
 17 C 18 A 17 ACTGGGCGTCAGCGTTGATATTTTTGCCTC
>after
 17 CA 18 CG 18 AA 17 CTGGGCGTCAGCGTTGATATTTTTGCCTC 

if we compare before and after, both sites can spell CACTGGG... and AACTGGG... but only the second one can spell CGCTGGG... and I can find CGCTGGG... in the MSAs... so it seems the new version is actually doing the right thing.

The second difference is the second version having two more SNPs in a region than the first one, representing 4 alleles instead of 1:

<before
ATGTTGTCAACCAGCGCTTTTGCTGCCGCCGAATATGC
>after
ATGTTGTCAACCA 27 G 28 A 27 CGCTTTTGCT 29 G 30 A 29 CCGCCGAATATGC

The 1st allele, common to both versions, is the most common in the MSA (genotypes G and G in both SNPs, n=774). Two other alleles, only present in the second version, are also in the MSA but with very low frequency (genotyping G and A, n=2; genotyping A and G, n=1). A fourth allele is not present in the MSA at all, genotyping A and A, but this is expected as this SNP is required to make G and A an option.

I can continue this analysis, if you want, going to the 3rd, 4th, and so on differences on alsB, but the first evidences seems to show the new version is actually producing more correct PRGs than before...

@iqbal-lab
Copy link
Contributor

looks good to me, am ok to merge

@leoisl leoisl merged commit 55ca0ec into iqbal-lab-org:dev Jul 20, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants