Merge branch 'swarm3'

torognes · Oct 24, 2019 · c8f18cf · c8f18cf
2 parents 02ad79a + 9aa56c5
commit c8f18cf
Show file tree

Hide file tree

Showing 39 changed files with 5,168 additions and 4,006 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,15 @@
+language: c++
+
+os: linux
+
+dist: bionic
+
+compiler: gcc
+
+before_install:
+- sudo apt-get install -y valgrind
+
+script:
+- make
+- export PATH=$PWD/bin:$PATH
+- git clone https://github.com/frederic-mahe/swarm-tests.git && cd swarm-tests && bash ./run_all_tests.sh | tee tests.log && ! grep -q FAIL tests.log
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+[![Build Status](https://travis-ci.org/torognes/swarm.svg?branch=swarm3)](https://travis-ci.org/torognes/swarm)
+
 # swarm
 
 A robust and fast clustering method for amplicon-based studies.
@@ -16,21 +18,32 @@ To help users, we describe
 starting from raw fastq files, clustering with **swarm** and producing
 a filtered OTU table.
 
-swarm 2.0 introduces several novelties and improvements over swarm
+swarm 3.0 introduces:
+* a much faster default algorithm,
+* a reduced memory footprint,
+* binaries for Windows x86-64, GNU/Linux ARM 64, and GNU/Linux POWER8,
+* an updated, hardened, and thoroughly tested code.
+
+Please note that:
+* strict dereplication of input sequences is now mandatory,
+* \-\-seeds option (\-w) now outputs results sorted by decreasing
+  abundance, and then by alphabetical order of sequence labels.
+
+swarm 2.0 introduced several novelties and improvements over swarm
 1.0:
 * built-in breaking phase now performed automatically,
 * possibility to output OTU representatives in fasta format (option
   `-w`),
 * fast algorithm now used by default for *d* = 1 (linear time
   complexity),
 * a new option called *fastidious* that refines *d* = 1 results and
-  reduces the number of small OTUs,
+  reduces the number of small OTUs.
 
 ## Common misconceptions
 
 **swarm** is a single-linkage clustering method, with some superficial
-  similarities with other clustering methods (e.g.,
-  [Huse et al, 2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
+  similarities with other clustering methods (e.g., [Huse et al,
+  2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
   novelty is its iterative growth process and the use of sequence
   abundance values to delineate OTUs. **swarm** properly delineates
   large OTUs (high recall), and can distinguish OTUs with as little as
@@ -76,8 +89,8 @@ cgtcgtcgtcgtcgt
 
 where sequence identifiers are unique and end with a value indicating
 the number of occurrences of the sequence (e.g., `_1000`). Alternative
-format is possible with the option `-z`, please see the
-[user manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
+format is possible with the option `-z`, please see the [user
+manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
 **requires** each fasta entry to present a number of occurrences to
 work properly. That crucial information can be produced during the
 [dereplication](#dereplication-mandatory) step.
@@ -87,7 +100,7 @@ Use `swarm -h` to get a short help, or see the
   for a complete description of input/output formats and command line
   options.
 
-The memory footprint of **swarm** is roughly 1.6 times the size of the
+The memory footprint of **swarm** is roughly 0.6 times the size of the
 input fasta file. When using the fastidious option, memory footprint
 can increase significantly. See options `-c` and `-y` to control and
 cap swarm's memory consumption.
@@ -210,15 +223,10 @@ from two different sets have the same hash code, it means that the
 sequences they represent are identical.
 
 If for some reason your fasta entries don't have abundance values, and
-you still want to run swarm, you can easily add fake abundance values:
-
-```sh
-sed '/^>/ s/$/_1/' amplicons.fasta > amplicons_with_abundances.fasta
-```
-
-Alternatively, you may specify a default abundance value with
-**swarm**'s `--append-abundance` (`-a`) option to be used when
-abundance information is missing from a sequence.
+you still want to run swarm (not recommended), you can specify a
+default abundance value with **swarm**'s `--append-abundance` (`-a`)
+option to be used when abundance information is missing from a
+sequence.
 
 
 ### Launch swarm ###
@@ -305,15 +313,6 @@ rm "${AMPLICONS}"
 ```
 
 
-## Troubleshooting ##
-
-If **swarm** exits with an error message saying `This program
-requires a processor with SSE2`, your computer is too old to run
-**swarm** (or based on a non x86-64 architecture). **swarm** only runs
-on CPUs with the SSE2 instructions, i.e. most Intel and AMD CPUs
-released since 2004.
-
-
 ## Citation ##
 
 To cite **swarm**, please refer to:
@@ -333,7 +332,7 @@ You are welcome to:
 
 * submit suggestions and bug-reports at: https://github.com/torognes/swarm/issues
 * send a pull request on: https://github.com/torognes/swarm/
-* compose a friendly e-mail to: Frédéric Mahé <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <torognes@ifi.uio.no>
+* compose a friendly e-mail to: Frédéric Mahé <frederic.mahe@cirad.fr> and Torbjørn Rognes <torognes@ifi.uio.no>
 
 
 ## Third-party pipelines ##
@@ -356,7 +355,7 @@ You are welcome to:
 If you want to try alternative free and open-source clustering
 methods, here are some links:
 
-* [VSEARCH](https://github.com/torognes/vsearch)
+* [vsearch](https://github.com/torognes/vsearch)
 * [Oligotyping](http://merenlab.org/projects/oligotyping/)
 * [DNAclust](http://dnaclust.sourceforge.net/)
 * [Sumaclust](http://metabarcoding.org/sumatra)
@@ -365,6 +364,11 @@ methods, here are some links:
 
 ## Version history ##
 
+### version 3.0 ###
+
+**swarm** 3.0 is much faster when _d_ = 1, and consumes less memory.
+Strict dereplication is now mandatory.
+
 ### version 2.2.2 ###
 
 **swarm** 2.2.2 fixes a bug causing Swarm to wait forever in very rare

diff --git a/man/swarm.1 b/man/swarm.1
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH swarm 1 "December 12, 2017" "version 2.2.2" "USER COMMANDS"
+.TH swarm 1 "October 24, 2019" "version 3.0.0" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 swarm \(em find clusters of nearly-identical nucleotide amplicons
@@ -110,8 +110,9 @@ results obtained during the clustering process allows \fBswarm\fR to
 avoid most of the amplicon comparisons needed in a naïve approach. To
 speed up the remaining amplicon comparisons, \fBswarm\fR implements an
 extremely fast Needleman-Wunsch algorithm making use of the Streaming
-SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are
-not available, \fBswarm\fR exits with an error message.
+SIMD Extensions (SSE2) of modern x86-64 CPUs, or NEON instructions of
+ARM-64 CPUs. If SSE2 instructions are not available, \fBswarm\fR exits
+with an error message.
 .PP
 \fBswarm\fR can read nucleotide amplicons in fasta format from a
 normal file or from the standard input (using a pipe or a
@@ -138,7 +139,19 @@ defined as a string of [ACGT] or [ACGU] symbols (case insensitive, 'U'
 is replaced with 'T' internally), starting after the end of the header
 line and ending before the next header line or the file end;
 \fBswarm\fR silently removes newline symbols ('\\n' or '\\r') and
-exits with an error message if any other symbol is present.
+exits with an error message if any other symbol is present. Lastly, if
+sequences are not all unique, i.e. were not properly dereplicated,
+swarm will exit with an error message.
+.PP
+Clusters are written to output files (specified with \-i, \-o, \-s and
+\-u) by decreasing abundance of their seed sequences, and then by
+alphabetical order of seed sequence labels. An exception to that is
+the \-w (\-\-seeds) output, which is sorted by decreasing \fIcluster
+abundance\fR (sum of abundances of all sequences in the cluster), and
+then by alphabetical order of seed sequence labels. This is
+particularly useful for post-clustering steps, such as \fIde novo\fR
+chimera detection, that require clusters to be sorted by decreasing
+abundances.
 .\" ----------------------------------------------------------------------------
 .SS General options
 .TP 9
@@ -286,7 +299,7 @@ in situations where writing to \fIstandard error\fR is problematic
 output clustering results to \fIfilename\fR. Results consist of a list
 of OTUs, one OTU per line. An OTU is a list of amplicon headers
 separated by spaces. That output format can be modified by the option
-\-\-mothur (\-r). Default is to write to standard output.
+\-\-mothur (\-r). Default is to write to \fIstandard output\fR.
 .TP
 .B \-r\fP,\fB\ \-\-mothur
 output clustering results in a format compatible with Mothur. That
@@ -305,7 +318,7 @@ total abundance of amplicons in the OTU,
 .IP \n+[step].
 label of the initial seed (header without abundance annotations),
 .IP \n+[step].
-initial seed abundance,
+abundance of the initial seed,
 .IP \n+[step].
 number of amplicons with an abundance of 1 in the OTU,
 .IP \n+[step].
@@ -363,13 +376,15 @@ output OTU representative sequences to \fIfilename\fR in fasta
 format. The abundance value of each OTU representative is the sum of
 the abundances of all the amplicons in the OTU. Fasta headers are
 formated as follows: '>label_\fIinteger\fR',
-or '>label;size=\fIinteger\fR;' if the \-z option is used.
+or '>label;size=\fIinteger\fR;' if the \-z option is used, and
+sequences are uppercased. Sequences are sorted by decreasing
+abundance, and then by alphabetical order of sequence labels.
 .TP
 .B \-z\fP,\fB\ \-\-usearch\-abundance
 accept amplicon abundance values in usearch/vsearch's style
 (>label;size=\fIinteger\fR[;]). That option influences the abundance
-annotation style used in swarm's standard output (\-o), as well as the
-ouput of options \-r, \-u and \-w.
+annotation style used in swarm's \fIstandard output\fR (\-o), as well
+as the output of options \-r, \-u and \-w.
 .LP
 .\" ----------------------------------------------------------------------------
 .SS Pairwise alignment advanced options
@@ -410,7 +425,7 @@ zcat myfile.fasta.gz | \\
         \-t 4 \\
         \-f \\
         \-w myfile.representatives.fasta \\
-        \-o myfile.swarms
+        \-o /dev/null
 .RE
 .EE
 .\" ============================================================================
@@ -475,7 +490,7 @@ License along with this program.  If not, see
 .\" ============================================================================
 .SH SEE ALSO
 \fBswipe\fR, an extremely fast Smith-Waterman database search tool by
-Torbjørn Rognes (available from
+Torbjørn Rognes (available at
 .UR https://github.com/torognes/swipe
 .UE ).
 .PP
@@ -492,8 +507,17 @@ New features and important modifications of \fBswarm\fR (short lived
 or minor bug releases are not mentioned):
 .RS
 .TP
+.BR v3.0.0\~ "released October 24, 2019"
+Version 3.0.0 introduces a faster algorithm for \fId\fR = 1, and a
+reduced memory footprint. Swarm has been ported to Windows x86-64,
+GNU/Linux ARM 64, and GNU/Linux POWER8. Internal code has been
+modernized, hardened, and thoroughly tested. Strict dereplication of
+input sequences is now mandatory. The \-\-seeds option (\-w) now
+outputs results sorted by decreasing abundance, and then by
+alphabetical order of sequence labels.
+.TP
 .BR v2.2.2\~ "released December 12, 2017"
-Version 2.2.2 fixes a bug that would cause Swarm to wait forever in
+Version 2.2.2 fixes a bug that would cause swarm to wait forever in
 very rare cases when multiple threads were used.
 .TP
 .BR v2.2.1\~ "released October 27, 2017"
@@ -527,7 +551,7 @@ bug only applies when \fId\fR > 1.
 .BR v2.1.10\~ "released December 22, 2016"
 Version 2.1.10 fixes two bugs related to gap penalties of alignments.
 The first bug may lead to wrong aligments and similarity percentages
-reported in UCLUST (.uc) files. The second bug makes Swarm use a
+reported in UCLUST (.uc) files. The second bug makes swarm use a
 slightly higher gap extension penalty than specified. The default gap
 extension penalty used have actually been 4.5 instead of 4.
 .TP
@@ -679,10 +703,10 @@ not. Only basic SSE2 instructions are now required to run \fBswarm\fR.
 .TP
 .BR v1.2.4\~ "released January 30, 2014"
 Version 1.2.4 introduces an option \-\-break\-swarms to output all
-pairs of amplicons with \fId\fR differences to standard error. That
-option is used by the companion script `swarm_breaker.py` to refine
-\fBswarm\fR results. The syntax of the inline assembly code is changed
-for compatibility with more compilers.
+pairs of amplicons with \fId\fR differences to \fIstandard
+error\fR. That option is used by the companion script
+`swarm_breaker.py` to refine \fBswarm\fR results. The syntax of the
+inline assembly code is changed for compatibility with more compilers.
 .TP
 .BR v1.2\~ "released May 16, 2013"
 Version 1.2 greatly improves speed by using alignment-free comparisons

diff --git a/man/swarm_manual.pdf b/man/swarm_manual.pdf
diff --git a/scripts/amplicon_contingency_table.py b/scripts/amplicon_contingency_table.py
@@ -1,15 +1,13 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
     Read all fasta files and build a sorted amplicon contingency
-    table. Usage: python amplicon_contingency_table.py samples_*.fas
+    table. Usage: python3 amplicon_contingency_table.py samples_*.fas
 """
 
-from __future__ import print_function
-
-__author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>"
-__date__ = "2016/03/12"
-__version__ = "$Revision: 2.1"
+__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
+__date__ = "2019/09/24"
+__version__ = "$Revision: 3.0"
 
 import os
 import sys
@@ -35,7 +33,7 @@ def fasta_parse():
         sample = os.path.basename(fasta_file)
         sample = os.path.splitext(sample)[0]
         samples[sample] = samples.get(sample, 0) + 1
-        with open(fasta_file, "rU") as fasta_file:
+        with open(fasta_file, "r") as fasta_file:
             for line in fasta_file:
                 if line.startswith(">"):
                     amplicon, abundance = line.strip(">;\n").split(separator)
@@ -65,7 +63,7 @@ def main():
     all_amplicons, amplicons2samples, samples = fasta_parse()
 
     # Sort amplicons by decreasing abundance (and by amplicon name)
-    sorted_all_amplicons = sorted(all_amplicons.iteritems(),
+    sorted_all_amplicons = sorted(iter(all_amplicons.items()),
                                   key=operator.itemgetter(1, 0))
     sorted_all_amplicons.reverse()
 

diff --git a/scripts/graph_plot.py b/scripts/graph_plot.py
@@ -1,29 +1,24 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
     Visualize the internal structure of a swarm (color vertices by
-    abundance). Requires the module igraph and python 2.7+.
-
-    Limitations: amplicons grafted with the fastidious option will be
-    discarded and will not be visualized.
+    abundance). Requires the module igraph and python 3.
 """
 
-from __future__ import print_function
-
-__author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>"
-__date__ = "2016/11/09"
-__version__ = "$Revision: 3.1"
+__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
+__date__ = "2019/09/24"
+__version__ = "$Revision: 4.0"
 
 import sys
 import os.path
 from igraph import Graph, plot
 from optparse import OptionParser
 
-#*****************************************************************************#
+# *************************************************************************** #
 #                                                                             #
 #                                  Functions                                  #
 #                                                                             #
-#*****************************************************************************#
+# *************************************************************************** #
 
 
 def option_parse():
@@ -76,7 +71,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
     """
     # List amplicon ids and abundances
     amplicons = list()
-    with open(swarms, "rU") as swarms:
+    with open(swarms, "r") as swarms:
         for i, swarm in enumerate(swarms):
             if i == OTU - 1:
                 # Deal with ";size=" in a rather clumsy way... but it works
@@ -100,7 +95,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
 
     # List pairwise relations
     relations = list()
-    with open(internal_structure, "rU") as internal_structure:
+    with open(internal_structure, "r") as internal_structure:
         print("Parsing amplicon relationships", file=sys.stdout)
         for line in internal_structure:
             # Get the first four elements of the line
@@ -138,7 +133,7 @@ def build_graph(amplicons, relations):
 
     amplicon_ids = [amplicon[0] for amplicon in amplicons]
     abundances = [int(amplicon[1]) for amplicon in amplicons]
-    minimum, maximum = min(abundances), max(abundances)
+    maximum = max(abundances)
 
     # Determine canvas size
     if len(abundances) < 500:
@@ -214,11 +209,11 @@ def main():
     return
 
 
-#*****************************************************************************#
+# *************************************************************************** #
 #                                                                             #
 #                                     Body                                    #
 #                                                                             #
-#*****************************************************************************#
+# *************************************************************************** #
 
 if __name__ == '__main__':