Why replace "-" with "N" in the tutorial script?

Given that data uploaded to BOLD and NCBI may be part of a bigger study and often data are uploaded as an alignment containing gaps. Would it not be more biologically relevant to simply remove the gaps?

waiting for moderation
1 Answer
Accepted answer

Hi Peter,

there are several things to consider.

Some data in genebank includes N patches, for example many of the older ITS sequences. They were knit together later from separated ITS1 and ITS2 sequences, and the missing central part (usually the entire 5.8S rDNA) is represented in these sequences as N patch. On the other hand there are actual (large) deletions (in the case of plant ITS this can be 100 or more nucleotides in ITS1). So one should never replace gaps by Ns before uploading. Ns should be reserved to uncertain base pairs. (A related note. We also should reserve ambiguity codes to actual single-site variation, heterozygosity, and use N for any uncertain data).

Biologically speaking, yes. The gaps are alignment artefacts, not data.

But this can be impractical. Aligned data has two practical advantages.

Nr. 1. When I uploaded my over 900 5S-IGS sequences as one batch, or hundreds of ITS sequences, I was very happy that the EMBL staff provided me with that option, because it facilitates the annotation of the gene region. However, they removed all gaps (de-aligned my data) for the final storage. So all my data is gap-less.

Nr. 2 It is a service to others to be able to download a large already dataset and work with it, rather than to have to re-align it first. Also, the standard software packages for sequence handling all have options to remove the gaps and de-align, which is much easier to do than to align large data sets.

Hope that answers your question,

Cheers, Guido

waiting for moderation