Data Scientist - Benjamin Tovar

Handling large FASTA sequence datasets in R: Shuffle and retrieve “n” number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file

15

Mar

 

Handling large FASTA sequence datasets in R: Shuffle and retrieve “n” number of sequences of fixed length from the whole FASTA file and export them in a new FASTA file

Introduction

When you are working with large FASTA datasets is likely to find out that the sequences are in sort of a mixed quality.

I mean, for example, imagine that you retrieve the whole collection of exons of a given organism and suppose that the FASTA file is 50mb and there are included ~50,000 DNA sequences but, if you look at them, you may find that there are sequences much larger than others and others will be probably 50 bases long.

Well that’s what I mean with mixed quality and therefore, a sort of filter might be very helpful because you may find more informative to put a threshold and say “hey, from the pool of ~50K sequences I only want 5K sequences randomly chosen (from the entire FASTA dataset) and every sequence must have a fixed length, say 1000 bases long”.

That’s why I wrote some code in R language, it’s a function titled “shuffleAndExtract” and you can download the example set and the source here.

Here is the description of the function:

shuffleAndExtract: This function in R is designed to open a fasta file dataset, shuffle the sequences and extract the desired sequences wanted by the user to generate a new dataset of fixed size (number of required sequences) and with the same length for each sequence.

And, after you download the example set and the R source file, you can try to run a very simple example:

NOTE: my implementation depends on the function “seqinr“, if is not installed, you may do this before all the magic begin with a simple install.packages("seqinr").


# run example:
source("handleFastaDatasets.R")
shuffleAndExtract("example.fasta",1000,200)

The arguments of the function are:

  • inputFastaFile: name of the input fasta file
  • numberOfoutputSeqs: number of desired sequences in the output file
  • lengthOutputSeqs: fixed length of every sequence in the output file
  • initialPos: Position where the new window sizing will begin, default = 1
  • outputFileName: name of the output file, by default will be (e.g): “inputFastaFile.fasta.output.fasta”

Code

You can download code by clicking here

twittergoogle_plusredditlinkedin

Tags: ,


Leave a comment
 

Your email address will not be published. Required fields are marked. *