Data Scientist - Benjamin Tovar

Simple Bash command line to reduce the length of the FASTA header lines

15

Mar

 

Simple Bash command line to reduce the length of the FASTA header lines

Introduction

Hi there, how many times we have a FASTA file that contains huge FASTA headers like this:


>gi|600513|gb|M21306.1|DROTRPC Drosophila melanogaster photoreceptor ...
AAAAGTTTAAATTGGATAAATTGCAAAAGGACAATTAAGGATACGGAATATATGCGTAGTTTGTGTAAAA
TGCGCTTATAGAAACACAGAAAAAAAAAATAAAAACGGATAAATCTTTAGAAACAATAAACACTAGCTTA
AAAATTAAAAGCAAAACAAACAATAAAACATGGCAACTGATCCGGAAAAAGGGAAAAATGAGGAAGAAAA

So, to clean up the header, just use this simple command line:

$ cat <input_file> | awk '{print $1}' > <output_file>

EXAMPLE (option 1):

cat sequence.fasta | awk '{print $1}' > sequence_filtered_op1.fasta

EXAMPLE (option 2):

awk '{print $1}' < sequence.fasta > sequence_filtered_op2.fasta

And the output will be:
>gi|600513|gb|M21306.1|DROTRPC
AGCCACATTGGGCACTAATGTAATTAGTGGAATATAGCGACCCGTGGCTGCCACTTTTCAGCAGTGCAAC
GCGGCTAATTGGAGGCGGAACATCGCCACGATGGAACACTAAAGGATACAGTGCGCGAAAGGATTAGCCC
AAGGCTCCCCGAGGAGCAGGGATAAATGCCCATAGTGTTTGTGAGATGTGAAGTGACCAAGTGATCCGAT
CCTGATTATCGCGTTCGCATAGACCAGTAAATCAGTGCAGATATGGGCAGCAATACGGAATCCGATGCCG

Code

Code and FASTA file used for example available here

twittergoogle_plusredditlinkedin

Tags: , ,


Leave a comment
 

Your email address will not be published. Required fields are marked. *