ProsmORF-pred, NII Delhi

ProsmORF-pred is a novel bioinformatics resource for identification and analysis of smORFs (<= 100 amino acids) in bacterial genomes. The prediction pipeline requires a genome FASTA and an annotation file having the longer ORFs (> 100 amino acids) as input. It first uses the longer genes to build a Translation Initiation Site (TIS) Identification Model and then uses a machine learning (ML) model trained on amino acid sequences of E.coli small ORFs (20 – 150 aa) present in Refseq to filter the predictions. The identified small ORFs are returned in FASTA or CSV format.

ProsmORFDB : The putative functions of identified small ORFs can also predicted based on matches in ProsmORFDB. It is a curated database of small ORFs collected from different sources like SwissProt and literature-mining. The literature-curated set has information compiled from both low-throughput and high-throughput studies.The database contains 3851 entries from SwissProt, 80 entries from low-throughput smORF studies and 3763 smORFs discovered in several high-throughput studies. An effort has been made to provide function for these ORFs wherever possible.

Tools : The smORF predicted in a given genome can be analysed in several ways using a bunch of utilities provided for these purposes. They include sequence similarity search against a database of prokaryotic smORFs (ProsmORFDB) , conservation search across a set of 3153 prokaryotic whole genome assemblies (NCBI Reference and Representative Genomes) and also genome neighborhood conservation search. The small ORFs identified by other studies can also be analyzed using these tools.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top