Posts

Showing posts from November, 2024

Modelling genomic language using NLP and LLMs

Image
Author : Navya Tyagi  Genomic data consists of DNA, RNA, and protein sequences that can be represented as strings of unstructured text. These sequences can be very large in size. For instance, human DNA is made of 3 billion A,G,C, and T letters. There are hidden patterns that can be considered equivalent to "words" in a natural language. But all of these words are not known and more importantly the grammar that genomic lanaguege follow is not well understood. These biological words with critical functions are of interest to study disease and development processes. Sometime a mutation in these "words" may result in a disease condition. Determining these "words" with biological function is a computational challenge. Figure 1: a) Code of DNA can be written using letters A,C,G,T. A pairs wiht T and G pairs with C making a double stnraded (helical) structure out of two DNA strands. b) message from DNA is transcribed to RNA (also known as messenger RNA). The ...