Читать книгу Biological Language Model - Qiwen Dong - Страница 7

Оглавление

Preface

Since the end of the 20th century, with the implementation and successful completion of the Human Genome Project, life sciences researchers have obtained a huge amount of biological data, especially with the development of the sequencing technology of biological macromolecules, thus increasing the number of nucleic acid and protein sequences in an explosive manner. How to get valuable information from biological data? This has thus become a new research hotspot to reveal the law of life activities and has contributed to the birth of a new discipline — Bioinformatics.

Bioinformatics is an interdisciplinary subject formed by integrating biology, information science and applied mathematics. There are different definitions of bioinformatics for different researchers. In a broad sense, bioinformatics is a discipline that deals with the collection, management and analysis of a mass of biological data. At present, bioinformatics mainly focuses on nucleic acids and proteins. In a narrow sense, bioinformatics is a subject that uses the tools and methods of biology, computer science and mathematics to obtain, process, manage, analyze and interpret information on biological macromolecules, and then reveals its biological significance. At present, the research focus of bioinformatics is mainly concentrated on genomics and proteomics. Generally, starting from the initial nucleotide or amino acid sequence, the structural and functional information of biological macromolecules contained in the sequence is analyzed by using the theories and methods of computer science, mathematics and statistics.

Proteins play a key role in various basic biological processes. As the material basis of life activities, proteins participate in various life processes, such as catalyzing almost all chemical reactions in biological cells, regulating gene activity and participating in the formation of most cell structures. In view of the key role of proteins in life activities, the study of protein structure and function has always been the focus of life science research.

Protein sequences are similar to sentences in natural language, as they are both linear arrangements of basic units. The mapping of sequences to structures and functions of proteins is conceptually similar to the mapping of words to meanings. This analogy has been studied by a growing body of research, but are there any linguistic features in protein sequences? What are the basic units in protein sequence language? Large amounts of genomic protein sequence data for Homo sapiens and other organisms have recently become available together with a growing body of protein structure and function data. The expected exponential increase in the amount of the data in the coming decade creates an opportunity for attacking the sequence–structure–function mapping problem with sophisticated data-driven methods. Such methods have been proven to be immensely successful in the domain of natural language.

The purpose of this book is to introduce the relevant techniques of biological language modeling into bioinformatics and promote the development of protein sequence–structure–function mapping. In view of the above purpose, the linguistic features of protein sequences are analyzed and several amino acid encoding schemes are explored. Then, several research topics including remote homology detection, protein structure prediction and protein function prediction are investigated by using biological language model approaches. Finally, a brief summary and future perspective are proposed. We hope that this book will be helpful for research in the field of bioinformatics, especially the mapping of protein sequences to their structure and function.

Qiwen Dong

Xiuzhen Hu

Xiaoyang Jing

Aoying Zhou

Biological Language Model

Подняться наверх