Biological Language Model. Qiwen Dong. Читать онлайн. Newlib. NEWLIB.NET

Автор: Qiwen Dong
Издательство: Ingram
Серия: East China Normal University Scientific Reports
Жанр произведения: Медицина
Год издания: 0
isbn: 9789811212963
Скачать книгу
[email protected]

       Associate Chief Editor

      Shanping Wang

      Senior Editor

      Journal of East China Normal University (Natural Sciences), China

      Email: [email protected]

       Associate Editors

      Liqiao Chen (Professor, School of Life Sciences, East China Normal University)

      Liangjun Da (Professor, Department of Environmental Science, East China Normal University)

      Pingxing Ding (Zijiang Chair Professor, State Key Laboratory of Estuarine and Coastal Research, East China Normal University)

      Chungang Duan (Zijiang Chair Professor, Director of the Key Laboratory of Polar Materials and Devices, East China Normal University)

      Fangliang He (Changjiang Chair Professor, College of Resources and Environmental Science, East China Normal University; CanadaResearch Chair Professor, Department of Renewable Resources, University of Alberta Edmonton)

      Mingyuan He (Academician of Chinese Academy of Sciences, Professor, Department of Chemistry, East China Normal University)

      Minsheng Huang (Professor, Department of Environmental Science, East China Normal University)

      Mingyao Liu (Professor and Director of Institute of Biomedical Sciences, East China Normal University)

      Mingkang Ni (Foreign Academician of Russian Academy of Natural Sciences; Professor, Department of Mathematics, East China Normal University)

      Zhongming Qian (Zijiang Chair Professor, School of Financial and Statistics, East China Normal University; Lecturer in the Mathematical Institute and Fellow at Exeter College, University of Oxford)

      Jiong Shu (Professor, Department of Geography, East China Normal University)

      Shengli Tan (Changjiang Chair Professor, Department of Mathematics, East China Normal University)

      Peng Wu (Changjiang Scholar Chair Professor, Department of Chemistry, East China Normal University)

      Jianpan Wang (Professor, Department of Mathematics, East China Normal University)

      Rongming Wang (Professor, School of Financial and Statistics, East China Normal University)

      Wei-Ning Xiang (Zijiang Chair Professor, Department of Environmental Science, East China Normal University; Professor, Department of Geography and Earth Science, University of North Carolina at Charlotte)

      Danping Yang (Professor, Department of Mathematics, East China Normal University)

      Kai Yang (Professor, Department of Environmental Science, East China Normal University)

      Shuyi Zhang (Zijiang Chair Professor, School of Life Sciences, East China Normal University)

      Weiping Zhang (Changjiang Chair Professor, Department of Physics, East China Normal University)

      Xiangming Zheng( Professor, Department of Geography, East China Normal University)

      Aoying Zhou (Changjiang Chair Professor, School of Data Science and Engineering, East China Normal University)

       Subseries on Data Science and Engineering

       Chief Editor

      Aoying Zhou

      Changjiang Chair Professor

      School of Data Science and Engineering

      East China Normal University, China

      Email: [email protected]

       Associate Editors

      Rakesh Agrawal (Technical Fellow, Microsoft Research in Silicon Valley)

      Michael Franklin (University of California at Berkeley)

      H. V, Jagadish (University of Michigan in Ann Arbor)

      Christian S. Jensen (University of Aalborg)

      Masaru Kitsuregawa (University of Tokyo, National Institute of Informatics (NII))

      Volker Markl (Technische Universität Berlin (TUBerlin))

      Gerhard Weikum (Max Planck Institute for Informatics)

      Ruqian Lu (Academy of Mathematics and Systems Science, Chinese Academy of Sciences)

       Preface

      Since the end of the 20th century, with the implementation and successful completion of the Human Genome Project, life sciences researchers have obtained a huge amount of biological data, especially with the development of the sequencing technology of biological macromolecules, thus increasing the number of nucleic acid and protein sequences in an explosive manner. How to get valuable information from biological data? This has thus become a new research hotspot to reveal the law of life activities and has contributed to the birth of a new discipline — Bioinformatics.

      Bioinformatics is an interdisciplinary subject formed by integrating biology, information science and applied mathematics. There are different definitions of bioinformatics for different researchers. In a broad sense, bioinformatics is a discipline that deals with the collection, management and analysis of a mass of biological data. At present, bioinformatics mainly focuses on nucleic acids and proteins. In a narrow sense, bioinformatics is a subject that uses the tools and methods of biology, computer science and mathematics to obtain, process, manage, analyze and interpret information on biological macromolecules, and then reveals its biological significance. At present, the research focus of bioinformatics is mainly concentrated on genomics and proteomics. Generally, starting from the initial nucleotide or amino acid sequence, the structural and functional information of biological macromolecules contained in the sequence is analyzed by using the theories and methods of computer science, mathematics and statistics.

      Proteins play a key role in various basic biological processes. As the material basis of life activities, proteins participate in various life processes, such as catalyzing almost all chemical reactions in biological cells, regulating gene activity and participating in the formation of most cell structures. In view of the key role of proteins in life activities, the study of protein structure and function has always been the focus of life science research.

      Protein sequences are similar to sentences in natural language, as they are both linear arrangements of basic units. The mapping of sequences to structures and functions of proteins is conceptually similar to the mapping of words to meanings. This analogy has been studied by a growing body of research, but are there any linguistic features in protein sequences? What are the basic units in protein sequence language? Large amounts of genomic protein sequence data for Homo sapiens and other organisms have recently become available together with a growing body of protein structure and function data. The expected exponential increase in the amount of the data in the coming decade creates an opportunity for attacking the sequence–structure–function mapping problem with sophisticated data-driven methods. Such methods have been proven to be immensely successful in the domain of natural language.

      The purpose of this book is to introduce the relevant techniques of biological language modeling into bioinformatics and promote the development of protein sequence–structure–function mapping. In view of the above purpose, the linguistic features of protein sequences are analyzed and several amino acid encoding schemes are explored. Then, several research topics including remote homology detection, protein structure prediction and protein function prediction are investigated by using biological language model approaches. Finally, a brief summary and future perspective are proposed. We hope that this book will be helpful for research in the field of bioinformatics, especially the mapping of protein sequences to their structure