# Enabling technology and core theory of synthetic biology

Synthetic biology provides a new paradigm for life science research (“build to learn”) and opens the future journey of biotechnology (“build to use”). Here, we discuss advances of various principles and technologies in the mainstream of the enabling technology of synthetic biology, including synthesis and assembly of a genome, DNA storage, gene editing, molecular evolution and de novo design of function proteins, cell and gene circuit engineering, cell-free synthetic biology, artificial intelligence (AI)- aided synthetic biology, as well as biofoundries. We also introduce the concept of quantitative synthetic biology, which is guiding synthetic biology towards increased accuracy and predictability or the real rational design. We conclude that synthetic biology will establish its disciplinary system with the iterative development of enabling technologies and the maturity of the core theory.

synthetic biology, quantitative synthetic biology, genome synthesis and assembly, DNA storage, molecular evolution, de novo design, computer-aided design, cell engineering, gene circuit, chassis cell, artificial intelligent (AI), biofoundry

Citation: Zhang, X.E., Liu, C., Dai, J., Yuan, Y., Gao, C., Feng, Y., Wu, B., Wei, P., You, C., Wang, X., and Si, T. (2023). Enabling technology and core theory of synthetic biology. Sci China Life Sci 66, 1742–1785. https://doi.org/10.1007/s11427-022-2214-2

# Introduction

Synthetic biology, also referred to as engineering biology, is a newly emerging interdisciplinary subject. It integrates biological sciences, chemistry, physics, material science, computer and information science, as well as engineering concepts to redesign or de novo design and construct biological systems, creating a new paradigm of biological research known as “Build to learn” and empowering current biotechnology a new driving force, called “Build to use” (Deng, 2019; Elowitz and Lim, 2010; Liu et al., 2013; Zhang, 2018; Zhang, 2019; Zhao, 2018). Synthetic biology consists of three interrelated aspects: theory, enabling technology,

and applied research.

More than half a century ago, after understanding the primary structure of nucleic acids and proteins, scientists achieved a historic leap from the chemical synthesis of small molecules to the chemical synthesis of biological macromolecules. Pioneering studies included the total synthesis of crystalline bovine insulin (Kung et al., 1965), polynucleotide synthesis of the genetic code (Khorana et al., 1966), and amino acid transferase RNAs (Lapidot et al., 1969; Wang et al., 1983; Weber and Khorana, 1972). At the turn of the century, with the completion of the Human Genome Project and the advancement of DNA sequencing and synthesis technologies, synthetic biology achieved a new leap from nucleic acid synthesis to genome synthesis. From simple to complex, scientists have synthesized viral genomes (Cello et al., 2002), bacterial genomes (Gibson et al., 2010a; Ostrov et al., 2016), and yeast chromosomes (Annaluru et al., 2014; Dymond et al., 2011; Shen et al., 2017; Xie et al., 2017; Zhang et al., 2017a). These landmark approaches provided the foundation for synthesis of biological systems by demonstrating that synthetic genomes can fully perform natural biological functions.

During the same period, the concepts of engineering were introduced into the design and creation of biological systems, such as gene circuits, biological devices and modules, minimal genomes and chassis cells. These concepts were well explained by many artificial logical biological devices and genomes, for example, gene toggle switches (Gardner et al., 2000), synthetic oscillatory networks of transcriptional regulators (Elowitz and Leibler, 2000), quorum sensing based intercellular communication circuits (Miller and Bassler, 2001), synthetic multicellular systems for programmed pattern formation (Basu et al., 2005), RNA devices for performing logical operations (Win and Smolke, 2008), and programmable microbial kill switches (Callura et al., 2010). These pioneering explorations seek to achieve the standardization of biological components, the generalization of chassis cells, and the predictability of designing biological systems based on engineering principles.

However, biological systems are extremely complex systems due to their sustainable genetic variation, metabolic diversity and dynamics, and biomass flexibility. The knowledge we have acquired for biological design is far from sufficient to meet engineering standards. The high-throughput, multi-cycle and automated “trial and error” biofoundries came into being, and its efficiency is being exceedingly advanced by the embedding of artificial intelligence (AI). In particular, the advanced algorithms represented by DeepMind’s AlphaFold (Jumper et al., 2021; Senior et al., 2020) and the Baker lab’s RoseTTA fold (Anishchenko et al., 2021a; Anishchenko et al., 2021b) have revolutionized the prediction of protein 3D structures. This means that AI-assisted protein de novo design will flourish.

As a result, we can distinguish two modus operandi: a data-driven “black box” and a knowledge-driven “white box”. The vast amount of knowledge input and data output enables us to perform computer modeling that enables accuracy and predictability in biological design. Such modeling is termed quantitative synthetic biology which is becoming the core part of the theory of synthetic biology. Its validity and reliability have been demonstrated in investigations on the ability of bacteria to colonize new available habitats (Liu et al., 2019b) and elaborating quantitative relationship between growth and cell cycle of Escherichia coli (Zheng et al., 2020).

During 2009 and 2012, the National Science and Engineering Academies from China, USA, and UK held a series of joint symposiums on synthetic biology (http://www.nap. edu/catalog.php?record_id=13316). The major topic of the event ocured in Shanghai was “Enabling technology of synthetic biology” (http://www.sippe.ac.cn/yjdy/hcswx/ hczxxx/201812/t20181214_5212243.html). Enabling technology of synthetic biology is a collection of a series of novel or iterative technologies, such as genome systheis and assembling, gene editing and genome reprograming, evolution and de novo design of proteins, chassis cells, computer-added design and modeling of new biomacromolecules and biosystems, DNA information storage, production automation of biological systems, genetic code expanding and semisynthetic organisms, etc. With the creating and development of the enabling technologies, synthetic biology will widely realize its goals, establish a new paradigm for biological science research, and have a revolutionary impact on biotechnology and its applications.

On December 24, 2021, we successfully held the Forum on “Enabling technologies and core theory of synthetic biology”. Here, we summarize the discussion held at this Forum. However, due to the large number of topics and limited space, this paper does not cover some critical topics, such as orthogenetic systems based on genetic code expansion, as well as bioethics, biosafety and biosecurity, which we hope to discuss at other opportunities. See the link for: https:// blog.sciencenet.cn/video.php?mod=vinfo&pid=1530.

# Quantitative synthetic biology

# Hierarchy of synthetic biology research

Living systems are considerably complex, as a result of their hierarchical organization and layers of interaction (Channon et al., 2008) (Figure 1A). Properties displayed by complex systems are often referred to as “emerging properties”, which are the result of interactions between components that cannot be predicted, even with a full knowledge of all parts (Aderem, 2005). For example, identifying the genes and proteins in an organism is analogous to listing all parts of an airplane, which is not enough to understand how an airplane flies. Therefore, understanding how the properties of life emerge in hierarchical organization remains one of the most fundamental questions in life sciences.

Figure 1 Understanding function emergence with quantitative synthetic biology. A, Function emergence in living systems across their hierarchical organization, which cannot be understood by knowing individual parts alone. B, Hierarchy of the methodology. C, Principle and rational design in quantitative synthetic biological research promote each other. The proposed research paradigm of quantitative synthetic biology integrates rational design, building, and testing.

Using biological and engineering principles, synthetic biology aims to design biological modules/reactions/systems to achieve desired functions/products (Kai and Schwille, 2019). From biomolecular engineering, genome engineering to artificial cell design, synthetic biologists can access hierarchical biological systems at any level. At every level, synthetic biologists aim to create a more synthetic and complex entry, thus advances in synthetic biology often results in the most complex and unnatural systems (Channon et al., 2008). In general, synthetic biology studies how properties of life emerge in hierarchical living systems using a bottom-up strategy which we “understand by construction”. As Richard P. Feynman’s famously stated “What we cannot create, we cannot understand.”

Synthetic biology has proven to be a valuable tool for exploring all three dimensions of the properties of life, namely, functions that previously existed, functions that exist, and functions that do not yet exist. It allows the creation of artificial systems, such as silicon-based life (Kan et al., 2016) and single-chromosomal yeast (Shao et al., 2018), to explore the boundaries of life and reveal the fundamental question of “what is life”. In contrast, these artificial systems can surpass the capabilities of natural cells, which will considerably facilitate the development of biotechnology to transform our daily lives (Voigt, 2020), i.e., to explore nonexistent life functions. By engineering/rebuilding biological systems, synthetic biology enables us to extract inaccessible information from conventional biology and helps us understand the principles of life (existing functions) (Elowitz and Leibler, 2000; Liu et al., 2019b). Finally, synthetic biologists study complex biological systems through simplification and modeling. These efforts have advanced our understanding of its underlying physics, which will help elucidate possible evolutionary routes (existing functions) in early Earth environments (Budin and Szostak, 2011).

Despite these considerable advances in understanding the life system, a well-established synthetic biology approach remains missing. To address this issue, we first summarize the general paradigm of previous systems. An integrative biology research practice usually consists of three levels, termed partial, topological, and functional. Functionality comes from parts, but with complex systems it is difficult to obtain an end-to-end relationship between them. Therefore, a topological intermediate layer is added. How topology emerges from the part is called the mechanism, and how the function emerges from the topology is called the principle (Figure 1B). The principle is more general, while the mechanism is specific to the part used. For example, Alan Turing proposed the reaction-diffusion principle, which broadly regulates pattern formation in nature (Turing, 1952). Based on this general principle, researchers constructed chemical Belousov-Zhabotinsky reactions (Yoshida, 2010), biological circadian oscillations (Nakajima et al., 2005), and engineered DNA/enzyme reaction networks (Senoussi et al., 2021) of artificial reaction-diffusion systems to follow their respective mechanisms. Therefore, understanding these principles lays the foundation for rational design, which in turn accelerates synthetic biology towards increased complexity and efficiency, advancing our understanding of the fundamental principles of living systems (Figure 1C). An ideal synthetic biology practice, which we call rational design, is to design basic components based on the principle of interest to achieve the intended function.

# Quantitative synthetic biology: a research paradigm to investigate function emergence

Successful rational design is limited to a few well-characterized systems, such as toggle switches and oscillators in earlier studies (Elowitz and Leibler, 2000; Gardner et al., 2000). Recent examples include morphogen-mediated artificial cell differentiation (Tian et al., 2019), and artificial photosynthetic systems that enable photodynamic carbon dioxide fixation (Miller et al., 2020a). However, the principles governing biological functions of interest are often elusive. In this case, researchers can devise novel principles based on quantitative analysis of the interest function and then test them; or they can rely on well-defined components and fine-tune to explore their potential functions. Thus, when the rationale for a function is unknown, routine work in synthetic biology research involves tedious trial and error and luck in fine-tuning. While this trial-and-error research paradigm has been successful in generating critical information and expanding our understanding of complex biological systems, the rational design of synthetic biology will be in high demand in the coming decades to effectively explore the fundamental principles of living systems.

How can we achieve rational design without the principle of interest? To design a system rationally is to design for predictability, that is, the ability to predict outcomes based on input components and parameters. This requires us to quantify natural phenomena, freeing us from the ambiguity and subjectivity of qualitative descriptions, allowing us to develop theories and make predictions. There are two models that guide us to quantify natural systems: knowledge-driven “white-box” models and data-driven “black-box” models. A white-box model is established based on macroscopic experimental observations. Through synthetic analysis, we can describe these complex observations in mathematical formulations, thereby extracting general theoretical frameworks and underlying principles. For example, Liu et al. developed a reaction-diffusion model consisting of multiple partial differential equations describing cell growth, cell movement, etc., in a range expansion system of bacterial colony, reproducing the spatial patterns that spontaneously emerge in this system (Liu et al., 2011); Zheng et al. discovered a linear relationship between the cell mass and the rate of chromosome replication-segregation, which provided a quantitative basis for understanding cell cycle regulation and programming cell size (Zheng et al., 2020). Terence Hwa’s group formulated a number of bacterial growth laws and established a principle of proteomic resource allocation (Erickson et al., 2017; Scott and Hwa, 2011), providing a predictive model for understanding the response of bacteria to physiological perturbations and design synthetic metabolic pathways. In contrast, black-box modeling focuses on the direct correlation between input and output. A large amount of known input and output data will be used to train and improve the algorithm, which can then be used to predict the outcome of the associated system. For example, AlphaFold, developed by DeepMind, successfully predicted $9 8 . 5 \%$ of human protein structures, which were trained with known amino acid sequence-protein-structure relationships (Jumper et al., 2021).

Both white-box and black-box modeling have been demonstrated to be valuable tools in biodesign: By developing mechanistic models and systematicly analyzing the models, Chau et al. identified network topologies that can achieve spontaneous cell polarization. Based on the models’ predictions, they successfully built synthetic gene circuits that generated polarized protein distribution in yeast (Chau et al., 2012). Lu et al. developed a machine learning algorithm to predict how PET (a major type of plastics) hydrolases could be mutated to improve their efficiency and robustness. Guided by this algorithm, the team engineered the wild-type enzyme and obtained a mutant with much superior PETdegrading activity (Lu et al., 2022). Although taking different routes, both approaches aim to construct natural phenomena through quantitative relationships, enabling predictive design. Therefore, we propose the concept of quantitative synthetic biology as an urgent research paradigm facing the current bottleneck of rational design.

Quantitative synthetic biology is the intersection of quantitative biology and synthetic biology. It studies synthetic biology systems from the bottom up and uses simplified quantitative relationships to describe complex biological phenomena. Guided by both white-box and black-box modeling, a quantitative understanding of living systems can be obtained, from which we can develop theories of complex biological systems and explore their fundamental principles. An understanding of the underlying principles will facilitate rational design, thereby accelerating the realization of true synthetic biology engineering. Both white-box and blackbox modeling deal with large amounts of data, which can be achieved by building automated and high-throughput experimental facilities and standardized protocols, algorithms, and workflows (tests). Finally, enabling technologies should be further developed to precisely control/rebuild biological systems, such as efficient and precise DNA synthesis techniques, genome editing, gene circuit design, and protein directed evolution (construction). Therefore, we propose a future research paradigm for synthetic biology, including design, construction, and testing (Figure 1C). We envision that this research cycle will revolutionize the current qualitative, descriptive, and limited synthetic biology research into a new phase with quantitative, theoretical, and systematic features. This revolution will push the frontiers of biology by answering fundamental questions about how life functions, which in turn will help us design synthetic systems with improved predictive power.

# Synthesis and assembly of a genome

# Brief history of genome synthesis

The history of synthetic genomics dates back to 1970, following a 5-year effort to synthesize the 77 bp yeast tRNA gene (Agarwal et al., 1970). In 2002, a $7 . 5 \ \mathrm { k b }$ poliovirus complementary DNA was chemically constructed (Cello et al., 2002). One year later, the $5 . 5 \mathrm { k b }$ genome of phage Χ174 was created in just two weeks (Smith et al., 2003). Encouraged by this success, a number of groups set out to construct the $5 8 3 ~ \mathrm { \ k b }$ Mycoplasma genitalium JCVI-1.0 genome, which was achieved in 2008 (Gibson et al., 2008a). Subsequently, the 1.1-Mbp Mycoplasma mycoides JCVIsyn1.0 genome was synthesized and demonstrated to be functional (Gibson et al., 2010b). To date, synthetic genomes have mostly mimicked natural template DNA. In 2016, scientists minimized the 1.1 Mbp JCVI-syn1.0 genome to a functional 531 kbp JCVI-syn3.0, using three design-build test cycles (Hutchison III et al., 2016).

In addition to assessing the plasticity of gene content, genome synthesis allows for reprogramming of the genetic code. In 2016, a 3.97-megabase, 57-codon Escherichia coli genome was designed and experimentally validated on $63 \%$ of the synthetic genome (Ostrov et al., 2016). Three years later, an $E$ . coli variant with a synthetic genome of 61 codons was created, enabling the first compression of the sense codons of the entire genome (Fredens et al., 2019) (Figure 2).

When synthesizing viral and bacterial genomes, an attempt to synthesize eukaryotic genomes, called the Synthetic Yeast Genome Project (Sc2.0), was initiated, an effort that is only now nearing completion (Jiang et al., 2020a; Luo et al., 2020; Luo et al., 2018; Zhang et al., 2020b). Sc2.0 aims to synthesize the entire genome of Saccharomyces cerevisiae $( \sim 1 2 \ \mathrm { M b }$ , divided into 16 chromosomes), with numerous changes to explore fundamental biological questions about genome function. In 2016, a more ambitious project, the Genome Project Write (GPW), was proposed to rewrite complex genomes of gigabase size (Boeke et al., 2016). However, the current capacity and the cost of DNA synthesis are the major limiting factors in the construction of such large genomes, thus there is an urgent need to make breakthroughs in DNA synthesis and genome assembly.

# Technology development for gene synthesis and genome assembly

Oligonucleotide synthesis Currently, the most commonly used technique for oligonucleotide synthesis is the solid-phase phosphoramide chemical synthesis method developed in the 1980s (Beaucage and Caruthers, 1981). In this method, the addition of each nucleotide monomer proceeds through a cycle of four steps: deprotection, coupling, capping, and oxidation (Hughes and Ellington, 2017; Hughes et al., 2011). The cycle is then repeated for the next base by removing the protecting group. The robustness and fidelity of this approach enables it to be automated and industrialized. Since the 1990s, DNA synthesizers based on this method have been developed to synthesize 96–1536 different oligonucleotides simultaneously (Cheng, 2002; Rayner et al., 1998). In comparison, arrayed base oligonucleotide synthesis technology could, in theory, significantly reduce costs and increase yields (Hughes and Ellington, 2017). The authors, however, pointed out that synthesis quality typically decreases with increasing oligonucleotide length due to inevitable synthesis-related errors in stepwise multiple reaction systems. Despite continuous efforts to optimize the synthesis process, synthetic oligonucleotides are typically no longer than 200 nt in length (Hughes and Ellington, 2017; Kosuri and Church, 2014).

Figure 2 Milestones in synthetic genomics. Brown represents progress in viral and bacterial genome composition. Blue indicates milestones in eukaryotic genome composition.

Enzymatic de novo synthesis of oligonucleotides, proposed as early as the 1960s (Bollum, 1962), has emerged as a promising alternative due to the length limitations and hazardous waste of chemical synthesis. Currently, terminal deoxynucleotidyl transferase (TdT) is the best option (Barthel et al., 2020; Eisenstein, 2020; Lee et al., 2020; Lee et al., 2019; Palluk et al., 2018). After years of efforts, it has been reported that enzymatic synthesis can generate about 300 mers, outperforming chemical synthesis (Eisenstein, 2020). To date, several companies have been established to advance the commercialization of enzymatic DNA synthesis. Achieving rapid, high-throughput on-demand synthesis of long DNA molecules in the future will considerably accelerate the design-build-test cycle in systems biology.

# Gene synthesis

The term “gene” in gene synthesis refers to a long doublestranded DNA sequence rather than the classical definition of a gene (Kosuri and Church, 2014). Commercially synthesized genes are usually between 200 and 3000 bp in length. Single-stranded oligonucleotides with complementary overlapping sequences are the raw materials for the assembly of this double-stranded synthetic DNA. Earlier methods were ligation-based in which adjacent oligonucleotides were enzymatically ligated by DNA ligase (Agarwal et al., 1970; Au et al., 1998; Bang and Church, 2008; Sekiya et al., 1979). Since the invention of the polymerase chain reaction (PCR) in the 1980s, PCR-mediated methods have been widely used to assemble desired DNA sequences from oligonucleotides (Hughes and Ellington, 2017; Hughes et al., 2011; Stemmer et al., 1995). In addition, Gibson and colleagues developed in vitro (Dormitzer et al., 2013; Gibson et al., 2010b) and in vivo (Gibson, 2009a) one-step methods for the direct assembly and cloning of oligomers into plasmids. Currently, the above methods have been iteratively improved and used in most commercial and academic applications (Kosuri and Church, 2014). Furthermore, due to the need for inexpensive synthetic DNA, methods for gene synthesis using microarray-based oligonucleotide pools have also been developed (Borovkov et al., 2010; Kosuri et al., 2010; Quan et al., 2011; Tian et al., 2004).

In addition to genes, various applications require longer DNA molecules over $1 0 \mathrm { \ k b }$ or even $1 0 0 ~ \mathrm { k b }$ in length, which has led to the development of a range of methods to assemble short DNA, such as BioBrick (Shetty et al., 2008), BglBrick (Anderson et al., 2010), iBrick (Liu et al., 2014), and HVAS (Li et al., 2014). However, the “scar” sequences produced by these endonuclease-based techniques may affect the function of the final constructs or introduce undesired variations. Type IIS restriction enzymes are characterized by a cleavage site that is only a few bases away from the recognition site, making them an ideal solution for “scar-free” assembly. Based on this principle, the Golden Gate method and toolkit were developed and gained significant popularity (Cermak et al., 2011; Engler et al., 2008; Guo et al., 2015). Furthermore, to obviate the need for restriction enzymes, several seamless assembly methods such as Gibson assembly (Gibson et al., 2009b), ligase cycling reactions (de Kok et al., 2014), sequence and ligation independent cloning (Li and Elledge, 2007), circular polymerase extension cloning (Quan and Tian, 2009; Quan and Tian, 2011) and yeast assembly (Gibson et al., 2008b; Jiang et al., 2022; Shao et al., 2009), have been established. Currently, which assembly technique to use is a matter of preference. Importantly, most of the above methods can be automated to increase the throughput of constructing long synthetic DNA (Hughes and Ellington, 2017).

# Genome assembly

To synthesize small genomes, restriction cloning or polymerase circular assembly (PCA) methods are usually sufficient (Cello et al., 2002; Smith et al., 2003), while combinations of different tools are employed to construct larger scale (more than 100 kilobases) synthetic chromosomes or genomes (Annaluru et al., 2014; Fredens et al., 2019; Gibson et al., 2008a; Gibson et al., 2010a; Hutchison III et al., 2016; Richardson et al., 2017; Zhang et al., 2020b; Zhang et al., 2017a).

Although Gibson assembly has been reported to assemble DNA molecules up to several hundred kilobases (Gibson et al., 2009b), the efficiency of in vitro procedures declined with the increase of DNA length, making it mainly a commercialized tool in constructing synthetic DNA of tens of kilobases (Gibson et al., 2008a; Gibson et al., 2010b; Zhang et al., 2020b). By comparison, the upper limit of yeast assembly appears to be much higher. Beyond the high efficiency of DNA assembly within or around $1 0 0 ~ \mathrm { k b }$ (Fredens et al., 2019; Gibson et al., 2008a; Jiang et al., 2022), one-pot yeast transformation has generated several synthetic genomes of hundreds of kilobases or even over $1 \ : \mathrm { M b }$ , such as the two Phaeodactylum tricornutum chromosomes $4 9 7 ~ \mathrm { k b }$ and $4 4 1 \ \mathrm { k b } )$ from 5 fragments (Karas et al., 2013), the $5 8 3 ~ \mathrm { k b }$ bacterial genome generated with $2 5 { - } 2 5 ~ \mathrm { k b }$ fragments (Gibson et al., 2008b), the $7 8 6 \mathrm { k b }$ Caulobacter ethensis-2.0 using 16 mega-segments $3 8 { \ - } 6 5 { \mathrm { ~ k b } }$ in size) (Venetz et al., 2019) and the $1 . 0 8 \ \mathrm { M b }$ JCVI-syn1.0 genome via 11 pieces of $1 0 0 \ \mathrm { k b }$ overlapping intermediates (Gibson et al., 2010a). Yeast homologous recombination is also critical for the assembly of the synthetic chromosomes in $\operatorname { S c } 2 . 0$ , which share high similarity to its own genome (Annaluru et al., 2014; Luo et al., 2018; Richardson et al., 2017; Zhang et al., 2020b; Zhang et al., 2017a). Combinations of regular cloning, Golden Gate, Gibson assembly or yeast assembly were used to generate the “megachunks” $( 3 0 { - } 6 0 ~ \mathrm { k b } )$ , which were sequentially introduced into yeast to replace their native counterparts through the strategy called switching auxotrophies progressively for integration (SwAP-In). Together, these results highlight the considerable capacity of the yeast host for DNA uptake and assembly. The fact that all 16 yeast chromosomes can be reorganized into a single linear or circular chromosome suggests that budding yeast may be capable of constructing DNA molecules over $1 0 \ \mathrm { M b }$ (Jiang et al., 2022; Shao et al., 2019; Shao et al., 2018).

In addition to Saccharomyces cerevisiae, Bacillus subtilis, Salmonella typhimurium, and $E$ . coli are three alternative hosts for in vivo genome assembly (Fredens et al., 2019; Itaya et al., 2008; Itaya et al., 2005; Lau et al., 2017). A $3 . 5 ~ \mathrm { M b }$ genome has been assembled into a $B$ . subtilis genome by the “inchworm” method (Itaya et al., 2005). Using a $B$ . subtilis genome (BGM) vector, a $1 6 . 3 – \mathrm { k b }$ mouse mitochondrial genome and a $1 3 4 . 5 – \mathrm { k b }$ rice chloroplast genome were successfully integrated by homologous recombination into a $B$ subtilis genome (Itaya et al., 2008). A $2 0 0 \mathrm { k b } S .$ . typhimurium segment was replaced by synthetic DNA through a process called stepwise integration of rolling circle amplified fragments (SIRCAS) (Lau et al., 2017). In E. coli, a conjugationbased strategy, coupled with repetitive replicon execution for enhanced genome engineering through programmed recombination (REXER), enabled the synthesis of a ${ \sim } 4 \ \mathrm { M b }$ recoding genome (Fredens et al., 2019; Wang et al., 2016).

# Perspective on synthetic genomics

Bottom-up genome synthesis enables the simultaneous integration of dense and complex genome-wide changes. Synthetic genomics not only return valuable discoveries in the life sciences. But also may lead to a new industrial revolution for food, medical and chemical production (Jiang et al., 2020b; Venter et al., 2022). For example, synthetic viruses have altered vaccine design and production, and synthetic genomes are being used to save lives through humanized pig organs for transplantation (Venter et al., 2022).

Currently, however, the cost of gene synthesis remains prohibitive for genomes megabases or more in length. In addition, a number of technical hurdles remain to be addressed. First, currently designed genomes are usually assembled in microbial hosts such as $E$ . coli or $S$ . cerevisiae; however, toxicity of certain DNA sequences to the host often leads to assembly failure. Second, the stepwise assembly of long DNA fragments is limited by sequence repeatability, such as centromeres and telomeres in higher eukaryotes. Third, the transfer of assembled DNA fragments from the host to the target organism remains challenging. Currently, only the Mycoplasma circular genome has been successfully transplanted.

Recent developments in high-throughput DNA synthesis and assembly technologies should considerably accelerate the construction of synthetic genomes. The emergence of new DNA synthesis technologies using high-density microchips, enzymatic DNA synthesis, and automated gene assembly using microfluidics will continue to drive down the price of gene synthesis. In addition to assembling large DNA fragments in vivo, new techniques for synthesizing and amplifying large DNA in vitro will emerge. Within the next five years, the cost of gene synthesis is expected to reach 0.001/base for DNA lengths over 1 Mb. Meanwhile, chromosomes of about $1 \ \mathrm { M b }$ in size can be fully synthesized in vitro, transferred into the target organism and restart the host, opening a new era of synthetic genomics.

One research direction related to genome assembly is genome simplification, which aims to identify a minimal genome of a living organism. For example, J. Craig Venter’s team removed nearly half of the mycobacterial genome to study the genome composition necessary for cell survival (Hutchison III et al., 2016). A genome compaction strategy based on synthetic chromosomal rearrangements and modifications by LoxPsym-mediated evolution (SCRaMbLE) revealed that at least $60 \%$ of the genes on the synthetic chromosomal arms (synXIIL) of budding yeast are dispensable for cell viability (Luo et al., 2021). These studies considerably improved the feasibility of constructing a microbial chassis with minimal genomes and desirable properties. Future exploration of genomic minimization of multiple synthetic chromosomes or entire genomes will considerably expand our knowledge of the core functions of eukaryotes (Dai et al., 2020).

# DNA data storage technology: a paradigm of BT-IT integration

The emerging DNA data storage

Information storage has always been the driving force of human civilization, a necessary condition for the accumulation of knowledge, the transmission of culture, and the transmission of technology from generation to generation. The techniques used to preserve information can be traced back to the beginning of papermaking and knot-tying in ancient China, and later in history, to paper and printing. Until nearly half a century ago, storage technology based on magneto-optical silicon, such as hard disks, solid-state disks, and magnetic tapes, has constantly changed the means of information storage.

Modern data storage and processing technologies have brought humanity into the digital age, and the total amount of digital data on the planet has grown exponentially. However, the current storage media are facing a number of challenges: theoretical limits of density, short duration, high energy consumption, and environmental pollution. So, a new generation of information archiving technologies needs to be developed (Ceze et al., 2019; Zhirnov et al., 2016). Surprisingly, DNA, a natural medium for preserving genetic information, has been found to be a potential medium for artificial data storage with high density, long-term durability, and low maintenance costs (Church et al., 2012; Dong et al., 2020; Grass et al., 2015; Bancroft et al., 2001). The use of synthetic DNA for high-density and long-term data storage has become a highly promising area of research, attracting considerable interest worldwide from both governments and industrial investors. The Semiconductor Industry Association, Defense Advanced Research Projects Agency, National Science Foundation, Semiconductor Research Corporation, and Intelligence Advanced Research Projects Activity all contribute to U.S. DNA data storage technologies and related semiconductors. The European Commission has also specifically funded DNA data storage and launched the Horizon 2020 program. The China Association for Science and Technology listed DNA data storage as one of 60 major scientific and engineering technology issues in 2018. China’s 14th Five-Year Plan explicitly promotes the development of cutting-edge technologies such as DNA storage. Microsoft and Western Digital, along with Twist Biosciences and Illumina announced in 2020 the creation of the “DNA Data Storage Alliance”. Its “common goal is to enable the full potential of DNA Data Storage as a new storage medium across existing and emerging archival storage use cases”. To date, more than 50 companies and academic institutes have joined the Alliance (https://dnastoragealliance.org/).

# Concept and working modes of DNA data storage

As shown in Figure 3A, the basic concept of DNA data storage consists of three basic components: (i) a coding system that can encode binary strings to DNA strings and adapt the reverse process—decoding DNA strings to binary strings; (ii) a writing device that can make actual DNA molecules with a specific sequence or structure; (iii) a reading device capable of reading the sequence of the DNA molecules (Han et al., 2021; Meiser et al., 2020). It is worth mentioning that, so far, there have been two different strategies for digital information storage in DNA. For the first strategy, DNA molecules with specific sequences are used to record information. DNA molecules with specific sequences are generated by de novo DNA synthesis or assembly for data writing (Church et al., 2012; Goldman et al., 2013; Grass et al., 2015; Organick et al., 2018). The second strategy uses pre-existing DNA molecules as the backbone for data recording. The information is then stored at preset locations on the double-stranded DNA (dsDNA) or singlestranded DNA backbone by gene editing or DNA hybridization to generate precise sequence or structural modifications on the backbone (Chen et al., 2019a; Chen et al., 2020a; Shipman et al., 2017; Tabatabaei et al., 2020). The first strategy offers higher storage densities but is rather expensive to write due to the need for DNA synthesis. The second strategy is expected to be less expensive to write than the first because it bypasses the costly stage of DNA synthesis. However, the reduction of its storage density may limit future applications (Chen et al., 2019a; Chen et al., 2020a). So, DNA data storage technology can be divided according to the technical details of writing, copying, storage and reading, namely into in vitro “hard disk mode” and in vivo “CD-ROM mode” (Figure 3B and C).

Figure 3B shows the in vitro “hard disk mode”, which uses high-throughput DNA synthesis to write data and has the potential for high-density data storage, similar to ordinary hard disks. The data writing and reading process in this mode is relatively simple because there is no cell membrane barrier (Church et al., 2012). However, according to studies (Gao et al., 2020; Heckel et al., 2019), in vitro storage has been shown to be associated with DNA strand loss during replication, high replication costs, and DNA degradation throughout storage. The in vivo “CD-ROM mode”, as depicted in Figure 3C, uses artificial chromosomes to store and distribute large amounts of data (Chen et al., 2021b). This in vivo model has a protective environment in which efficient replication and repair enzyme systems emerge naturally, offering significant advantages in terms of durability, fidelity, and low replication costs. The main advantage of the “CD-ROM mode” compared to the in vitro “hard disk mode” is the low-cost, reliable replication of chromosomal DNA as part of cellular replication, which can be used for fast, lowcost data replication and dissemination. In addition, in vivo storage makes it easier to realize more advanced storage functions by constructing complex intracellular biological circuits, such as random reading and writing through gene editing, encryption and decryption, and communication with the flow of biological information. These additional options open up more possibilities for in vivo mode, allowing for a wider range of potential application scenarios such as cellular event recording, environmental toxicant detection, and disease marker monitoring.

Figure 3 The basic concept and storage modes of DNA data storage. A, The basic concept of DNA data storage. To achieve basic data writing and reading operations, DNA data storage requires three basic components of a coding system, a writing device, and a reading device. B, “Hard disk mode” based on in vitro synthesis and sequencing of massive DNA fragments. C, “CD-ROM mode” based on the manipulation of $i n$ vivo chromosomal DNA (redrawn with reference to Chen et al., 2021b). The details of the two storage modes are described in the main texts.

# Major progresses and the current state of DNA data storage

The feasibility of the in vitro “hard disk mode” has been demonstrated on the lab scale. Researchers at Columbia University have introduced Fountain codes into DNA data storage to improve coding efficiency and prevent GC-rich, complex DNA sequences that are difficult to construct and sequence. At a data scale of $2 \mathrm { M B }$ (megabytes, ${ 1 0 } ^ { 6 }$ bytes), a high storage density of $2 1 5 ~ \mathrm { P B / g }$ of DNA (PB, petabytes, $1 \times 1 0 ^ { 1 5 }$ bytes) has been achieved (Erlich and Zielinski, 2017). In late 2018, researchers at the University of Washington achieved reliable random data access at a scale of $2 0 0 ~ \mathrm { M B }$ (Organick et al., 2018). Furthermore, a fully integrated DNA storage system was built, enabling the automatic writing, storage, and reading of a single word “hello” (Takahashi et al., 2019). Startup CATALOG has taken a different approach, using “DNA movable type” for high-speed data writing. They announced in 2019 that all $1 6 \mathrm { \ G B }$ (gigabytes, ${ 1 0 } ^ { 9 }$ bytes) of Wikipedia text can be written in DNA within $^ { 1 2 \mathrm { ~ h ~ } }$ (Versai, 2019), which is nearly 1000 times faster than any other currently used technology. Researchers at the Technion Institute of Technology in Israel have devised a concept of compound DNA letters, which minimize the cost of writing by exploiting the base composition information to improve the ability to write per synthesis cycle (Anavy et al., 2019). Gao et al. accomplished low-bias DNA strand amplification using isothermal amplification (Gao et al., 2020). Chen et al. of Tianjin University used Low Density Parity Check (LDPC) and Reed-Solomon (RS) algorithms to encode video clips and text with a total size of 3 MB in DNA (Chen et al., 2020b). To address the sequence compatibility issue for DNA storage, Ping et al. devised a “yin-yang” encoding system (Ping et al., 2022). Early proof-of-concept studies of the $E$ . coli DNA CD-ROM paradigm used plasmid storage data (Bancroft et al., 2001; Davis, 1996). Later research has focused on the implementation of genetic circuits, such as toggle switches, for data storage (Gardner et al., 2000). However, the storage capacity of such systems is clearly limited. Shipman et al. utilized CRISPR-Cas9 technology to store digital movies in bacterial cells and enable them to be decoded using high-throughput sequencing (Shipman et al., 2017). Later, Tang and Liu recorded a large number of cellular activities in the cell population by using two CRISPRmediated analog multi-event recording apparatus systems (Tang and Liu, 2018).

Recently, Chen et al. of Tianjin University designed and synthesized an artificial chromosome with a length of 254,886 bp from scratch for data storage. This study shows for the first time that assembled artificial chromosomes can be used for large-scale data distribution through reliable and low-cost cellular replication (Chen et al., 2021b). New concepts and ideas such as “DNA-of-things”, “bio-orthogonal information storage”, “true random number generation”, “data encryption in DNA”, have also been proposed (Banal et al., 2021; Bee et al., 2021; Fan et al., 2021; Koch et al., 2020) and lead the way for a wide range of potential applications for DNA storage and computation. A recent review (Meiser et al., 2022) provides an excellent summary of these topics.

# Future perspective

DNA data storage involves a range of key technologies, including DNA synthesis, sequencing, microfluidics, micronano fabrication, and requires multidisciplinary efforts to achieve the ultimate goal of transforming DNA storage into practical applications. Although previous research has made significant progress in data volume, stability and random access, cost, especially write cost, has become a major obstacle to practical application of DNA data storage (Ceze et al., 2019; Dong et al., 2020; Ping et al., 2019). It is estimated that DNA data storage will need to reduce write costs by 7–8 orders of magnitude over the currently used tape-based storage technologies (Meiser et al., 2022). Despite several attempts, such as non-terminated TdT (Lee et al., 2019), DNA punch cards (Tabatabaei et al., 2020), DNA movable type (Versai, 2019), compound DNA letters (Anavy et al., 2019), and low-quality synthesis (Antkowiak et al., 2020), the competitive route to cost reduction remains unclear. Every information storage medium faces the same high production cost challenges in its early stages.

Modern storage technologies have been widely used in an accordance with Moore’s Law for over decades. It is worth mentioning that DNA synthesis and sequencing, two key technologies in DNA data storage, are developing faster than Moore’s Law predicts. In a brief history of DNA storage, since the first publication of chip-based DNA data storage by

Church et al. in 2012 (Church et al., 2012), the data size has expanded more than 300 times, showing a rapid upward trend.

In conclusion, the author argues that with the continuous development of enzymatic DNA synthesis, data writing and reading methodologies, practical DNA data storage technologies will be available in the near future. As an environmentally friendly, high capacity and long-term storage medium, DNA is expected to compensate for the insufficiency of current storage media.

# Gene editing

In the life sciences, there has long been a goal of being able to programmably, specifically and efficiently edit the DNA sequence of all living cells, which has unlimited value in gene research, gene therapy, genetic breeding, and synthetic biology. Previous approaches, such as meganucleases, zinc finger nucleases (ZFNs), and transcription activator-like effector nucleases (TALENs), rely on complex and specific protein-DNA interactions to target protein effectors to desired DNA sequences. While effective for targeting specific loci, it is difficult to rapidly and simply reprogram the targeting of these protein domains to new genomic loci of interest. The discovery and engineering of the clustered regularly interspaced short palindromic repeats (CRISPR) system has sparked a new and exciting renaissance in the field of genome editing.

# Strictly protein-based genome editing systems

Meganucleases, ZFNs, and TALENs are powerful biological tools that can be used for genome editing (Figure 4A). Meganucleases (also called homing endonucleases) are large protein complexes that recognize specific DNA sequences. These proteins rely on a complex network of interactions between the protein itself and the target DNA sequence. Although previous efforts had successfully applied meganucleases to new, user-defined genome sequences (Epinat et al., 2003; Silva et al., 2011), the process was extremely laborious, time-consuming, and technically challenging. The application of meganucleases fundamentally relies on the large-scale reprogramming of entire protein complexes, enabling them to recognize new DNA regions of interest. Therefore, there is urgent need to identify novel variants against newly defined protein sequences using highthroughput methods that require variant libraries. Thus, more programmable methods are required for thermostable DNA targeting.

Zinc finger proteins are small protein modules capable of recognizing specific sequences of three DNA bases. These proteins are commonly found in nature, and previous studies have identified key components of individual zinc finger modules that determine specific 3-base pair DNA-binding sequences. A modular zinc finger array can be fused together to enable DNA targeting based on specific DNA sequences. In addition, the researchers cleverly fused these larger zinc finger proteins involved in DNA targeting with the FokI protein, which can cut DNA. To minimize all undesired random DNA cleavage in living cells, the researchers cleverly split the FokI protein into two halves, each recruited to target regions of DNA using specific zinc fingers (Kim et al., 1996; Urnov et al., 2005). Therefore, the combination of two zinc finger nucleases (ZFNs) can specifically and precisely cut DNA. These ZFNs have been shown to function in human, animal, and plant cells, and thus play an important role in programmable genome editing.

Figure 4 Overview of genome editing technologies. A, Nuclease-based genome editing technologies that target DNA, including meganucleases, ZFNs, TALENs, CRISPR-Cas9, CRISPR-Cas12 and new small Cas variants. B, Precision DNA genome editing technologies including the cytosine base editor, adenine base editor, and prime editor. C, RNA editing technologies including CRISPR-Cas13, CRISPR-Cas7-11, and RNA base editing approaches like REPAIR, RESCUE, and other Cas-free RNA editing approaches.

Following the discovery of zinc finger proteins, researchers identified transcription activator-like (TAL) effectors from plant pathogens. Unlike ZFs, each TAL effector (TALE) binds to a single DNA base. This effect can be programmed to bind to specific DNA sequences. TALEs bind to FokI dimers to generate TALE nucleases (TALENs), a fully protein-based programmable genome editing technology (Christian et al., 2010; Li et al., 2011b). Compared with ZFNs, TALENs exhibit better programming ability because each DNA base is recognized by a single unit, rather than the triplet-encoded property of zinc fingers, but TALENs are larger than ZFNs, thus delivery remains challenging. In addition, protein complexes need to be constructed, which is not easy when seeking to extensively edit the genome of living cells.

# CRISPR-Cas systems

While studying bacterial genomes, researchers identified a repeating stretch of DNA named as a clustered regularly interspaced short palindromic repeats (CRISPR) array (Ishino et al., 1987; Jansen et al., 2002). Through subsequent studies, the researchers demonstrated that CRISPR arrays and their nearby proteins, CRISPR-associated (Cas) proteins, function as bacterial immune systems against foreign invading nucleic acids (Barrangou et al., 2007; Marraffini and Sontheimer, 2008). When bacteria are exposed to pathogenic DNA fragments, the immune system isolates a small portion of the foreign DNA and integrates that sequence into the CRISPR array in the bacterial genome itself. This discovery is critical to the development of CRISPR-Cas as a revolutionary genome editing technology.

The CRISPR arrays were identified to encode RNA sequences associated with Cas proteins and target a protospacer-based nucleic acid sequence in the DNA. Subsequent engineering demonstrated that targeting sequences in CRISPR RNAs can be easily replaced and programmed with user-defined sequences (Cong et al., 2013; Gasiunas et al., 2012; Jinek et al., 2012; Mali et al., 2013) (Figure 4A), which would completely alter and reprogram the Cas protein targeting sequence. This discovery plays an important role in the field of genome editing, as this is the first time that genome editing reagents can be easily reprogrammed by replacing nucleic acid sequences, unlike previous approaches that required complex and high-throughput protein engineering. Once bound to a target DNA sequence, the Cas protein initiates cleavage of double-stranded DNA, creating damage in the genome of living cells.

ome. Researchers have discovered a large number of novel Cas proteins with diverse PAM requirements (Ran et al., 2015; Zetsche et al., 2015), thereby expanding the targeting range of CRISPR genome editing technology. In addition, protein engineering and directed evolution efforts have successfully altered the PAM requirements of Cas proteins (Miller et al., 2020b; Nishimasu et al., 2018; Walton et al., 2020), which has contributed to the development of a range of genomic targets using CRISPR Cas.

Recently, many new small Cas proteins have been discovered. The length of SpCas9 is 1368 amino acids, which is further extended by effector proteins in the base editor and prime editor. The stability and delivery of genome editing proteins are negatively affected by increased length. New CRISPR-Cas proteins, such as Cas12f (Kim et al., 2022; Wu et al., 2021; Xu et al., 2021a), formly called Cas14 (Harrington et al., 2018), CasΦ (Pausch et al., 2020), CasX (Liu et al., 2019a), are smaller than many previously discovered Cas proteins (Figure 4A). However, further engineering, discovery and evolution efforts are required to improve the editing efficiency of these new Cas proteins.

# Genetic knockout

Meganucleases, ZFNs, TALENs, and DNA-targeting CRISPR-Cas systems all operate by cleaving double-stranded DNA. Following the generation of DNA double-strand breaks (DSBs), a cell’s endogenous repair machinery rapidly repairs the lesions. Perfect repair can serve as a substrate for additional editing until non-homologous end joining (NHEJ) or microhomology-mediated end joining (MMEJ) repair results in random small DNA insertions or deletions (INDELs) around the target site (Moore and Haber, 1996). INDEL results in gene knockout, which is useful in certain cases but lacks precision. Homology-directed repair (HDR) is a competitive repair process in which a nucleic acid donor template is used to repair DNA (Liang et al., 1998). Although programmable, HDR is extremely inefficient when compared to NHEJ/MMEJ repair. Therefore, new genome editing technologies are needed to edit DNA sequences efficiently and precisely.

# New CRISPR proteins

The Cas protein is a key component in the CRISPR genome editing technology. Streptococcus pyogenes (Sp) Cas9 was the first engineered Cas protein for genome editing applications and will remain widely used when developing new editing techniques. All Cas proteins are known to require a protospacer-adjacent motif (PAM), a small stretch of DNA located directly adjacent to the target genomic locus. This targeting range limitation of Cas proteins remains a challenge when trying to edit at other positions in a cell’s gen

# Base editing

Base editing is a programmable, efficient and precise genome editing technology built on the ability to localize DNAbinding proteins to sequences of interest. The first class of base editors designed, called cytosine base editors (CBEs), exploited the ability of Cas proteins to bind and unwind target regions of DNA into a single-stranded DNA state (Figure 4B). CBEs consist of a single-strand-specific cytidine deaminase fused to Cas proteins that deaminate regions of endogenous DNA targeted by Cas proteins (Komor et al.,

2016). Deamination of cytosine bases in DNA produces uracil, which can be replicated and repaired to thymine by endogenous cellular processes. To improve editing efficiency, CBE also includes uracil glycosylase inhibitor (UGI) to inhibit endogenous uracil N-glycosylase (UNG), which specifically recognizes the presence of uracil bases in the cell genome (Acharya et al., 2003; Komor et al., 2017; Krokan et al., 2002). The presence of localized UGI would further prolong the lifespan of the uracil intermediate, thereby promoting the permanent incorporation of thymine after repair.

To further facilitate editing, the Cas protein is converted into a nickase that cleaves the strand opposite the edited strand, using the opposite side chain of the DNA containing the new base-edited uracil as a repair template, manipulating the cellular repair machinery to replace the nicked trace chain. This finally achieves permanent editing from one DNA strand to two DNA strands, significantly increasing the efficiency of base editing.

Adenine base editors (ABE) were the second class of base editors developed. ABE consists of a laboratory-evolved adenosine deaminase that converts adenine bases in DNA to inosine (Gaudelli et al., 2017) (Figure 4B). Inosine is subsequently recognized as guanine by endogenous cellular polymerases. Advanced directed evolution methods have further improved editing efficiency and expanded the utility of adenine base editing (Gaudelli et al., 2020; Richter et al., 2020).

The original base editing methods used Cas proteins to unwind DNA and expose single-stranded DNA sequences as substrates for deamination. A new class of base editors called DddA-derived cytosine base editors (DdCBEs) utilize a naturally occurring double-stranded DNA cytidine deaminase called DddA to perform base editing without unraveling the DNA (Mok et al., 2020). DNA-binding proteins, such as TALEs or ZFs, can be fused to split DddAs and UGIs to direct cytosine base editing to target DNA sequences in the absence of Cas proteins. Further, by fusing catalytically impaired DddA variants with an adenine deaminase TadA8e, targeted A-to-G editing was achieved in human mitochondria (Cho et al., 2022).

CBE, ABE, and DdCBE can all edit DNA precisely and efficiently to create CG-to-TA (CBE and DdCBE) or AT-toGC base (ABE) conversions. However, there are many other types of genome editing, such as other base conversions and programmable insertions and deletions, which require newer precision editing techniques.

# Prime editing

Prime editing is a precision genome editing technology that uses a Cas protein’s ability to bind DNA and nick one strand of DNA (Anzalone et al., 2019) (Figure 4B). In contrast to base editors, prime editors nick the unwound single-stranded

R-loop DNA following Cas binding. Following the specific nicking of this strand, the released DNA can serve as a primer to perform subsequent DNA polymerization.

Another critical component of prime editors is the prime editing guide RNA (pegRNA), which encodes a primer binding site that is complementary to the released singlestranded R-loop from the Cas protein nick and a template region encoding a particular desired DNA editing event. Following RNA-DNA hybridization, a reverse transcriptase protein fused to the Cas protein can use the pegRNA as a template to extend the genomic DNA. An orthogonal Cas protein guide RNA which targets a region $3 ^ { \prime }$ of the prime edit can be used to further enhance editing. Following subsequent DNA replication and repair, the newly synthesized DNA sequence can permanently be integrated into the genome, resulting in a programmable and versatile edit dictated by the pegRNA sequence. Initial demonstrations of prime editing were relatively low in editing efficiency (Lin et al., 2020), however, subsequent modifications of the procedure, such as optimal primer binding melting temperatures (Lin et al., 2021), the use of two pegRNAs (Anzalone et al., 2021; Choi et al., 2022; Lin et al., 2021; Wang et al., 2022), DNA repair manipulations (Chen et al., 2021a), and RNA stability motifs (Nelson et al., 2022), and modification of the reverse transcriptase enzyme (Zong et al., 2022) have greatly improved prime editing efficiencies.

# RNA editing

Genome editing in RNA can avoid permanent changes to the genome, thereby reducing the risk of off-target DNA editing. A class of RNAs targeting Cas proteins, such as Cas13a (Abudayyeh et al., 2017, Cox et al., 2017; Xu et al., 2021b) and Cas7-11 (Özcan et al., 2021), programmably target RNA sequences determined by CRISPR guide sequences (Figure 4C). Similar to the development of DNA base editors, researchers developed RNA base editors by fusing RNA-specific adenosine deaminase to RNA-targeting Cas proteins (Figure 4C). The adenosine deaminase RNA-specific (ADAR) protein is fused to Cas13a through the “RNA editing with programmable A-to-I replacement (repair)” technique, which converts adenine to inosine (similar to ABE DNA bases) (Cox et al., 2017). Similarly, engineered ADAR proteins that can deaminate cytosines in RNAs were used to develop “RNA editing for specific CU exchange (RESCUE)” technology, which converts cytosine bases to uracils in RNA (similar to CBE DNA base editor) (Abudayyeh et al., 2019).

New CRISPR-free RNA editing systems have been developed to perform site-specific RNA editing by exploiting the ability of RNA nucleic acids to recruit endogenous proteins to chemically react with RNA (Merkle et al., 2019; Reautschnig et al., 2022) (Figure 4C). Furthermore, a parallel technique demonstrated that longer RNAs can naturally recruit ADARs for A-to-I editing of RNAs (Qu et al., 2019). Recently, aggregation design and loop design of RNA have considerably improved the editing efficiency and specificity of RNA editing techniques (Katrekar et al., 2022; Reautschnig et al., 2022; Yi et al., 2022).

The past decade has been marked by the rapid development of new genome editing technologies. From initial protein-based approaches to precise genome editing techniques such as prime editing, the ability to manipulate the genomes of living cells and organisms is increasingly exciting. RNA editing techniques are also starting to become more precise and efficient. The continued development of smaller, precise, accurate, and efficient genome editing tools is urgently required, especially for application in areas such as therapeutics, agriculture, and biological research.

# Applications of genome editing

The development of genome editing technologies has enabled great advances throughout biomedicine and agriculture. There has been a flurry of advances from biotechnology companies to generate new genome editing medicines. Recently, researchers have advanced in vivo genome editing technologies such as CRISPR-Cas9 and base editing to treat genetic disorders like sickle cell anemia (Newby et al., 2021), progeria (Koblan et al., 2021), transthyretin amyloidosis (Gillmore et al., 2021) or genetic conditions like hypercholesteremia (Verve Therapeutics, 2022. https://ir.vervetx.com/news-releases/news-release-details/ verve-therapeutics-doses-first-human-investigational-vivobase).

The application of genome editing in agriculture has sparked new excitement for future biological crop breeding (Gao, 2021). Disease resistance and herbicide resistance are two of the most developed across many crop species. Recently, researchers demonstrated that four simultaneous multiplex editing events enabled the creation of disease resistance and increased yield wheat plants (Li et al., 2022). Furthermore, many endogenous edits have been shown to generate robust herbicide resistance (Zhang et al., 2019a; Zhang et al., 2021c). Genome editing will continue to enable the creation of valuable agricultural crop species.

# Molecular evolution of proteins

In vitro molecular evolution of proteins accelerates the natural evolution of protein in a test tube, creating an infinite opportunity for protein science and application. The original contributor of the method, Frances H. Arnold, shared the Nobel Prize in Chemistry (2018). In recent years, great efforts have been made to build more efficient method for directed evolution of proteins, which not only contribute to a deep understanding of fundamental science of proteins, but also can create enzymes and antibodies that are superior to natural or non-existent ones, and promote the application of synthetic biology.

# Structure-based evolution

A strategy based on a deep understanding of protein structure-function relationships, called rational design, can generate desired mutants in a short period of time. The precise definition of mutational “hot spots” is the key to achieving the desired results. Furthermore, building smaller but intelligent mutation libraries can considerably speed up the evolutionary process (Chica et al., 2005).

With the rapid development of bioinformatics, ‘hot spot’ predictions have become popular due to the release of mutational limits at certain residue positions that can have a significant impact on the specific function of enzymes (Lutz, 2010). Computational tools have been developed to identify and assess favorable hotspots (Ofran and Rost, 2007). For example, the ConSurf web server (Ashkenazy et al., 2010) can analyze evolutionarily conserved patterns of protein structures, the LigPlot+ program (Laskowski and Swindells, 2011) can generate schematic diagrams of protein-ligand interactions, and CAVER 3.0 (Chovancova et al., 2012) can visualize protein structures in tunnels and channels. The web server PoPMuSiC (Dehouck et al., 2011) can estimate recent protein stability changes, and the algorithms ASRA and Innov’SAR are well suited as guides for saturating mutations at sites within the binding pocket to enhance stereoselectivity and activity (Cadet et al., 2018; Li et al., 2019a).

Various robust strategies focusing on active site engineering have subsequently been developed and have been used in lipases (Liu et al., 2015), glucanases (Niu et al., 2016), xylanases (Wang et al., 2021a) and other major achievements in the evolution of enzymes. Structure-directed mutational screening of multiple residues in the substrate-binding pocket of thioesterase TesA strongly alters its substrate selectivity (Deng et al., 2020). The active site stabilization (ACS) strategy effectively enhances the enzymatic kinetic stability of the lipase CalB by increasing the rigidity within the directed active site (Xie et al., 2014). Non-standard amino acid (ncAA) technology has significantly expanded the functional scope of synthetic polypeptide materials by incorporating new chemical functions that may facilitate material fabrication (Wu et al., 2013). Click-reaction modification was applied to protein modification to increase the molecular weight of protease, and dextran was used as a modifier to successfully optimize the application of protease in wool biofelting processing (Shao et al., 2019).

Through structural and phylogenetic analysis, loop remodeling reconstituted a phosphotriesterase (PTE) with

PTE-like lactose activity within several mutational steps, demonstrating the potential role of loop remodeling for rapid differentiation of new enzyme functions (Afriat-Jurnou et al., 2012). The stepwise loop insertion strategy (StLois) identifies target regions through structural and functional analysis of the corresponding enzymes, effectively expanding the residues in the loop regions to provide new structures of the enzyme active site for new catalytic properties (Hoque et al., 2020). Domain swapping helps reveal structural and functional information on important regulators, such as $\beta$ -repressors (Ghosh et al., 2019) and decay accelerators (Panwar et al., 2016).

Semi-rational design introduces random mutations at selected residues, as saturation mutation creates a small library of mutants containing all possible mutations at the selected residues, a small number of which may be beneficial in the mutated protein. Notably, with the help of codon degeneracy, extended versions of the combinatorial active site saturation assay (CAST) (Reetz et al., 2005) and iterative saturation mutagenesis (ISM) (Reetz and Carballeira, 2007) were efficient construction. Considerable progress has been made with the creation of “smart libraries” (Qu et al., 2020). These approaches have been reported to successfully improve enzyme properties such as thermostability, catalytic activity, and enantioselectivity (Fan et al., 2022; Tan et al., 2022; Tong et al., 2022). The combination of enzyme engineering and systems metabolic engineering has also significantly increased the metabolic flux of target products (Qian et al., 2019; Yang et al., 2020).

# Random mutagenesis

Directed evolution, which does not rely on the structural information of enzymes but on the sequence information of enzymes, offers a promising way to obtain desired mutants in the laboratory over months rather than millions of years. The sequence space for variation is very large, for example, mutating at four residues may yield $1 6 0 , 0 0 0 ( 2 0 ^ { 4 } )$ sequences. One of the key issues in directed evolution is how to efficiently generate mutant libraries.

The general method is error-prone PCR (epPCR), which introduces changes in genes. The researchers considerably increased the mutation rate by changing the PCR reaction conditions. Zaccolo et al. re-scaled the mutation rate to one mutation every five base pairs by changing PCR conditions and the number of mutation PCR cycles (Zaccolo and Gherardi, 1999). So far, epPCR has achieved many successes, such as improving the activity, affinity and stability of enzymes and substrates (Brands et al., 2020; Ruan et al., 2022; Zhou et al., 2021; Zhu et al., 2019). Most importantly, epPCR also represents a powerful method for studying molecular evolution by analyzing large-scale sequence diversity (Konno et al., 2022).

DNA shuffling mimics natural homologous recombination, another mechanism of natural evolution. During DNA shuffling, two or more related starting genes are recombined, resulting in a pool of variant genes with new combinations of random sequences. Compared to epPCR, DNA shuffling combines fragments of related functional proteins, resulting in novel sequences with a relatively high probability of being compatible with the desired protein structure and function. An example is the generation of active halogenase variants from catalase with altered selectivity to expand the enzymatic halogenation capacity of unactivated C–H bonds (Neugebauer et al., 2021). Similarly, motif shuffling based on BRC repeat modularity was used to generate stronger chimeras that bind to RAD51 (Lindenburg et al., 2021).

Recently, a number of promising techniques have been developed for in vivo protein evolution (Figure 5). These methods generate multiple random mutations directly within the host organism by localizing mutant enzymes or nucleases into DNA (Kim et al., 2019). CRIPSR/Cas9 ushered in a new era of genome editing, which has also been applied to protein engineering. The EvolvR system consists of Cas9-nickase, and an error-prone DNAP I that continuously generates mutations in tunable windows under the guidance of gRNAs (Halperin et al., 2018). More specifically, a new in vivo mutation method, CRISPR-Enabled Traceable Genome Engineering (CREATE), utilizes the CRISPR/Cas9 system and barcoded tracking cassettes to mutate multiple sites and track them (Garst et al., 2017). It can form single-base libraries for entire protein sequences to construct a saturated library in which every amino acid residue is substituted (Liang et al., 2017; Reynolds et al., 2017). Phage-assisted sequential evolution (PACE) is another strategy for in vivo evolution. It exploits the survival of the M13 phage to mutate genes in $E$ . coli. In general, PACE is able to evolve any protein associated with basic phage gene expression (Miller et al., 2020c). Due to the rapid generation time of the phage life cycle, dozens of rounds of evolution can occur in a day without human intervention. In addition, T7 RNA polymerase is used in several in vivo protein evolution systems because of its binding affinity to DNA. MutaT7 is a chimeric protein containing T7 RNA polymerase and cytidine deaminase that can edit or mutate specific genes downstream of the T7 promoter (Moore et al., 2018a). More recently, targeted in vivo diversification via T7 RNAP (TRIDENT) has been developed based on an evolutionary platform of T7 RNA polymerase, exploiting increased mutational diversity and higher in vivo mutation rates (Cravens et al., 2021)

# High-throughput screening

High-throughput screening is a technique used to obtain desired mutants from large variant libraries. Microtiter platebased screening methods are the most commonly used methods in enzyme-directed evolution. These systems have the advantages of simple installation, convenient operation and strong versatility. However, the screening capacity is relatively low, usually limited to $1 0 ^ { 3 } – 1 0 ^ { 4 }$ colonies per day. To increase the speed of screening, automated equipment such as robotic liquid handling units and colony picking systems have been developed (Nirantar, 2021). In highthroughput screening methods, colorimetric or fluorogenic substrates are often used to measure enzymatic activity (Giger et al., 2013). This screen can also be combined with pH indicators (Bornscheuer et al., 1999; Morıs-Varas et al., 1999) or enzymatic cascades that generate absorbance or fluorescence signals to create high-throughput screening methods (Malhotra et al., 1996).

Figure 5 Overview of emerging in vivo mutagenesis methods. A, Schematic of PACE. PACE utilizes the survival of M13 phage to mutate genes in $E$ . coli. MP, mutagenic plasmid; AP, accessory plasmid; SP, selection plasmid. B, Schematic of EvolvR. EvolvR utilizes a chimeric protein of error-prone Pol I and nicking variant of Cas9 to specifically mutate genes targeted by gRNA. nCas9, nicking variant of Cas9. C, Schematic of CREATE. CREATE utilizes CRISPR-Cas9 with barcode-tracking of mutations for multiplex genome engineering.

Growth-complementary selection is a powerful screening method whenever the target enzyme is critical for host cell survival. This approach has been widely applied to enzymes associated with major metabolic pathways, including tRNA synthetases (Zhao et al., 2021), proteases (Verhoeven et al., 2012), amino acid synthesis isomerases (Jürgens et al., 2000), and more. Similarly, enzymes with desired functions, such as base editors, can be screened for by rescuing defective antibiotic resistance genes containing point mutations at key positions (Gaudelli et al., 2017).

Fluorescence-activated cell sorting (FACS) (Aharoni et al., 2006) and fluorescence-activated droplet sorting (FADS) (Agresti et al., 2010) have screening throughputs larger than

$1 0 ^ { 6 } \mathrm { h } ^ { - 1 }$ , making them ultra-high-throughput screening technical benchmarks. In pioneering studies, fluorogenic substrates for glycosyltransferases were designed that can move freely in and out of cells, and the fluorescent products can be captured in cells and screened by flow cytometry (Tan et al., 2019; Yang and Withers, 2009). For cells that cannot absorb desired substrates or retain fluorescent signals, FADS uses droplets as enzymatic microreactors to separate individual cells. Microfluidic chip systems allow for multiple operations such as droplet production, cell lysis, reagent addition, incubation, fluorescence detection, and dual-channel screening (Ma et al., 2018b). Recently, a method combining FACS and FADS was invented in which intact doubleemulsion droplets can be selected using a commercial FACS instrument (Brower et al., 2020).

Protein display technology is an important platform for screening protein or peptide binding activity. Phage surface display was first used to study antigen-antibody binding (Smith, 1985). Various cell display methods were subsequently invented, such as bacterial display (Charbit et al., 1986) and yeast surface display (Schreuder et al., 1993). Cell display methods are also widely used for directed evolution, such as improving the stability of $\beta$ -lactamases (Kather et al., 2008) and expanding the substrate spectrum of DNA polymerases (Chen et al., 2016). Likewise, cell-free display methods such as ribosome display (Hanes and Plückthun, 1997) and mRNA display (Amaral et al., 2017; Wilson et al.,

2001) have accelerated the directed evolution of enzymes. Compared to phage display, cell surface display provides a larger display surface and can also be screened by FACS/ FADS if relevant fluorescence assays are available. In addition, the cell-free display system overcomes the limitations of cell-based display methods on transformation efficiency as it can handle libraries of up to $1 0 ^ { 1 4 }$ members and is also suitable for generating toxic or unstable proteins.

# Computer-aided design of functional proteins

Proteins are the main cellular macromolecules with a plethora of biological functions and constitute the basic building blocks of biological systems. However, since the sequence structure-function space of protein systems is considerable, it is extremely challenging to mathematically solve proteinrelated problems. The efficient design of proteins, one of the core tasks of synthetic biology, significantly compresses the search space at the expense of acceptable accuracy. The goal of computational protein design is to employ algorithms to create proteins that can fold into specific structures and have desired functions (Brini et al., 2020). With breakthroughs in the computational prediction of protein structures and the continuous emergence of sequence design algorithms, it has become possible to develop computational protein design platforms that support synthetic biology (Huang et al., 2016).

# Algorithms for designing protein structures

Currently, protein sequences are usually designed with a fixed backbone according to data from the existing protein structures. Compared with the given narrow structural space, the corresponding protein sequence space is considerable, and the epistatic negative influence can significantly weaken the foldability of the designed protein. Therefore, sequence design requires the development of targeted algorithms and strategies. Commonly used methods for computational protein design can be divided into the following categories (Richter and Baker, 2013). (i) Backbone generation: construct the backbone conformation model according to the requirements for sequence design. (ii) Side chain layout: according to the structure of a given protein framework, a set of suitable amino acid side chain conformations are selected to meet the requirements of the backbone structure. This requires the actual design of the sequence, also known as protein sequence design. (iii) Rigid body placement: fix the relative spatial position and orientation between proteins/ proteins or proteins/small molecules. (iv) Negative film design: increase the energy of non-target states and achieve effective folding, which can be considered as an optimization and supplement to the sidechain layout algorithm.

Computational design of proteins typically involves three steps. First, discrete side chain conformations are placed on the main chain. Next, the energy between the inserted side chain and the native side chain and between the side chains and the backbone is calculated. Finally, the combination of sequence and conformation is optimized by a search algorithm (Figure 6). The entire process involves the optimization of a series of sequence combinations and their corresponding structures through a search algorithm. Fixed backbone frameworks are given in advance (e.g., derived from native protein structures). The type of amino acid residues at each backbone position and their side chain conformations are unknown and should be calculated. The possible combinations of structural states and the choice of residues at different positions make up the amino acid sequence and side chain structure space. Energy functions defined in this space are used to evaluate specific sequence and conformation combinations. Search algorithms automatically search an unknown number of sequence spaces and side-chain conformations to find the lowest-energy solution to design protein structures. In order to correctly mimic the mutated side chain conformation, it is necessary to redesign the existing structure. This step is usually performed using a backbone-dependent rotor molecule library of the software, while the optimization of the side chains is energy-dependent.

The energy function is the basis for characterizing the different conformational structures of each sequence combination. Different algorithms use different energy functions, mainly including physical energy terms (non-covalent van der Waals interaction, electrostatic energy, hydrogen bond energy, solvation free energy) and statistical energy terms (main chain dihedral angle, side chain twist). The most widely used energy functions are the Rosetta energy function (mainly determined by the physical energy term) (Leman et al., 2020) and the backbone-based amino acid usage survey (ABACUS)/side chain-unknown backbone arrangement (SCUBA) energy function (mainly determined by the statistical energy term) (Xiong et al., 2014; Huang et al., 2022).

In protein design with a fixed backbone, the lengths and angles of covalent bonds are usually constant and the main interactions to be considered are non-covalent. In the Rosetta energy function, the Lennard-Jones potential is used to calculate the van der Waals interaction energy. The electrostatic energies were calculated using the raw atomic charge distribution of the CHARMM molecular force field and adjusted by group optimization. The energies of hydrogen bonds are calculated using the electrostatic model and a special hydrogen bond model, and the hydrogen bonds are classified into four types: long-range main chain hydrogen bonds, short-range main-chain hydrogen bonds and hydrogen bonds between main-chain and side-chain atoms. Hydrogen bonds between side chains are calculated separately. The Lazaridis-Karplus implicit Gaussian exclusion model can include isotropic and anisotropic solvation free energies to describe solvation effects. The statistical energy term represents the energy obtained by transforming the probability distributions present in the database. From the point of view of statistical thermodynamics, in the equilibrium state, the energies and probabilities of the different microstates of the system obey the Boltzmann distribution.

Figure 6 Principles and examples of computer-aided protein design.

An alternative view is that from a purely statistical perspective, assuming the distribution of amino acid sequences for a given backbone structure can be written as conditional probabilities, the problem to be solved by sequence design is to find the sequence with the largest conditional probability. Thus, ABACUS incorporates different structural features: structural type of amino acid positions; backbone dihedral angles; solvent accessibility; relative positions; and statistical information between residues to obtain side chain rotamers (rotor isomers) and atomic packing energies. In addition, SCUBA utilizes neural networks to learn explicit energy terms from the backbone-centered structural variable energy landscape. Together, SCUBA and ABACUS provide comprehensive solutions for the design of artificial proteins.

Search algorithms are also critical for protein sequence design to avoid traversing all conformational combinations in considerable sequence spaces and even larger conformational spaces. Therefore, as a stochastic software, Rosetta was designed based on the Monte Carlo method to perform statistical analysis of conformations generated by multiple simulations and then obtain numerical solutions. Rosetta first uses a random number generator to generate random images. A random perturbation is then confirmed, and new conformations are scored, accepting all conformations with higher scores and those with lower scores with a certain probability, until the best score is selected within a given number of cycles. However, such iterative algorithms are often trapped at local minima. To obtain the global energy minimum conformation, in addition to molecular dynamics simulations, the physical concept of momentum was used. Imagine a small ball rolling down from a high-energy function. When the momentum is high enough, the ball will not be trapped in a small pit, but will rush towards the final canyon. The iterations take into account not only the current energy, but also previous energy changes.

Several algorithms based on statistics and machine learning were proposed. Inspired by the success of the algorithm trRosetta for structure prediction, Baker et al. further developed the Hallucination protein de novo design methods (Anishchenko et al., 2021b). First, a random sequence is fed into trRosetta as input to predict the residual contact map. Then, the amino acid sequence space was sampled using Monte Carlo methods and the KL scattering between sequences was calculated to obtain foldable sequences and predicted structures. The Hallucination method proposes the DeepDream algorithm, based on convolutional neural network, which transforms the input into the training data space and produces (note tense) a dream-like illusion. Thus, the Hallucination method can be used to rapidly design protein sequences that are similar to the input sequence and conform to the sequence structure relationships learned by trRosetta, yet differ considerably from the natural sequence.

# Protein design in synthetic biology

Sequences designed from protein structures cannot directly meet the needs of synthetic biology for the desired functional proteins. The computational design of proteins mainly includes the design of protein self-skeletons, protein-macromolecule interactions and protein-small molecule interactions. These interactions can be engineered to optimize the function of native proteins as components of synthetic biology, while creating biosensors, biocatalysts, and vaccines with desired functions.

Protein frameworks are designed to enhance the robustness of native proteins, stabilize vaccine epitopes, and modify protein stability under specific conditions. To develop novel coronavirus inhibitors, and basing these on the structure of a complex of the novel coronavirus S protein and human angiotensin-converting enzyme 2 (ACE2), Baker et al. used the helical fragment of ACE2 bound to the S protein receptor binding region as a starting point. An attempt to stabilize the structure was made by adding two extra helices. In addition, using protein molecular docking and protein interface design in the microprotein library, small proteins capable of inhibiting 2019-nCoV at picomolar concentrations were designed (Cao et al., 2020). Correia et al. developed the TopoBuilder system for the de novo design of proteins capable of stabilizing complex pre-defined building blocks. For different epitopes, the authors enumerated suitable two-dimensional protein topologies and constructed tertiary structure models using ideal secondary structures. This method was used to design proteins that present three antigens simultaneously (Sesterhenn et al., 2020). Combining physical energy terms, statistical energy terms, and bioinformatics analysis, Wu et al. developed a greedy accumulated strategy for protein engineering (GRAPE strategy) based on the fusion of a single-point prediction algorithm and a “greedy” algorithm to computationally reshape PET plastic hydrolase, which superimposes a single point mutation that increases the thermal melting temperature of the final mutation by $3 1 ^ { \circ } \mathrm { C }$ (Cui et al., 2021a).

Designing protein-macromolecule interactions can be used for signal transduction and regulation in synthetic cells. Computationally designed biosensors by Baker et al. can take advantage of naturally occurring interacting proteins in signaling pathways. In the absence of a detection target, the locking domain of the sensor’s lucCage protein binds to the cage domain. In contrast, in the presence of the detection target, the terminal region of the lucCage domain binds to the detection target, and the lucCage protein opens and binds to the sensor’s lucKey protein, activating luciferase to emit fluorescence (Quijano-Rubio et al., 2021). The same group also designed logic gates to regulate protein binding, constructed de novo backbone helical frameworks, and constructed hydrogen-bonding networks to optimize sequences. Multiple protein pairs with specific heterodimers were designed, using monomers or linker monomers as input. The gating unit is constructed to accept different inputs through a designed hydrogen bond network encoding for specific binding (Chen et al., 2020d).

The interaction design of proteins and small molecules can obtain new enzyme catalytic components, transcription factors and small molecule sensors. By designing enzymes with substrate selectivity, new biochemical reactions can be generated for direct use in bioindustrial catalysis as well as new pathway design. In this context, Kortemme et al. screened four residue-binding modules for the binding of farnesyl pyrophosphate (FPP) to the structure of the native protein. They then designed biosensors in which FPPs could be regulated by interfacing with various frameworks enhancing further optimization (Glasgow et al., 2019). Ranganathan et al. used direct coupling analysis to extract statistical constraints on the space of implicit sequence structure functions in multiple sequence alignment (MSA). They designed a chorismate translocase with comparable activity to the native enzyme (Russ et al., 2020). Wu et al. used a fixed backbone design, combining multiple parallel short-duration kinetic simulations to compensate for uneven sampling of the fixed backbone and sidechains. Thus, aspartate lyase-catalyzed hydroamination of unnatural amino acids was obtained (Cui et al., 2021b).

# Short summary

Over the past decade, impressive progress has been made in computationally creating functional proteins with tailored activities and specificities. The astonishing rate of algorithm development continues to improve researchers’ ability to manipulate protein structure and function. Looking ahead, there are numbers of key trends expected to accelerate the discovery, design and application of function proteins. Advances in computational methods for predicting protein structure through AI have raised the confidence of the biomolecular community, subsequent function design may provide access to the demand of target reaction with the help of combination of model-based and data-based methods. As the protein structure databases and standard experimental data continue to grow, more advanced computational methods, will create further research opportunities for interpreting the underlying catalytic mechanisms, eventually leading to a clearer perception of structure-function relationships of function proteins. Based on the considerable success of computational protein design, the future is expected to witness the generation of more efficient, customized proteins for synthetic biology.

# Cell and gene circuit engineering

Whether using traditional bioengineering or current synthetic biology, designing cells for beneficial functions has presented a considerable challenge. In the era of synthetic biology, a hallmark of engineered cells is the emphasis on designing and recreating unnatural cellular behaviors at the systems and quantitative levels, which often require more than one component to form interactive networks with specific topology and function. These designable biological networks consist of macromolecules such as proteins, DNA, RNA, or any genetic part within each cell, called gene circuits. It is worth noting that such a network can logically go beyond the single-cell level, in other words, an interactive multi-cell system is formed through direct or indirect intercell contact or communication, which is called a cell circuit. Engineering cellular and gene circuits faces two fundamental challenges: (i) an available genetic component that emphasizes orthogonality and modularity, and (ii) knowledge of circuit modular design principles that provide theoretical guidance for predictable circuit behavior. Furthermore, the design process is highly dependent on sophisticated computational modeling capabilities to analyze and predict circuit behavior in larger circuits and parameter spaces. Therefore, computational aided design of synthetic cells will further facilitate automation and artificial intelligence in future cell engineering, which we will discuss next.

# Synthetic gene circuits and quantitative cellular behavior

Gene circuits conceptually originate from electronic circuits, but substantially differ from electronic circuits due to the enormous complexity arising from the biochemical or biophysical interactions of a large number of components and the nonlinear connections between these components. Similar to gene circuits in natural cells, synthetic gene circuits comprise two basic types: (i) protein-based signaling circuits (or protein circuits) (Chen and Elowitz, 2021; Gao et al., 2018) and (ii) transcriptional gene regulatory circuits (or genetic circuits) (Gardner et al., 2000; Hasty et al., 2002). However, these two types differ little, and work coordinatively to control cellular function. Specifically, protein circuits process environmental signals on faster time scales (from seconds to minutes) through membrane receptor proteins (or sensors), and then transmit the signals to downstream gene regulatory circuits, occurring at a longer time scale (from minutes to hours) (Kiel et al., 2010).

In the past few decades, extensive research on synthetic circuits resulted in successful construction of genetic circuits with integrated functions such as logic gates, bandpass, oscillation, adaptation, and polarization (Anderson et al., 2007; Basu et al., 2005; Chau et al., 2012; Elowitz and Leibler, 2000; Shen-Orr et al., 2002). While many of these studies are still in the proof-of-concept stage, the increasing complexity and scale of these synthetic circuits has considerably advanced our ability to design and build complex genetic circuits with increased efficiency and accuracy. A major development in the direction of synthetic circuits is to take full advantage of computer-aided design and automation. To do this, extensive research work is necessary, including wellcharacterized and standardized genetic components, experimentally validated algorithms and software for building and simulating silicon circuits, and custom-developed automated experimental equipment (Chen et al., 2020c; Jones et al., 2022; Lim et al., 2013; Ma et al., 2009; Nielsen et al., 2016). Notably, circuit engineering in mammalian cells is underdeveloped due to the difficulties of genetic manipulation and the limitations of various protein or nucleic acid tools. For example, the number of promoters used in mammalian circuit engineering is often in the single digits. In the existing promoter toolboxes, the transcriptional strength of target genes is difficult to continuously regulate, which becomes a major obstacle to experimental verification of circuit parameter conditions in circuit design. In addition, the dynamic range of gene transcription of many inducible mammalian promoters is extremely low, which is not conducive to the construction of circuits that require low basal but highly inducible gene transcription. Similar to bacterial cells, it is difficult to predict the consistency of promoter strength and induction in different mammalian cell lines.

Protein engineering is more challenging than promoter engineering. The protein function is determined by a threedimensional structure consisting of 20 amino acids, which is considerably more complex than a one-dimensional sequence consisting of 4 nucleic acids. As for sensors, receptor engineering has emerged as an important area for establishing orthogonal cell-to-cell signaling that results in either sensing a given extracellular signal, such as synthetic cytokines and growth factors, or redirecting cells to specific disease signals (Engelowski et al., 2018; Schwarz et al., 2017; Sockolosky et al., 2018; Williams et al., 2020). Chimeric antigen receptor (CAR)-activated T cells have been an important example of anticancer therapy (Jackson et al., 2016). Many protein types in mammalian cells exist which establish signaling pathways at different levels, including protein kinases/phosphatases, proteases, adaptor/scaffold proteins, transcription factors or epigenetic regulatory proteins. Several protease tools adopted from viruses have been repurposed to control many levels of cellular function (Cella et al., 2018; Fernandez-Rodriguez and Voigt, 2016; Gray et al., 2010). Recent studies have also shown that protein circuit construction of complex logical functions is mainly based on these engineered proteases (Chen and Elowitz, 2021; Gao et al., 2018). Finally, de novo protein design is becoming increasingly powerful, especially as a tool for engineering programmable protein-protein interactions (Chen et al., 2019b; Silva et al., 2019). Notably, recent developments in AI algorithms should play a vital role in future protein engineering (Baek et al., 2021). Undoubtedly, the development of protein tools remains a difficult but essential task for mammalian synthetic biology.

Another challenge for mammalian synthetic biology is the complex behavior controlled by the “black box” of natural evolution. These complex behaviors exhibit quantitative properties, the rationale for which remains unclear. These principles govern nearly all important cellular processes, including the cell cycle, control of size and number, robustness and heterogeneity, homeostasis and growth, cell differentiation and death, and more. To date, few synthetic biology studies have been able to cover these enigmatic questions of life. Encouragingly, the bottom-up approach of synthetic biology has shown new avenues to understand the construction of complex biological systems in far greater detail than previously thought. A striking example is the oscillatory circuit that controls many fundamental biological processes (e.g., cell cycle, circadian rhythms, signaling responses, somitogenesis) (Elowitz and Leibler, 2000; Stricker et al., 2008; Zhang et al., 2017c). As a next step, this oscillatory circuit is expected to act as a “central processing unit,” intelligently controlling the function of engineered cells (Figure 7).

We envision that mammalian cell engineering, along with new tools and techniques, will be one of the next key steps in synthetic biology.

# Cell-cell communication-based cell circuits

A newly emerging field for mammalian synthetic biology is to engineer multicellular systems. This will form interactions based on cell-to-cell communication with specific circuit structures and functions. For bacterial cells, a clear direction is to reconstruct microbial communities ranging from diverse natural environments and disease-associated guts to agriculturally important soils (Bano et al., 2021; Khan et al., 2021). As for mammalian cells, they naturally exist in the context of multicellular interactions, even in well-structured organs. Therefore, cell engineering at the multicellular level represents another major avenue for synthetic biology (Figure 7).

Communication between cells in natural systems occurs in three ways: (i) proteins or small molecules produced and secreted by sender cells diffuse and induce activation of surface receptor proteins or intracellular sensors in recipient cells (Altan-Bonnet and Mukherjee, 2019); (ii) signaling molecules (usually small second messenger molecules) are transported via channel proteins to neighboring recipient cells in direct contact (Hervé and Derangeon, 2013); (iii) membrane ligands on the sending cells and direct interactions between membrane receptors on recipient cells (Kotsias et al., 2019). It is likely that a signal from the sender triggers a transcriptional event in the receiving cell. Regardless, these cell-level circuits would lead to highly complex population behaviors that would not function at the single-cell level. Spatial organization patterns in bacterial and mammalian cells can be formed by typical bandpass circuits or logic gates. Recently, synthetic quorum sensing circuits have been successfully deployed to control cell population size in bacterial and mammalian cells (Ma et al., 2022b).

However, this cellular circuit remains at an early developmental stage. Two major challenges need to be overcome in the future. First, too few signaling molecules are used in the current studies. In contrast, hundreds of cytokines and growth factors exist in humans that are involved in multiple regulations in a large number of cell types. Therefore, it is attractive to engineer synthetic cytokines or other factors to construct future cellular circuits. Second, orthogonal pairs of receptors and ligands for direct contact communication are difficult to design. The most successful example is synthetic Notch (synNotch) signaling, which enables the binding of any ligand through the extracellular recognition domain and triggers programmable downstream gene transcription (Morsut et al., 2016). Several demonstrative studies have applied the synNotch system to spatially organized multicellular structures (Toda et al., 2020).

Figure 7 Synthetic cell and gene circuits for therapeutics. At the single cell level, engineer efforts will focus on three major aspects: (i) sensors that can recognize diseases or environmental signals as biochemical reactions; and then (ii) gene circuits function as the “central processor” to process various input signals, yields (iii) quantitively-defined output functions to control cell function. As the multicell level, synthetic cytokine secretion or direct ligand-receptor interactions enable various cell-cell communications to form topologically-organized cell circuits or spatially-organized organlike patterns. These single or multiple engineered living cells may act as the powerful drug platform to cure complex diseases, such as cancer and metabolic diseases.

In addition to the technical challenges similar to gene circuits, the principles by which structure determines circuit function is difficult to understand, especially given the increasing complexity of the spatiotemporal regulation of cell populations. For example, how can designing circuits with precisely controlled biostability or multistability make a lot of sense for synthetic cell differentiation? Which circuit topologies enable efficient signal amplification with high fidelity and robustness in the treatment of disease? How do cellular circuits control the size and type of cell populations in the steady state? We envision that these issues require quantitative and comprehensive consideration of fundamental design principles at the level of cellular circuitry.

# Engineered living cell therapeutics

Another major trend in cellular and genetic circuit engineering is the extension of proof-of-concepts, currently in “toy” systems, to disease-related clinical applications. Compared with traditional molecular drug forms, live cell drugs have significant advantages as integrated platforms for deploying payload drugs or performing complex functions (e.g., cell lysis, wound healing) that would be intelligently controlled by integrated genes or cellular circuits. By doing so, cellular drugs can show significantly improved diseasefighting efficacy with minimal side effects. For example, reconfigured cytokine signaling pathways can act as cytokine switches to sense and eliminate pro-tumor cytokines and create a pro-immune cytokine microenvironment (Zhang et al., 2021b). In CAR-T cells, protein circuits with logic gates or hypersensitivity functions are deployed to generate more specific recognition of tumor antigens (HernandezLopez et al., 2021; Williams et al., 2020). CAR-T immunotherapy has demonstrated the power of living cells as a drug form, i.e. cell therapy. In another case, optogenetically controlled gene circuits successfully produced intelligent control of steady-state glucose levels in animals’ blood through a closed-loop control strategy (Yu et al., 2022). These striking examples show that synthetic live-cell medicines have ushered in a new revolution in the treatment of intractable diseases.

Due to the importance to human health, the current success of cell therapy is mainly based on the use of immune cells, especially T cells, as the chassis therapeutic cells. More recently, other immune cells, such as natural killer cells, macrophages, have shown considerable potential not only in cancer treatment but also in treating infectious diseases (Fisicaro and Boni, 2022; Klichinsky et al., 2020). While many cells being engineered as drugs are still at proof-ofconcept , we envision that once we can design more accurate and functional gene and cellular circuits, engineering at the primary cell level will be easier. Some recent studies have shown substantial improvement at the clinical level. Notably, multicellular systems will provide additional advantages for cell therapy, which can significantly reduce engineering costs by distributing functional circuit modules into different sub-cell types. A group of cells with well-programmed interacting circuits will work as a whole to enable more efficient, safer and less expensive therapeutic functions.

# Cell-free synthetic biology

A cell-free synthetic system represents another technical route of synthetic biology in parallel with cell engineering. The target of cell-free synthetic biology is an open system without cell structure, and is focused on the desired metabolic network, using corresponding active components, such as enzymes and coenzymes, to complement complicated biochemistry reactions. Cell-free synthetic biology originated with Eduard Buchner’s paradigm-shifting discovery of “cell-free ethanol fermentation by non-living yeast lysate” (Nobel Chemistry Prize 1907). Another milestone was the discovery of the genetic code and its function in protein synthesis by Nirenberg and Matthaei (Nobel Physiology or Medicine Prize 1968). For the development of cell-free synthetic biology, two types of cell-free systems have been proposed: a cell-extract based system and a purified enzymebased system (Figure 8). A cell-extract based system has always been used for cell-free protein synthesis (CFPS), to realize the fundamental processes of the central dogma (DNA to RNA, RNA to protein) outside of cells. The purified enzyme-based system consists of numerous purified or partially purified enzymes to implement complicated cascade enzyme reactions, mainly for the biomanufacturing of functional biomolecules and biochemicals. Compared to the system that is carried out inside cells, a cell-free synthetic biosystem features many advantages such as high product yield, fast reaction rate, high engineering flexibility, an accelerated design-build-test-learn cycle, high tolerance to toxic environments, and easy scale-up. These features make cell-free synthetic biology an important enabling technology for many applications.

# Cell-free biosystem for protein synthesis and applications

Cell-free biological systems for protein synthesis consist of crude cell extracts, DNA templates, ATP regeneration systems, amino acids, nucleotides, cofactors, and buffers (Silverman et al., 2020). Several cell extracts from $E$ . coli, $S$ . cerevisiae, wheat germ, rabbit reticulocytes, insect cells, and Chinese hamster ovary cells can be selected according to requirements. The system can be used for the synthesis of toxic or membrane proteins, prototyping of biological functions, protein modification, and biosensors.

It is difficult to overexpress toxic proteins at high yields in vivo because toxic proteins may interfere with cellular metabolic pathways and membrane proteins are always expressed in the form of inclusion bodies. Cell-free biological systems for protein synthesis can be used to synthesize toxic proteins such as restriction endonucleases (Goodsell, 2002), cytolethal dilatation toxins (Ceelen et al., 2006), and human microtubule-binding proteins (Betton, 2003) because the in vitro system is tolerant of toxic environments. Membrane proteins can be expressed in cell-free biological systems by adding surfactants, liposomes, or nanodiscs (Junge et al., 2011; Matthies et al., 2011; Panganiban et al., 2018; Shelby et al., 2020). Many membrane proteins, such as G proteincoupled receptors (Kaiser et al., 2008; Wang et al., 2011), tetracycline pumps (Wuu and Swartz, 2008), ATP synthases (Matthies et al., 2011), and Hepatitis C virus membrane proteins (Fogeron et al., 2015), are all produced by cellextract-based cell-free biological systems.

Figure 8 Various applications by cell-free synthetic biology system that contained cell-extract based system for cell-free protein synthesis (CFPS) and purified enzyme-based system.

For the prototyping of biological functions, such as genetic components, genetic circuits, and metabolic pathways, cellfree biological systems provide an important platform in vitro and allow implementation in cells (Silverman et al., 2020). For a single genetic component (promoter, ribosome binding site, and terminator), a library of variants of linear expression templates can be generated by PCR mutagenesis, and then, with the help of microfluidics, cells containing the single gene variant can be extracted, encapsulated in picoliter droplets (Fallah-Araghi et al., 2012; Zhang et al., 2019b).

In addition to probing individual genetic components, cellfree biological systems can be used to determine how these components work together in synthetic genetic control networks or “circuits” (Siegal-Gaskins et al., 2014). Numerous cell-free genetic circuits have been assembled and prototyped, including cascades driven by the sequential expression of orthogonal polymerases or sigma factors (Garamella et al., 2016; Noireaux et al., 2003; Shin and Noireaux, 2012), as well as feedforward loops and negative autoregulators (Hori et al., 2017; Hu et al., 2018; Takahashi et al., 2015a; Takahashi et al., 2015b). With the goal of engineering cellular metabolism, cell-free biological systems offer enormous possibilities for elucidating these metabolic pathways. It would be of considerable advantage to using cell extracts for protein synthesis in cell-free biological systems in which the expression of enzymes encoding DNA templates can lead to the self-assembly of pathways in a single reaction. To date, several reports have confirmed this approach. For example, two pathways containing three and six enzymes respectively were re-identified from linearly expressed DNA by cell-free biological systems to produce N-acetylglucosamine and peptidoglycan precursors, respectively (Sheng et al., 2014; Zhou et al., 2010). A five-enzyme pathway that converts tryptophan to purpurin has also been demonstrated (Garamella et al., 2016; Pardee et al., 2016b). Furthermore, a combinatorial strategy was recently used to construct a 17-step enzymatic pathway for n-butanol (Karim and Jewett, 2016). Combined with data-driven design, cellfree biosystems can be used to rapidly evaluate hundreds of pathway combinations in $E$ . coli extracts to enhance butanol and 3-hydroxybutyrate production in Gram-positive anaerobic bacteria, demonstrating cell-free and in vivo pathway performance (Karim et al., 2020)

For a broad range of protein modifications including glycosylation (Jaroentomeechai et al., 2018), phosphorylation (Oza et al., 2015), PEGylation (Shozen et al., 2009) and insertion of unnatural amino acids (uAAs) (Perez et al., 2016), cell-free biological systems offer robust control and versatility, bypassing limitations related to cell-based toxicity and permeability (Jaroentomeechai et al., 2018). Studying the modification of intracellular proteins is often challenging because it is difficult to obtain proteins with homogenous modifications within the cell. Cell-free biological systems have been shown to have highly homogeneous protein modification functions. A classic example is the glycosylation at specific sites on proteins. Many therapeutic proteins are highly dependent on efficient and homogeneous glycosylation (Li and d’Anjou, 2009).

Cell-free biological systems using $E$ . coli cell extracts represent an ideal test bed for detecting glycosylation, as $E$ . coli does not possess a native glycosylation function. Thus, the ability to accelerate cell-free carbohydrate screening in prokaryotic cells using cell-free techniques could have a transformative impact on the design of glycosylation therapies and vaccines (Valderrama-Rincon et al., 2012; Wacker et al., 2002). Open cell-free biological systems are particularly suitable for the use of orthogonal translation systems consisting of non-native tRNA and aminoacyl tRNA synthetases, adding uAAs at the UAG amber stop codon of mRNAs (Albayrak and Swartz, 2013; Hong et al., 2014; Martin et al., 2018). The incorporation of uAAs into proteins offers unlimited possibilities for the use of modified proteins as therapeutics. Once uAAs are incorporated at precise locations on target proteins, they act as biorthogonal chemical handles that react with functionalized small molecules to generate therapeutic conjugates, such as antibody-drug conjugates (ADCs) (Agarwal and Bertozzi, 2015; Ratner, 2014; Yin et al., 2012).

When we assess the role of cell-free biosystems as biosensors, they provide several practical advantages over whole-cell biosensors. Cell wall-impermeable or cytotoxic analytes can be detected in cell-free biological systems, and they are more reliable because of the potential for mutation and plasmid loss in whole-cell sensors (Silverman et al., 2020). The properties of cell-free biological systems for protein synthesis can be used to host gene circuit-based sensors that can detect nucleic acids and small molecules with extreme sensitivity and specificity (Tinafar et al., 2019). To detect nucleic acids primarily from disease-causing viruses and bacteria, RNA extracted from pathogen-containing samples is added to a cell-free biological system programmed to produce reporter proteins only in the presence of target nucleic acid sequences through a designed toehold switch riboregulator. It could replace reverse transcription PCR (RT-PCR) for more rapid diagnostic testing. Using this strategy, many viruses can be detected quickly, including Ebola (Pardee et al., 2014), Zika (Pardee et al., 2016a), Norwalk virus (Ma et al., 2018a), Cucumber Mosaic virus (Verosloff et al., 2019), SARS-Cov-2 (Hunt et al., 2022; Ma et al., 2022a), and certain gut-colonizing bacteria (Takahashi et al., 2018). This cell-free system for virus detection can be fixed on paper by freeze-dry technology to improve its portability and stability (Hunt et al., 2022), providing an alternative to meet the urgent diagnostic requirement for current Covid-19 and future virus pandemic. Progress in cell-free detection of small molecules (e.g., environmental toxins or cellular metabolites) has been slower than in the detection of nucleic acids, because there are no analogs to synthetic ribose modulation for the construction of sensors for arbitrary small molecules. Most reported cellfree small molecule sensors detect environmental toxins, such as mercury (Salehi et al., 2017) and fluoride (Thavarajah et al., 2019), drugs, such as gamma-hydroxybutyrate (Gräwe et al., 2019), or bacterial quorum-sensing signals such as N-butyl-L-homoserine lactone (Wen et al., 2017). Studies have shown that cell-free sensors can be freeze-dried and remain active for months even when dried on paper substrates (Pardee et al., 2014), offering an alternative means to address the unmet need for easy distribution and low-cost sensing by cell-free systems (Tinafar et al., 2019).

# Cell-free biosystems based on purified enzymes for biomanufacturing

Cell-free biological systems based on purified enzymes refer to the construction of biocatalytic systems that constitute multiple purified/partially purified enzymes for converting certain substrates to desired compounds through engineered reaction pathways (Wei et al., 2020). Here, we focus on cellfree biological systems for biomanufacturing using sustainable substrates such as starch, glucose, cellulose, and carbon dioxide.

Myo-inositol (hereafter referred to as inositol) and hydrogen are two typical products produced directly from starch by cell-free biological systems. Inositol is widely used in the cosmetic, pharmaceutical and food industries. It is obtained from phytic acid by acid hydrolysis. This method uses expensive raw materials and produces serious phosphorus pollution. Zhang et al. and Atomi et al. both constructed a cell-free biological system containing four enzymatic reactions that can convert starch to inositol with a theoretical product yield of $100 \%$ (Fujisawa et al., 2017; You et al., 2017). All enzymes in this biological system are thermophilic, so the enzymes can be easily purified by thermal treatment and high reaction temperature, avoiding microbial contamination. Compared with traditional chemical methods, this new method of producing inositol from starch has great potential for green inositol production. Currently, Bohaoda Biological (China) is building an industrial facility that is scaling up this novel method to produce inositol (You et al., 2017). Many other value-added chemicals, such as glucosamine (Meng et al., 2020), allulose (Li et al., 2021), and (-)-vibo quercitol (Bai et al., 2019), can be synthesized by similar enzymatic treatment of starch. Hydrogen is the transportation fuel of the future, and improved energy efficiency through fuel cells has the potential to reduce greenhouse gas emissions and provide end users with zero pollutants (Armaroli and Balzani, 2010). Natural cellular metabolic pathways can only produce up to 4 moles of $\mathrm { H } _ { 2 }$ per mole of glucose (Chou et al., 2008; Veit et al., 2008; Wu et al., 2017). Zhang and colleagues conducted a proof-of-concept experiment that produced 12 moles of $\mathrm { H } _ { 2 }$ per mole of glucose through a cell-free biological system containing 13 purified enzymes. The biological system converts starch into $\mathrm { H } _ { 2 }$ and $\mathrm { C O } _ { 2 }$ almost quantitatively with the following total stoichiometry: $\mathrm { C } _ { 6 } \mathrm { H } _ { 1 0 } \mathrm { O } _ { 5 } { + } 7 \mathrm { H } _ { 2 } \mathrm { O } { = } 1 2 \mathrm { H } _ { 2 }$ $+ 6 \mathrm { C O } _ { 2 }$ . This biological system can be slightly modified to develop sugar biobatteries with an energy density that is an order of magnitude higher than lithium-ion batteries (Zhu et al., 2014). This cell-free biological system for hydrogen production lays the foundation for future sugar-hydrogen vehicles.

The production of ethanol, isobutanol, and prenylated natural compounds is described here when glucose is used as a substrate in cell-free biological systems. Ethanol is the most important gasoline additive, and isobutanol is fourcarbon liquid alcohol compatible with current internal combustion engines and transportation pipelines (Atsumi et al., 2008; Li et al., 2011a; Welch and Scopes, 1985). Sieber and colleagues designed a cell-free biological system that can produce ethanol and isobutanol from glucose via pyruvate (Guterl et al., 2012). Compared with the 10 enzymes used in the natural glycolytic pathway, this biological system uses only four enzymes to convert glucose to pyruvate, which can be converted to ethanol and isobutanol. This cellfree biosystem produces large amounts of isobutanol even in the presence of $4 \%$ (v/v) isobutanol, whereas even low concentrations (e.g., $1 \% - 2 \%$ v/v) prevent microbial production of isobutanol (Atsumi et al., 2008). This progress demonstrates that cell-free biological systems are highly tolerant of toxic environments. To produce prenylated natural compounds, Bowie and colleagues designed a cell-free biological system consisting of more than 20 enzymes (Valliere et al., 2019). These enzymes can be divided into 4 main reaction modules: a glycolysis module for the production of pyruvate from glucose; an acetyl-CoA module for the production of acetyl-CoA from pyruvate, and a mevalonate module for the production of geranyl pyrophosphate (GPP) from acetyl-CoA; and a prenylation module for the production of the desired prenylated product. The prenylation module can also be modulated by using alternative enzymes and substrates to produce various prenylated compounds, such as isoprenoids (repeated word) and cannabinoids. After system optimization, this cell-free biosystem produced a cannabinoid titer of $1 . 2 5 \ \mathrm { g ~ L ~ } ^ { - 1 }$ , which was at least two orders of magnitude higher than published results using live cells (Luo et al., 2019).

When cellulose is used as a substrate, a typical example is the production of starch from cellulose by cell-free systems. This biological system contains endoglucanase, cellobiohydrolase, cellobiose phosphorylase, and alpha-glucan phosphorylase for a one-pot enzymatic conversion of pretreated biomass to starch. Up to $30 \%$ of the anhydroglucose units in cellulose are converted to starch (You et al., 2013). Because the annual source of cellulose feedstock is ${ \sim } 4 0$ times greater in mass than the starch for food and feed, this cost-effective transformation of non-food cellulose to starch can reshape the bioeconomy and solve the triple dilemma of food, biofuels, and the environment (Zhang, 2013).

Researchers from the Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences constructed an artificial starch anabolic pathway (ASAP) to utilize $\mathrm { C O } _ { 2 }$ and hydrogen in the synthesis of starch (Cai et al., 2021). ASAP is a chemical-biological hybrid system, which includes a chemical system and a cell-free biological system. The chemical system converts $\mathrm { C O } _ { 2 }$ and hydrogen to methanol (Wang et al., 2017). The cell-free biological system contains 11 core enzymes and three auxiliary enzymes to convert methanol to starch. After optimization of conditions, including modular assembly and substitution and protein engineering of three rate-limiting enzymes, this chemical-biological hybrid system converts carbon dioxide to starch at a rate of $2 2 ~ \mathrm { n m o l ~ m i n } ^ { - 1 } ~ \mathrm { m g } ^ { - 1 }$ of total catalyst, which is 8.5 times higher than in a maize substrate system. This approach offers a potential strategy to feed the world, and more importantly, provides a potential solution to the problem of food sources when exploring other planets $\mathrm { W u }$ and Bornscheuer, 2022).

In conclusion, cell-free synthetic biology offers a gamechanging tool to circumvent the limitations inherent in living cells. With a plethora of research across different fields, including gene expression, genetic networks, protein modification, on-demand biosensing, and biomanufacturing using cell-free biological systems, the prospects of cell-free synthetic biology are evident. However, to realize the true potential of cell-free biological systems, several challenges need to be overcome, including the longevity of such biological systems and the regeneration of unstable natural cofactors. After addressing these shortcomings, cell-free synthetic biology will bring biology and biotechnology into a new era with many interesting results. It is exciting to see cell-free synthetic biology combined with other cutting-edge disciplines such as material science, electronics, computer, and artificial intelligence.

# AI and synthetic biology

Obtaining ideal biological components is the basis for building synthetic biological systems. With the recently witnessed increase in computing power, artificial intelligence (AI) has been shown to excel in a variety of challenging tasks such as image generation (Bau et al., 2020), natural language processing (LeCun et al., 2015), and synthetic biology applications (Wang et al., 2020). This section will briefly describe AI-dependent approaches that have shown increasing success in mining complex biological properties and designing optimized synthetic biological components (bioparts), especially gene regulatory sequences. Excellent reviews are available for in-depth discussion on specific applications such as metabolic engineering (Lawson et al., 2021), gene therapy (Huang et al., 2021) and drug discovery (Sanchez-Lengeling and Aspuru-Guzik, 2018).

# A general framework of AI guided bioparts inverse design

Bioparts design is an important yet complex task, which aims to reverse engineer new biomolecules based on specific target properties. It is experimentally difficult to exhaustively search the potential sequence space to discover new bioparts (e.g., a $1 0 0 { \mathrm { b p } }$ of DNA sequence forms a potential sequence space of $4 ^ { 1 0 0 }$ ). Therefore, virtual screening offers a promising alternative for exploring this vast space. Armed with a computational model that can estimate the fitness landscape of a sequence space, it is now possible to select candidate designs with high fitness and employ an iterative process to achieve an efficient virtual screen of bioparts (Figure 9A).

From a machine learning perspective, the inverse design problem of bioparts can be abstracted as the mathematical problem of estimating the joint distribution of bioparts functions and sampling the target biopart $x$ with target function $y$ from it. Aming the target functional $y$ , the biopart design problem can be formulated in probabilistic terms as finding mutually compatible sequence-function pairs that maximize the joint probability $p ( x , y )$ (Anishchenko et al., 2021b). Using the probability chain rule, we can get

$$
p ( x , y ) = p ( y | x ) ^ { * } p ( x ) ,
$$

where the first term represents the conditional probability of the function $y$ for the given sequence $x ,$ , and the second term represents the biocompatible of sequences $x$ , which is constrained by chemical and biophysical properties. The development of machine learning methods, especially deep learning, has enabled increasingly accurate estimates of adaptive environments (Angermueller et al., 2016) and has considerably improved the efficiency of generating candidate designs that satisfy biological constraints. Combining virtual screening and high-throughput experimental screening in a closed loop facilitates iterative optimization of virtual screening and further accelerates design progress (Figure 9A).

Figure 9 A, The schematic of Design-Build-Test-Learn (DBTL) process for bioparts design. Bioparts design integrates the AI-guided virtual screening, bioparts synthesis, biological measurements and learning of functional features into a closed-loop framework. B, The framework about the generative models and predictive models.

# Deep learning models for synthetic biology

With the increase in computational power and highthroughput omics data, the use of deep learning has emerged as a promising approach to learn complex patterns and efficiently estimate data distributions implicitly or explicitly. We briefly introduce two main classes of deep learning models widely used in biological component design: predictive models and generative models.

Predictive model: To estimate the term $p ( y / x )$ , predictive models were constructed to evaluate the property $y$ on condition of input bioparts $x$ . The input to the prediction model is the sequences of bioparts and the output is the predicted properties of those sequences. For example, we provided the model with a promoter sequence and had it predict the expression levels of downstream genes. One of the widely used models is based on convolutional neural networks (CNNs), which can efficiently learn local patterns in sequences, such as transcription factor binding sites and their combinations. For example, Zrimec et al. used a CNN model to predict gene expression of promoter, 5′UTR, 3′UTR and terminator sequences and achieved an R-squared value of 0.822 in predicting S. cerevisiae mRNA abundance (Zrimec et al., 2020). Another well-known application of CNN-based models comes from AlphaFold (Senior et al., 2020). In the first stage of protein structure prediction, they used 64 residual convolutional blocks to predict distance and torsion distribution and achieved considerable improvements over a comparative CASP 13 system (AlQuraishi, 2019).

Predictive models based on recurrent neural network (RNN) models and attention-based neural networks are also widely used to capture long-term interactions between different regulatory elements. RNN is an artificial neural network that uses sequence information to extract long-term correlations (Greener et al., 2022). Quang et al. (Quang and Xie, 2016) proposed a hybrid convolutional and recurrent deep neural network to quantify the functionality of DNA sequences. Attention is a technique that mimics cognitive attention, which may be associated with different locations in a single sequence and has achieved impressive results in natural language understanding tasks. The advantage of the attention mechanism is that it can notice regions or patterns of interest regardless of the distance, thus facilitating the understanding of long-range syntax in DNA or protein sequences. The most prominent application of the attention model is AlphaFold2 (Jumper et al., 2021), which significantly outperforms other methods including AlphaFold in protein structure prediction. AlphaFold2 uses an attention mechanism in the predictive model instead of the convolutional layers in AlphaFold. This improvement shows that the attention mechanism has considerable potential for future applications in structure and fitness prediction.

Generative models: Generative models aim to understand the underlying distribution of the data $p ( x )$ . Deep generative models, such as generative adversarial networks (GANs) and variational autoencoders (VAEs), use deep neural networks to implicitly estimate the distribution in the sample space by mapping the sample space to a low-dimensional representation space and generate new samples from it. Stateof-the-art generative models have achieved considerable success in generating vivid images in computational vision tasks (Bau et al., 2020). The rationale behind these state-ofthe-art performances is that generative models can estimate sample distributions constrained by complex properties, which helps researchers achieve efficient model-guided navigation in the sample space and generate completely new samples that have never been seen before.

Similar methods can also be used to estimate the distribution of functional biomolecular sequences that occupy only a small fraction of the overall sequence space limited by biophysical properties and long-range interactions. Deep generative models can help researchers more efficiently explore candidate sequences that are more likely to be functional.

# Engineering bioparts using AI

AI algorithms have been applied to the design of synthetic biological components, including the design of cis-regulatory sequences (Van Brempt et al., 2020), small-molecule drugs (Xu et al., 2019), and small peptides (Cherkasov et al., 2009).

With sufficient numbers of sequence samples $( x )$ with functional annotation $( \nu )$ , as well as a well-designed AI model, it is now possible to computationally design biological components rationally from scratch. One approach is to combine predictive models with stochastic screening, genetic algorithms or gradient search to virtually screen the fitness landscape to find possible functional biological components. For example, Van Brempt et al. used a training set of over 250,000 synthetic sequences to predict expression, then applied random screening to select $E$ . coli sigma 70 promoters with differential expression, and experimentally verify a Spearman’s rank correlation factor equal to 0.909 (Van Brempt et al., 2020). Kotopka et al. implemented a genetic algorithm and gradient search optimization to design constitutive and inducible yeast promoters for high expression (Kotopka and Smolke, 2020). Bogard et al. designed surrogate polyadenylation sites through gradient ascent optimization based on a predictive model that maps the position weight matrices (PWM) inputs of surrogate polyadenylation sites to isoform numbers (Bogard et al., 2019). Bryant et al. designed highly diverse adeno-associated virus type 2 (AAV2) capsid protein variants by gradually moving away from natural AAV serotype sequences guided by predictive models (Bryant et al., 2021).

In addition to obtaining better predictive models, another important direction is to obtain better estimates of sample distribution in order to more efficiently generate new biological component candidates. Taking high-expression promoter sequence design as an example (Figure 9B), Wang et al. used generative adversarial networks to survey more than 10,000 natural sequences and applied de novo promoter design in $E$ . coli. The generative model first learns the promoter distribution from the natural promoter dataset. The generator detects key regulatory patterns, such as transcription factor binding sites (TFBSs), and can generate new samples that match sequence signatures. A predictive model is then trained to estimate the properties of the samples. Combining generative and predictive models, a synthetic high-expression promoter candidate set can be obtained by virtual screening. Finally, an iterative process including artificial intelligence-based virtual screening and experimental validation (Figure 9A) helps the model to efficiently learn the functions of the biological components. As a result, up to $7 0 . 8 \%$ of AI-designed promoters were experimentally demonstrated to be functional, few of which exhibited significant sequence similarity to the $E$ . coli genome (Wang et al., 2020). In another example, Repecka et al. proposed a variant of self-attention-based generative adversarial network to learn natural protein sequence diversity, with $24 \%$ of the generated sequences being functional including a highly mutated variant of 106 amino-acid substitutions (Repecka et al., 2021). Shin et al. also introduced a deep generative model, which successfully designed and tested a diverse ${ 1 0 } ^ { 5 }$ nanobody library, namely by exploring new sequence spaces (Shin et al., 2021). Biswas et al. applied a deep generative model with long-short memory to capture uniform protein sequence distributions and computationally explore protein landscapes in the range of $1 0 ^ { 7 } { - } 1 0 ^ { 8 }$ variants. They successfully constructed the fluorescent protein avGFP and the enzyme TEM- $1 \beta$ -lactamase from wild-type sequences, serving as training data in only 24 or 96 characterized sequence variants (Biswas et al., 2021).

With the computational power to process complex data, AI has been successfully applied to a variety of synthetic biology problems and has shown unprecedented efficiency, accelerating inverse design tasks several orders of magnitude faster than traditional experimental methods. AI-based approaches, such as deep predictive models, deep generative models, and reinforcement learning methods, have brought significant improvements in synthetic regulatory sequence design and drug discovery (Sanchez-Lengeling and AspuruGuzik, 2018; Faulon and Faure, 2021). Integrating AI models into a closed-loop bioprocess optimization framework will considerably speed up the comprehensive design process.

Despite these recent advances, the application of artificial intelligence in synthetic biology is still in its infancy. One major reason is that the capabilities of existing AI methods are limited by the size of training samples. Compared to computer vision and natural language processing tasks, which often have millions or even billions of training samples, the sample size of biological data is too small to fully unleash the power of these deep learning models. Therefore, it is important for the community to provide standardized samples with sequence to functional annotation pairs to better train AI models. Another point is that the state-of-theart AI frameworks, such as convolutional neural networks and attention-based models, originate from non-biological domains, such as computer vision or natural language processing tasks. It is now critical to develop new AI frameworks that better incorporate knowledge from the biological domain. For example, the interplay between synthetic gene circuits and complex multilevel regulation within cells remains to be studied. In addition, functional bioparts in cells usually show dynamic changes in time series, so how to measure and capture the dynamic distribution of bioparts is also an important problem to be solved. Overcoming these challenges will bring enormous potential opportunities for synthetic biology and artificial intelligence in the near future.

# Biofoundries—process automation in synthetic biology

# DBTL automation using biofoundries

As discussed in the first two sections, due to the lack of predictive modeling, a trial-and-error process is often employed to create biological systems with desirable properties. Through physical and information automation (Figure 10), biofoundries promise to implement and accelerate the design-build-test-learn (DBTL) cycle as an engineering framework for synthetic biology (Chao et al., 2017b; Hillson et al., 2019). The integration of computer-aided design (CAD), robotics, and high-throughput instrumentation allows efficient exploration of genetic and process variables and rapid data generation to recommend experimental plans for the next DBTL iteration using active learning algorithms (HamediRad et al., 2019; Zhang et al., 2020a; Zhang et al., 2021a). Additionally, standardization of materials, hardware, protocols, and data reporting eliminates idiosyncratic biases and errors (Beal et al., 2020). The enhanced reproducibility thus enables the collective aggregation of big data across batches, projects and institutions to obtain mechanistic and statistical models of engineering biology (Farzaneh and Freemont, 2021; Hillson et al., 2019). Currently, public institutions and private companies around the world are building biofoundries (Table 1). Numerous robotic workflows have been developed for the automated construction and testing of biologically based synthetic genetic constructs and organisms, with a primary focus on microbial cell factories for chemical/biochemical production (Hillson et al., 2019; Zhang et al., 2021a).

Figure 10 Biofoundries provide an integrated infrastructure to automate DBTL loops of synthetic biology via physical and informatic automation. Build: programmed by robotic scripts, automation instruments construct gene cassettes, pathways/circuits, and synthetic genomes, followed by cell culture, genetic transformation, and clonal selection. Test: test automation permits rapid and large-scale genotype-phenotype mapping across scales with findable, accessible, interoperable, and reusable (FAIR) metadata for DBTL iterations using adaptive learning algorithms.

# Build automation: DNA construction

In biofoundries, robotic protocols can be applied to largescale fabrication of expression cassettes, metabolic pathways, genetic circuits, and even entire genomes, using synthetic or cloned DNA fragments (Zhang et al., 2021a). Type II restriction and homology-directed methods are widely used for automation-compatible DNA assembly, primarily because they allow one-step, scar-free assembly of many fragments using standard procedures (Chao et al., 2017a; Dharmadi et al., 2014; Kanigowska et al., 2016; Walsh III et al., 2019). For the Type II restriction method, the Golden Gate method assembles up to 15 fragments in one step on an academic biofoundry called iBioFAB, which can create 400 structures per day (Chao et al., 2017a). For the homologydirected method, up to 12 DNA fragments can be assembled in S. cerevisiae using transformation-associated recombination (TAR), achieving a throughput of over 1500 constructs per (Ip et al., 2020). However, DNA assembly is not errorfree, and robotic screening is necessary for the rapid and large-scale identification of correctly assembled structures. For example, structural analysis can automate qPCR analysis for detection of assembled junctions (Shapland et al., 2015), and capillary electrophoresis (CE) analysis for matching restriction patterns (Dharmadi et al., 2014). Furthermore, with the help of multiplex DNA barcodes introduced during robotic NGS library preparation, next-generation sequencing (NGS) validation can be designed to analyze hundreds to thousands of assembled structures in a single run (Shapland et al., 2015; Suckling et al., 2019).

# Build automation: engineered organisms

Basic procedures for organism engineering using robotic systems include cell culture, genetic transformation, and clonal selection (HamediRad et al., 2019; Ip et al., 2020; Rajakumar et al., 2019; Si et al., 2017a). For these procedures, commercial instrumentation is available and can be easily integrated to automate the handling of aerobic model microorganisms, as described above, when $E$ . coli and $S .$ cerevisiae are used as recombinant hosts for robotic DNA assembly (Chao et al., 2017a; Ip et al., 2020; Kanigowska et al., 2016). For non-model organisms, specially designed equipment and laboratories are required to support automated workflows. For example, a casting platform is completely enclosed in an environmental control room for robotic manipulation of strictly anaerobic bacteria. Furthermore, although bulk electroporation can be performed in a 96-well format (Park et al., 2011; Wang and Church, 2011), genetic transformation of certain microalgae requires singlecell electroporation in microfluidics (Im et al., 2015). In addition, customization is necessary to modify commercial colony pickers to purify homokaryotic transformants and pick mycelia of multinucleated filamentous fungi (SunSpiral et al., 2022). For organisms that are challenging for robotic protocols, the automation of corresponding cell-free systems offers a viable alternative, as demonstrated by rapid prototyping of the Bacillus megaterium promoter at the London Biofoundry (Moore et al., 2018b).

Table 1 A list of selected biofoundries and automation capacities