11.1 C
New York
Tuesday, November 26, 2024

Dr. Serafim Batzoglou, Chief Information Officer at Seer – Interview Sequence


Serafim Batzoglou is Chief Information Officer at Seer. Previous to becoming a member of Seer, Serafim served as Chief Information Officer at Insitro, main machine studying and knowledge science of their strategy to drug discovery. Previous to Insitro, he served as VP of Utilized and Computational Biology at Illumina, main analysis and know-how growth of AI and molecular assays for making genomic knowledge extra interpretable in human well being.

What initially attracted you to the sector of genomics?

I got interested within the discipline of computational biology initially of my PhD in pc science at MIT, once I took a category on the subject taught by Bonnie Berger, who grew to become my PhD advisor, and David Gifford. The human genome mission was selecting up tempo throughout my PhD. Eric Lander, who was heading the Genome Heart at MIT grew to become my PhD co-advisor and concerned me within the mission. Motivated by the human genome mission, I labored on whole-genome meeting and comparative genomics of human and mouse DNA.

I then moved to Stanford College as school on the Laptop Science division the place I spent 15 years, and was privileged to have suggested about 30 extremely proficient PhD college students and plenty of postdoctoral researchers and undergraduates. My workforce’s focus has been the appliance of algorithms, machine studying and software program instruments constructing for the evaluation of large-scale genomic and biomolecular knowledge. I left Stanford in 2016 to guide a analysis and know-how growth workforce at Illumina. Since then, I’ve loved main R&D groups in trade. I discover that teamwork, the enterprise side, and a extra direct affect to society are attribute of trade in comparison with academia. I labored at revolutionary firms over my profession: DNAnexus, which I co-founded in 2009, Illumina, insitro and now Seer. Computation and machine studying are important throughout the know-how chain in biotech, from know-how growth, to knowledge acquisition, to organic knowledge interpretation and translation to human well being.

During the last 20 years, sequencing the human genome has turn out to be vastly cheaper and quicker. This led to dramatic progress within the genome sequencing market and broader adoption within the life sciences trade. We are actually on the cusp of getting inhabitants genomic, multi-omic and phenotypic knowledge of adequate dimension to meaningfully revolutionize healthcare together with prevention, analysis, remedy and drug discovery. We are able to more and more uncover the molecular underpinnings of illness for people via computational evaluation of genomic knowledge, and sufferers have the prospect to obtain remedies which are customized and focused, particularly within the areas of most cancers and uncommon genetic illness. Past the apparent use in medication, machine studying coupled with genomic info permits us to achieve insights into different areas of our lives, comparable to our family tree and vitamin. The following a number of years will see adoption of customized, data-driven healthcare, first for choose teams of individuals, comparable to uncommon illness sufferers, and more and more for the broad public.

Previous to your present position you had been Chief Information Officer at Insitro, main machine studying and knowledge science of their strategy to drug discovery. What had been a few of your key takeaways from this time interval with how machine studying can be utilized to speed up drug discovery?

The standard drug discovery and growth “trial-and-error” paradigm is plagued with inefficiencies and very prolonged timelines. For one drug to get to market, it may possibly take upwards of $1 billion and over a decade. By incorporating machine studying into these efforts, we are able to dramatically scale back prices and timeframes in a number of steps on the way in which. One step is goal identification, the place a gene or set of genes that modulate a illness phenotype or revert a illness mobile state to a extra wholesome state may be recognized via large-scale genetic and chemical perturbations, and phenotypic readouts comparable to imaging and useful genomics. One other step is compound identification and optimization, the place a small molecule or different modality may be designed by machine learning-driven in silico prediction in addition to in vitro screening, and furthermore desired properties of a drug comparable to solubility, permeability, specificity and non-toxicity may be optimized. The toughest in addition to most vital side is maybe translation to people. Right here, alternative of the appropriate mannequin—induced pluripotent stem cell-derived strains versus main affected person cell strains and tissue samples versus animal fashions—for the appropriate illness poses an extremely vital set of tradeoffs that in the end replicate on the power of the ensuing knowledge plus machine studying to translate to sufferers.

Seer Bio is pioneering new methods to decode the secrets and techniques of the proteome to enhance human well being, for readers who’re unfamiliar with this time period what’s the proteome?

The proteome is the altering set of proteins produced or modified by an organism over time and in response to atmosphere, vitamin and well being state. Proteomics is the research of the proteome inside a given cell kind or tissue pattern. The genome of a human or different organisms is static: with the vital exception of somatic mutations, the genome at delivery is the genome one has their whole life, copied precisely in every cell of their physique. The proteome is dynamic and adjustments within the time spans of years, days and even minutes. As such, proteomes are vastly nearer to phenotype and in the end to well being standing than are genomes, and consequently extra informative for monitoring well being and understanding illness.

At Seer, now we have developed a brand new solution to entry the proteome that gives deeper insights into proteins and proteoforms in advanced samples comparable to plasma, which is a extremely accessible pattern that sadly to-date has posed a fantastic problem for standard mass spectrometry proteomics.

What’s the Seer’s Proteograph™ platform and the way does it provide a brand new view of the proteome?

Seer’s Proteograph platform leverages a library of proprietary engineered nanoparticles, powered by a easy, fast, and automatic workflow, enabling deep and scalable interrogation of the proteome.

The Proteograph platform shines in interrogating plasma and different advanced samples that exhibit massive dynamic vary—many orders of magnitude distinction within the abundance of varied proteins within the pattern—the place standard mass spectrometry strategies are unable to detect the low abundance a part of the proteome. Seer’s nanoparticles are engineered with tunable physiochemical properties that collect proteins throughout the dynamic vary in an unbiased method. In typical plasma samples, our know-how permits detection of 5x to 8x extra proteins than when processing neat plasma with out utilizing the Proteograph. Consequently, from pattern prep to instrumentation to knowledge evaluation, our Proteograph Product Suite helps scientists discover proteome illness signatures which may in any other case be undetectable. We prefer to say that at Seer, we’re opening up a brand new gateway to the proteome.

Moreover, we’re permitting scientists to simply carry out large-scale proteogenomic research. Proteogenomics is the combining of genomic knowledge with proteomic knowledge to determine and quantify protein variants, hyperlink genomic variants with protein abundance ranges, and in the end hyperlink the genome and the proteome to phenotype and illness, and begin disentangling the causal and downstream genetic pathways related to illness.

Are you able to talk about a number of the machine studying know-how that’s presently used at Seer Bio?

Seer is leveraging machine studying in any respect steps from know-how growth to downstream knowledge evaluation. These steps embrace: (1) design of our proprietary nanoparticles, the place machine studying helps us decide which physicochemical properties and combos of nanoparticles will work with particular product strains and assays; (2) detection and quantification of peptides, proteins, variants and proteoforms from the readout knowledge produced from the MS devices; (3) downstream proteomic and proteogenomic analyses in large-scale inhabitants cohorts.

Final yr, we printed a paper in Superior Supplies combining proteomics strategies, nanoengineering and machine studying for bettering our understanding of the mechanisms of protein corona formation. This paper uncovered nano-bio interactions and is informing Seer within the creation of improved future nanoparticles and merchandise.

Past nanoparticle growth, now we have been growing novel algorithms to determine variant peptides and post-translational modifications (PTMs). We not too long ago developed a technique for detection of protein quantified trait loci (pQTLs) that’s strong to protein variants, which is a identified confounder for affinity-based proteomics. We’re extending this work to straight determine these peptides from the uncooked spectra utilizing deep learning-based de novo sequencing strategies to permit search with out inflating the dimensions of spectral libraries.

Our workforce can be growing strategies to allow scientists with out deep experience in machine studying to optimally tune and make the most of machine studying fashions of their discovery work. That is completed by way of a Seer ML framework primarily based on the AutoML device, which permits environment friendly hyperparameter tuning by way of Bayesian optimization.

Lastly, we’re growing strategies to scale back the batch impact and enhance the quantitative accuracy of the mass spec readout by modeling the measured quantitative values to maximise anticipated metrics comparable to correlation of depth values throughout peptides inside a protein group.

Hallucinations are a typical difficulty with LLMs, what are a number of the options to stop or mitigate this?

LLMs are generative strategies which are given a big corpus and are skilled to generate related textual content. They seize the underlying statistical properties of the textual content they’re skilled on, from easy native properties comparable to how typically sure combos of phrases (or tokens) are discovered collectively, to greater stage properties that emulate understanding of context and that means.

Nevertheless, LLMs are usually not primarily skilled to be appropriate. Reinforcement studying with human suggestions (RLHF) and different methods assist prepare them for fascinating properties together with correctness, however are usually not absolutely profitable. Given a immediate, LLMs will generate textual content that almost all carefully resembles the statistical properties of the coaching knowledge. Usually, this textual content can be appropriate. For instance, if requested “when was Alexander the Nice born,” the right reply is 356 BC (or BCE), and an LLM is probably going to offer that reply as a result of inside the coaching knowledge Alexander the Nice’s delivery seems typically as this worth. Nevertheless, when requested “when was Empress Reginella born,” a fictional character not current within the coaching corpus, the LLM is more likely to hallucinate and create a narrative of her delivery. Equally, when requested a query that the LLM might not retrieve a proper reply for (both as a result of the appropriate reply doesn’t exist, or for different statistical functions), it’s more likely to hallucinate and reply as if it is aware of. This creates hallucinations which are an apparent downside for severe purposes, comparable to “how can such and such most cancers be handled.”

There are not any excellent options but for hallucinations. They’re endemic to the design of the LLM. One partial answer is correct prompting, comparable to asking the LLM to “think twice, step-by-step,” and so forth. This will increase the LLMs probability to not concoct tales. A extra subtle strategy that’s being developed is using information graphs. Information graphs present structured knowledge: entities in a information graph are linked to different entities in a predefined, logical method. Setting up a information graph for a given area is after all a difficult process however doable with a mix of automated and statistical strategies and curation. With a built-in information graph, LLMs can cross-check the statements they generate in opposition to the structured set of identified information, and may be constrained to not generate an announcement that contradicts or will not be supported by the information graph.

Due to the basic difficulty of hallucinations, and arguably due to their lack of adequate reasoning and judgment talents, LLMs are at present highly effective for retrieving, connecting and distilling info, however can not substitute human specialists in severe purposes comparable to medical analysis or authorized recommendation. Nonetheless, they’ll tremendously improve the effectivity and functionality of human specialists in these domains.

Are you able to share your imaginative and prescient for a future the place biology is steered by knowledge relatively than hypotheses?

The standard hypothesis-driven strategy, which entails researchers discovering patterns, growing hypotheses, performing experiments or research to check them, after which refining theories primarily based on the info, is changing into supplanted by a brand new paradigm primarily based on data-driven modeling.

On this rising paradigm, researchers begin with hypothesis-free, large-scale knowledge technology. Then, they prepare a machine studying mannequin comparable to an LLM with the target of correct reconstruction of occluded knowledge, robust regression or classification efficiency in quite a lot of downstream duties. As soon as the machine studying mannequin can precisely predict the info, and achieves constancy similar to the similarity between experimental replicates, researchers can interrogate the mannequin to extract perception in regards to the organic system and discern the underlying organic rules.

LLMs are proving to be particularly good in modeling biomolecular knowledge, and are geared to gas a shift from hypothesis-driven to data-driven organic discovery. This shift will turn out to be more and more pronounced over the subsequent 10 years and permit correct modeling of biomolecular methods at a granularity that goes effectively past human capability.

What’s the potential affect for illness analysis and drug discovery?

I consider LLM and generative AI will result in important adjustments within the life sciences trade. One space that may profit significantly from LLMs is scientific analysis, particularly for uncommon, difficult-to-diagnose ailments and most cancers subtypes. There are large quantities of complete affected person info that we are able to faucet into – from genomic profiles, remedy responses, medical data and household historical past – to drive correct and well timed analysis. If we are able to discover a solution to compile all this knowledge such that they’re simply accessible, and never siloed by particular person well being organizations, we are able to dramatically enhance diagnostic precision. This isn’t to suggest that the machine studying fashions, together with LLMs, will be capable to autonomously function in analysis. Because of their technical limitations, within the foreseeable future they won’t be autonomous, however as a substitute they’ll increase human specialists. They are going to be highly effective instruments to assist the physician present fantastically knowledgeable assessments and diagnoses in a fraction of the time wanted up to now, and to correctly doc and talk their diagnoses to the affected person in addition to to your complete community of well being suppliers linked via the machine studying system.

The trade is already leveraging machine studying for drug discovery and growth, touting its potential to scale back prices and timelines in comparison with the standard paradigm. LLMs additional add to the out there toolbox, and are offering wonderful frameworks for modeling large-scale biomolecular knowledge together with genomes, proteomes, useful genomic and epigenomic knowledge, single-cell knowledge, and extra. Within the foreseeable future, basis LLMs will undoubtedly join throughout all these knowledge modalities and throughout massive cohorts of people whose genomic, proteomic and well being info is collected. Such LLMs will help in technology of promising drug targets, determine possible pockets of exercise of proteins related to organic perform and illness, or counsel pathways and extra advanced mobile features that may be modulated in a particular means with small molecules or different drug modalities. We are able to additionally faucet into LLMs to determine drug responders and non-responders primarily based on genetic susceptibility, or to repurpose medication in different illness indications. Lots of the current revolutionary AI-based drug discovery firms are undoubtedly already beginning to assume and develop on this path, and we must always anticipate to see the formation of extra firms in addition to public efforts aimed on the deployment of LLMs in human well being and drug discovery.

Thanks for the detailed interview, readers who want to be taught extra ought to go to Seer.

Related Articles

Latest Articles