Hi, my name is

James Baker

PhD Candidate & Researcher

I am a PhD Graduate Student in the Human Genetics program at Vanderbilt University. Welcome to my portfolio.

About Me

I am a PhD student in the Human Genetics Program at Vanderbilt University. My research focuses on how we can leverage distant relatedness in large biobanks linked to electronic health data to explore the genetic causes of disease. To investigate this, I design pipelines and software in (mostly) Python and R for both local HPC environments and the cloud environments to implement novel statistical methods or to efficiently scale existing methods to modern large biobanks (BioVU, All of Us, and UK Biobank). Additionally, I have an interest in how we can leverage modern high performance data science packages (such as Polars and DuckDB) and software best practices to design reliable, scalable, and maintainable research software.

Skills

Here are some of the technologies I work with:
  • Python
  • R
  • Go
  • Pandas
  • Polars
  • DuckDB
  • Terra.bio
  • All of Us
  • ggplot
  • SQL
  • Docker
  • Git & GitHub
  • REGENIE
  • SAIGE
  • PLINK
  • BCFtools
  • Tractor

Software Projects

DRIVE
DRIVE
Distant Relatedness for Identification and Variant Evaluation. A tool leveraging graph theory and identity-by-descent (IBD) in biobanks to identify networks of related individuals and evaluate genetic variants.
Contributions:
  • Conceptual design and programmatic implementation
  • Optimized memory usage and performance for large-scale biobank data with packages like DuckDB and Pandas
  • Wrote comprehensive documentation available on both GitHub and Read the Docs.
  • Integration tests as well as continuous integration through GitHub actions
  • Performance testing and benchmarks using pytest
  • Simulated test IBD data for integration tests and examples
IBDMap
IBDMap
A computational workflow for identity-by-descent (IBD) mapping. IBDMap identifies genomic regions associated with disease by detecting excess IBD segment sharing among affected individuals in biobank-scale datasets.
Contributions:
  • Restructured & refactored Python IBDreduce code to use modern Python features and idioms
  • Containerized application using a multi-stage Docker build
  • Tested the build and install pipelines for the C++ code and the Python code to ensure performance on different CPU architectures
COMPADRE
COMPADRE
Combined pedigree-aware distant relatedness estimation for improved pedigree reconstruction. This tool leverages genome-wide IBD sharing to accurately reconstruct families even with sparse or ungenotyped individuals.
Contributions:
  • Tested the conda and Docker installations
  • Refactored logging interface between both the Perl frontend and the Python server to improve error handling and user experience
  • Designed and implemented unit and integration testing framework using the Perl 'prove' tool and implemented CI/CD to automatically run all test
  • Reduced memory footprint of program to increase runtime efficiency
PubMed RAG
PubMed RAG
CLI TUI that allows the user to search for abstracts related to a topic of interest. Abstracts are retrieved from PubMed and then embedded with a vector database. Vector search is performed to find relevant abstracts and then an LLM is used to summarize the abstracts and return results. This project uses common technologies like FastAPI, LangChain, HuggingFace (LLMs), Qdrant (for vector store), and Bubble Tea (TUI interface).

Experience

Aug 2019 - Present
Graduate Student Researcher
College of Basic Sciences - Vanderbilt University
Mentors: David Samuels, PhD & Jennifer Below, PhD
Projects:
  • Executed scalable cloud-based pipelines (WDL/Terra.bio) for quality control, pedigree reconstruction, and relatedness analysis in whole genome sequencing data for 35,024 predominantly African ancestry individuals using PLINK and BCFtools. Also contributed local ancestry informed GWAS for white and red blood cell count in this cohort.
  • Orchestrated large-scale local ancestry inference, IBD segment detection, and comprehensive variant annotation (using VEP) for 250,000 WGS BioVU participants.
  • Developed scalable software to optimize IBD mapping within the 93,000 BioVU participants with genotyping enabling large-scale phenome-wide analysis in electronic health records.
  • Identified the background haplotype harboring a rare variant in RHO that is causal for Retinitis Pigmentosa by leveraging multi-individual IBD clustering with local ancestry data
  • Conducted GWAS analysis using SAIGE for Von-Willebrand factor in 70,000 BioVU participants contributing to large multi-institution meta-analysis
Aug 2018 - Aug 2019
Undergraduate Student Researcher
Department of Chemistry: North Carolina State University
Mentors: Christian Melander, PhD
Projects:
  • Meridianin D Analogues Display Antibiofilm Activity against MRSA and Increase Colistin Efficacy in Gram-Negative Bacteria

Education

Aug 2019 - Present
PhD in Human Genetics
Vanderbilt University
Aug 2014 - May 2018
B.S. in Chemistry
North Carolina State University
Aug 2016 - May 2019
B.S. in Biochemistry
North Carolina State University

Publications

DRIVE v3: command line application for identity-by-descent haplotype clustering in large biobank-scale data
James T Baker, Hung-Hsin Chen, Grahame F Evans, Alyssa C Scartozzi, Ryan J Bohlender, Chad D Huff, Quinn S Wells, David C Samuels, Jennifer E Below
Genetic Epidemiology 2026
Revise & Resubmit
IBDMap: biobank-scale identity-by-descent mapping software for dichotomous traits.
Ryan J. Bohlender*, Grahame F Evans*, James T Baker*, Joshua M Landman, Elizabeth G Frankel, Autumn R Morrow, Lauren E Petty, Alexander S Petty, Lisa A Bastarache, Hung-Hsin Chen, Matthew S Zawistowski, David C Samuels, Quinn S Wells, Jennifer E Below, Chad D Huff
Revise & Resubmit
Out of sight, out of mind: Characterizing the variant landscape of ALPL and associated medical records in a clinical cohort enriched for African ancestry highlights the diagnostic failure of HPP in this population
Kimberlyn A. Ellis, Wanying Zhu, James T. Baker, Lauren E. Petty, Dayi Bian, Elena Evans, Hannah G. Polikowsky, Margo Black, Emily Shardelow, E. Mason Campbell, La Toya Hannah, Alexander Petty, Jennifer E. Below, Kathryn M. Dahir
MedRxiv 2025
Under Review
Common variant approaches to study Mendelian disease gene function identify novel phenome and pathways associated with PLOD3
Alexandra Scalici, James T Baker, Freida Blostein, Megan Shuey, Dharmendra Choudhary, Ela W Knapik, David C Samuels, Jennifer E Below, Lisa Bastarache, Tyne W Miller-Fleming, Nancy J Cox
MedRxiv 2025
DOI Revise & Resubmit
Genome sequencing of 35,024 predominantly African ancestry persons addresses gaps in genomics and healthcare
Cecile Avery, Mojgan Babanejad, James Baker, Xavier Bledsoe, Freida Blostein, Robert W Corty, Kimberlyn Ellis, Adriana M Hung, Allison Lake, John Shelley, Quanhu Sheng, Vanderbilt University Medical Center and Alliance for Genomic Discovery Investigators, Melinda Aldrich, Melissa Basford, Lisa Bastarache, Jennifer Below, Alexander G Bick, Peter Embi, QiPing Feng, Eric Gamazon, Lide Han, Jibril Hirbo, Kayla Marginean, Jonathan Mosley, Jill Pulley, Dan M. Roden, Douglas M Ruderfer, Megan Shuey, Yu Shyr, C Michael Stein, Colin Walsh, Consuelo Wilkins
MedRxiv 2025
DOI Revise & Resubmit
COMPADRE: Combined pedigree-aware distant relatedness estimation for improved pedigree reconstruction
Evans GF, Baker JT, Petty LE, Petty AS, Polikowsky HG, Bohlender RJ, Chen HH, Chou CY, Viljoen KZ, Beilby JM, Kraft SJ, Zhu W, Landman JM, Morrow AR, Bian D, Scartozzi AC, Huff CD, Below JE
American Journal of Human Genetics, 2025
Genetic study of von Willebrand factor antigen levels ≤ 50 IU/dL identifies variants associated with increased risk of VWD and bleeding
Friedman RK, Heath AS, Huffman JE, Baker JT, Hasbani NR, Gagliano Taliun SA, Chen MH, Howard TE, Lewis JP, Pankratz N, Patil S, Reiner AP, Thibord F, Yanek LR, Yao J, Chen HH, Curran JE, Faraday N, Guo X, Wheeler MM, Ryan KA, Zhou X, Cho K, Almasy L, Auer PL, Becker LC, Wilson PWF, Boerwinkle E, O'Connell JR, Rich SS, Samuels DC; NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium; TOPMed Hematology & Hemostasis Working Group; VA Million Veteran Program; Blangero J, Fornage M, Kooperberg C, Mathias RA, Mitchell BD, Rotter JI, Johnson AD, Smith NL, Coban-Akdemir ZH, Below JE, Morrison AC, Johnsen JM, de Vries PS
Journal of Thrombosis and Haemostasis
Gene and phenome-based analysis of the shared genetic architecture of eye diseases
Scalici A, Miller-Fleming TW, Shuey MM, Baker JT, Betti M, Hirbo J, Knapik EW, Cox NJ
American Journal of Human Genetics (2025)
Detection of distant relatedness in biobanks to identify undiagnosed cases of Mendelian disease as applied to Long QT syndrome
Lancaster MC, Chen HH, Shoemaker MB, Fleming MR, Strickland TL, Baker JT, Evans GF, Polikowsky HG, Samuels DC, Huff CD, Roden DM, Below JE
Nature Communications (2025)
Machine-learning based classification of Frontotemporal dementia in electronic health records for genetic discovery
Below, J., Shaw, D., Evans, G., Baker, J., Bohlender, R., Petty, A., Petty, L., Roshani, R., Lifferth, J., Bastarache, L., Naj, A., Bush, W., Darby, R., McMillan, C., Samuels, D., Huff, C
Alzheimer’s & Dementia (2023)
2-aminobenzimidazoles as antibiofilm agents against Salmonella enterica serovar Typhimurium
Huggins, W. M., Vu Nguyen, T., Hahn, N. A., Baker, J.T., Kuo, L. G., Kaur, D., Melander, R. J., Gunn, J. S., & Melander, C
MedChemComm (2018)
Meridianin D analogues display antibiofilm activity against MRSA and increase colistin efficacy in gram-negative bacteria
Huggins, W. M., Barker, W. T., Baker, J.T., Hahn, N. A., Melander, R. J., & Melander, C
ACS Medicinal Chemistry Letters

Get in Touch

Feel free to reach out directly via email: