Jun-Peng Zhu - Homepage

Short Biography

Jun-Peng Zhu received his Ph.D. degree from the School of Data Science and Engineering, East China Normal University (September 2022 - December 2025). He is fortunate to work closely with Prof. Peng Cai, Prof. Xuan Zhou and Dr. Kai Xu. His research interests span various database topics, with a current focus on data analysis systems through LLMs and distributed database systems. He is also exploring AI for Science applications in areas such as bioinformatics and healthcare. He worked as an R&D engineer in the TiDB Cloud Platform Group at PingCAP, focusing on autonomous services for TiDB, under the supervision of Dr. Kai Xu, Liu Tang, and Qi Liu. Previously, he worked as a kernel R&D engineer at the VMware Greenplum database team, focusing on nested transactions and query optimization for MPP distributed database systems. He has published more than ten papers at top-tier international database conferences in the field of data management, such as SIGMOD, VLDB, and ICDE. Feel free to reach out for potential collaborations.

Research Interests

Automated Data Analysis Systems: TiInsight (PVLDB'25), UNITQA (SIGMOD'25), Chat2Query (ICDE'24), AutoTQA (PVLDB'24)
Distributed Database Systems: RedTAO (SIGMOD'25), SylphDB (ICDE'25), FDBKeeper (PVLDB'25), HAWK (PVLDB'25), AETS (ICDE'24)
AI for Database Systems: ESTune (SIGMOD'26), MODT (DASFAA'24), AutoTable (WISE'24)

News

[Nov. 2025] I successfully defended my Ph.D. thesis! The title of the thesis is “Automated Data Analysis through Large Language Models”. Congratulations! 🎉🎉🎉
[Nov. 2025] One paper about database configuration tuning is accepted to SIGMOD 2026. Congratulations! 🎉🎉🎉
[Sep. 2025] Attended VLDB 2025@London, United Kingdom. 🎉🎉🎉
[Jun. 2025] One paper about exploratory data analysis is accepted to VLDB 2025. Congratulations! 🎉🎉🎉
[Jun. 2025] One paper about coordination service is accepted to VLDB 2025. Congratulations! 🎉🎉🎉
[Jun. 2025] One paper about deadlock detection is accepted to VLDB 2025. Congratulations! 🎉🎉🎉
[May. 2025] Attended ICDE 2025@Hong Kong SAR, China. 🎉🎉🎉
[Mar. 2025] One paper about tabular question answering system is accepted to SIGMOD 2025. Congratulations! 🎉🎉🎉
[Mar. 2025] One paper about graph store is accepted to SIGMOD 2025. Congratulations! 🎉🎉🎉
[Mar. 2025] One paper about LSM-tree optimization is accepted to ICDE 2025. Congratulations! 🎉🎉🎉
[Dec. 2024] One paper about cross-domain exploratory data analysis is uploaded to arXiv. 🎉🎉🎉
[Aug. 2024] Attended VLDB 2024@Guangzhou, China. 🎉🎉🎉
[Jun. 2024] One paper about tabular question answering is accepted to VLDB 2024. Congratulations! 🎉🎉🎉
[May. 2024] Attended ICDE 2024@Utrecht, Netherlands. 🎉🎉🎉
[Feb. 2024] One paper about exploratory data analysis is accepted to ICDE 2024. Congratulations! 🎉🎉🎉
[Dec. 2023] One paper about log replaying for HTAP workloads is accepted to ICDE 2024. Congratulations! 🎉🎉🎉

Selected Publications [View All]

For more details, please view the Google Scholar profile.

2025

PVLDB
Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models

Jun-Peng Zhu, Boyan Niu, Peng Cai, Zheming Ni, Jianwei Wan, Kai Xu, Jiajun Huang, Shengbo Ma, Bing Wang, Xuan Zhou, Guanglei Bao, Donghui Zhang, Liu Tang, Qi Liu

Proceedings of the VLDB Endowment (PVLDB), 2025.

PDF Code Video Blog 4

Exploratory data analysis (EDA), coupled with SQL, is essential for data analysts involved in data exploration and analysis. However, data analysts often encounter two primary challenges: (1) the need to craft SQL queries skillfully, and (2) the requirement to generate suitable visualization types that enhance the interpretation of query results. Due to its significance, substantial research efforts have been made to explore different approaches to address these challenges, including leveraging large language models (LLMs). However, existing methods fail to meet real-world data exploration requirements primarily due to (1) complex database schema; (2) unclear user intent; (3) limited cross-domain generalization capability; and (4) insufficient end-to-end text-to-visualization capability. This paper presents TiInsight, an automated SQL-based cross-domain exploratory data analysis system. First, we propose hierarchical data context (i.e., HDC), which leverages LLMs to summarize the contexts related to the database schema, which is crucial for open-world EDA systems to generalize across data domains. Second, the EDA system is divided into four components (i.e., stages): HDC generation, question clarification and decomposition, text-to-SQL generation (i.e., TiSQL), and data visualization (i.e., TiChart). Finally, we implemented an end-to-end EDA system with a user-friendly GUI interface in the production environment at PingCAP. We have also open-sourced all APIs of TiInsight to facilitate research within the EDA community. Through extensive evaluations by a real-world user study, we demonstrate that TiInsight offers remarkable performance compared to human experts. Additionally, TiSQL achieves an execution accuracy of 86.3% on the Spider dataset when using GPT-4 . It also attains an execution accuracy of 60.98% on the Bird dataset.
```
@article{zhu2024towards,
  title={Towards Automated Cross-domain Exploratory Data Analysis through Large Language Models},
  author={Zhu, Jun-Peng and Niu, Boyan and Cai, Peng and Ni, Zheming and Wan, Jianwei and Xu, Kai and Huang, Jiajun and Ma, Shengbo and Wang, Bing and Zhou, Xuan and others},
  journal={arXiv preprint arXiv:2412.07214},
  year={2024}
}
```
CCF-A

Data Analytics LLM-based EDA
PVLDB
FDBKeeper: Enabling Scalable Coordination Services for Metadata Management using Distributed Key-Value Databases

Jun-Peng Zhu, Lingfeng Zhang, Peng Cai, Xuan Zhou, Peisen Zhao, Xue Wang, Lingpeng Tang

Proceedings of the VLDB Endowment (PVLDB), 2025.

PDF Code Blog

High-reliability distributed coordination services have become an indispensable part of modern large-scale distributed systems. Popular coordination services (e.g., ZooKeeper) adopt a single-writer design to provide a centralized service for managing system metadata, including various configuration information and data catalogs, and to provide distributed synchronization functions. With the continuous increase in metadata size and the scale of distributed systems, these coordination services gradually become performance bottlenecks due to their limitations in capacity, read and write performance, and scalability. To bridge the gaps, we propose FDBKeeper, a novel solution that enables scalable coordination services on distributed ACID key-value database systems. Our motivation is that transactional key-value stores (i.e., FoundationDB) meet the demands of performance and scalability required by large-scale distributed systems over coordination service. To leverage these advantages, coordination services can be implemented as an upper layer on top of distributed ACID key-value databases. Our experimental results demonstrate that FDBKeeper significantly outperforms ZooKeeper across key metrics. Additionally, FDBKeeper reduces hardware resource costs on average by 33% in the production environment, resulting in substantial monetary cost savings. We have successfully replaced ZooKeeper with FDBKeeper in the production-grade ClickHouse cluster deployment.
```
@article{zhu2025fdbkeeper,
  title={FDBKeeper: Enabling Scalable Coordination Services for Metadata Management using Distributed Key-Value Databases},
  author={Zhu, Jun-Peng and Zhang, Lingfeng and Cai, Peng and Zhou, Xuan and Zhao, Peisen and Wang, Xue and Tang, Linpeng},
  journal={Proceedings of the VLDB Endowment},
  volume={18},
  number={12},
  pages={5004--5016},
  year={2025},
  publisher={VLDB Endowment}
}
```
CCF-A

Coordination Services Distributed Database Systems
PVLDB
HAWK: A Workload-driven Hierarchical Deadlock Detection Approach in Distributed Database System

Rongrong Zhang, Zhiwei Ye, Jun-Peng Zhu, Peng Cai, Xuan Zhou, Dunbo Cai, Ling Qian

Proceedings of the VLDB Endowment (PVLDB), 2025.

PDF Code

Distributed databases are widely used in various fields, such as financial services and e-commerce. These businesses generally exhibit characteristics of large-scale and rapid growth. However, these business systems often suffer from deadlocks that prevent them from operating normally for extended periods. Traditional deadlock detection methods face challenges in scalability and efficiency, especially as the number of nodes increases. Therefore, deadlock detection has always been a research area in distributed databases. In this paper, we introduce an efficient deadlock detection algorithm called HAWK, leveraging a Hierarchical Approach based on Workload modeling. Our algorithm addresses these issues by constructing a dynamic hierarchical detection tree that adapts to transaction patterns, significantly reducing time complexity and communication overhead. HAWK first models the workload and generates a predicted access graph (PAG), transforming the problem of partitioning detection task in the basic hierarchical detection into partition detection zone (DZ) in the PAG by a graph-cutting algorithm. Then, leveraging the properties of strongly connected components (SCCs) and deadlock cycles, the SCC-cut algorithm naturally partitions the system-wide deadlock detection into multiple non-intersecting detection zones, thereby enhancing detection efficiency. We used the greedy SCC-cut algorithm to perform a more fine-grained partitioning of the complex PAG. Finally, by periodically sampling and updating the hierarchical structure, the algorithm remains responsive to dynamic workload variations, ensuring efficient detection. Our approach outperforms both centralized and distributed methods, offering a more efficient and adaptive solution. Extensive experimental results demonstrate the effectiveness of the HAWK algorithm, showing significant reductions in the duration of the deadlock and improved system throughput.
```
@article{zhang2025hawk,
  title={HAWK: A Workload-driven Hierarchical Deadlock Detection Approach in Distributed Database System},
  author={Zhang, Rongrong and Ye, Zhiwei and Zhu, Jun-Peng and Cai, Peng and Zhou, Xuan and Cai, Dunbo and Qian, Ling},
  journal={Proceedings of the VLDB Endowment},
  volume={18},
  number={10},
  pages={3682--3694},
  year={2025},
  publisher={VLDB Endowment}
}
```
CCF-A

Deadlock Detection Distributed Database Systems
ICDE
SylphDB: An Active and Adaptive LSM Engine for Update-Intensive Workloads

Jun-Peng Zhu, Zhiwei Ye, Xiaolong He, Peng Cai, Xuan Zhou, Aoying Zhou, Dunbo Cai, Ling Qian, Kai Xu, Liu Tang, Qi Liu

41st IEEE International Conference on Data Engineering (ICDE), 2025

PDF Code

Update-intensive workloads are prevalent in contemporary OLTP and AI/ML scenarios. An update operation typically involves deleting the old version of the target record and then inserting a new version. In this work, we demonstrate that an LSM-tree faces two issues when dealing with update-intensive workloads. Firstly, the deleted old versions are not promptly garbage collected until they merge with their new versions during compaction. This may lead to space waste and write amplification. Secondly, it is common for an update operation to modify only a small fraction of a data record, such as one of a hundred attributes. However, state-of-the-art LSM-trees fail to effectively utilize the incremental storage strategy, which involves storing only the updated fraction rather than the entire new version to enhance efficiency. In this paper, we propose two techniques, active and fast garbage collection, and adaptive incremental updating, to address these issues, respectively. Active and fast garbage collection probes the distribution of invalid data versions in an LSM-tree and performs garbage collection in a more promptly manner. Adaptive incremental updating applies different storage modes to the update operation to achieve balanced write and read amplification ratios as much as possible. Based on the techniques, we introduce SylphDB implemented based on the codebase of RocksDB and optimized for update-intensive workloads. Experimental results demonstrated that, compared to traditional LSM-tree based systems, SylphDB can improve the efficiency of garbage collection by 2× and reduce write amplification by 20%.
```
@inproceedings{zhu2025sylphdb,
  title={SylphDB: An Active and Adaptive LSM Engine for Update-Intensive Workloads},
  author={Zhu, Jun-Peng and Ye, Zhiwei and He, Xiaolong and Cai, Peng and Zhou, Xuan and Zhou, Aoying and Cai, Dunbo and Qian, Ling and Xu, Kai and Tang, Liu and others},
  booktitle={2025 IEEE 41st International Conference on Data Engineering (ICDE)},
  pages={4360--4372},
  year={2025},
  organization={IEEE Computer Society}
}
```
CCF-A

Storage Engine LSM-tree
SIGMOD
UNITQA: A Unified Automated Tabular Question Answering System with Multi-Agent Large Language Models

Jun-Peng Zhu, Peng Cai, Kai Xu, Li Li, Yishen Sun, Shuai Zhou, Haihuang Su, Liu Tang, Qi Liu

ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2025

PDF Code Video

Automated tabular question answering (TQA) has attracted significant attention in data analysis and natural language processing communities due to its powerful capabilities. The emergence of large language models (LLMs) has initiated a paradigm shift in this field. However, existing state-of-the-art approaches cannot generally operate on multiple tables from multiple heterogeneous systems, and the answer accuracy is insufficient to meet the demands of the industrial field. This paper presents UNITQA, a unified automated tabular question-answering system through multi-agent LLMs. First, UNITQA offers a user-friendly GUI interface that enables users to use natural language questions to execute TQA tasks. Second, UNITQA consists of five agents who collaborate to complete user-specified tasks. To efficiently orchestrate different agents, UNITQA utilizes a dynamic agent scheduling algorithm based on a finite-state machine. Third, UNITQA integrates a series of data connectors that allow UNITQA to access various tables from multiple heterogeneous systems. We have implemented and deployed UNITQA in numerous production environments and have demonstrated its usability and efficiency in representative real-world scenarios.
```
@inproceedings{zhu2025unitqa,
  title={UNITQA: A Unified Automated Tabular Question Answering System with Multi-Agent Large Language Models},
  author={Zhu, Jun-Peng and Cai, Peng and Xu, Kai and Li, Li and Sun, Yishen and Zhou, Shuai and Su, Haihuang and Tang, Liu and Liu, Qi},
  booktitle={Companion of the 2025 International Conference on Management of Data},
  pages={279--282},
  year={2025}
}
```
CCF-A

LLM-based TQA
SIGMOD
RedTAO: A Trillion-edge High-throughput Graph Store

Shihao Zhou, Qi Mao, Yi Cheng, Hongcheng Qi, Yilun Huang, Peng Cai, Jun-Peng Zhu

ACM SIGMOD/PODS International Conference on Management of Data (SIGMOD), 2025

PDF

With the explosive growth of daily active users, the social graph data of Xiaohongshu has scaled to trillions of edges, imposing high pressure on our storage system. Current state-of-the-art systems struggle to address the issue, primarily due to: (1) Traditional relational databases as the back-end storage require frequent scaling, incurring high cost and stability risks. (2) Most graph databases focus on complex multi-hop queries. Redundant components in these systems make them difficult to take advantage when processing our workloads dominated by one-hop queries. (3) Using cache systems like Redis or Memcache often struggles to ensure consistency between the cache and storage. In this paper, we propose RedTAO, which has a scalable and efficient graph cache layer optimized for social scenarios. Over 90.7% of queries are served directly by the cache, enabling us to focus on scaling it as traffic increases. RedTAO employs cross-cloud, multi-active deployment, synchronizing replicas through the storage layer. The cache layer directly accesses local storage, avoiding costly cross-region requests. Additionally, the data transmission service (DTS) component asynchronously corrects cache data, ensuring cache consistency. RedTAO has been successfully deployed in Xiaohongshu, achieving a 1.8X throughput improvement and at least 21.3% reduction in resource usage compared to the previously used MySQL architecture.
```
@inproceedings{zhou2025redtao,
  title={RedTAO: A Trillion-edge High-throughput Graph Store},
  author={Zhou, Shihao and Mao, Qi and Cheng, Yi and Qi, Hongcheng and Huang, Yilun and Cai, Peng and Zhu, Jun-Peng},
  booktitle={Companion of the 2025 International Conference on Management of Data},
  pages={716--728},
  year={2025}
}
```
CCF-A

Storage Engine Distributed Database Systems

2024

PVLDB
AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models

Jun-Peng Zhu, Peng Cai, Kai Xu, Li Li, Yishen Sun, Shuai Zhou, Haihuang Su, Liu Tang, Qi Liu

Proceedings of the VLDB Endowment (PVLDB), 2024.

PDF Code Video Blog 33

With the growing significance of data analysis, several studies aim to provide precise answers to users' natural language questions from tables, a task referred to as tabular question answering (TQA). The state-of-the-art TQA approaches are limited to handling only single-table questions. However, real-world TQA problems are inherently complex and frequently involve multiple tables, which poses challenges in directly extending single-table TQA designs to handle multiple tables, primarily due to the limited extensibility of the majority of single-table TQA methods. This paper proposes AutoTQA, a novel Autonomous Tabular Question Answering framework that employs multi-agent large language models (LLMs) across multiple tables from various systems (e.g., TiDB, BigQuery). AutoTQA comprises five agents: the User, responsible for receiving the user's natural language inquiry; the Planner, tasked with creating an execution plan for the user's inquiry; the Engineer, responsible for executing the plan step-by-step; the Executor, provides various execution environments (e.g., text-to-SQL) to fulfill specific tasks assigned by the Engineer; and the Critic, responsible for judging whether to complete the user's natural language inquiry and identifying gaps between the current results and initial tasks. To facilitate the interaction between different agents, we have also devised agent scheduling algorithms. Furthermore, we have developed LinguFlow, an open-source, low-code visual programming tool, to quickly build and debug LLM-based applications, and to accelerate the creation of various external tools and execution environments. We also implemented a series of data connectors, which allows AutoTQA to access various tables from multiple systems. Extensive experiments show that AutoTQA delivers outstanding performance on four representative datasets.
```
@article{zhu2024autotqa,
  title={Autotqa: Towards autonomous tabular question answering through multi-agent large language models},
  author={Zhu, Jun-Peng and Cai, Peng and Xu, Kai and Li, Li and Sun, Yishen and Zhou, Shuai and Su, Haihuang and Tang, Liu and Liu, Qi},
  journal={Proceedings of the VLDB Endowment},
  volume={17},
  number={12},
  pages={3920--3933},
  year={2024},
  publisher={VLDB Endowment}
}
```
CCF-A

LLM-based TQA
ICDE
Chat2query: A Zero-Shot Automatic Exploratory Data Analysis System with Large Language Models

Jun-Peng Zhu, Peng Cai, Boyan Niu, Zheming Ni, Kai Xu, Jiajun Huang, Jianwei Wan, Shengbo Ma, Bing Wang, Donghui Zhang, Liu Tang, Qi Liu

2024 IEEE 40th International Conference on Data Engineering (ICDE), 2024.

PDF Code Video Blog 16

Data analysts often encounter two primary chal- lenges while conducting exploratory data analysis by SQL: (1) the need to skillfully craft SQL queries, and (2) the requirement to generate suitable visualizations that enhance the interpretation of query results. The emergence of large language models (LLMs) has inaugurated a paradigm shift in text-to-SQL and data-to-chart. This paper presents Chat2Query, an LLM-empowered zero-shot automatic exploration data analysis system. Firstly, Chat2Query provides a user-friendly interface that allows users to employ natural languages to interact with the database directly. Secondly, Chat2Query offers an LLM-empowered text-to-SQL generator, SQL rewriter, SQL formatter, and data-to-chart generator. Thirdly, Chat2Query is uniquely distinguished by its underlying incorporation of the TiDB Serverless, fostering superior elasticity and scalability. This strategic integration empowers Chat2Query with the capability to seamlessly adapt to change workloads, aligning with the evolving demands of the user. We have implemented and deployed Chat2Query in the production environment, and demonstrate its usability and efficiency in three representative real-world scenarios.
```
@inproceedings{zhu2024chat2query,
  title={Chat2query: A zero-shot automatic exploratory data analysis system with large language models},
  author={Zhu, Jun-Peng and Cai, Peng and Niu, Boyan and Ni, Zheming and Xu, Kai and Huang, Jiajun and Wan, Jianwei and Ma, Shengbo and Wang, Bing and Zhang, Donghui and others},
  booktitle={2024 IEEE 40th International Conference on Data Engineering (ICDE)},
  pages={5429--5432},
  year={2024},
  organization={IEEE}
}
```
CCF-A

LLM-based EDA

2023

ICDE
Log Replaying for Real-Time HTAP: An Adaptive Epoch-based Two-Stage Framework

Jun-Peng Zhu, Zhiwei Ye, Peng Cai, Donghui Wang, Fengyan Zhang, Dunbo Cai, Ling Qian

2024 IEEE 40th International Conference on Data Engineering (ICDE), 2024.

PDF Code 1

As real-time analytics become increasingly important, more organizations are deploying Hybrid Transactional/Analytical Processing (HTAP) systems. The HTAP systems, based on a primary/backup replication architecture, usually support real-time read-only queries on backup nodes for the data recently generated by OLTP applications on the primary node. This work is based on the observation that real-time analytical applications often require access to only a fraction of the latest modifications from OLTP applications. However, the state-of-the-art parallel log replay approaches treat all replicated transaction logs equally and replay the entire transaction logs with the same priority which does not take consideration into the OLAP query access pattern. This design can result in increased response latency for real-time applications. This paper presents AETS, an Adaptive Epoch-based Two-Stage log replay framework that implements epoch-based log replay and table group transaction visibility. Simultaneously, ame also takes full account of the table access priority in real-time HTAP workload log replay. It aims to make the data required by analytical queries visible more quickly. Furthermore, AETS includes a two-phase parallel log replay algorithm called TPLR, which achieves lower overhead compared to state-of-the-art algorithms through careful design. We also offer an adaptive fine-grained thread resource allocation method that considers changes in table access patterns over time under thread resource constraints. Our experimental results show that AETS significantly reduces visibility delay for real-time queries. And the results also show that AETS achieves significant replay throughput improvement.
```
@inproceedings{zhu2024log,
  title={Log Replaying for Real-Time HTAP: An Adaptive Epoch-Based Two-Stage Framework},
  author={Zhu, Jun-Peng and Ye, Zhiwei and Cai, Peng and Wang, Donghui and Zhang, Fengyan and Cai, Dunbo and Qian, Ling},
  booktitle={2024 IEEE 40th International Conference on Data Engineering (ICDE)},
  pages={2096--2108},
  year={2024},
  organization={IEEE}
}
```
CCF-A

Replication Log Replaying

Services

Conference Reviewers

European Conference on Computer Systems (EuroSys 2026)
2025 IEEE BDDM session chair and reviewer
International Conference on Database Systems for Advanced Applications (DASFAA 2024/2025/2026)
ACM International Conference on Information and Knowledge Management (CIKM 2024)

Journal Reviewers

IEEE Transactions on Knowledge and Data Engineering (TKDE)

Experience

Database R&D Engineer

PingCAP, TiDB Cloud Platform Team

Database Kernel R&D Engineer

VMware, Greenplum Database Team

Honors and Awards

Excellent Student Cadre of ECNU, 2025
Ph.D Industrial First Class Scholarship, 2025
Ph.D National Scholarship, 2024
"Hack Split Insert for Greenplum" won the bronze medal at the VMware Global Hackathon, 2022

Visitors