USTC 高级计算机网络 课程主页
课程信息
- 课程名称:高级计算机网络 (COMP6103P.01)
- 上课时间地点:2~20周 GT-B212: 2(11,12,13)
- 主讲教师:赵功名 (gmzhao at ustc dot edu dot cn)
- 助教 (邮箱:括号中的用户名 at mail dot ustc dot edu dot cn)
- 朱家成 (zhu_jc)
- 邓立鑫 (denglx)
- 田佳林 (jltian)
- 李紫惠 (lizihui2002)
期末考试形式
期末考试将于 2026 年 1 月 13 日 19:00 ~ 21:00 在 G3-110 进行。 考试采用全开卷形式。
- 可以使用:纸质资料、电子设备、搜索引擎、AIGC 工具
- 不能使用:通信软件、拍照
更新:线上论文分享和课程小论文 / 论文阅读笔记的提交
请同学们阅读相关要求并按时提交。
线上论文分享提交
由于线下汇报人数有限,请报名线上录屏分享论文的队伍(1-3人)将录屏文件及对应论文pdf发送至邮箱 netclass2025@163.com,要求如下:
- 录屏为 mp4 格式,建议使用腾讯会议录屏,30-40 分钟为佳,命名为论文标题-学号姓名-学号姓名;
- 同时附上论文 PDF,命名为论文标题-学号姓名-学号姓名;
- 邮件标题为高网论文分享-学号姓名-学号姓名(例如高网论文分享-SA25011001张三-SA25011002李四);
- 截止日期为 2025年1月13日24时。
如第一节课所述,此项为额外加分,而非必选项,可以不用交。
课程小论文 / 论文阅读笔记提交
请大家在 1月15日24时 前,通过邮箱提交 课程小论文,或提交不少于 5 篇论文的阅读笔记(二选一即可)。以 PDF 格式提交,文件命名为“SA25011001-张三-高网课程作业”。提交至:acncoursework@163.com。
提交后,同学们应该收到提示邮件被成功接收的自动回复。若邮件发送失败或没有收到自动回复,请尽快联系我们。同学们收到的自动回复邮件是作业提交成功的重要凭证。
-
如果你选择提交 论文阅读笔记,我们建议按下面的逻辑框架来写(每篇论文的阅读笔记都应包含对这 6 个问题的回答):
- 当前存在什么问题?
- 现有方案为什么不能解决该问题?
- 本文打算通过什么思路解决该问题?
- 该思路会遇到什么挑战?
- 本文通过什么手段克服该挑战?
- 本文如何通过实验论证了所提方案的优越性?
另外,建议在每篇阅读笔记末尾单独附上一个“附录”:将该论文第一章(Introduction)每一段用 1–2 句话概括其核心思想/内容,以加深对第一章写作逻辑与结构的理解。每篇论文阅读笔记大约在 600-1000 字,要求是与网络相关的论文。
-
若选择提交 小论文,则需是网络相关的小论文,可以是研究论文也可是综述论文,按照常规论文格式即可,中英文不限,严禁抄袭。
课程小论文/论文阅读笔记占总评成绩 40%。
课程安排
可供参考的分享论文列表
| 序号 | 论文名称 | 会议 | 时间 |
|---|---|---|---|
| 1 (已选) | Cassini: Network-Aware Job Scheduling in Machine Learning Clusters | NSDI | 2024 |
| 2 | Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using Syndicate | NSDI | 2023 |
| 3 (已选) | RDMA over Ethernet for Distributed AI Training at Meta Scale | NSDI | 2024 |
| 4 (已选) | Crux: GPU-Efficient Communication Scheduling for Deep Learning Training | SIGCOMM | 2024 |
| 5 (已选) | Alibaba HPN: A Data Center Network for Large Language Model Training | SIGCOMM | 2024 |
| 6 (已选) | Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot | FAST | 2025 |
| 7 | SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision | NSDI | 2025 |
| 8 (已选) | Efficient Memory Management for Large Language Model Serving with PagedAttention | SOSP | 2023 |
| 9 (已选) | ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation | SOSP | 2023 |
| 10 | Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models | ISCA | 2022 |
| 11 (已选) | Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoint | SOSP | 2023 |
| 12 (已选) | Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models | SIGCOMM | 2023 |
| 13 (已选) | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | ATC | 2024 |
| 14 | Towards Domain-Specific Network Transport for Distributed DNN Training | NSDI | 2024 |
| 15 (已选) | Differential Network Analysis (DNA) | NSDI | 2022 |
| 16 | Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs | NSDI | 2023 |
| 17 | One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems | NSDI | 2025 |
| 18 | Starvation in End-to-End Congestion Control | SIGCOMM | 2022 |
| 19 | DUNE: Distributed Inference in the User Plane | INFOCOM | 2025 |
| 20 | Nezha: SmartNIC-based Virtual Switch Load Sharing | SIGCOMM | 2025 |
| 21 (已选) | Fast Algorithms for Loop-Free Network Updates using Linear Programming and Local Search | INFOCOM | 2024 |
| 22 | Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning | ASPLOS | 2024 |
| 23 | DREAM: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads | ASPLOS | 2023 |
| 24 (已选) | MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | NSDI | 2024 |
| 25 | TopoopT: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs | NSDI | 2023 |
| 26 | White-Boxing RDMA with Packet-Granular Software Control | NSDI | 2025 |
| 27 | Unlocking ECMP Programmability for Precise Traffic Control | NSDI | 2025 |
| 28 (已选) | Autellix: An Efficient Serving Engine for LLM Agents as General Programs | NSDI | 2026 |
| 29 | Load Balancing With Multi-Level Signals for Lossless Datacenter Networks | ToN | 2024 |
| 30 (已选) | Swing: Short-cutting Rings for Higher Bandwidth Allreduce | NSDI | 2024 |
| 31 | Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem | SIGCOMM | 2024 |
| 32 | MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud | SIGCOMM | 2024 |
| 33 | Whale: Efficient Giant Model Training over Heterogeneous GPUs | ATC | 2023 |
| 34 (已选) | Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | OSDI | 2024 |
| 35 (已选) | ServerlessLLM: Low-Latency Serverless Inference for Large Language Models | OSDI | 2024 |
| 36 | Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances | NSDI | 2024 |
| 37 (已选) | CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving | SIGCOMM | 2024 |
学生 Presentation 安排
| 时间 | 汇报人 | 论文 |
|---|---|---|
| 第3周 2025.9.23 | 唐梓皓 魏清扬 | Crux: GPU-Efficient Communication Scheduling for Deep Learning Training |
| 第4周 2025.9.30 | 沈嘉玮 李宇航 | Alibaba HPN: A Data Center Network for Large Language Model Training |
| 第6周 2025.10.14 | 李宇哲 刘睿博 周瓯翔 | AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training |
| 第7周 2025.10.21 | 周晖林 刘国柱 杨敏 | Efficient Memory Management for Large Language Model Serving with PagedAttention |
| 第8周 2025.10.28 | 胡潇逸 赵英豪 朱炜荣 | Swing: Short-cutting Rings for Higher Bandwidth Allreduce |
| 第9周 2025.11.4 | 陈润佳 刘盈睿 刘珈辰 | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention |
| 第10周 2025.11.11 | 隋翊 张艾媛 边锋 | Mooncake:Trading More Storage for Less Computation -A KVCache-centric Architecture for Serving LLMChatbot |
| 第11周 2025.11.18 | 梅陶然 史弘佐 黄万超 | Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models |
| 第12周 2025.11.25 | 刘亚国 张桦坚 王道宇 | Autellix: An Efficient Serving Engine for LLM Agents as General Programs |
| 第13周 2025.12.02 | 吴晓春 汪延 时锐 | RDMA over Ethernet for Distributed Al Training at Meta Scale |
| 第14周 2025.12.09 | 马嘉慧 靳琪 | Differential Network Analysis (DNA) |
| 第15周 2025.12.16 | 范冰 胡恒瑞 陈智慧 | CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving |
| 张宇洋 宋子阳 沈楠 | How to Disturb Network Reconnaissance: A Moving Target Defense Approach Based on Deep Reinforcement Learning | |
| 陈世初 张恒基 方驰正 | Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | |
| 第16周 2025.12.23 | 杜柏言 吴风帆 林文浩 | MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs |
| 黄佳依 赵涵 苗明鑫 | ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation | |
| 李岱霖 刘翔宇 鉏博洋 | Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoint | |
| 第17周 2025.12.30 | 陈子阳 丁则文 侯世卓 | ServerlessLLM: Low-Latency Serverless Inference for Large Language Models |
| 江伟怡 胡梦婷 蒋雨含 | CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters | |
| 王若言 刘丰毅 | Fast Algorithms for Loop-Free Network Updates using Linear Programming and Local Search |