Skip to content

USTC 高级计算机网络 课程主页

课程信息

  • 课程名称:高级计算机网络 (COMP6103P.01)
  • 上课时间地点:2~20周 GT-B212: 2(11,12,13)
  • 主讲教师:赵功名 (gmzhao at ustc dot edu dot cn)
  • 助教 (邮箱:括号中的用户名 at mail dot ustc dot edu dot cn)
    • 朱家成 (zhu_jc)
    • 邓立鑫 (denglx)
    • 田佳林 (jltian)
    • 李紫惠 (lizihui2002)

期末考试形式

期末考试将于 2026 年 1 月 13 日 19:00 ~ 21:00 在 G3-110 进行。 考试采用全开卷形式。

  • 可以使用:纸质资料、电子设备、搜索引擎、AIGC 工具
  • 不能使用:通信软件、拍照

更新:线上论文分享和课程小论文 / 论文阅读笔记的提交

请同学们阅读相关要求并按时提交。

线上论文分享提交

由于线下汇报人数有限,请报名线上录屏分享论文的队伍(1-3人)将录屏文件及对应论文pdf发送至邮箱 netclass2025@163.com,要求如下:

  1. 录屏为 mp4 格式,建议使用腾讯会议录屏,30-40 分钟为佳,命名为论文标题-学号姓名-学号姓名;
  2. 同时附上论文 PDF,命名为论文标题-学号姓名-学号姓名;
  3. 邮件标题为高网论文分享-学号姓名-学号姓名(例如高网论文分享-SA25011001张三-SA25011002李四);
  4. 截止日期为 2025年1月13日24时

如第一节课所述,此项为额外加分,而非必选项,可以不用交。

课程小论文 / 论文阅读笔记提交

请大家在 1月15日24时 前,通过邮箱提交 课程小论文,或提交不少于 5 篇论文的阅读笔记二选一即可)。以 PDF 格式提交,文件命名为“SA25011001-张三-高网课程作业”。提交至:acncoursework@163.com

提交后,同学们应该收到提示邮件被成功接收的自动回复。若邮件发送失败或没有收到自动回复,请尽快联系我们。同学们收到的自动回复邮件是作业提交成功的重要凭证。

  • 如果你选择提交 论文阅读笔记,我们建议按下面的逻辑框架来写(每篇论文的阅读笔记都应包含对这 6 个问题的回答):

    1. 当前存在什么问题?
    2. 现有方案为什么不能解决该问题?
    3. 本文打算通过什么思路解决该问题?
    4. 该思路会遇到什么挑战?
    5. 本文通过什么手段克服该挑战?
    6. 本文如何通过实验论证了所提方案的优越性?

    另外,建议在每篇阅读笔记末尾单独附上一个“附录”:将该论文第一章(Introduction)每一段用 1–2 句话概括其核心思想/内容,以加深对第一章写作逻辑与结构的理解。每篇论文阅读笔记大约在 600-1000 字,要求是与网络相关的论文。

  • 若选择提交 小论文,则需是网络相关的小论文,可以是研究论文也可是综述论文,按照常规论文格式即可,中英文不限,严禁抄袭。

课程小论文/论文阅读笔记占总评成绩 40%

课程安排

课次 主题 论文 PPT
1 DCN A Scalable, Commodity Data Center Network Architecture FatTree
2 DCN VL2: A scalable and flexible data center network VL2
3 SDN OpenFlow Enabling Innovation in Campus Networks OpenFlow
4 SDN B4: Experience with a Globally-Deployed Software Defined WAN B4
5 DCN Hedera: Dynamic Flow Scheduling for Data Center Networks Hedera
6 SDN Dynamic Scheduling of Network Updates Dionysus
7 SDN SIMPLE-fying Middlebox Policy Enforcement Using SDN SIMPLE
8 SDN ClickOS and the Art of Network Function Virtualization ClickOS
9 SDN P4: Programming Protocol-Independent Packet Processors P4
10 SDN The Design and Implementation of Open vSwitch OVS
11 Protocol Design, implementation and evaluation of congestion control for multipath TCP MPTCP
12 Protocol Congestion Control for Large-Scale RDMA Deployments DCQCN
13 Protocol The QUIC Transport Protocol: Design and Internet-Scale Deployment QUIC

可供参考的分享论文列表

序号 论文名称 会议 时间
1 (已选) Cassini: Network-Aware Job Scheduling in Machine Learning Clusters NSDI 2024
2 Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using Syndicate NSDI 2023
3 (已选) RDMA over Ethernet for Distributed AI Training at Meta Scale NSDI 2024
4 (已选) Crux: GPU-Efficient Communication Scheduling for Deep Learning Training SIGCOMM 2024
5 (已选) Alibaba HPN: A Data Center Network for Large Language Model Training SIGCOMM 2024
6 (已选) Mooncake: Trading More Storage for Less Computation — A KVCache-centric Architecture for Serving LLM Chatbot FAST 2025
7 SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision NSDI 2025
8 (已选) Efficient Memory Management for Large Language Model Serving with PagedAttention SOSP 2023
9 (已选) ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation SOSP 2023
10 Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models ISCA 2022
11 (已选) Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoint SOSP 2023
12 (已选) Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models SIGCOMM 2023
13 (已选) Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention ATC 2024
14 Towards Domain-Specific Network Transport for Distributed DNN Training NSDI 2024
15 (已选) Differential Network Analysis (DNA) NSDI 2022
16 Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI 2023
17 One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems NSDI 2025
18 Starvation in End-to-End Congestion Control SIGCOMM 2022
19 DUNE: Distributed Inference in the User Plane INFOCOM 2025
20 Nezha: SmartNIC-based Virtual Switch Load Sharing SIGCOMM 2025
21 (已选) Fast Algorithms for Loop-Free Network Updates using Linear Programming and Local Search INFOCOM 2024
22 Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning ASPLOS 2024
23 DREAM: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads ASPLOS 2023
24 (已选) MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs NSDI 2024
25 TopoopT: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs NSDI 2023
26 White-Boxing RDMA with Packet-Granular Software Control NSDI 2025
27 Unlocking ECMP Programmability for Precise Traffic Control NSDI 2025
28 (已选) Autellix: An Efficient Serving Engine for LLM Agents as General Programs NSDI 2026
29 Load Balancing With Multi-Level Signals for Lossless Datacenter Networks ToN 2024
30 (已选) Swing: Short-cutting Rings for Higher Bandwidth Allreduce NSDI 2024
31 Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem SIGCOMM 2024
32 MCCS: A Service-based Approach to Collective Communication for Multi-Tenant Cloud SIGCOMM 2024
33 Whale: Efficient Giant Model Training over Heterogeneous GPUs ATC 2023
34 (已选) Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve OSDI 2024
35 (已选) ServerlessLLM: Low-Latency Serverless Inference for Large Language Models OSDI 2024
36 Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances NSDI 2024
37 (已选) CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving SIGCOMM 2024

学生 Presentation 安排

时间 汇报人 论文
第3周 2025.9.23 唐梓皓 魏清扬 Crux: GPU-Efficient Communication Scheduling for Deep Learning Training
第4周 2025.9.30 沈嘉玮 李宇航 Alibaba HPN: A Data Center Network for Large Language Model Training
第6周 2025.10.14 李宇哲 刘睿博 周瓯翔 AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
第7周 2025.10.21 周晖林 刘国柱 杨敏 Efficient Memory Management for Large Language Model Serving with PagedAttention
第8周 2025.10.28 胡潇逸 赵英豪 朱炜荣 Swing: Short-cutting Rings for Higher Bandwidth Allreduce
第9周 2025.11.4 陈润佳 刘盈睿 刘珈辰 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
第10周 2025.11.11 隋翊 张艾媛 边锋 Mooncake:Trading More Storage for Less Computation -A KVCache-centric Architecture for Serving LLMChatbot
第11周 2025.11.18 梅陶然 史弘佐 黄万超 Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
第12周 2025.11.25 刘亚国 张桦坚 王道宇 Autellix: An Efficient Serving Engine for LLM Agents as General Programs
第13周 2025.12.02 吴晓春 汪延 时锐 RDMA over Ethernet for Distributed Al Training at Meta Scale
第14周 2025.12.09 马嘉慧 靳琪 Differential Network Analysis (DNA)
第15周 2025.12.16 范冰 胡恒瑞 陈智慧 CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
张宇洋 宋子阳 沈楠 How to Disturb Network Reconnaissance: A Moving Target Defense Approach Based on Deep Reinforcement Learning
陈世初 张恒基 方驰正 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
第16周 2025.12.23 杜柏言 吴风帆 林文浩 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
黄佳依 赵涵 苗明鑫 ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
李岱霖 刘翔宇 鉏博洋 Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoint
第17周 2025.12.30 陈子阳 丁则文 侯世卓 ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
江伟怡 胡梦婷 蒋雨含 CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters
王若言 刘丰毅 Fast Algorithms for Loop-Free Network Updates using Linear Programming and Local Search