-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] May I ask what tool was used to plot Figure 6 in paper.How can I profile bubble time in pipeline parallelism? #18
Comments
We need to write some scripts for that manually. I will put our plotting scripts to this repo later. So what we want here is to find out all the CUDA kernels for each F/B/W/Optimizer so that we can get the GPU start/end time for it. The main workflow is:
The 2nd is the most tricky part. I will break them down one by one. (1) Profiling Using nsysTechnically, there are 3 kinds of events involved:
In our code, when profiling of Megatron is turned on, So first thing first, launch Megatron to do training and get the profiling data using nsys profile -s none -t nvtx,cuda <Command to run megatron like "torch run"> You may want to add more options to nsys like the following depending on your usage:
Now you get the nsys export --type sqlite --output <sqlite file> <nsys-rep file> (2) Event Reconstruction from SqliteNow the heavy lifting part. There are 3 tables corresponding to the 3 kinds of events:
You can use What we are basically going to do is to use SQL and script codes to reconstruct the mapping between:
Linking CPU CUDA Calls to GPU kernelsThis is pretty straightforward, there’s a But there are still some details to take note. SELECT runtime.correlationId, kernel.start, kernel.end, runtime.start, runtime.end,
kernel.deviceId, kernel.shortName, runtime.nameId, runtime.globalTid
FROM CUPTI_ACTIVITY_KIND_KERNEL as kernel, CUPTI_ACTIVITY_KIND_RUNTIME as runtime
WHERE runtime.correlationId = kernel.correlationId
and kernel.globalPid / 0x1000000 % 0x1000000 = runtime.globalTid / 0x1000000 % 0x1000000
and runtime.globalTid in (
SELECT distinct globalTid FROM NVTX_EVENTS
WHERE text like "F%" or text like "B%" or text like "W%" or text = "Optimizer"
); Firstly, different processes may generate the same Secondly, we are just interested in the threads where our NVTX events happen. So need to filter them through globalTid. Thirdly, the kernel.shortName and runtime.nameId are just ids, not strings. We need to get back the strings in table StringIds by these ids. I prefer to do this in the script, not SQL. Linking NVTX Events to CPU CUDA CallsNow comes the hard part. There’s no direct connection between NVTX events and CUDA calls. The only thing we can rely on is the start/end time of NVTX events and CUDA calls. A CUDA call belongs to a NVTX event only if the call happens inside the event in the same thread. More accurately, since NVTX events can be stacked, we normally only care about the stack-top NVTX event. For example, if we have 3 ranges A, B, and C in the same thread:
AAAAAAAAAAAAAAAAAAAA
BBBBBBBBBBBB CCC
CUDA call
Time ---> Range B is the likely the F,B,W NVTX event we’re looking for. The SQL to pull out all the related NVTX events would be: SELECT start, end, globalTid, text from NVTX_EVENTS
WHERE eventType = 59 and
(text like "F%" or text like "B%" or text like "W%" or text = "Optimizer") The event type 59 here is Then we need to write a script to match the NVTX events and CUDA API calls by checking the duration range and thread id. Since the data are large in many cases, you may need to optimize the matching algorithm a bit using binary search or other methods. Now you know for each NVTX event, F/B/W/Optimizer in our project, which CUDA kernels belong to it. We can easily get the start/end time of this event by the start time of the first kernel and the end time of the last kernel. We can also get the deviceId from table CUPTI_ACTIVITY_KIND_KERNEL. I would like to thank Jay and Holly from Nvidia Support for helping us to figure all this all. (3) PlottingWe use Python and lib If you are running multiple servers, you will need to combine the sqlites on each server together. |
I run it with the following command
However, the following error occurs
Is there a problem with my execution? |
Seems like the events that the script depends on are missing. |
Your question
How can I profile bubble time in pipeline parallelism?
The text was updated successfully, but these errors were encountered: