Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add analysis tool for nsight reports #3428

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

charleskawczynski
Copy link
Member

This PR adds an analysis script for our nsight reports. This will print, for example:

[ Info: Statistics across 710021 total kernels
                                     Kernel duration percentage         
                             ┌                                        ┐ 
                 CUDA memcpy ┤ 9.36495e⁻⁵                               
              RRTMGP_col_gas ┤ 0.00117462                               
                        fill ┤ 0.0126449                                
                 CUDA memset ┤ 0.37228                                  
                         dss ┤■■ 1.64283                                
   single_field_solve_kernel ┤■■■■■■ 4.89241                            
        multiple_field_solve ┤■■■■■■■ 4.9641                            
             CuKernelContext ┤■■■■■■■■■■ 7.4615                         
                    spectral ┤■■■■■■■■■■■ 8.65283                       
                      copyto ┤■■■■■■■■■■■■■■■ 11.6266                   
               RRTMGP_RTE_sw ┤■■■■■■■■■■■■■■■■■■■■■ 15.7644             
               RRTMGP_RTE_lw ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 21.0091      
                     stencil ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 23.5999   
                             └                                        ┘ 
                                          Number of kernels             
                             ┌                                        ┐ 
                 CUDA memcpy ┤ 2                                        
              RRTMGP_col_gas ┤ 18                                       
               RRTMGP_RTE_lw ┤ 36                                       
               RRTMGP_RTE_sw ┤ 38                                       
                        fill ┤■ 6 516                                   
        multiple_field_solve ┤■ 8 640                                   
   single_field_solve_kernel ┤■ 17 280                                  
                 CUDA memset ┤■■ 18 720                                 
                         dss ┤■■■ 32 400                                
             CuKernelContext ┤■■■ 39 517                                
                    spectral ┤■■■■■ 62 640                              
                     stencil ┤■■■■■■■■■■■■ 146 970                      
                      copyto ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 377 244   
                             └                                        ┘ 
                                    Average kernel duration (μs)        
                             ┌                                        ┐ 
                        fill ┤ 4.53625                                  
                 CUDA memset ┤ 46.4866                                  
                      copyto ┤ 72.0436                                  
                 CUDA memcpy ┤ 109.456                                  
                         dss ┤ 118.525                                  
              RRTMGP_col_gas ┤ 152.542                                  
                    spectral ┤ 322.902                                  
                     stencil ┤ 375.359                                  
             CuKernelContext ┤ 441.373                                  
   single_field_solve_kernel ┤ 661.825                                  
        multiple_field_solve ┤ 1343.05                                  
               RRTMGP_RTE_sw ┤■■■■■■■■■■■■■■■■■■■■■ 969748.0            
               RRTMGP_RTE_lw ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 1.36417e⁶   
                             └                                        ┘ 

@charleskawczynski charleskawczynski force-pushed the ck/nsight_post_process branch 7 times, most recently from c0dc87c to ec79427 Compare November 8, 2024 17:11
@charleskawczynski
Copy link
Member Author

Perhaps unsurprisingly, diagnostic edmf pointwise kernels are the dominant cost:

                                     Kernel duration percentage
                             ┌                                        ┐
                        fill ┤ 0.0991058                               
                 CUDA memset ┤ 0.109637                                
             bycolumn_reduce ┤ 0.180484                                
                   dss_local ┤ 0.262486                                
               dss_transform ┤ 0.274207                                
             dss_untransform ┤ 0.276479                                
        multiple_field_solve ┤■ 1.34337                                
   single_field_solve_kernel ┤■ 1.93316                                
             CuKernelContext ┤■ 2.79137                                
                    spectral ┤■ 3.51933                                
                     stencil ┤■■■■■■ 14.0858                           
                      copyto ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 75.1246  
                             └                                        ┘

@charleskawczynski charleskawczynski force-pushed the ck/nsight_post_process branch 4 times, most recently from 8410c59 to 6796eaa Compare November 12, 2024 20:52
Try fixes

Try multiline

Add analysis to more jobs
@charleskawczynski
Copy link
Member Author

Still need to fix the 139 errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant