Optimising Firedrake Performance

“Premature optimisation is the root of all evil”

—Donald Knuth

Performance of a Firedrake script is rarely optimal from the outset. Choice of solver options, discretisation and variational form all have an impact on the amount of time your script takes to run. More general programming considerations such as not repeating unnecessary work inside of a loop can also be signficant.

It is always a bad idea to attempt to optimise your code without a solid understanding of where the bottlenecks are, else you could spend vast amounts of developer time resulting in little to no improvement in performance. The best strategy for performance optimisation should therefore always be to start at the highest level possible with an overview of the entire problem before drilling down into specific hotspots. To get this high level understanding of your script we strongly recommend that you first profile your script using a flame graph (see below).

Automatic flame graph generation with PETSc

Flame graphs are a very useful entry point when trying to optimise your application since they make hotspots easy to find. PETSc can generate a flame graph input file using its logging infrastructure that Firedrake has extended by annotating many of its own functions with PETSc events. This allows users to easily generate informative flame graphs giving a lot of insight into the internals of Firedrake and PETSc.

As an example, here is a flame graph showing the performance of the scalar wave equation with higher-order mass lumping demo. It is interactive and you can zoom in on functions by clicking.

Flame Graph Reset Zoom Search ic firedrake.interpolation.Interpolator.interpolate (26,196 us, 0.41%) firedrake.functionspace.VectorFunctionSpace (56,466 us, 0.89%) firedrake.function.Function.assign (163,585 us, 2.57%) f.. firedrake.output.get_topology (808 us, 0.01%) CreateFunctionSpace (56,466 us, 0.89%) firedrake.output.File.write (3,109,187 us, 48.87%) firedrake.output.File.write firedrake.mesh.MeshTopology._facets (7,583 us, 0.12%) CreateFunctionSpace (16,473 us, 0.26%) VecPointwiseMult (13,577 us, 0.21%) firedrake.halo.Halo.local_to_global_begin (1,044 us, 0.02%) VecSet (3,055 us, 0.05%) firedrake.halo.Halo.global_to_local_begin (1,840 us, 0.03%) CreateSparsity (13,835 us, 0.22%) CreateFunctionSpace (20,686 us, 0.33%) CreateFunctionSpace (318,768 us, 5.01%) Crea.. pyop2.sequential.JITModule.compile (1,174 us, 0.02%) firedrake.functionspace.make_scalar_element (4,087 us, 0.06%) firedrake.assemble.assemble (1,851,875 us, 29.11%) firedrake.assemble.assemble firedrake.constant.Constant.assign (10,036 us, 0.16%) CreateMesh (3,028 us, 0.05%) firedrake.function.Function.interpolate (1,415,484 us, 22.25%) firedrake.function.Functi.. firedrake.utility_meshes.SquareMesh (13,901 us, 0.22%) firedrake.functionspacedata.get_shared_data (7,583 us, 0.12%) firedrake.functionspacedata.get_shared_data (3,883 us, 0.06%) all (6,362,100 us, 100%) firedrake.interpolation.make_interpolator (1,386,099 us, 21.79%) firedrake.interpolation.m.. firedrake.functionspacedata.FunctionSpaceData.__init__ (2,339 us, 0.04%) pyop2.sequential.JITModule.compile (106,933 us, 1.68%) firedrake.functionspacedata.FunctionSpaceData.__init__ (7,583 us, 0.12%) firedrake.assemble.assemble_form (1,843,972 us, 28.98%) firedrake.assemble.assemble_form ParLoop_set_#x7fa90e5c3cd0_wrap_zero (9,106 us, 0.14%) ParLoop_Cells_wrap_expression_kernel (9,206 us, 0.14%) firedrake.interpolation.interpolate (1,414,420 us, 22.23%) firedrake.interpolation.i.. firedrake.__init__ (804,596 us, 12.65%) firedrake.__i.. firedrake (6,362,100 us, 100.00%) firedrake ParLoopExecute (23,402 us, 0.37%) firedrake.functionspace.make_scalar_element (4,089 us, 0.06%) PCApply (13,577 us, 0.21%) ParLoopExecute (1,675,699 us, 26.34%) ParLoopExecute KSPSolve (20,851 us, 0.33%) firedrake.functionspace.VectorFunctionSpace (28,255 us, 0.44%) firedrake.functionspaceimpl.FunctionSpace.__init__ (56,466 us, 0.89%) firedrake.function.Function.__init__ (14,361 us, 0.23%) firedrake.functionspacedata.get_shared_data (2,339 us, 0.04%) CreateMesh (3,028 us, 0.05%) firedrake.formmanipulation.split_form (1,889 us, 0.03%) firedrake.functionspaceimpl.FunctionSpace.__init__ (255,258 us, 4.01%) fi.. DMPlexInterp (8,738 us, 0.14%) MatZeroInitial (4,481 us, 0.07%) ParLoop_set_#x7fa90e5c3cd0_wrap_copy (24,422 us, 0.38%) firedrake.functionspace.make_scalar_element (4,084 us, 0.06%) CreateMesh (7,044 us, 0.11%) ParLoop_Cells_wrap_form00_cell_integral_otherwise (2,323 us, 0.04%) firedrake.functionspace.make_scalar_element (63,510 us, 1.00%) firedrake.functionspacedata.FunctionSpaceData.get_map (1,203 us, 0.02%) firedrake.halo.Halo.global_to_local_end (1,392 us, 0.02%) firedrake.functionspacedata.FunctionSpaceData.__init__ (3,060 us, 0.05%) ParLoop_Cells_wrap_form0_cell_integral_otherwise (1,514,161 us, 23.80%) ParLoop_Cells_wrap_form0_ce.. firedrake.functionspaceimpl.FunctionSpace.__init__ (10,247 us, 0.16%) firedrake.utility_meshes.RectangleMesh (13,901 us, 0.22%) Mesh: numbering (4,518 us, 0.07%) firedrake.linear_solver.LinearSolver.solve (61,637 us, 0.97%) firedrake.mesh._from_cell_list (9,450 us, 0.15%) firedrake.utility_meshes.UnitSquareMesh (13,901 us, 0.22%) firedrake.functionspaceimpl.FunctionSpace.__init__ (8,109 us, 0.13%) firedrake.tsfc_interface.compile_form (119,373 us, 1.88%) ParLoopExecute (81,285 us, 1.28%)

One can immediately see that the dominant hotspots for this code are assembly and writing to output so any optimisation effort should be spent in those. Some time is also spent in firedrake.__init__ but this corresponds to the amount of time spent importing Firedrake and would be amortized for longer-running problems.

Flame graphs can also be generated for codes run in parallel with the reported times in the graph given by the maximum value across all ranks.

Generating the flame graph

To generate a flame graph from your Firedrake script you need to:

  1. Run your code with the extra flag -log_view :foo.txt:ascii_flamegraph. For example:

    $ python myscript.py -log_view :foo.txt:ascii_flamegraph
    

    This will run your program as usual but output an additional file called foo.txt containing the profiling information.

  2. Visualise the results. This can be done in one of two ways:

    • Generate an SVG file using the flamegraph.pl script from this repository with the command:

      $ ./flamegraph.pl foo.txt > foo.svg
      

      You can then view foo.svg in your browser.

    • Upload the file to speedscope and view it there.

Adding your own events

It is very easy to add your own events to the flame graph and there are a few different ways of doing it. The simplest methods are:

  • With a context manager:

    from firedrake.petsc import PETSc
    
    with PETSc.Log.Event("foo"):
        do_something_expensive()
    
  • With a decorator:

    from firedrake.petsc import PETSc
    
    @PETSc.Log.EventDecorator("foo")
    def do_something_expensive():
        ...
    

    If no arguments are passed to PETSc.Log.EventDecorator then the event name will be the same as the function.

Caveats

  • The flamegraph.pl script assumes by default that the values in the stack traces are sample counts. This means that if you hover over functions in the SVG it will report the count in terms of ‘samples’ rather than the correct unit of microseconds. A simple fix to this is to include the command line option --countname us when you generate the SVG. For example:

    $ ./flamegraph.pl --countname us foo.txt > foo.svg
    
  • If you use PETSc stages in your code these will be ignored in the flame graph.

  • If you call PETSc.Log.begin() as part of your script/package then profiling will not work as expected. This is because this function starts PETSc’s default (flat) logging while we need to use nested logging instead.

    This issue can be avoided with the simple guard:

    from firedrake.petsc import OptionsManager
    
    # If the -log_view flag is passed you don't need to call
    # PETSc.Log.begin because it is done automatically.
    if "log_view" not in OptionsManager.commandline_options:
        PETSc.Log.begin()
    

Common performance issues

Calling solve repeatedly

When solving PDEs, Firedrake uses a PETSc SNES (nonlinear solver) under the hood. Every time the user calls solve() a new SNES is created and used to solve the problem. This is a convenient shorthand for scripts that only need to solve a problem once, but it is fairly expensive to set up a new SNES and so repeated calls to solve() will introduce some overhead.

To get around this problem, users should instead instantiate a variational problem (e.g. NonlinearVariationalProblem) and solver (e.g. NonlinearVariationalSolver) outside of the loop body. An example showing how this is done can be found in this demo.

Other useful tools

Here we present a handful of performance analysis tools that users may find useful to run with their codes.

py-spy

py-spy is a great sampling profiler that outputs directly to SVG flame graphs. It allows users to see the entire stack trace of the program rather than just the annotated PETSc events and unlike most Python profilers it can also profile native code.

A flame graph for your Firedrake script can be generated from py-spy with:

$ py-spy record -o foo.svg --native -- python myscript.py

Beyond the inherent uncertainty that comes from using a sampling profiler, one substantial limitation of py-spy is that it does not work when run in parallel.

pyinstrument

pyinstrument is a great sample-based profiling tool that you can use to easily identify hotspots in your code. To use the profiler simply run:

$ pyinstrument myscript.py

This will print out a timed callstack to the terminal. To instead generate an interactive graphic you can view in your browser pass the -r html flag.

Unfortunately, pyinstrument cannot profile native code. This means that information about the code’s execution inside of PETSc is largely lost.

memory_profiler

memory_profiler is a useful tool that you can use to monitor the memory usage of your script. After installing it you can simply run:

$ mprof run python myscript.py
$ mprof plot

The former command will run your script and generate a file containing the profiling information. The latter then displays a plot of the memory usage against execution time for the whole script.

memory_profiler also works in parallel. You can pass either of the --include-children or --multiprocess flags to mprof depending on whether or not you want to accumulate the memory usage across ranks or plot them separately. For example:

$ mprof run --include-children mpiexec -n 4 python myscript.py

Score-P

Score-P is a tool aimed at HPC users. We found it to provide some useful insight into MPI considerations such as load balancing and communication overhead.

To use it with Firedrake, users will also need to install Score-P’s Python bindings.