Python profiling data

Python comes with a built-in way of collecting profile information, documented at https://docs.python.org/3/library/profile.html.

In a nutshell, it boils down to:

python3 -m cProfile -o profile.out command.py arguments…

This post is an attempt to document what's in profile.out, since the python documentation does not cover it. I have tried to figure this out by looking at the sources of the pstats module, the sources of gprof2dot, and trying to make sense of the decoded structure, and navigate it.


Decoding Python profile data

The data collected by python's profile/cProfile is a data structure encoding using the marshal module1.

Loading the data is simple:

>>> import marshal
>>> with open("profile.out", "rb") as fd:
>>>     data = marshal.load(fd)

Structure of Python profile data

Decoded profile data is a dictionary indexed by tuples representing functions or scopes:

# File name, Line number, Function or scope name
Func = Tuple[str, int, str]

For example:

("./staticsite/page.py", 1, "<module>")
("./staticsite/features/markdown.py", 138, "load_dir")
('~', 0, "<method 'items' of 'collections.OrderedDict' objects>")
('/usr/lib/python3.7/pathlib.py', 1156, 'stat')

('~', 0, …) represents a built-in function.

The signature of the decoded dict seems to be:

Dict[
    # Function of scope
    Func,
    Tuple[
        # Aggregated statistics
        int, int, float, float,
        # Callers
        Dict[
            # Caller function or scope
            Func,
            # Call statistics collected for this caller → callee
            Tuple[int, int, float, float]
        ]
    ]
]

The four int, int, float, float statistics are normally unpacked as cc, nc, tt, ct, and stand for:

Phrases in italic are quotations from profile's documentation.

Building a call graph

This is some example code to turn the collected statistics into a call graph structure:

class Node:
    """
    Node in the caller → callee graph
    """
    def __init__(self, func: Func):
        self.fname, self.lineno, self.scope = func
        self.callees: List["Call"] = []
        self.callers: List["Call"] = []
        self.cc = 0
        self.nc = 0
        self.tt = 0.0
        self.ct = 0.0

    def __str__(self):
        # Builtin function
        if self.fname == "~" and self.lineno == 0:
            return f"[builtin]:{self.scope}"

        # Shorten file names from system libraries
        self.shortname = self.fname
        for path in sorted(sys.path, key=lambda x: -len(x)):
            if not path:
                continue
            if self.fname.startswith(path):
                return f"[sys]{self.fname[len(path):]}:{self.lineno}:{self.scope}"

        # File in the local project
        return f"{self.fname}:{self.lineno}:{self.scope}"


class Call:
    """
    Arc in the caller → callee graph
    """
    def __init__(self, caller: Node, callee: Node, cc: int, nc: int, tt: float, ct: float):
        self.caller = caller
        self.callee = callee
        self.cc = cc
        self.nc = nc
        self.tt = tt
        self.ct = ct


class Graph:
    """
    Graph of callers and callees.

    Each node in the graph represents a function, with its aggregated
    statistics.

    Each arc in the graph represents each collected caller → callee statistics
    """
    def __init__(self, stats: Stats):
        # Index of all nodes in the graph
        self.nodes: Dict[Func, Node] = {}
        # Total execution time
        self.total_time = 0.0

        # Build the graph
        for callee, (cc, nc, tt, ct, callers) in stats.items():
            self.total_time += tt

            # Get the callee and fill its aggregated stats
            ncallee = self.node(callee)
            ncallee.cc = cc
            ncallee.nc = nc
            ncallee.tt = tt
            ncallee.ct = ct

            # Create caller → callee arcs
            for caller, (cc, nc, tt, ct) in callers.items():
                ncaller = self.node(caller)
                call = Call(ncaller, ncallee, cc, nc, tt, ct)
                ncallee.callers.append(call)
                ncaller.callees.append(call)

    def node(self, fun: Func) -> Node:
        """
        Lookup or create a node
        """
        res = self.nodes.get(fun)
        if res is None:
            res = Node(fun)
            self.nodes[fun] = res
        return res

    @classmethod
    def load(cls, pathname: str) -> "Graph":
        """
        Builds a Graph from profile statistics saved on a file
        """
        with open(pathname, "rb") as fd:
            return cls(marshal.load(fd))

    @classmethod
    def from_pstats(cls, stats: pstats.Stats) -> "Graph":
        """
        Builds a Graph from an existing pstats.Stats structure
        """
        return cls(stats.stats)

Other python profiling links


  1. marshal does not guarantee stability of the serialized format across python versions. This means that you can't, for example, use a python2 tool to examine profiling results from a python3 script, or vice versa. See for example Debian bug #946894