Observing Application Behavior and Performance

13 Code Optimization

Code optimization is the process of making an application work more efficiently, usually without modifying its functionality and accuracy. Code optimization is usually concerned with the speed of processing (CPU time optimization), but can also be used to minimize the usage of different resources, such as memory, disk space, or network bandwidth.

In the previous chapter, we learned about various observability techniques that can be used to identify performance hotspots in applications. Logging, metric monitoring, and tracing can be used to create a general performance overview and prioritize optimization efforts. Moreover, operations teams often use custom alerts of performance and resource usage metrics or log messages related to operation timeouts to determine components that require immediate action.

But even the best logging, metrics, and tracing systems will give you only a rough overview of the performance problem. If you decide to fix it, you will have to perform a careful profiling process that will uncover detailed resource usage patterns.

There are plenty of optimization techniques. Some concentrate on algorithmic complexity and the use of optimized data structures. Others can trade the accuracy or consistency of results for quick and easy performance gains. Only when you know how your application works and how it uses resources will you be able to apply the proper optimization technique.

In this chapter, we will discuss the following topics:

Common culprits for bad performance
Code profiling
Reducing complexity by choosing appropriate data structures
Leveraging architectural trade-offs

Python comes with built-in profiling capabilities and useful optimization utilities, so theoretically, we could use nothing other than Python and its standard library. But extra tools are like meta-optimization—they increase your own efficiency in finding and solving performance issues. That's why we will rely heavily on some extra utilities. Let's consider them.

Technical requirements

The following are the Python packages that are mentioned in this chapter that you can download from PyPI:

gprof2dot
objgraph
pymemcache

Information on how to install packages is included in Chapter 2, Modern Python Development Environments.

The following are some extra applications that you may need to install:

Graphviz: An open source diagram visualization solution available at https://graphviz.org
Memcached: An open source caching service available at https://memcached.org

The code files for this chapter can be found at https://github.com/PacktPublishing/Expert-Python-Programming-Fourth-Edition/tree/main/Chapter%2013.

Common culprits for bad performance

Before we go into the details of profiling, let's first discuss the typical reasons for bad performance in applications.

In practice, many performance problems are pretty common and reoccurring. Knowledge of potential culprits is essential in choosing the right profiling strategy. Also, some specific culprits may manifest in recognizable patterns. If you know what to look for, you will be able to fix obvious performance issues without a detailed analysis and save a lot of time on further optimization attempts.

The most common reasons for experiencing bad application performance are the following:

Excessive complexity
Excessive resource allocation and resource leaks
Excessive I/O and blocking operations

The most dreaded culprit of application inefficiency is excessive complexity, so we will discuss this first.

Code complexity

The first and most obvious thing to look for when trying to improve application performance is complexity. There are many definitions of what makes a program complex, and there are many ways to express it. Some measurable complexity metrics can provide objective information about how code behaves, and such information can often be extrapolated into performance expectations. An experienced programmer can even reliably guess how two different implementations will perform in practice, as long as they're aware of their complexities and the execution context.

The two most popular ways to define application complexity are as follows:

Cyclomatic complexity, which is very often correlated with application performance.
The Landau notation, also known as big O notation, is an algorithm classification method that is useful in objectively judging code performance.

The optimization process may therefore be sometimes understood as a process of reducing complexity. In the following sections, we will take a closer look at the definitions of these two types of code complexity.

Cyclomatic complexity

Cyclomatic complexity is a metric that was developed by Thomas J. McCabe in 1976; because of its author, it's also known as McCabe's complexity. Cyclomatic complexity measures the number of linear paths through a piece of code. In short, all branching points (if statements) and loops (for and while statements) increase code complexity.

Depending on the value of measured cyclomatic complexity, code can be classified into various complexity classes. The following is a table of commonly used McCabe complexity classes:

Cyclomatic complexity value	Complexity class
1 to 10	Not complex
11 to 20	Moderately complex
21 to 50	Really complex
Above 50	Too complex

Cyclomatic complexity usually correlates inversely with application performance (higher complexity = lower performance). Still, it is more of a code quality score than a performance metric. It does not replace the need for code profiling when looking at performance bottlenecks. Code that has high cyclomatic complexity often tends to utilize rather complex algorithms that may not perform well with larger inputs.

Although cyclomatic complexity is not a reliable way to judge application performance, it has an important advantage: it is a source code metric that can be measured with proper tools. This cannot be said of other canonical ways of expressing complexity, including big O notation. Thanks to its measurability, cyclomatic complexity may be a useful addition to profiling, as it gives you more information about problematic parts of your software. Complex parts of code are the first things you should review when considering radical code architecture redesigns.

Measuring McCabe's complexity is relatively simple in Python because it can be deduced from its abstract syntax tree. Of course, you don't need to do that by yourself. mccabe is a popular Python package from PyPI that can perform automated complexity analysis over Python sources. It is also available as a Pytest plugin named pytest-mccabe, so it is easy to include in your automated quality and testing processes (see Chapter 10, Testing and Quality Automation).

You can learn more about the mccabe package at https://pypi.org/project/mccabe/.

The pytest-mccabe package is available at https://pypi.org/project/pytest-mccabe/.

The big O notation

The canonical method of defining function complexity is the big O notation. This metric defines how an algorithm is affected by the size of the input. For instance, does an algorithm's complexity scale linearly with the size of the input, or quadratically?

Manually assessing the big O notation for an algorithm is the best approach when trying to achieve an overview of how its performance is related to the size of the input. Knowing the complexity of your application's components gives you the ability to detect and focus on aspects that will significantly slow down the code.

To measure the big O notation, all constants and low-order terms are removed in order to focus on the portion that really matters when the size of the input data grows very large. The idea is to categorize the algorithm in one of the known complexity categories (even if it is an approximation). The following are the most common complexity classes, where n is equal to the number of input elements of a problem:

Notation	Type
O(1)	Constant; does not depend on the input size
O(n)	Linear; will grow at a constant rate as n grows
O(n log n)	Quasi-Linear
O(n²)	Quadratic
O(n³)	Cubic
O(n!)	Factorial

We can easily understand complexity classes by comparing basic lookup operations in some Python standard data types:

Python list lookup by index has a complexity of O(1): Python lists are pre-allocated resizable arrays akin to a Vector container from the C++ standard library. The in-memory position of the list item of a given index is known in advance, so accessing it is a constant-time operation.
Finding the element index in a list by value has a complexity of O(n): If you use the list.index(value) method, Python will have to iterate over items until it finds the element that matches the value expression. In the worst case, it will have to iterate over all the elements of the list, so it will perform n operations, where n is the list length. On average, however, finding random elements will take n/2 operations, so the complexity is O(n).
Key lookup in a dictionary has a complexity of O(1): In Python, you can use any immutable value as keys in a dictionary. Key values do not translate directly into exact positions in memory, but Python uses an advanced hashing technique that ensures that the average complexity of the key lookup operation is constant.

To understand how function complexity can be determined, let's take a look at the following example:

def function(n):
    for i in range(n):
        print(i)

In the preceding function, the print() function will be executed n times. Loop length depends linearly on n, so the complexity of the whole function is O(n).

If the function has conditions, the correct notation is that of the worst-case scenario. Consider the following example:

def function(n, print_count=False):
    if print_count:
        print(f'count: {n}')
    else:
        for i in range(n):
            print(i)

In this example, the function could be O(1) or O(n), depending on the value of the print_count argument. The worst case is O(n), so the whole function complexity is O(n).

We don't always have to consider the worst-case scenario when determining complexity. Many algorithms change runtime performance, depending on the statistical characteristic of input data, or amortize the cost of worst-case operations by performing clever tricks. This is why, in many cases, it may be better to review your implementation in terms of average complexity or amortized complexity.

For example, take a look at the operation of appending a single element to Python's list type instance. We know that a list in CPython uses an array with overallocation for the internal storage instead of linked lists. If an array is already full, appending a new element requires the allocation of a new array and copying all existing elements (references) to a new area in the memory. If we look at this from the point of view of worst-case complexity, it is clear that the list.append() method has O(n) complexity, which is a bit expensive compared to a typical implementation of the linked list structure. We also know, however, that the CPython list type implementation uses the mechanism of overallocation (it allocates more space than is required at a given time) to mitigate the complexity of occasional reallocation. If we evaluate the complexity over a sequence of operations, we will notice that the average complexity of list.append() is O(1), and this is actually a great result.

Always be mindful of big O notation, but don't be too dogmatic about it. Big O notation is asymptotic notation, meaning it is intended to analyze the limiting behavior of a function when the size of the input trends toward infinity. It therefore may not offer a reliable performance approximation for real-life data. Asymptotic notation is a great tool for defining the growth rate of a function, but it won't give a direct answer to the simple question of "Which implementation will take the least time?" Worst-case complexity ignores all the details about timings of individual operations to show you how your program will behave asymptotically. It works for arbitrarily large inputs that you may not always have considered.

For instance, let's assume that you have a problem that requires the processing of n independent elements. Let's say we have Program A and Program B. You know that Program A requires 100n² operations to complete the task, while Program B requires 5n³ operations to provide a solution. Which one would you choose?

When speaking about very large inputs, Program A is the better choice because it behaves better asymptotically. It has O(n²) complexity compared to Program B's O(n³) complexity. However, by solving a simple 100n² > 5n³ inequality, we can find that Program B will take fewer operations when n is less than 20. Therefore, if we know a bit more about our input bounds, we can make slightly better decisions. Moreover, we can't assume that individual operations in both programs take the same amount of time. If we want to know which one runs faster for a given input size, we will have to profile both applications.

Excessive resource allocation and leaks

With increased complexity often comes the problem of excessive resource allocation. Code that is complex (either algorithmically or asymptotically) is more likely to include inefficient data structures or allocate too many resources without freeing them afterward.

Although resource usage inefficiency often goes hand in hand with complexity, excessive resource usage needs to be treated as a separate performance issue. And this is for two reasons:

Allocating (and freeing) resources takes time: This includes RAM, which is supposed to be fast memory (when compared to disks). Sometimes, it is better to maintain a pool of resources instead of allocating and freeing them constantly.
Your application is unlikely to run alone on the same host: If it allocates too many resources, it may starve other programs. In extreme cases, excessive usage of resources by a single application may render the whole system unavailable or lead to abrupt program termination.

Whether it is network bandwidth, disk space, memory, or any other resource, if an application appropriates all available resources to itself, this will have a detrimental impact on other programs running in the same environment.

Moreover, many operating systems or environments allow more resources to be requested than are technically available. A common case is memory overcommitting, which allows an operating system to assign processes more memory than hosts physically have available. It works by assigning processes virtual memory. In the event of a physical memory shortage, an operating system will usually temporarily swap unused memory pages to disk to create more free space.

This swapping may go unnoticed at healthy usage levels. However, if virtual memory is overused, normal swapping can transform into 'thrashing,' where the operating system is constantly swapping pages in and out of the disk. This can lead to a dramatic performance collapse.

Excessive resource usage may stem from ineffective data structures, overly large resource pool allocations, or be caused by unintentional resource leaks. Resource leaks happen when an application is allocated some specific resources but never frees them, even after it is no longer necessary. Resource leaks are most common for memory allocations, but can happen for other resources (such as file descriptors or open connections) too.

Excessive I/O and blocking operations

When writing applications, we often forget that every operation requires writing to and reading from a disk or via a network connection, which takes time. These Input/Output (I/O) operations always have a noticeable performance impact on applications, and this is what many software architects ignore when designing elaborate networked systems.

With the advent of fast SSDs (Solid State Drives), disk I/O is faster than ever but is still not as fast as RAM (and RAM is not as fast as a CPU cache). Also, some drives have fast throughput for big files, but perform surprisingly poorly in random access mode.

The same goes for network connections. It cannot go faster than the speed of light. If processes are running on two hosts separated by a great distance, every communication exchange will add a noticeable round-trip overhead. This is a great concern in distributed systems that are often characterized by the fine granulation of many networked services.

Excessive I/O operations will also have an impact on the ability to serve multiple users concurrently. We've already learned about various concurrency models in Chapter 6, Concurrency. Even with the asynchronous concurrency model, which should theoretically deal best with I/O, it is possible to accidentally include synchronous I/O operations that can potentially ruin the benefits of concurrency. In such a situation, a performance drop may be observed as a drop in application maximum processing throughput.

As you can see, there are many possible reasons for performance problems. In order to fix them, you will need to identify their culprit. We usually do that through a process known as profiling.

Code profiling

Knowing what can potentially go wrong allows you to make hypotheses and bets on what are the performance issue culprits and how can you fix them. But profiling is the only way to verify these hypotheses. You should usually avoid optimization attempts without profiling your application first.

Experience helps, so there is of course nothing wrong with doing a small code overview and experiments before profiling. Also, some profiling techniques require the incorporation of additional code instrumentation or the writing of performance tests. It means that often you will have to read it thoroughly anyway. If you perform some small experiments along the way (for instance, in the form of debugging sessions), you may spot something obvious.

Low-hanging fruit happens, but don't rely on it. A good ratio between freeform experiments and classic profiling is around 1:9. My favorite way of organizing the profiling and optimization process is as follows:

Decide how much time you can spend: Not every optimization is possible, and you will not always be able to fix it. Accept the possibility that you won't succeed on your first attempt. Decide when you will stop in advance. It is better to withdraw and successfully retry later than lose a long-standing battle.
Split that time into sessions: Optimization is like debugging. It requires focus, organization, and a clear mind. Pick a length of the session that will allow you to run a whole cycle, consisting of putting forward a hypothesis, profiling, and experimentation. This should be no more than a few hours.
Make a schedule: Implement your sessions when you are least likely to be interrupted and will be most effective. It is often better to plan a few days ahead if the problem is big so you can start every day with fresh ideas.
Don't make other development plans: If you don't succeed in your first session, you will probably be constantly thinking about your problem. It will be hard to focus on other bigger tasks. If you work in a team, make it transparent because you won't contribute fully until you close the topic. Have a handful of small bite-sized tasks to keep you distracted between sessions.

Optimization will likely require code changes in your application. To reduce the risk of breaking things, it is a good practice to ensure that code is well covered in tests. We discussed common testing techniques in Chapter 10, Testing and Quality Automation.

The key to every profiling session is a hypothesis in terms of what could be wrong. It allows you to decide what profiling technique to use and what tools will work best. A great source of hypotheses are metrics collection and distributed tracing systems, which we discussed in detail in Chapter 12, Observing Application Behavior and Performance.

If you have no clue as to what could be wrong, the best way is to start traditional CPU profiling because this has the best chance of leading you to the next viable hypothesis. That's because many common resource and network usage anti-patterns can manifest themselves in CPU profiles as well.

Profiling CPU usage

Typically, performance problems arise in a narrow part of code that has a significant impact on application performance as a whole. We call such places bottlenecks. Optimization is a process of identifying and fixing those bottlenecks.

The first source of bottlenecks is your code. The standard library provides all the tools that are required to perform code profiling. They are based on a deterministic approach. A deterministic profiler measures the time spent in each function by adding a timer at the lowest level. This introduces a noticeable overhead but provides a good idea of where the time is consumed. A statistical profiler, on the other hand, samples instruction pointer usage and does not instrument the code. The latter is less accurate, but allows you to run the target program at full speed.

There are two ways to profile the code:

Macro-profiling: This profiles the whole program while it is being used and generates statistics.
Micro-profiling: This measures a precise part of the program by instrumenting it manually.

Where we don't know the exact fragment of the code that runs slowly, we usually start with macro-profiling of the whole application to get a general performance profile and determine components that act as performance bottlenecks.

Macro-profiling

Macro-profiling is done by running the application in a special mode, where the interpreter is instrumented to collect statistics regarding code usage. The Python standard library provides several tools for this, including the following:

profile: This is a pure Python implementation.
cProfile: This is a C implementation that provides the same interface as that of the profile tool, but has less overhead.

The recommended choice for most Python programmers is cProfile due to its reduced overhead. In any case, if you need to extend the profiler in some way, then profile will be a better choice because it doesn't use C extensions and so is easier to extend.

Both tools have the same interface and usage, so we will use only one of them here. The following is a myapp.py module with a main function that we are going to profile with the cProfile module:

import time 
 
def medium(): 
    time.sleep(0.01) 
 
def light(): 
    time.sleep(0.001) 
 
def heavy(): 
    for i in range(100): 
        light() 
        medium() 
        medium() 
    time.sleep(2) 
 
def main(): 
    for i in range(2): 
        heavy() 
 
if __name__ == '__main__': 
    main()

This module can be called directly from the prompt, and the results are summarized here:

$ python3 -m cProfile myapp.py

Example profiling output for our myapp.py script can be as follows:

         1208 function calls in 8.243 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.001    0.000    8.243    4.121 myapp.py:13(heavy)
        1    0.000    0.000    8.243    8.243 myapp.py:2(<module>)
        1    0.000    0.000    8.243    8.243 myapp.py:21(main)
      400    0.001    0.000    4.026    0.010 myapp.py:5(medium)
      200    0.000    0.000    0.212    0.001 myapp.py:9(light)
        1    0.000    0.000    8.243    8.243 {built-in method exec}
      602    8.241    0.014    8.241    0.014 {built-in method sleep}

The meaning of each column is as follows:

ncalls: The total number of calls
tottime: The total time spent in the function, excluding time spent in the calls of sub-functions
cumtime: The total time spent in the function, including time spent in the calls of sub-functions

The percall column to the left of tottime equals tottime / ncalls, and the percall column to the left of cumtime equals the cumtime / ncalls.

These statistics are a print view of the statistic object that was created by the profiler. You can also create and review this object within the interactive Python session, as follows:

>>> import cProfile
>>> from myapp import main
>>> profiler = cProfile.Profile()
>>> profiler.runcall(main)
>>> profiler.print_stats()
         1206 function calls in 8.243 seconds
   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall file:lineno(function)
        2    0.001    0.000    8.243    4.121 myapp.py:13(heavy)
        1    0.000    0.000    8.243    8.243 myapp.py:21(main)
      400    0.001    0.000    4.026    0.010 myapp.py:5(medium)
      200    0.000    0.000    0.212    0.001 myapp.py:9(light)
      602    8.241    0.014    8.241    0.014 {built-in method sleep}

The statistics can also be saved in a file and then read by the pstats module. This module provides a class that knows how to handle profile files, and gives a few helpers to more easily review the profiling results. The following transcript shows how to access the total number of calls and how to display the first three calls, sorted by the time metric:

>>> import pstats
>>> import cProfile
>>> from myapp import main
>>> cProfile.run('main()', 'myapp.stats')
>>> stats = pstats.Stats('myapp.stats')
>>> stats.total_calls
1208
>>> stats.sort_stats('time').print_stats(3)
Mon Apr  4 21:44:36 2016    myapp.stats
         1208 function calls in 8.243 seconds
 
   Ordered by: internal time
   List reduced from 8 to 3 due to restriction <3>
   ncalls  tottime  percall  cumtime  percall file:lineno(function)
      602    8.241    0.014    8.241    0.014 {built-in method sleep}
      400    0.001    0.000    4.025    0.010 myapp.py:5(medium)
        2    0.001    0.000    8.243    4.121 myapp.py:13(heavy)

From there, you can browse the code by printing out the callers and callees for each function, as follows:

>>> stats.print_callees('medium')
   Ordered by: internal time
   List reduced from 8 to 1 due to restriction <'medium'>
Function           called...
                   ncalls  tottime  cumtime
myapp.py:5(medium) ->  400    4.025    4.025  {built-in method sleep}
>>> stats.print_callees('light')
   Ordered by: internal time
   List reduced from 8 to 1 due to restriction <'light'>
Function           called...
                    ncalls  tottime  cumtime
myapp.py:9(light)  ->  200    0.212    0.212  {built-in method sleep}

Being able to sort the output allows you to work on different views to find the bottlenecks. For instance, consider the following scenarios:

When the number of small calls (a low value of percall for the tottime column) is really high (high value of ncalls) and takes up most of the global time, the function or method is probably running in a very long loop. Often, optimization can be done by moving this call to a different scope to reduce the number of operations.
When a single function call is taking a very long time, a cache might be a good option, if possible.

Another great way to visualize bottlenecks from profiling data is to transform them into diagrams (see Figure 13.1). The gprof2dot.py script can be used to turn profiler data into a dot graph:

$ gprof2dot.py -f pstats myapp.stats | dot -Tpng -o output.png

The gprof2dot.py script is part of the gprof2dot package available on PyPI. You can download it using pip. Working with it requires the installation of Graphviz software. You can download this for free from http://www.graphviz.org/.

The following is an example output of the preceding gprof2dot.py invocation in a Linux shell that turns the myapp.stats profile file into a PNG diagram:

Figure 13.1: An example of a profiling overview diagram that was generated with gprof2dot

The example diagram shows different code paths that were executed by the program and the relative time spent in each path. Each box represents a single function. Linking the functions, you have the number of times a given code path was executed and the percentage of total execution time spent on those paths. Using diagrams is a great way to explore the performance patterns of large applications.

The advantage of gprof2dot is that it tries to be language-agnostic. It is not limited to a Python profile or cProfile output and can read from multiple other profiles, such as Linux perf, xperf, gprof, Java HPROF, and many others.

Macro-profiling is a good way to detect the function that has a problem, or at least its neighborhood. When you have found it, you can proceed to micro-profiling.

Micro-profiling

When the slow function is found, we usually proceed with micro-profiling to generate a profile focused on the smallest possible amount of code. This is done by manually instrumenting a part of the code in a specifically created speed test.

For instance, the cProfile module can be used in the form of a decorator, as in the following example:

import time
import tempfile
import cProfile
import pstats
def profile(column='time', list=3):
    def parametrized_decorator(function):
        def decorated(*args, **kw):
            s = tempfile.mktemp()
            profiler = cProfile.Profile()
            profiler.runcall(function, *args, **kw)
            profiler.dump_stats(s)
            p = pstats.Stats(s)
            print("=" * 5, f"{function.__name__}() profile", "=" * 5)
            p.sort_stats(column).print_stats(list)
        return decorated
    return parametrized_decorator
def medium():
    time.sleep(0.01)
@profile('time')
def heavy():
    for i in range(100):
        medium()
        medium()
    time.sleep(2)
def main():
    for i in range(2):
        heavy()
if __name__ == '__main__':
    main()

This approach allows for testing only selected parts of the application (here, the heavy() function) and sharpens the statistics' output. This way, you can collect many isolated and precisely targeted profiles on a single application run, as follows. The following is example output of running the preceding code in a Python interpreter:

===== heavy() profile =====
Wed Apr 10 03:11:53 2019 /var/folders/jy/wy13kx0s7sb1dx2rfsqdvzdw0000gq/T/tmpyi2wejm5
         403 function calls in 4.330 seconds
 Ordered by: internal time
 List reduced from 4 to 3 due to restriction <3>
 ncalls tottime percall cumtime percall filename:lineno(function)
    201 4.327   0.022   4.327   0.022   {built-in method time.sleep}
    200 0.002   0.000   2.326   0.012   cprofile_decorator.py:24(medium)
      1 0.001   0.001   4.330   4.330   cprofile_decorator.py:28(heavy)
===== heavy() profile =====
Wed Apr 10 03:11:57 2019 /var/folders/jy/wy13kx0s7sb1dx2rfsqdvzdw0000gq/T/tmp8mubgwjw
         403 function calls in 4.328 seconds
 Ordered by: internal time
 List reduced from 4 to 3 due to restriction <3>
 ncalls tottime percall cumtime percall filename:lineno(function)
    201 4.324   0.022   4.324   0.022   {built-in method time.sleep}
    200 0.002   0.000   2.325   0.012   cprofile_decorator.py:24(medium)
      1 0.001   0.001   4.328   4.328.  cprofile_decorator.py:28(heavy)

In the preceding output, we see that the heavy() function was called exactly twice and both times the profile was very similar. In the list of calls, we made 201 calls to the time.sleep() function, which cumulatively takes around 4.3 seconds to execute.

Sometimes at this stage, having just a profile with a list of callees is not enough to understand the problem. A common practice is to create alternative implementations of specific parts of the code and measure how long it takes to execute them. If the heavy() function was to do anything useful, we could try, for instance, to solve the same problem with code of lower complexity.

timeit is a useful module that provides a simple way to measure the execution time of a small code snippet, with the best underlying timer the host system provides (the time.perf_counter() object), as shown in the following example:

>>> from myapp import medium
>>> import timeit
>>> timeit.timeit(light, number=1000)
1.2880568829999675

The timeit.timeit() function will run the specified function 1,000 times (specified by the number parameter) and return the total amount of time spent during all executions. If you expect a larger variance in terms of results, you can use timeit.repeat() to repeat the whole test a specified number of times:

>>> timeit.repeat(light, repeat=5, number=1000)
[1.283251813999982, 1.2764048459999913, 1.2787090969999326, 1.279601415000002, 1.2796783549999873]

The timeit module allows you to repeat the call multiple times, and can be easily used to try out isolated code snippets. This is very useful outside the application context, in a prompt, for instance, but is not really ideal for use within an existing application.

The timeit module can be used to create speed tests in your automated testing framework, but that approach should be used with caution. Timing results may vary each time. Repeating the same test multiple times and taking the average provides more accurate results. Furthermore, some computers have special CPU features, which might change the processor clock time depending on the load or temperature. So you can see different time results when the computer is idling and when it is really busy. Other programs running concurrently will also very likely affect overall timings. So, continually repeating the test is good practice for small code snippets. That's why, when measuring time in automated testing flows, we usually try to observe patterns rather than put concrete timing thresholds as assertions.

When doing CPU profiling, we often spot various patterns related to acquiring and releasing resources as these often happen through function calls. Traditional Python profiling with profile and cProfile modules can provide a general overview of resource usage patterns, but when it comes to memory usage, we prefer dedicated memory profiling solutions.

Profiling memory usage

Before we start to hunt down memory issues in Python, you should know that the nature of memory leaks in Python is quite special. In some compiled languages such as C and C++, the memory leaks are almost exclusively caused by allocated memory blocks that are no longer referenced by any pointer. If you don't have a reference to memory, you cannot release it. This very situation is called a memory leak.

In Python, there is no low-level memory management available for the user, so we instead deal with leaking references—references to objects that are no longer needed but were not removed. This stops the interpreter from releasing resources, but is not the same situation as a classic memory leak in C.

There is always the exceptional case of memory leaking through pointers in Python C extensions, but they are a different kind of beast that needs completely different tools to diagnose. These cannot be easily inspected from Python code. A common tool for inspecting memory leaks in C is Valgrind. You can learn more about Valgrind at https://valgrind.org/.

So, memory issues in Python are mostly caused by unexpected or unplanned resource acquisition patterns. It happens very rarely that this is the effect of real bugs caused by the mishandling of memory allocation and deallocation routines. Such routines are available to the developer only in CPython when writing C extensions with Python/C APIs. You will deal with these very rarely, if ever. Thus, most memory leaks in Python are mainly caused by the overblown complexity of the software and subtle interactions between its components that are really hard to track. To spot and locate such deficiencies in your software, you need to know how actual memory usage looks in the program.

Getting information about how many objects are controlled by the Python interpreter and inspecting their real size is a bit tricky. For instance, knowing how much memory a given object takes in bytes would involve crawling down all its attributes, dealing with cross-references, and then summing up everything. This is a pretty complex problem if you consider the way objects tend to refer to each other. The built-in gc module, which is the interface of Python's garbage collector, does not provide high-level functions for this, and it would require Python to be compiled in debug mode to have a full set of information.

Often, programmers just ask the system about the memory usage of their application before and after a given operation has been performed. But this measure is an approximation and depends a lot on how the memory is managed at the system level. Using the top command under Linux, or the task manager under Windows, for instance, makes it possible to detect memory problems when they are obvious. However, this approach is laborious and makes it really hard to track down the faulty code block.

Fortunately, there are a few tools available to make memory snapshots and calculate the number and size of loaded objects. But let's keep in mind that Python does not release memory easily and prefers to hold on to it in case it is needed again.

For some time, one of the most popular tools to use when debugging memory issues and usage in Python was Guppy-PE and its Heapy component. Unfortunately, it no longer appears to be maintained and it lacks Python 3 support. Luckily, the following are some of the other alternatives that are Python 3-compatible to some extent:

Memprof (http://jmdana.github.io/memprof/): This is declared to work on some POSIX-compliant systems (macOS and Linux). Last updated in December 2019.
memory_profiler (https://pypi.org/project/memory-profiler): Declared to be OS-independent. Actively maintained.
Pympler (https://pypi.org/project/Pympler/): Declared to be OS-independent. Actively maintained.
objgraph (https://mg.pov.lt/objgraph/): Declared to be OS-independent. Actively maintained.

Note that the preceding information regarding compatibility is based purely on trove classifiers that are used by the latest distributions of featured packages, declarations from the documentation, and the inspection of projects' build pipeline definitions at the time of writing the book. These values may be different at the time you read this.

As you can see, there are a lot of memory profiling tools available to Python developers. Each one has some constraints and limitations. In this chapter, we will focus on one profiler that is known to work well with the latest release of Python (that is, Python 3.9) on different operating systems. This tool is objgraph.

The API of objgraph may seem a bit clumsy, and it has a very limited set of functionalities. But it works, does what it needs to do very well, and is really simple to use. Memory instrumentation is not a thing that is added to the production code permanently, so this tool does not need to be pretty.

Using the objgraph module

objgraph is a simple module for creating diagrams of object references within applications. These are extremely useful when hunting memory leaks in Python.

objgraph is available on PyPI, but it is not a completely standalone tool and requires Graphviz, which was installed in the Macro-profiling section, in order to create memory usage diagrams.

objgraph provides multiple utilities that allow you to list and print various statistics regarding memory usage and object counts. An example of such utilities in use is shown in the following transcript of interpreter sessions:

>>> import objgraph
>>> objgraph.show_most_common_types()
function                   1910
dict                       1003
wrapper_descriptor         989
tuple                      837
weakref                    742
method_descriptor          683
builtin_function_or_method 666
getset_descriptor          338
set                        323
member_descriptor          305
>>> objgraph.count('list')
266
>>> objgraph.typestats(objgraph.get_leaking_objects())
{'Gt': 1, 'AugLoad': 1, 'GtE': 1, 'Pow': 1, 'tuple': 2, 'AugStore': 1, 'Store': 1, 'Or': 1, 'IsNot': 1, 'RecursionError': 1, 'Div': 1, 'LShift': 1, 'Mod': 1, 'Add': 1, 'Invert': 1, 'weakref': 1, 'Not': 1, 'Sub': 1, 'In': 1, 'NotIn': 1, 'Load': 1, 'NotEq': 1, 'BitAnd': 1, 'FloorDiv': 1, 'Is': 1, 'RShift': 1, 'MatMult': 1, 'Eq': 1, 'Lt': 1, 'dict': 341, 'list': 7, 'Param': 1, 'USub': 1, 'BitOr': 1, 'BitXor': 1, 'And': 1, 'Del': 1, 'UAdd': 1, 'Mult': 1, 'LtE': 1}

Note that the preceding numbers of allocated objects displayed by objgraph are already high due to the fact that a lot of Python built-in functions and types are ordinary Python objects that live in the same process memory. Also, objgraph itself creates some objects that are included in this summary.

As mentioned earlier, objgraph allows you to create diagrams of memory usage patterns and cross-references that link all the objects in the given namespace. The most useful functions of the objgraph module are objgraph.show_refs() and objgraph.show_backrefs(). They both accept a reference to the object being inspected and save a diagram image to file using the Graphviz package. Examples of such graphs are presented in Figure 13.2 and Figure 13.3. Here is the code that was used to create these diagrams:

from collections import Counter
import objgraph
def graph_references(*objects):
    objgraph.show_refs(
        objects,
        filename='show_refs.png',
        refcounts=True,
        # additional filtering for the sake of brevity
        too_many=5,
        filter=lambda x: not isinstance(x, dict),
    )
    objgraph.show_backrefs(
        objects,
        filename='show_backrefs.png',
        refcounts=True
    )
if __name__ == "__main__":
    quote = """
    People who think they know everything are a
    great annoyance to those of us who do.
    """
    words = quote.lower().strip().split()
    counts = Counter(words)
    graph_references(words, quote, counts)

Without Graphviz installed, objgraph will output diagrams in DOT format, which is a special graph description language.

The following diagram shows the diagram of all references held by words, quote, and counts objects:

Figure 13.2: An example result of the show_refs() diagram from the graph_references() function

As you can see, the words object (denoted as list [3]) holds references to 16 objects. The counts object (denoted as Counter [3]) holds references to 15 objects. It is one object less than the words object because the word "who" appears twice. The quote object (denoted as str [5]) is a plain string, so it doesn't hold any extra references.

The following diagram shows the back references leading to words, quote, and counts objects:

Figure 13.3: An example result of the show_backrefs() diagram from the graph_references() function

In the preceding diagram, we see that the quote, words, and counts objects are references in a dictionary (denoted as dict[6]) of global variables of the __main__ module named __globals__. Moreover, the quote object is referenced in a special list object (denoted as list [1]) due to CPython's string interning mechanism.

String interning is the memory optimization mechanism of CPython. Most of the string literals are pre-allocated by CPython when a module is loaded. Strings in Python are immutable, so duplicate occurrences of the same string literal will refer to the same address in memory.

To show how objgraph may be used in practice, let's review a code example that may create memory issues under certain versions of Python. As we already noted in Chapter 9, Bridging Python with C and C++, CPython has its own garbage collector that exists independently from its reference counting mechanism. It is not used for general-purpose memory management, and its sole purpose is to solve the problem of cyclic references. In many situations, objects may reference each other in a way that would make it impossible to remove them using simple techniques based on reference counting. Here is the simplest example:

x = [] 
y = [x] 
x.append(y)

Such a situation is presented visually in Figure 13.4:

Figure 13.4: An example diagram generated by objgraph of cyclic references between two objects

In the preceding case, even if all external references to x and y objects are removed (for instance, by returning from the local scope of a function), these two objects cannot be removed through reference counting because there will always be two cross-references owned by these two objects. This is a situation where the Python garbage collector steps in. It can detect cyclic references to objects and trigger their deallocation if there are no other valid references to these objects outside of the cycle.

The real problem starts when at least one of the objects in such a cycle has the custom __del__() method defined. This is a custom deallocation handler that will be called when the object's reference count finally goes to zero. It can execute any arbitrary Python code and thus can also create new references to featured objects. This is the reason why the garbage collector prior to Python 3.4 could not break reference cycles if at least one of the objects provided the custom __del__() method implementation. PEP 442 introduced safe object finalization to Python and became part of the language standard, starting from Python 3.4. Anyway, this may still be a problem for packages that worry about backward compatibility and target a wide spectrum of Python interpreter versions. The following snippet of code allows you to show the difference in behavior between the cyclic garbage collector in different Python versions:

import gc 
import platform 
import objgraph 
 
 
class WithDel(list): 
    """ list subclass with custom __del__ implementation """ 
    def __del__(self): 
        pass 
 
 
def main(): 
    x = WithDel() 
    y = [] 
    z = [] 
 
    x.append(y) 
    y.append(z) 
    z.append(x) 
 
    del x, y, z 
 
    print("unreachable prior collection: %s" % gc.collect()) 
    print("unreachable after collection: %s" % len(gc.garbage)) 
    print("WithDel objects count:        %s" % 
          objgraph.count('WithDel')) 
 
 
if __name__ == "__main__": 
    print("Python version: %s" % platform.python_version()) 
    print() 
    main()

The following output of the preceding code, when executed under Python 3.3, shows that the cyclic garbage collector in the older versions of Python cannot collect objects that have the __del__() method defined:

$ python3.3 with_del.py 
Python version: 3.3.5
    
unreachable prior collection: 3
unreachable after collection: 1
WithDel objects count:        1

With a newer version of Python, the garbage collector can safely deal with the finalization of objects, even if they have the __del__() method defined, as follows:

$ python3.5 with_del.py 
Python version: 3.5.1
unreachable prior collection: 3
unreachable after collection: 0
WithDel objects count:        0

Although custom finalization is no longer a memory threat in the latest Python releases, it still poses a problem for applications that need to work in different environments. As we mentioned earlier, the objgraph.show_refs() and objgraph.show_backrefs() functions allow you to easily spot problematic objects that take part in unbreakable reference cycles. For instance, we can easily modify the main() function to show all back references to the WithDel instances to see whether we have leaking resources, as follows:

def main(): 
    x = WithDel() 
    y = [] 
    z = [] 
 
    x.append(y) 
    y.append(z) 
    z.append(x) 
 
    del x, y, z 
 
    print("unreachable prior collection: %s" % gc.collect()) 
    print("unreachable after collection: %s" % len(gc.garbage)) 
    print("WithDel objects count:        %s" % 
          objgraph.count('WithDel')) 
 
    objgraph.show_backrefs( 
        objgraph.by_type('WithDel'), 
        filename='after-gc.png' 
    )

Running the preceding example under Python 3.3 will result in the diagram presented in Figure 13.5. It shows that gc.collect() could not succeed in removing x, y, and z object instances. Additionally, objgraph highlights all the objects that have the custom __del__() method in red to make spotting such issues easier:

Figure 13.5: A diagram showing an example of cyclic references that can't be picked by the Python garbage collector prior to Python 3.4

When memory leaks happen in the C extensions (for instance, in Python/C extensions), it is usually much harder to diagnose and profile them. However, harder does not mean impossible, as we will discover in the next section.

C code memory leaks

If the Python code seems perfectly fine and memory still increases when you loop through the isolated function, the leak might be located on the C side. This happens, for instance, when a Py_DECREF macro is missing in the critical part of an imported C extension.

The C code of CPython interpreter is pretty robust and tested for the existence of memory leaks, so it is the last place to look for memory problems. But if you use packages that have custom C extensions, they might be a good place to look first. Because you will be dealing with code operating on a much lower level of abstraction than Python, you need to use completely different tools to resolve such memory issues.

Memory debugging is not easy in C, so before diving into extension internals, make sure that you properly diagnose the source of your problem. It is a very popular approach to isolate a suspicious package with code similar in nature to unit tests. To diagnose the source of your problem, you should consider the following actions:

Write a separate test for each API unit or functionality of an extension you suspect of leaking memory.
Perform the test in a loop for an arbitrarily long time in isolation (one test function per run).
Observe from outside which of the tested functionalities increases memory usage over time.

By using such an approach, you will eventually isolate the faulty part of the extension and this will reduce the time required later to inspect and fix its code. This process may seem burdensome because it requires a lot of additional time and coding, but it does pay off in the long run. You can always make your work easier by reusing some of the testing tools that were introduced in Chapter 10, Testing and Quality Automation. Utilities such as Pytest were perhaps not designed specifically for this case, but can at least reduce the time required to run multiple tests in isolated environments.

If you have successfully isolated the part of the extension that is leaking the memory, you can finally start actual debugging. If you're lucky, a simple manual inspection of the isolated source code section may provide the desired results. In many cases, the problem is as simple as adding the missing Py_DECREF call. Nevertheless, in most cases, your work won't be that simple. In such situations, you need to bring out some bigger guns. One of the notable generic tools for combating memory leaks in compiled code that should be in every programmer's toolbelt is Valgrind. This is a whole instrumentation framework for building dynamic analysis tools. Because of this, it may not be easy to learn and master, but you should definitely acquaint yourself with the basics of its usage.

You can learn more about Valgrind at https://valgrind.org.

After profiling, when you know what's wrong with your code performance, it's time to apply actual code optimizations. The most common reason for bad performance is code complexity. Very often, code complexity can be reduced just by applying appropriate data structures. Let's now look at some examples of optimizations with built-in Python data types.

Reducing complexity by choosing appropriate data structures

To reduce code complexity, it's important to consider how data is stored. You should pick your data structure carefully. The following section will provide you with a few examples of how the performance of simple code snippets can be improved by the correct data types.

Searching in a list

Due to the implementation details of the list type in Python, searching for a specific value in a list isn't a cheap operation. The complexity of the list.index() method is O(n), where n is the number of list elements. Such linear complexity won't be an issue if you only need to perform a few element index lookups, but it can have a negative performance impact in some critical code sections, especially if done over very large lists.

If you need to search through a list quickly and often, you can try the bisect module from Python's standard library. The functions in this module are mainly designed for inserting or finding insertion indexes for given values in a way that will preserve the order of the already sorted sequence. This module is used to efficiently find an element index using a bisection algorithm. The following recipe, from the official documentation of the function, finds an element index using a binary search:

def index(a, x): 
    'Locate the leftmost value exactly equal to x' 
    i = bisect_left(a, x) 
    if i != len(a) and a[i] == x: 
        return i 
    raise ValueError

Note that every function from the bisect module requires a sorted sequence in order to work. If your list is not in the correct order, then sorting it is a task with a worst-case complexity of O(n log n). This is a worse class than O(n), so sorting the whole list to then perform only a single search will not pay off. However, if you need to perform a number of index searches across a large list that rarely changes, using a single sort() operation for bisect may prove to be the best trade-off.

If you already have a sorted list, you can also insert new items into that list using bisect without needing to re-sort it. The bisect_left() and bisect_right() functions already return insertion points in the left-to-right or right-to-left sorted lists accordingly. The following is an example of inserting a new value in a left-to-right sorted list using the bisect_left() function:

>>> from bisect import bisect_left
>>> items = [1, 5, 6, 19, 20]
>>> items.insert(bisect_left(items, 15), 15)
>>> items
[1, 5, 6, 15, 19, 20]

There are also insort_left() and insort_right() functions that are shorthand for inserting elements in sorted lists:

>>> from bisect import insort_left
>>> items = [1, 5, 6, 19, 20]
>>> insort_left(items, 15)
>>> items
[1, 5, 6, 15, 19, 20]

In the next section, we will see how to use a set instead of a list when element uniqueness is required.

Using sets

When you need to build a sequence of distinct values from a given sequence, the first algorithm that might come to mind is as follows:

>>> sequence = ['a', 'a', 'b', 'c', 'c', 'd']
>>> result = []
>>> for element in sequence:
...     if element not in result:
...         result.append(element)
... 
>>> result
['a', 'b', 'c', 'd']

In the preceding example, the complexity is introduced by the lookup in the result list. The in operator has a time complexity of O(n). It is then used in the loop, which costs O(n). So, the overall complexity is quadratic, that is, O(n²).

Using a set type for the same work will be faster because the stored values are looked up using hashes (the same as in the dict type). The set type also ensures the uniqueness of elements, so we don't need to do anything more than create a new set from the sequence object. In other words, for each value in sequence, the time taken to see whether it is already in the set will be constant, as follows:

>>> sequence = ['a', 'a', 'b', 'c', 'c', 'd']
>>> unique = set(sequence)
>>> unique
set(['a', 'c', 'b', 'd'])

This lowers the complexity to O(n), which is the complexity of the set object creation. The additional advantage of using the set type for element uniqueness is shorter and more explicit code.

Sometimes the built-in data types are not enough to handle your data structures efficiently. Python comes with a great set of additional performant data types in the collections module.

Using the collections module

The collections module provides specialized alternatives to general-purpose, built-in container types. The main types from this module that we will focus on in this chapter are as follows:

deque: A list-like type with extra features
defaultdict: A dict-like type with a built-in default factory feature
namedtuple: A tuple-like type that assigns keys for members

We've already discussed other types from the collections module in other chapters: ChainMap in Chapter 3, New Things in Python; UserList and UserDict in Chapter 4, Python in Comparison with Other Languages; and Counter in Chapter 5, Interfaces, Patterns, and Modularity.

We'll discuss these types in the following sections.

deque

A deque is an alternative implementation for lists. While the built-in list type is based on ordinary arrays, a deque is based on a doubly-linked list. Hence, a deque is much faster when you need to insert something into its tail or head, but much slower when you need to access an arbitrary index.

Of course, thanks to the overallocation of an internal array in the Python list type, not every list.append() call requires memory reallocation, and the average complexity of this method is O(1). The situation changes dramatically when the element needs to be added at the first index of the list. Because all elements to the right of the new one need to be shifted in an array, the complexity of list.insert() is O(n). If you need to perform a lot of pops, appends, and inserts, a deque in place of a list may provide a substantial performance improvement.

Remember to always profile your code before switching from a list to a deque because a few things that are fast in arrays (such as accessing an arbitrary index) are extremely inefficient in linked lists.

For example, if we measure the time it takes to append one element and remove it from the sequence with timeit, the difference between a list and a deque may not even be noticeable.

The following is a sample timeit run for the list type:

$ python3 -m timeit \
  -s 'sequence=list(range(10))' \
  'sequence.append(0); sequence.pop();'
1000000 loops, best of 3: 0.168 usec per loop

And the following is a sample timeit run for the deque type:

$ python3 -m timeit \ 
  -s 'from collections import deque; sequence=deque(range(10))' \
  'sequence.append(0); sequence.pop();'
1000000 loops, best of 3: 0.168 usec per loop

However, if we perform a similar comparison for situations where we want to add and remove the first element of the sequence, the performance difference is impressive.

The following is a sample timeit run for the list type:

$ python3 -m timeit \
  -s 'sequence=list(range(10))' \
  'sequence.insert(0, 0); sequence.pop(0)'  
1000000 loops, best of 3: 0.392 usec per loop

And the following is a similar timeit run for the deque type:

$ python3 -m timeit \
  -s 'from collections import deque; sequence=deque(range(10))' \
  'sequence.appendleft(0); sequence.popleft()'
10000000 loops, best of 3: 0.172 usec per loop

As you may expect, the difference gets bigger as the size of the sequence grows. The following is an example of running the same test with timeit for a list that contains 10,000 elements:

$ python3 -m timeit \
  -s 'sequence=list(range(10000))' \
  'sequence.insert(0, 0); sequence.pop(0)'
100000 loops, best of 3: 14 usec per loop

If we do the same for deque, we will see that the timing of the operation does not change:

$ python3 -m timeit \
  -s 'from collections import deque; sequence=deque(range(10000))' \ 
  'sequence.appendleft(0); sequence.popleft()'
10000000 loops, best of 3: 0.168 usec per loop

Thanks to the efficient append() and pop() methods, which work at the same speed from both ends of the sequence, deque makes a perfect example of implementing queues. For example, a First-In First-Out (FIFO) queue will be much more efficient if implemented with deque instead of list.

deque works well when implementing queues, but there is also a separate queue module in Python's standard library that provides a basic implementation for FIFO, Last-In First-Out (LIFO), and priority queues. If you want to utilize queues as a mechanism of inter-thread communication, you should use classes from the queue module instead of collections.deque. This is because these classes provide all the necessary locking semantics. If you don't use threading and choose not to utilize queues as a communication mechanism, deque should be enough to provide queue implementation basics.

defaultdict

The defaultdict type is similar to the dict type, except it adds a default factory for new keys. This avoids the need to write an extra test to initialize the mapping entry and is also a bit more efficient than the dict.setdefault method.

defaultdict may seem like simple syntactic sugar over dict that allows us to write shorter code. However, the fallback to a pre-defined value on a failed key lookup is slightly faster than the dict.setdefault() method.

The following is a timeit run for the dict.setdefault() method:

$ python3 -m timeit \
  -s 'd = {}' \
  'd.setdefault("x", None)'
10000000 loops, best of 3: 0.153 usec per loop

And the following is a timeit run for the equivalent defaultdict usage:

$ python3 -m timeit \
-s 'from collections import defaultdict; d = defaultdict(lambda: None)' \
'd["x"]'
10000000 loops, best of 3: 0.0447 usec per loop

The difference in the preceding example may seem more substantial, but the computational complexity hasn't changed. The dict.setdefault() method consists of two steps (key lookup and key set), both of which have a complexity of O(1). There is no way to have a complexity class lower than O(1), but sometimes it is worth looking for faster alternatives in the same O(1) class. Every small speed improvement counts when optimizing critical code sections.

The defaultdict type takes a factory as a parameter and can therefore be used with built-in types or classes whose constructors do not take arguments. The following code snippet is an example from the official documentation that demonstrates how to use defaultdict for counting:

>>> from collections import defaultdict 
>>> s = 'mississippi'
>>> d = defaultdict(int)
>>> for k in s:
...     d[k] += 1
...
>>> list(d.items())
[('i', 4), ('p', 2), ('s', 4), ('m', 1)]

For this particular example (counting unique elements), the collections module also offers a special Counter class. It can be used to query efficiently for a number of top elements.

namedtuple

namedtuple is a class factory that takes a name of a type with a list of attributes and creates a class out of it. The new class can then be used to instantiate tuple-like objects and also provides accessors for their elements, as follows:

>>> from collections import namedtuple 
>>> Customer = namedtuple(
...     'Customer',
...     'firstname lastname'
... )
>>> c = Customer('Tarek', 'Ziadé')
>>> c.firstname
'Tarek'

As shown in the preceding example, it can be used to create records that are easier to write compared to a custom class that may require boilerplate code to initialize values. On the other hand, it is based on a tuple, so gaining access to its elements by means of an index is a quick process. The generated class can also be sub-classed to add more operations.

The advantage of using namedtuple over other data types may not be obvious at first glance. The main advantage is that it is easier to use, understand, and interpret than ordinary tuples. Tuple indexes don't carry any semantics, so it is great to be able to access tuple elements by attributes. Note that you could also get the same benefits from dictionaries that have an O(1) average complexity of get and set operations.

The main advantage in terms of performance is that namedtuple is still a flavor of tuple. This means that it is immutable, so the underlying array storage is allocated for the necessary size. Dictionaries, on the other hand, need to use overallocation in the internal hash table to ensure the low-average complexity of get/set operations. So, namedtuple wins over dict in terms of memory efficiency.

Similar micro memory optimization can be done on user-defined classes using slots. Slots were discussed in Chapter 4, Python in Comparison with Other Languages.

The fact that namedtuple is based on a tuple may also be beneficial for performance. Its elements may be accessed by an integer index, as in other simple sequence objects—lists and tuples. This operation is both simple and fast. In the case of dict or custom class instances (that use dictionaries for storing attributes), the element access requires a hash table lookup. Dictionaries are highly optimized to ensure good performance independently from collection size, but as mentioned, O(1) complexity is actually only regarded as average complexity. The actual, amortized worst-case complexity for set/get operations in dict is O(n). The real amount of work required to perform such an operation is dependent on both collection size and history. In sections of code that are critical for performance, it may be wise to use lists or tuples over dictionaries, as they are more predictable. In such a situation, namedtuple is a great type that combines the following advantages of dictionaries and tuples:

In sections where readability is more important, the attribute access may be preferred.
In performance-critical sections, elements may be accessed by their indexes.

Named tuples can be a useful optimization tool, but when readability matters, usually, data classes are a better choice for storing struct-like data. We've discussed data classes and their advantages in Chapter 4, Python in Comparison with Other Languages.

Reduced complexity can be achieved by storing data in an efficient data structure that works well with the way the algorithm will use it. That said, when the solution is not obvious, you should consider dropping and rewriting the incriminating part of the code instead of sacrificing the readability for the sake of performance. Often, Python code can be both readable and fast, so try to find a good way to perform the work instead of trying to work around a flawed design.

But sometimes, the problem we are trying to solve doesn't have an efficient solution or we don't have a good and performant structure at hand. In such situations, it is worth considering some architectural trade-offs. We will discuss examples of such trade-offs in the next section.

Leveraging architectural trade-offs

When your code can no longer be improved by reducing the complexity or choosing a proper data structure, a good approach may be to consider a trade-off. If we review users' problems and define what is really important to them, we can often relax some of the application's requirements. Performance can often be improved by doing the following:

Replacing exact solution algorithms with heuristics and approximation algorithms
Deferring some work to delayed task queues
Using probabilistic data structures

Let's move on and take a look at these improvement methods.

Using heuristics and approximation algorithms

Some algorithmic problems simply don't have good state-of-the-art solutions that could run within a time span that would be acceptable to the user.

For example, consider a program that deals with complex optimization problems, such as the Traveling Salesman Problem (TSP) or the Vehicle Routing Problem (VRP). Both problems are NP-hard problems in combinatorial optimization. Exact low-complexity algorithms for these problems are not known; this means that the size of a problem that can be solved practically is greatly limited. For larger inputs, it is unlikely that you'll be able to provide the correct solution in time.

Fortunately, a user will likely not be interested in the best possible solution, but one that is good enough and can be obtained in a timely manner. In these cases, it makes sense to use heuristics or approximation algorithms whenever they provide acceptable results:

Heuristics solve given problems by trading optimality, completeness, accuracy, or precision for speed. Thus, it may be hard to prove the quality of their solutions compared to the result of exact algorithms.
Approximation algorithms are similar in idea to heuristics, but, unlike heuristics, have provable solution quality and runtime bounds.

There are many known good heuristics and approximation algorithms that can solve extremely large TSP or VRP problems within a reasonable amount of time. They also have a high probability of producing good results, just 2-5% from the optimal solution.

Another good thing about heuristics is that they don't always need to be constructed from scratch for every new problem that arises. Their higher-level versions, called metaheuristics, provide strategies for solving mathematical optimization problems that are not problem-specific and can thus be applied in many situations. Popular metaheuristic algorithms include the following:

Simulated annealing, which imitates the physical processes happening during annealing in metallurgy (controlled heating and cooling of materials)
Evolutionary computation, which is inspired by biological evolutions and uses evolutionary mechanisms, such as mutation, reproduction, recombination, and selection, to efficiently search a large area of solutions in complex problems
Genetic algorithms, which are a specialized form of evolutionary computation that represent possible problem solutions as sets of genotypes and perform genetic transformations, such as crossover and mutations, to produce better results
Tabu search, which is a general problem search technique that introduces prohibited search paths (taboos) to reduce the probability of algorithm settling in local optimums
Ant colony optimization, which imitates the behavior of ants in a colony when searching through the possible space of problem solutions

Heuristic and approximation algorithms are viable optimization techniques when the majority of the performance happens in a single algorithmic task of the application. But often, performance issues are caused by general system architecture and communication links between different system components.

A common architectural trade-off that improves the perceived performance of complex applications involves the use of task queues and delayed processing.

Using task queues and delayed processing

Sometimes, it's not about doing too much, but about doing things at the right time. A common example that's often mentioned in literature is sending emails within a web application. In this case, increased response times for HTTP requests may not necessarily translate to your code implementation. The response time may instead be dominated by a third-party service, such as a remote email server. So, can you ever successfully optimize your application if it spends most of its time waiting on other services to reply?

The answer is both yes and no. If you don't have any control over a service that is the main contributor to your processing time, and there is no faster solution you can use, you cannot speed up the service any further. A simple example of processing an HTTP request that results in sending an email is presented in Figure 13.6:

Figure 13.6: An example of synchronous email delivery in a web application

If your code relies on third-party services, you often cannot reduce the waiting time. But you can change the way users will perceive it.

The usual solution to this is kind of problem is to use message or task queues (see Figure 13.7). When you need to do something that could take an indefinite amount of time, add it to the queue of work that needs to be done and immediately tell the user that their request was accepted. This is why sending emails is such a great example: emails are already task queues! If you submit a new message to an email server using the SMTP protocol, a successful response does not mean your email was delivered to the addressee—it means that the email was delivered to the email server. If a response from the server does not guarantee that an email was delivered, you don't need to wait for it in order to generate an HTTP response for the user:

Figure 13.7: An example of asynchronous email delivery in a web application

Even if your email server is responding at blazing speed, you may need some more time to generate the message that needs to be sent. Are you generating yearly reports in an XLS format? Or are you delivering invoices via PDF files? If you use an email transport system that is already asynchronous, then you can put the whole message generation task to the message processing system. If you cannot guarantee the exact time of delivery, then you should not bother generating your deliverables synchronously.

The correct usage of task and message queues in critical sections of an application can also give you other benefits, including the following:

Web workers that serve HTTP requests will be relieved from additional work and will be able to process requests faster. This means that you will be able to process more requests with the same resources and thereby handle greater loads.
Message queues are generally more immune to transient failures of external services. For instance, if your database or email server times out from time to time, you can always re-queue the currently processed task and retry it later.
With a good message queue implementation, you can easily distribute work on multiple machines. This approach may improve the scalability of some of your application components.

As you can see in Figure 13.7, adding an asynchronous task process to your application inevitably increases the complexity of the whole system's architecture. You will need to set up some new backing services (a message queue such as RabbitMQ) and create workers that will be able to process asynchronous jobs. Fortunately, there are some popular tools available for building distributed task queues.

A popular Python tool for asynchronous job handling is Celery. Celery is a fully fledged task queue framework with the support of multiple message brokers that also allows for the scheduled execution of tasks. It can even replace your cron jobs. You can read more about Celery at http://www.celeryproject.org/.

If you need something simpler, then RQ might be a good alternative. RQ is a lot simpler than Celery and uses Redis key-value storage as its message broker Redis Queue (RQ). You can read more about RQ at http://python-rq.org/.

Although there are some good and battle-hardened tools available, you should always carefully consider your approach to task queues. Not every kind of task should be processed in queues. While queues are good at solving a number of issues, they can also introduce the following problems:

The increased complexity of system architecture
The possibility of processing a single message more than once
More services to maintain and monitor
Larger processing delays
More difficult logging

A completely different approach to architectural tradeoffs involves the use of non-deterministic/probabilistic data structures.

Using probabilistic data structures

Probabilistic data structures are structures that are designed to store collections of values in a way that allows you to answer specific questions within time or resource constraints that would otherwise be impossible. A common example would be efficiently storing unique view counts in a large video streaming platform such as YouTube that has billions of videos and billions of users. Having naïve implementation that stores exact information on who watched what would take a tremendous amount of memory and would probably be hard to operate efficiently. When the problem is that big, it may be necessary to consider the use of probabilistic data structures.

The most important feature of probabilistic data structures is that the answers they give are only probable to be true; in other words, they are just approximations of real values. The probability of a correct answer can be easily estimated, however. Despite not always giving the correct answer, probabilistic data structures can still be useful if there is some room for potential error.

There are a lot of data structures with such probabilistic properties. Each one of them solves specific problems, but due to their stochastic nature, they cannot be used in every situation. As a practical example, we'll talk about one of the more popular structures, HLL (HyperLogLog).

HLL is an algorithm that approximates the number of distinct elements in a multiset. With ordinary sets, if you want to know the number of unique elements, you need to store all of them. This is obviously impractical for very large datasets. HLL is distinct from the classical way of implementing sets as programming data structures.

Without digging into implementation details, let's say that it only concentrates on providing an approximation of set cardinality; real values are never stored. They cannot be retrieved, iterated, or tested for membership. HLL trades accuracy and correctness for time complexity and size in memory. For instance, the Redis implementation of HLL takes only 12k bytes with a standard error of 0.81%, with no practical limit on collection size.

Using probabilistic data structures is an interesting way of solving performance problems. In most cases, it is about trading off some accuracy for faster processing or more efficient resource usage. It does not always need to do so, however. Probabilistic data structures are often used in key/value storage systems to speed up key lookups. One of the most popular techniques that's used in such systems is called an Approximate Member Query (AMQ). An interesting probabilistic data structure that is often used for this purpose is the Bloom filter.

In the next section, we'll take a look at caching.

Caching

When some of your application functions take too long to compute, a useful technique to consider is caching. Caching saves the return values of function calls, database queries, HTTP requests, and so on for future reference. The result of a function or method that is expensive to run can be cached as long as one of the following conditions is met:

The function is deterministic and the results have the same value every time, given the same input.
The return value of the function is non-deterministic, but continues to be useful and valid for certain periods of time.

In other words, a deterministic function always returns the same result for the same set of arguments, whereas a non-deterministic function returns results that may vary in time. The caching of both types of results usually greatly reduces the time of computation and allows you to save a lot of computing resources.

The most important requirement for any caching solution is a storage system that allows you to retrieve saved values significantly faster than it takes to calculate them. Good candidates for caching are usually the following:

Results from callables that query databases
Results from callables that render static values, such as file content, web requests, or PDF rendering
Results from deterministic callables that perform complex calculations
Global mappings that keep track of values with expiration times, such as web session objects
Results that need to be accessed often and quickly

Another important use case for caching is when saving results from third-party APIs served over the web. This may greatly improve application performance by cutting off network latencies, but it also allows you to save money if you are billed for every request to an API.

Depending on your application architecture, the cache can be implemented in many ways and with various levels of complexity. There are many ways of providing a cache, and complex applications can use different approaches on different levels of the application architecture stack. Sometimes, a cache may be as simple as a single global data structure (usually a dictionary) that is kept in the process memory. In other situations, you may want to set up a dedicated caching service that will run on carefully tailored hardware. The following sections will provide you with basic information on the most popular caching approaches, guiding you through some common use cases as well as the common pitfalls.

So, let's move on and see what deterministic caching is.

Deterministic caching

Deterministic functions are the easiest and safest use case for caching. Deterministic functions always return the same value if given the same input, so caching can generally store their results indefinitely. The only limitation to this approach is storage capacity. The simplest way to cache results is to put them into process memory, as this is usually the fastest place to retrieve data from. Such a technique is often called memoization.

Memoization is very useful when optimizing recursive functions that may need to evaluate the same input multiple times. We already discussed recursive implementations for the Fibonacci sequence in Chapter 9, Bridging Python with C and C++. Earlier on in this book, we tried to improve the performance of our program with C and Cython. Now, we will try to achieve the same goal by simpler means—through caching. Before we do that, let's first recall the code for the fibonacci() function, as follows:

def fibonacci(n): 
    if n < 2: 
        return 1 
    else: 
        return fibonacci(n - 1) + fibonacci(n - 2)

As we can see, fibonacci() is a recursive function that calls itself twice if the input value is more than two. This makes it highly inefficient. The runtime complexity is O(2ⁿ) and its execution creates a very deep and vast call tree. For a large input value, this function will take a long time to execute. There is also is a high chance that it will exceed the maximum recursion limit of the Python interpreter.

If you take a closer look at the following Figure 13.8, which presents an example call tree of the fibonacci() function, you will see that it evaluates many of the intermediate results multiple times. A lot of time and resources can be saved if we reuse some of these values:

Figure 13.8: A call tree for the fibonacci(5) execution

An example of a simple memoization attempt would be to store the results of previous runs in a dictionary and to retrieve them if they are available.

Both the recursive calls in the fibonacci() function are contained in a single line of code, as follows:

return fibonacci(n - 1) + fibonacci(n - 2)

We know that Python evaluates instructions from left to right. This means that, in this situation, the call to the function with a higher argument value will be executed before the call to the function with a lower argument value. Thanks to this, we can provide memoization by constructing a very simple decorator, as follows:

def memoize(function): 
    call_cache = {} 
 
    def memoized(argument): 
        try: 
            return call_cache[argument] 
        except KeyError: 
            return call_cache.setdefault(
                argument, function(argument)
            ) 
 
    return memoized

We can then apply it to the fibonacci() function as follows:

@memoize 
def fibonacci(n): 
    if n < 2: 
        return 1 
    else: 
        return fibonacci(n - 1) + fibonacci(n - 2)

We've used the dictionary on the closure of the memoize() decorator as a simple storage solution from cached values. Saving and retrieving values in the preceding data structure has an average complexity of O(1). This should greatly reduce the overall complexity of the memoized function. Every unique function call will be evaluated only once. The call tree of such an updated function is presented in Figure 13.9. Even without mathematical proof, we can visually deduce that we have reduced the complexity of the fibonacci() function from the very expensive O(2ⁿ) to the linear O(n):

Figure 13.9: A call tree for the fibonacci(5) execution with memorization

The implementation of our memoize() decorator is, of course, not perfect. It worked well for the preceding example, but it isn't a reusable piece of software. If you need to memoize functions with multiple arguments, or want to control the size of your cache, you will require something more generic.

Luckily, the Python standard library provides a very simple and reusable utility that can be used in most cases of deterministic caching. This utility is the lru_cache() decorator from the functools module. The name comes from the LRU (Last Recently Used) algorithm. The following additional parameters allow for finer control of memoization behavior:

maxsize: This sets the maximum size of the cache. The None value means no limit at all.
typed: This defines whether values of different types that compare as equal should be mapped to the same result.

The usage of lru_cache() in our Fibonacci sequence example would be as follows:

from functools import lru_cache
@lru_cache(None) 
def fibonacci(n): 
    if n < 2: 
        return 1 
    else: 
        return fibonacci(n - 1) + fibonacci(n - 2)

In the next section, we will take a look at non-deterministic caching.

Non-deterministic caching

Caching non-deterministic functions is trickier than memoization. Since every execution of such a function may provide different results, it is usually impossible to use previous values for an arbitrarily long amount of time. What you need to do instead is to decide for how long a cached value can be considered valid. After a defined period of time passes, the stored results are considered stale, and the cache will need to be refreshed with a new value.

So, in other words, non-deterministic caching is performed in any situation where pre-computed results are used temporarily. Cached non-deterministic functions often depend on some external state that is hard to track inside your application code. Typical examples of components include the following:

Relational databases, or generally any type of structured data storage engine
Third-party services accessible through a network connection (web APIs)
Filesystems

Note that such an implementation is a trade-off. If you resign from running part of your code whenever necessary and instead use historical results, you are risking using data that is stale or represents an inconsistent state of your system. In this case, you are trading accuracy and/or completeness for speed and performance.

Of course, such caching is only efficient as long as the time taken to interact with the cache is less than the time the cached function takes to execute. If it's faster to simply recalculate the value, by all means do so! That's why setting up a cache has to be done only if it's worth it; setting it up properly has a cost.

Things that can actually be cached are usually the results of interactions with other components of your system. For instance, if you want to save time and resources when communicating with the database, it is worth caching frequent and expensive queries. If you want to reduce the number of I/O operations, you may want to cache the content of files that are most frequently accessed or responses from external APIs.

Techniques for caching non-deterministic functions are actually very similar to those used in caching deterministic ones. The most notable difference is that they usually require the option of invalidating cached values by their age. This means that the lru_cache() decorator from the functools module has limited use; however, it should not be too difficult to extend this function to provide the expiration feature. As this is a very common problem that has been solved numerous times by many developers, you should be able to find multiple libraries on PyPI that have been designed for caching non-deterministic values.

Using local process memory is a fast way to cache values, but every process maintains its own cache. If you have a lot of processes, independent caches can take a substantial amount of your memory. In distributed systems, it is common to use dedicated cache services.

Cache services

Although non-deterministic caching can be implemented using local process memory, it is actually rarely done that way in a distributed system. That's because a cache needs to be duplicated for every service and this is often a waste of resources. Moreover, multiple process instances can have different cache values, and this may lead to data inconsistencies.

If you run into a situation where non-deterministic caching is your preferred solution to performance problems, you may well need something more. Usually, non-deterministic caching is your must-have solution when you need to serve data or a service to multiple users at the same time.

Sooner or later, you may also need to ensure that users can be served concurrently. While local memory provides a way of sharing data between multiple threads, threading may not be the best concurrency model for every application. It does not scale well, so you will eventually need to run your application as multiple processes.

If you are lucky enough, you may be able to run your application on hundreds or thousands of machines. If you would like to store cached values in local memory in this scenario, your cache will need to be duplicated on every process that requires it. This is not just a total waste of resources—if every process has its own cache, which is already a trade-off between speed and consistency, how can you guarantee that all caches are consistent with each other?

Consistency across subsequent requests is a serious concern, especially for web applications with distributed backends. In complex distributed systems, it is extremely hard to ensure that the user will always be served by the same process hosted on the same machine. It is, of course, doable to some extent, but once you solve that problem, 10 others will pop up.

If you are making an application that needs to serve multiple concurrent users, the best way to handle a non-deterministic cache is to use a dedicated service. With tools such as Redis or Memcached, you allow all of your application processes to share the same cached results. This both reduces the use of precious computing resources and saves you from any problems caused by having too many independent and inconsistent caches.

Caching services such as Memcached are useful for implementing memoization-like caches with states that can be easily shared across multiple processes, and even multiple servers. There is also another way of caching that can be implemented on a system architecture-level, and such an approach is extremely common in applications working over the HTTP protocol. Many elements of a typical HTTP application stack provide elastic caching capabilities that often use mechanisms that are well standardized by the HTTP protocol. This kind of caching can, for instance, take the form of the following:

Caching reverse-proxy (for example, nginx or Apache): Where a proxy caches full responses from multiple web workers working on the same host
Caching load balancer (for example, HAProxy): Where a load balancer not only distributes the load over multiple hosts, but also caches their responses
Content distribution network: Where resources from your servers are cached by a system that also tries to keep them in close geographical proximity to users, thus reducing network roundtrip times

In the next section, we will take a look at Memcached.

Using Memcached

If you want to be serious about caching, Memcached is a very popular and battle-hardened solution. This cache server is used by big applications, including Facebook and Wikipedia, to scale their websites. Among simple caching features, it has clustering capabilities that make it possible for you to set up an efficiently distributed cache system in no time.

Memcached is a multi-platform service, and there are a handful of libraries for communicating with it available in multiple programming languages. Many Python clients differ slightly from one another, but the basic usage is usually the same. The simplest interaction with Memcached almost always consists of the following three methods:

set(key, value): This saves the value for the given key.
get(key): This gets the value for the given key if it exists.
delete(key): This deletes the value under the given key if it exists.

The following code snippet is an example of integration with Memcached using one popular Python package available on PyPI, pymemcache:

from pymemcache.client.base import Client 
 
# setup Memcached client running under 11211 port on localhost 
client = Client(('localhost', 11211)) 
 
# cache some value under some key and expire it after 10 seconds 
client.set('some_key', 'some_value', expire=10) 
 
# retrieve value for the same key 
result = client.get('some_key')

One of the downsides of Memcached is that it is designed to store values as binary blobs. This means that more complex types need to be serialized in order to be successfully stored in Memcached. A common serialization choice for simple data structures is JSON. An example of how to use JSON serialization with pymemcached is as follows:

import json 
from pymemcache.client.base import Client 
def json_serializer(key, value): 
     if type(value) == str: 
         return value, 1 
     return json.dumps(value), 2 
def json_deserializer(key, value, flags): 
    if flags == 1: 
        return value 
    if flags == 2: 
        return json.loads(value) 
    raise Exception("Unknown serialization format") 
client = Client(('localhost', 11211), serializer=json_serializer, 
                deserializer=json_deserializer) 
client.set('key', {'a':'b', 'c':'d'}) 
result = client.get('key')

Another problem that is very common when working with a caching service that works on the key/value storage principle is how to choose key names.

For cases when you are caching simple function invocations that have basic parameters, the solution is usually simple. Here, you can convert the function name and its arguments into strings and then concatenate them together. The only thing you need to worry about is making sure that there are no collisions between keys that have been created for different functions if you are caching in different places within an application.

A more problematic case is when cached functions have complex arguments that consist of dictionaries or custom classes. In that case, you will need to find a way to convert invocation signatures into cache keys in a consistent manner.

Many caching services (including Memcached) store their cache in RAM to provide the best cache lookup performance. Often, old cache keys can be removed when the working dataset gets too large. The whole cache can be also cleared when the service gets restarted. It is important to take this into account and use caching services to store data that should remain persistent. Sometimes, it may also be necessary to provide a proper cache warmup procedure that will populate the cache with the most common cache entries (for instance, in the case of a service upgrade or new application release).

The last problem is that Memcached, like many other caching services, does not respond well to very long key strings, which may either reduce performance or will simply not fit the hardcoded service limits. For instance, if you cache whole SQL queries, the query strings themselves are generally suitable unique identifiers that can be used as keys. On the other hand, complex queries are generally too long to be stored in a caching service such as Memcached. A common practice is to instead calculate the MD5, SHA, or any other hash function and use that as a cache key instead. The Python standard library has a hashlib module that provides implementation for a few popular hash algorithms. One important thing to note when using hashing functions is hash collisions. No hash function guarantees that collisions will never occur, so always be sure to know and mitigate any potential risks.

Summary

In this chapter, we've learned about the optimization process, from identifying possible bottlenecks, through common profiling techniques, to useful optimization techniques that can be applied to a wide variety of performance problems.

This chapter concludes the book, the same way that optimization usually concludes application development cycles. We perform an optimization process on applications that are known to work well. That's why it is important to have proper methodologies and processes set in place that will ensure that our application continues to work properly.

While optimization often concentrates on reducing algorithmic and computational complexity, it can increase different kinds of complexity. Optimized applications are often harder to read and understand, and so are more complex in terms of readability and maintainability. Architectural trade-offs often rely on introducing dedicated services or using solutions that sacrifice part of application correctness or accuracy. Applications that leverage such architectural trade-offs almost always have more complex architectures.

Code optimization, like every other development practice, requires skill and expertise. Part of that expertise is knowing what the balance between various development processes and activities is. Sometimes, minor optimizations are not worth doing at all. Sometimes, it is worth breaking a few rules to satisfy business needs. That is why, in this book, we tried to capture a holistic view of the entire development process of the application. You may not always be doing everything on your own, but knowing how applications are built, maintained, tested, released, observed, and optimized will allow you to know the right balance between each of those activities.

Share your experience

Thank you for taking the time to read this book. If you enjoyed this book, help others to find it. Leave a review at https://www.amazon.com/dp/1801071101

Previous Chapter