SOME PYTHON CODE AND SOME THOUGHTS

Nov 04 2011
25 notes

Optimization of the Monty Python parody code

In this post, we’ll see how the program proposed in my previous post can be optimized in term of execution time. The code can be found the project GitHub repository. For those who haven’t read the first article but are interested in code optimization, i’d then advise you to read it first, to understand the causes of the problem.

What is the problem?

My code is slow. It takes too much time to run : around 45 seconds.

Profiling : Let’s find the bottlenecks !

To identify the bottlenecks in our code, we can take advantage of the excellent profile python module. (More information here : http://docs.python.org/library/profile.html)

Let’s run our code using profile :
$ python -m cProfile -o proba.prof probabilities.py ../data/data.txt

This command will create a file proba.prof containing the profiling results. We can now read these results using the pstats module.

>>> import pstats
>>> p = pstats.Stats("proba.prof")
>>> p.sort_stats("cumulative").print_stats(10)

These commands will result into an output listing the 10 most time-consuming functions.

   5931126 function calls (5801827 primitive calls) in 48.691 CPU seconds

   Ordered by: cumulative time
   List reduced from 80 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   48.691   48.691 <string>:1(<module>)
        1    0.001    0.001   48.691   48.691 {execfile}
        1    0.005    0.005   48.690   48.690 probabilities.py:13(<module>)
        1    1.177    1.177   48.657   48.657 probabilities.py:33(empirical_entropy_rate)
  1617584    1.818    0.000   45.773    0.000 probabilities.py:88(conditional_empirical_proba)
  1617584    1.518    0.000   43.955    0.000 probabilities.py:112(n_b_ak)
  1638868   43.075    0.000   43.075    0.000 {method ‘count’ of ‘str’ objects}
        1    0.009    0.009    1.050    1.050 /usr/lib/python2.6/pickle.py:1361(dump)
        1    0.000    0.000    1.041    1.041 /usr/lib/python2.6/pickle.py:220(dump)
  86729/1    0.275    0.000    1.041    1.041 /usr/lib/python2.6/pickle.py:269(save)

It seems that 90% of the execution time is spent in the str.count() method. How come, I thought that this method was written in C and batshit optimized? Well, yes. We can see that str.count() takes 0.000s / call. But the thing is, we call it 1638868 times

We can visualize these profiling results using a nice tool written in Java: RunSnakeRun.
To use it, type : $ runsnake proba.prof to generate a nice output like this (sorry for the flashy green >_<) :

We hence can have a (rather) nice graphic view of the profile results previously obtained with the profile module.

Line profiling

This kind of profiling only gives us the most time consuming functions. We could use line profiling to locate the most time consuming lines. To do that, we use the kernprof.py line profiler.
We need to add the @profile decorator “on top” of the functions we want to profile.

We then type $ python kerprof.py -l -v probabilities_line_profile.py ../data/data.txt

This gives us the following output, for the n_b_ak function:

File: probabilities_line_profile.py
Function: n_b_ak at line 113
Total time: 47.6636 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   113                                           @profile
   114                                           def n_b_ak(chain, ak, symbol): # n_(b|a^k)
   115                                               “”“
   116                                               Given a string chain, returns the number of
   117                                               times that a given symbol is found
   118                                               right after a string ak inside the chain one
   119                                               “”“
   120                                           
   121   1617584      1654535      1.0      3.5      if k != 0:
   122   1617584     44478265     27.5     93.3    res = chain.count(ak+symbol)
   123                                               else:
   124                                                   # if k=0,the past has no influence
   125                                                   res = chain.count(symbol)
   126   1617584      1530820      0.9      3.2      return res

The instruction res = chain.count(ak+symbol) takes 93% of the 47s runtime ! We’ve just found our bottlneck line. Hurray !

What about the algorithm ?

Let’s ask ourselves why do we use the str.count() method so intensively. For each distinct k-tuple (with k = 10) in the text, we count its number of occurences in the text, and then, for each symbol of the alphabet, we count the number of occurences of the concatenation of the k-tuple and the symbol. That makes a serious load of str.count() calls. Well, 1638868 exactly…

This algorithm does not scale well : if we increase the number of distinct k-tuples by one, we increase the number of count operations by 1+n, with n the number of symbols in the alphabet.

We now have two choices :

  • hack / change the algorithm
  • try to optimize the code without changing the algorithm

We will focus on the latter, and try to suggest some ideas of hacks at the end of the article.

Problem : str.count() is written in C !

You know how people always brag about C being freakishly fast? They are right. But our problem is that str.count() is a Python built-in method … written in C. If we’d rewrite the entire same program in C, we would then be optimizing the remaining 10%, but it would have no effect on the main algorithm. Shoot !

It doesn’t matter, I can still name some of the possible solutions available in the case of C re-writing would optimize the code:

  • Brute force C writing (very bright side of the force, but can be painful for some people, and can take some time. D. Ritchie would be proud of you though.).
  • ctypes : ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python. More info here and here.
  • Shed Skin : Shed Skin is an experimental compiler, that can translate pure, but implicitly statically typed Python (2.4-2.6) programs into optimized C++. It can generate stand-alone programs or extension modules that can be imported and used in larger Python programs.Warning: a restricted set of modules are supported. More info here.
  • Cython : (my personal favorite) Cython is a language that makes writing C extensions for the Python language as easy as Python itself. Cython is based on the well-known Pyrex, but supports more cutting edge functionality and optimizations.The Cython language is very close to the Python language, but Cython additionally supports calling C functions and declaring C types on variables and class attributes. This allows the compiler to generate very efficient C code from Cython code. More info here.

Parallel algorithm

If you have a multi-core processor, you can take advantage of this architecture to dispatch workloads on every core. There is a lovely python module for that : multiprocessing.

We could split the list of distinct k-tuples in several parts, and having each core to process them with our statisical approach. When the work of each core is done, we would gather the sub-results and store the final result.

The code can be read here but these are the important lines:

nb_cores = multiprocessing.cpu_count()
chunk_size = len(ak_chunks) / nb_cores

# split the list of chunks into nb_cores sublists
ak_chunks_split = [ak_chunks[i*chunk_size:(i+1)*chunk_size] for i in xrange(nb_cores-1)]

ak_chunks_split.append(ak_chunks[(nb_cores-1)*chunk_size:]) # handle the case where last chunk is not of the same size    
 
# we create a pool of processes
p = multiprocessing.Pool()

# we send the splited list of chunks to the pool

po = p.map_async(multiprocess_proba_wrapper, ak_chunks_split)

# Gather the results
results = po.get()

All calculations are done in the multiprocess_proba_wrapper function.

On my Quad-Core laptop, i get a nice 2.7 speedup : the program now only takes 17s. We could expect a x4 speedup, but multiprocessing implies communication between the cores, and thus a lot of overhead.

Can we do better ?

I hope so. I am absolutely not satisfied with this result, because the algorithm doesn’t scale well. I have an idea to strongly decrease the number of count() calls. Instead of calling count(ak) and count(ak+symbol) for all symbols in the alphabet, we could just extract the next symbol for each occurence of the ak string, playing with the index() method.

i = chain.index(ak) # gives the position of the first occurence of ak in chain
symbol = chain[i+k] # symbol after ak
# increment counter in a dictionnary

We could then perform these operations until there is no ak occurence in chain.

My point is that sometimes you can optimize your code, sometimes you must change the method if you want your program to be scalable.

Conclusions

In this post, I have named some solutions that can be useful to optimize a Python code : Ctypes, ShedSkin, Cython, multiprocessing and we have seen some useful tools, profile, line profiler kerprof.py and RunSnakeRun.

All problems are unique, and these techniques will not always be the solution. In our example, the bottleneck is the algorithm : eventhough the str.count() method runs fast, we just call it too much…We proposed a rather nice parallel solution, but if we multiply the size of the input by 2, the calculations then takes more than a minute ! We have to change/hack the algorithm for the program to be scalable to larger texts.

Thanks

If all that was of any interest to you, you shoud (must !) then check out the outstanding paper written by @ianozsvald : High Performance Python. (All sources available on his GitHub.) All techniques named in this article are more deeply covered. (This is how I first got into Python optimization, and it turns out I will soon work with him =)

Thanks Ian for all your advice on this project, and for all your time !

PS : Do not hesitate to fork the git repo to propose improvements :)

Comments

Jul 11 2011
5 notes

List sorting

In this post, we’ll focus on a very useful and common operation : list sorting.

You’ll see that in Python, list sorting can be both simple to write and very powerful ! Let’s now begin with the absolute basics.

Sorting basics

If you want to sort the following list by ascending/descending order : [1,4,6,2,8], you can use the built-in sorted() function.

sorted([1,4,6,2,8])
# [1,2,4,6,8]
sorted([1,4,6,2,8],reverse=True)
# [8,6,4,2,1]

This function returns a sorted function, but doesn’t modify the original one !

li = [8,9,5,7,1]
sorted(li)
# [1,5,7,8,9]
li
# [8,9,5,7,1]

There is another build-in list-sorting function, which is older and less convenient to use, but still worth mentioning : the sort() function. This one modifies the original list and doesn’t return anything, which can make it confusing to use.

li = [8,9,5,7,1]
li.sort()
li
# [1,5,7,8,9]

We might be tempted to write li = li.sort() instead, but that is a mistake : the sort() function doesn’t return anything, as we’ve stated before.

li = li.sort()
li
# Nothing : we've erased the list >_<

Even if this instruction is valid for python, it will erase your original list with no error message =) That means that it can be a real pain in the ass to debug…

Another fundamental difference between sorted() and sort() is that sorted() can apply to other iterables, like dictionnaries.

Custom sorting

Ascending and descending sorting are far from being the only way of sorting a list. We’ll now see how to define a sorting criteria.

If you have a list of strings, the sorted() function will sort it considering ascending/descending alphabetical order.

li = ["zzz","bb","i"]
sorted(li)
# ["bb","i","zzz"]

But now, let’s say that you’d like to sort the list by the length of the strings. To accomplish this mighty miracle, we’ll use the key parameter function. This parameter will define the sorting rule(s). It is very important to fully understand how it works.


The key function will be applied to each element of the to-be-sorted list, and will return a value. An intermediate list will then be created, filled with the values returned by the key function. This intermediate list will then be sorted “traditionally”.

To sort by the length of the strings, we have to use the following instructions :

li = ["zzz","bb","i"]
sorted(li, key=len)
# ["i","bb","zzz"]

The len() key-function will be applied to each element of the li list, and will return [3,2,1]. This intermediate list will then be sorted to return [1,2,3] and consequently ["i","bb","zzz"].

Now, let’s say you store the name and age of people in a len-2 tuple. To sort the list of len-2 tuples, you have to implement a key function. Instead of defining this function among the other ones, we’ll use lambda functions, to avoid obscuring the code :

li = [("Eugene",33),("Dweezil",23),("RobertTheShrubber",56)]
sorted(li, key = lambda person : person[1])
# the lambda function returns the second element of the list
# [('Dweezil', 23), ('Eugene', 33), ('RobertTheShrubber', 56)]

This instruction is very powerful and weirdly very simple to write ! (And it seems to work with non-ridiculous names too)

Conclusion

Try to use the sorted() function instead of sort(), that could avoid you some debug time.
You can define custom sorting rules with the key parameter, and when doing so, use lambda functions, to clarify your code.

Comments

Jun 26 2011
19 notes

Lambda functions

Passing functions as argument is a common Python feature, which can be used quite often throughout a program. However, defining these functions within the other “normal” ones might obscure the code structure.

That’s why Python allows you to define anonymous/lambda functions inside an instruction. These functions are created at runtime, and are not linked to any name.

Let’s say you’d like to obtain a list containing the result of the function x —> 2x² -1 for x < 10. You could declare a function in order to calculate the 2x² - 1 operation and apply it to a range(10).

def math_operation(x):
    return 2*x**2 - 1

L = []
for i in range(10):
    L.append(math_operation(i))
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

An other way would be to use the map function:

L = map(math_operation,range(10))
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

This instruction will apply the math_operation function to each item of the range(10) list. Like we said before, declaring the math_operation() with more essential functions could obscure your code. To avoid that, let’s use a lambda function.

L = map(lambda x : 2*x*x-1,range(10))
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

This function does not have a name, and only exists within the L assignement instruction. No need of previous declaration !

NB : I know what you’re thinking : “list comprehensions could have worked in this example.”

L = [2*x*x-1 for x in range(10)]
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

Indeed. However, lambda functions can also be used for other things, such as list sorting, list filtering… lambda functions are often used in combination with the map(), reduce(), filter() functions.

Example 1 : List filtering

We’d like to filter the range(20) list, to obtain a sub-list only containing the even numbers.
For that, we’ll use the filter() function.

L_even = filter(lambda x : x%2 == 0, range(20))
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Example 2 : List-reducing

In this example, we would like to calculate the sum of numbers from 0 to 50. We could here use the reduce() function, combined with a lambda function.

L_sum = reduce(lambda x,y :x+y, range(50))
# 1225

In the following article, we’ll introduce how to use lambda functions for customized list sorting, one very important use of lambda functions.

Conclusion

Lambda functions are a good tool to avoid to “overweight” your code with simple functions declarations, and can even make it simpler to read.

Some more examples here !

Comments