SOME PYTHON CODE AND SOME THOUGHTS

Nov 04 2011
25 notes

Optimization of the Monty Python parody code

In this post, we’ll see how the program proposed in my previous post can be optimized in term of execution time. The code can be found the project GitHub repository. For those who haven’t read the first article but are interested in code optimization, i’d then advise you to read it first, to understand the causes of the problem.

What is the problem?

My code is slow. It takes too much time to run : around 45 seconds.

Profiling : Let’s find the bottlenecks !

To identify the bottlenecks in our code, we can take advantage of the excellent profile python module. (More information here : http://docs.python.org/library/profile.html)

Let’s run our code using profile :
$ python -m cProfile -o proba.prof probabilities.py ../data/data.txt

This command will create a file proba.prof containing the profiling results. We can now read these results using the pstats module.

>>> import pstats
>>> p = pstats.Stats("proba.prof")
>>> p.sort_stats("cumulative").print_stats(10)

These commands will result into an output listing the 10 most time-consuming functions.

   5931126 function calls (5801827 primitive calls) in 48.691 CPU seconds

   Ordered by: cumulative time
   List reduced from 80 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   48.691   48.691 <string>:1(<module>)
        1    0.001    0.001   48.691   48.691 {execfile}
        1    0.005    0.005   48.690   48.690 probabilities.py:13(<module>)
        1    1.177    1.177   48.657   48.657 probabilities.py:33(empirical_entropy_rate)
  1617584    1.818    0.000   45.773    0.000 probabilities.py:88(conditional_empirical_proba)
  1617584    1.518    0.000   43.955    0.000 probabilities.py:112(n_b_ak)
  1638868   43.075    0.000   43.075    0.000 {method ‘count’ of ‘str’ objects}
        1    0.009    0.009    1.050    1.050 /usr/lib/python2.6/pickle.py:1361(dump)
        1    0.000    0.000    1.041    1.041 /usr/lib/python2.6/pickle.py:220(dump)
  86729/1    0.275    0.000    1.041    1.041 /usr/lib/python2.6/pickle.py:269(save)

It seems that 90% of the execution time is spent in the str.count() method. How come, I thought that this method was written in C and batshit optimized? Well, yes. We can see that str.count() takes 0.000s / call. But the thing is, we call it 1638868 times

We can visualize these profiling results using a nice tool written in Java: RunSnakeRun.
To use it, type : $ runsnake proba.prof to generate a nice output like this (sorry for the flashy green >_<) :

We hence can have a (rather) nice graphic view of the profile results previously obtained with the profile module.

Line profiling

This kind of profiling only gives us the most time consuming functions. We could use line profiling to locate the most time consuming lines. To do that, we use the kernprof.py line profiler.
We need to add the @profile decorator “on top” of the functions we want to profile.

We then type $ python kerprof.py -l -v probabilities_line_profile.py ../data/data.txt

This gives us the following output, for the n_b_ak function:

File: probabilities_line_profile.py
Function: n_b_ak at line 113
Total time: 47.6636 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   113                                           @profile
   114                                           def n_b_ak(chain, ak, symbol): # n_(b|a^k)
   115                                               “”“
   116                                               Given a string chain, returns the number of
   117                                               times that a given symbol is found
   118                                               right after a string ak inside the chain one
   119                                               “”“
   120                                           
   121   1617584      1654535      1.0      3.5      if k != 0:
   122   1617584     44478265     27.5     93.3    res = chain.count(ak+symbol)
   123                                               else:
   124                                                   # if k=0,the past has no influence
   125                                                   res = chain.count(symbol)
   126   1617584      1530820      0.9      3.2      return res

The instruction res = chain.count(ak+symbol) takes 93% of the 47s runtime ! We’ve just found our bottlneck line. Hurray !

What about the algorithm ?

Let’s ask ourselves why do we use the str.count() method so intensively. For each distinct k-tuple (with k = 10) in the text, we count its number of occurences in the text, and then, for each symbol of the alphabet, we count the number of occurences of the concatenation of the k-tuple and the symbol. That makes a serious load of str.count() calls. Well, 1638868 exactly…

This algorithm does not scale well : if we increase the number of distinct k-tuples by one, we increase the number of count operations by 1+n, with n the number of symbols in the alphabet.

We now have two choices :

  • hack / change the algorithm
  • try to optimize the code without changing the algorithm

We will focus on the latter, and try to suggest some ideas of hacks at the end of the article.

Problem : str.count() is written in C !

You know how people always brag about C being freakishly fast? They are right. But our problem is that str.count() is a Python built-in method … written in C. If we’d rewrite the entire same program in C, we would then be optimizing the remaining 10%, but it would have no effect on the main algorithm. Shoot !

It doesn’t matter, I can still name some of the possible solutions available in the case of C re-writing would optimize the code:

  • Brute force C writing (very bright side of the force, but can be painful for some people, and can take some time. D. Ritchie would be proud of you though.).
  • ctypes : ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python. More info here and here.
  • Shed Skin : Shed Skin is an experimental compiler, that can translate pure, but implicitly statically typed Python (2.4-2.6) programs into optimized C++. It can generate stand-alone programs or extension modules that can be imported and used in larger Python programs.Warning: a restricted set of modules are supported. More info here.
  • Cython : (my personal favorite) Cython is a language that makes writing C extensions for the Python language as easy as Python itself. Cython is based on the well-known Pyrex, but supports more cutting edge functionality and optimizations.The Cython language is very close to the Python language, but Cython additionally supports calling C functions and declaring C types on variables and class attributes. This allows the compiler to generate very efficient C code from Cython code. More info here.

Parallel algorithm

If you have a multi-core processor, you can take advantage of this architecture to dispatch workloads on every core. There is a lovely python module for that : multiprocessing.

We could split the list of distinct k-tuples in several parts, and having each core to process them with our statisical approach. When the work of each core is done, we would gather the sub-results and store the final result.

The code can be read here but these are the important lines:

nb_cores = multiprocessing.cpu_count()
chunk_size = len(ak_chunks) / nb_cores

# split the list of chunks into nb_cores sublists
ak_chunks_split = [ak_chunks[i*chunk_size:(i+1)*chunk_size] for i in xrange(nb_cores-1)]

ak_chunks_split.append(ak_chunks[(nb_cores-1)*chunk_size:]) # handle the case where last chunk is not of the same size    
 
# we create a pool of processes
p = multiprocessing.Pool()

# we send the splited list of chunks to the pool

po = p.map_async(multiprocess_proba_wrapper, ak_chunks_split)

# Gather the results
results = po.get()

All calculations are done in the multiprocess_proba_wrapper function.

On my Quad-Core laptop, i get a nice 2.7 speedup : the program now only takes 17s. We could expect a x4 speedup, but multiprocessing implies communication between the cores, and thus a lot of overhead.

Can we do better ?

I hope so. I am absolutely not satisfied with this result, because the algorithm doesn’t scale well. I have an idea to strongly decrease the number of count() calls. Instead of calling count(ak) and count(ak+symbol) for all symbols in the alphabet, we could just extract the next symbol for each occurence of the ak string, playing with the index() method.

i = chain.index(ak) # gives the position of the first occurence of ak in chain
symbol = chain[i+k] # symbol after ak
# increment counter in a dictionnary

We could then perform these operations until there is no ak occurence in chain.

My point is that sometimes you can optimize your code, sometimes you must change the method if you want your program to be scalable.

Conclusions

In this post, I have named some solutions that can be useful to optimize a Python code : Ctypes, ShedSkin, Cython, multiprocessing and we have seen some useful tools, profile, line profiler kerprof.py and RunSnakeRun.

All problems are unique, and these techniques will not always be the solution. In our example, the bottleneck is the algorithm : eventhough the str.count() method runs fast, we just call it too much…We proposed a rather nice parallel solution, but if we multiply the size of the input by 2, the calculations then takes more than a minute ! We have to change/hack the algorithm for the program to be scalable to larger texts.

Thanks

If all that was of any interest to you, you shoud (must !) then check out the outstanding paper written by @ianozsvald : High Performance Python. (All sources available on his GitHub.) All techniques named in this article are more deeply covered. (This is how I first got into Python optimization, and it turns out I will soon work with him =)

Thanks Ian for all your advice on this project, and for all your time !

PS : Do not hesitate to fork the git repo to propose improvements :)

Comments

Oct 26 2011
6 notes

Generate a Monty Python parody

It’s been a freaking while since my last post, and I’m sorry about that. I just felt that Python basics were already covered by people so much more gifted that me, so why copy them? That’s why I am now covering personal experiences with Python.

If you always wanted to write texts in the way of Monty Python, I have what you need ! In this post, I am going to show you mathematical techniques to analyse a text, in order to randomly generate look-alike texts.

Introduction to basic concepts

First essential question :

What is a text?

From a mathematical point of view, a text of length n simply is the concatenation of n symbols, all taken from a finite alphabet A. In our context, the alphabet is generally composed of all lowercase and uppercase letters, punctuation signs, etc.

In a real-life situation, the symbols sucession is not random, but depends of the previous symbols. Indeed, if the 3 last symbols are ” “, “t” and “h”, it is highly probable that the next one will be “e”, because the world “the” is fairly common.

The whole problem can thus be resumed to obtaining a transition probability matrix between strings of fixed length and all smbols of the alphabet.

Example : Let’s assume that the three last symbols are ” “, “t”, and “h”, and that the probability of the next symbol being a “e” (written p(“e” | ” th”)) is 0.6, an “a” is 0.3 and “u” is 0.1.
We would then obtain a line of the matrix of transition probability between ” th” and all alphabet symbols:

” th” —> a: 0.3, b: 0, c: 0, …, e: 0.6, …, u: 0.1, …

The probability p(“e” | ” th”) is called a conditional probability.

Markov chain of order k

We are going to model our data text (here, the “Monthy Python and the Holy Grail” script) with a Markov chain of order k. This barbarian name refers to :

a mathematical system that undergoes transitions from one state to another (from a finite or countable number of possible states) in a chainlikemanner

Source : Wikipedia

That means that the following state is conditioned by the k previous ones.

If we deal with a Markov chain of order 3, the probability of occurence of the next symbol will only depends on the 3 previous symbols. From previous tests, I can say that k=10 is a good place to start. (More on that later)

Text Alphabet

We’ve just fixed the value of k, which was the first step of the process. Now, we need to to create a list of all encountered symbols (ie: the alphabet).

First, we read the data file, and join all the lines in a single string.

f= open('../data/monty.txt','r')
f_lines = ' '.join(f.readlines())

Then, we create the alphabet list:

def alphabet(datafile_lines):
"""
Returns all used characters in a given text
"""
alph = []
for letter in datafile_lines:
    if letter not in alph:
        alph.append(letter)
return sorted(alph)

Finding all exiting K-tuples in the source text

Now, we need to identify all distinct strings of length k=10 in the text.

This can seem a bit tedious, but list comprehensions and sets will do a lovely work.

# -- split text in ak chunks of length k
ak_chunks = [datafile_lines[i:i+k] for i in xrange(len(datafile_lines))]

# -- remove final chunk if not of size k
if len(ak_chunks[-1]) != k:
ak_chunks.remove(ak_chunks[-1])

# -- Extract unique values from list
ak_chunks = list(set(ak_chunks)) #set: reduce to unique values

Empirical probabilities of transition

Now comes the hard work. So far, we have

  • a text,
  • its alphabet,
  • a HUGE list of all distincts strings of length k=10 contained in the text

What we then need is a way to calculate the empirical probability of transition between each string of length 10 and symbols of the alphabet (“empirical” in the way that these probabilities will only apply to the text we study).

Let’s formalize a bit the problem:

  • a^k : string of length k (here, 10)
  • b : symbol situated after a^k
  • n_(a^k) : number of times that the string a^k is encountered in the text
  • n_(b|a^k) = number of times that the string a^k is followed by the symbol b

We can now express the empirical probability p(b|a^k) = n_(b|a^k) / n_(a^k)
(number of times that the string a^k is followed by the symbol b / number of times that the string a^k is encountered in the text)

Example : if our text is ABCABDABC, a^k = AB and b = C:

  • n_(AB) = 3
  • n_(C|AB) = 2
  • p(C|AB) = 2/3 = 0.667

Let’s write all that in Python:

def conditional_empirical_proba(chain, ak, symbol, n_ak): # p(b|a^k)
    """
    Returns the proportion of symbols after the ak string (contained
    in chain string and of length k) which are equal to the value
    of given parameter 'symbol'
    Ex:conditional_empirical_proba('ABCABD', 2, 'AB', 'C', n_ak)-> 0.5
    """
    nb_ak = n_b_ak(chain, ak, symbol)
    if n_ak != 0:
        return float(nb_ak)/n_ak
    else:
        return 0

def n_b_ak(chain, ak, symbol): # n_(b|a^k)
    """
    Given a string chain, returns the number of
    times that a given symbol is found
    right after a string ak inside the chain
    """
    return chain.count(ak+symbol)

def n_ak(chain, ak): # n_(a^k)
   """
    Given a string chain and a string ak, returns
    the number of times ak is found in chain
   """
   return chain.count(ak)

Now, the only remaning thing to do is to calculate the empirical conditional probability for each k-tuple and for each symbol.

A few remarks are necessary:

  • We will only store empirical conditional probabilities > 0 (more on that later)
  • We will store accumulative empirical conditional probabilities (more on that later)
  • The matrix will be created with a dictionnary of dictionnaries

# Initialization of matrix
prob = {}
for ak in ak_chunks:
  # New matrix line
  prob[ak] = {}

  # -- calculate p(b|a^k) for each symbol of alphabet
  pbak_cumul = 0
  for symb in alpha:
    pbak = conditional_empirical_proba(datafile_lines, ak, symb, nak)

    # cumulative probabilities
    pbak_cumul += pbak

   # if sucession ak+symb is encountered in text, add probability to matrix
   if pbak != 0.0: # Very important, if pbak = 0.0, the combination ak+symb will not be randomly generated
      prob[ak][symb] = pbak_cumul

proba_file = open('../results/distribs/distrib_k%d.txt'%(k), 'w')
pickle.dump(prob, proba_file)
proba_file.close()

Random text generation

Close your eyes for a second, and think about what we just did. We calculated empirical transition probabilities between all existing strings of length 10 and all symbols of the alphabet, and stored the non nil acumulative probabilities in a matrix. (The non-nil part has two main advatages : it implies less storage cost, and we only store combinations that occured in the text. This way, random generation becomes really easy !)

It is now extremely easy to generate a text using these accumulative probabilities! Let’s consider a quick example.

Example: a^k = AB, p(A|AB)=0.2, p(B|AB)=0.5, p(C|AB)=0.5. We then store these acumulative values in the matrix:

  • p(A|AB)=0.2
  • p(B|AB)=0.7
  • p(C|AB)=1

That way, we only have to pick a random float between 0 and 1 using a uniform distribution to match this float with a symbol. random(0,1)=0.678 → symbol = B

For this technique to work, the first k=10 symbols of the generated text must directly come from the original text (and hence will be contained in the matrix). This will give us a valid initial condition.

Let’s now generate the text :

def random_text(size, k):
"""
Given a result size and an integer k,
returns a randomly generated text using
probability distributions of markov chains
of order k dumped in ../results/distribs/distrib_kX.txt
files
"""
# -- Initial string

f = open('../data/monty.txt','r')
initial_string = ' '.join(f.readlines())[:k]
out = initial_string
f.close()

# -- Import probability distribution
try:
    p = open('../results/distribs/distrib_k%d.txt'%(k),'r')
except IOError as err:
    print err
    exit(2)

distrib_matrix = pickle.load(p)
p.close()

# -- Generate text following probability distribution
kuple = initial_string
for x in xrange(size):
    p = random.uniform(0,1)
    i = 0
    char = ''

    # read distribution specific to k-tuple string
    dist = distrib_matrix[kuple]

    for symbol in dist:
        char = symbol
        i = dist[symbol]
        if i > p:
            break

    out += symbol
    kuple = kuple[1:]+symbol # update k-tuple

return out

Done ! Now, you only have to call the function random_text(len_text, 10) and BOOM !

Example of generated text with k = 10

KING ARTHUR: Will you ask your master that we have been charged by God with a sacred quest. If he will give us food and shelter for the week.
ARTHUR: Will you ask your master if he wants to join my court at Camelot?!
SOLDIER #1: You’re using coconuts!
ARTHUR: Ohh.
BEDEVERE: Uh, but you are wounded!
GALAHAD: What are you doing in England?
FRENCH GUARDS: [whispering] Forgive me that’ and ‘I’m not worth

What if we change k ?

k can be interpreted as the quantity of context you take into account to calculate a symbol occurence probability. We chose k = 10, because a context of 10 symbols allows the program to generate a text with apparent sense (limited by the randomness of the process, and by the fact that THIS IS MONTY FREAKING PYTHON).

The more context you add, the more alike the generated and original texts will be, up to a point where they will be identical.

If you decrease k, you can find a interesting case where you generate words, but where the context is senseless.

Example, for k=5:

KING ARTHUR: Yes!
VILLAGER #3: A bit.
VILLAGER #1: You saw saw saw it, did you could
separate, and master that!
ARTHUR: Will you on Thursday.
CUSTOMER: What do you can you think kill your every
good people. It’s one.)
OTHER FRENCH GUARDS: [whispering]

If you decrease k even more, you will only generate rubbish.

Conclusion

We have seen a pretty simple text analysis technique which allows us to randomly generate a text, based on statistical analysis of the data text. This technique is based on the fact that the probability of occurence of a letter depends on its local “past”.

Playing with the value of the “past length”, you can generate text more or less alike to the original, and with more or less “sense”.

This simple technique does not use the nltk python module, or a set of texts to generate “theoretical” rules on a language. Its is purely empirical.

It now would be pretty easy to use this tool to implement a Youtube comment/insults generator : they always say the same thing “Fuck Justin Bieber”. They’re not fundamentally wrong though :)

All source code available on GitHub.

EDIT : A nice comment from reddit:

This approach was first proposed by Claude Shannon in his landmark paper “A Mathematical Theory of Communication”… in 1948.

Gotta love how people keep reinventing the same things over and over again. But this time, in Python!

Comments

Jul 11 2011
5 notes

List sorting

In this post, we’ll focus on a very useful and common operation : list sorting.

You’ll see that in Python, list sorting can be both simple to write and very powerful ! Let’s now begin with the absolute basics.

Sorting basics

If you want to sort the following list by ascending/descending order : [1,4,6,2,8], you can use the built-in sorted() function.

sorted([1,4,6,2,8])
# [1,2,4,6,8]
sorted([1,4,6,2,8],reverse=True)
# [8,6,4,2,1]

This function returns a sorted function, but doesn’t modify the original one !

li = [8,9,5,7,1]
sorted(li)
# [1,5,7,8,9]
li
# [8,9,5,7,1]

There is another build-in list-sorting function, which is older and less convenient to use, but still worth mentioning : the sort() function. This one modifies the original list and doesn’t return anything, which can make it confusing to use.

li = [8,9,5,7,1]
li.sort()
li
# [1,5,7,8,9]

We might be tempted to write li = li.sort() instead, but that is a mistake : the sort() function doesn’t return anything, as we’ve stated before.

li = li.sort()
li
# Nothing : we've erased the list >_<

Even if this instruction is valid for python, it will erase your original list with no error message =) That means that it can be a real pain in the ass to debug…

Another fundamental difference between sorted() and sort() is that sorted() can apply to other iterables, like dictionnaries.

Custom sorting

Ascending and descending sorting are far from being the only way of sorting a list. We’ll now see how to define a sorting criteria.

If you have a list of strings, the sorted() function will sort it considering ascending/descending alphabetical order.

li = ["zzz","bb","i"]
sorted(li)
# ["bb","i","zzz"]

But now, let’s say that you’d like to sort the list by the length of the strings. To accomplish this mighty miracle, we’ll use the key parameter function. This parameter will define the sorting rule(s). It is very important to fully understand how it works.


The key function will be applied to each element of the to-be-sorted list, and will return a value. An intermediate list will then be created, filled with the values returned by the key function. This intermediate list will then be sorted “traditionally”.

To sort by the length of the strings, we have to use the following instructions :

li = ["zzz","bb","i"]
sorted(li, key=len)
# ["i","bb","zzz"]

The len() key-function will be applied to each element of the li list, and will return [3,2,1]. This intermediate list will then be sorted to return [1,2,3] and consequently ["i","bb","zzz"].

Now, let’s say you store the name and age of people in a len-2 tuple. To sort the list of len-2 tuples, you have to implement a key function. Instead of defining this function among the other ones, we’ll use lambda functions, to avoid obscuring the code :

li = [("Eugene",33),("Dweezil",23),("RobertTheShrubber",56)]
sorted(li, key = lambda person : person[1])
# the lambda function returns the second element of the list
# [('Dweezil', 23), ('Eugene', 33), ('RobertTheShrubber', 56)]

This instruction is very powerful and weirdly very simple to write ! (And it seems to work with non-ridiculous names too)

Conclusion

Try to use the sorted() function instead of sort(), that could avoid you some debug time.
You can define custom sorting rules with the key parameter, and when doing so, use lambda functions, to clarify your code.

Comments

Jun 26 2011
19 notes

Lambda functions

Passing functions as argument is a common Python feature, which can be used quite often throughout a program. However, defining these functions within the other “normal” ones might obscure the code structure.

That’s why Python allows you to define anonymous/lambda functions inside an instruction. These functions are created at runtime, and are not linked to any name.

Let’s say you’d like to obtain a list containing the result of the function x —> 2x² -1 for x < 10. You could declare a function in order to calculate the 2x² - 1 operation and apply it to a range(10).

def math_operation(x):
    return 2*x**2 - 1

L = []
for i in range(10):
    L.append(math_operation(i))
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

An other way would be to use the map function:

L = map(math_operation,range(10))
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

This instruction will apply the math_operation function to each item of the range(10) list. Like we said before, declaring the math_operation() with more essential functions could obscure your code. To avoid that, let’s use a lambda function.

L = map(lambda x : 2*x*x-1,range(10))
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

This function does not have a name, and only exists within the L assignement instruction. No need of previous declaration !

NB : I know what you’re thinking : “list comprehensions could have worked in this example.”

L = [2*x*x-1 for x in range(10)]
# [-1, 1, 7, 17, 31, 49, 71, 97, 127, 161]

Indeed. However, lambda functions can also be used for other things, such as list sorting, list filtering… lambda functions are often used in combination with the map(), reduce(), filter() functions.

Example 1 : List filtering

We’d like to filter the range(20) list, to obtain a sub-list only containing the even numbers.
For that, we’ll use the filter() function.

L_even = filter(lambda x : x%2 == 0, range(20))
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Example 2 : List-reducing

In this example, we would like to calculate the sum of numbers from 0 to 50. We could here use the reduce() function, combined with a lambda function.

L_sum = reduce(lambda x,y :x+y, range(50))
# 1225

In the following article, we’ll introduce how to use lambda functions for customized list sorting, one very important use of lambda functions.

Conclusion

Lambda functions are a good tool to avoid to “overweight” your code with simple functions declarations, and can even make it simpler to read.

Some more examples here !

Comments

Jun 24 2011
1 note

Copying a list

Python variables don’t work as other languages variables, and it’s very important to understand why, especially when dealing with lists.

First, we shouldn’t think in terms of variables, but in terms of names, or labels. If we declare a = 1, we assign the name a to the value 1. Then, if we declare b = a, we just give another name to the value 1, which can now be accessed by the name a or b. Both a and b names point to the same value stored in memory. 

Now, let’s say that we are dealing with lists that we’d like to copy. At first, we’d think of that :

a = ["SPAM","eggs","Ni"]
b = a
a
# ["SPAM","eggs","Ni"]
b
# ["SPAM","eggs","Ni"]

Doing that, we might feel that we’ve copied the list a into a variable b, which isn’t true ! If we add a value to a, this value will be added to b. Like we’ve said before, a and b are two names pointing to the same memory address.

a.append("shrubery")
a
# ["SPAM","eggs","Ni","shrubery"]
b
# ["SPAM","eggs","Ni","shrubery"]

When writing b = a, we just gave another name/label to the list.

If you’re still not convinced, let’s have a look at the memory addresses of a and b’s values :

a.__repr__
<method-wrapper '__repr__' of int object at 0x8c38d90>
b.__repr__
<method-wrapper '__repr__' of int object at 0x8c38d90>

See ?

Now, if you want to copy a list into an independant “variable”, you have to write the following instruction :

b = a[:]
That will copy all a values into a new list b.

a.append("Coconut")
a
# ["SPAM","eggs","Ni","shrubery","Coconut"]
b
# ["SPAM","eggs","Ni","shrubery"]
a.__repr__
<method-wrapper '__repr__' of int object at 0x8c38d90>
b.__repr__
<method-wrapper '__repr__' of int object at 0xb717492>

This little tip may seem obvious, but many experimented developpers have fallen into this kind of trap, which might be hard to detect and debug !

Conclusion

Be careful when you want to copy a list. Instruction list1 = list2 will only create a new name for list1. You’ll have to write list2 = list1[:].

Comments

Jun 23 2011

CodeEval most popular languages

Python appears to be the most popular programming language for challenges solving amongst 7000 interviewed developpers !

Source: codeeval’s posterous

Comments

Jun 22 2011
8 notes

List comprehensions & generator expressions

List comprehensions and generator expressions are powerful, useful and elegant python features.
They work quite the same way, but are not meant to be used in the same situations.

List comprehensions

List comprehensions is a feature that allows you to compute and create a list of elements validating a condition. Let’s say you’d like to compute the sum of squares of even numbers less than 20.
You could do it this way:

s = 0
for i in range(20):
    if i%2 == 0:
        s += i**2

Or, you could spare some time and some testing by increasing i with a step 2:

s = 0
for i in range(20,2):
     s += i**2

List comprehensions allow you to do the same thing in a single line.

result = sum([i**2 for i in range(20) if i%2 == 0])

Remember the python philosophy :

Flat is better than nested.

Indeed, if the list is quite complicated to compute, with a lot of tests and conditions, the list comprehensions will still be writable in a single line. Remember : less code, less mistakes.

Let’s talk about performance now, with a little benchmark of these two methods.

import time

sum = 0
t1 = time.time()
for i in range(10000000):
    if i%2 == 0:
        sum += i**2
t2 = time.time()
print "Dummy method : {0:.4}s".format(t2-t1)
# Dummy method : 8.050s

t3 = time.time()
list_comp = sum([i**2 for i in range(10000000) if i%2 == 0])
t4 = time.time()
print "Smart method : {0:.4}s".format(t4-t3)
#Smart method : 8.126s

These methods are equally fast. Actually no. They are equally slow, but hey, you wanted to deal with 10000000-long list ! And that directly brings us to our second topic.

Generator expressions

If the list of square even numbers less than 10000000 is what you want to obtain, use list comprehensions. But if this list is just an intermediate step (like in the previous example), use generator expressions.

This feature uses generators (Duh !). Explanations about generators are worth an entire article, but you’ll find here a good tutorial). What you have to remember is that instead of storing the entire list, generators ony return one item of the list at the time. When dealing with huge lists, that will save you some memory space.

The syntax for creating a generator expression is exactly the same as for list comprehensions, but with brackets instead of square brackets.

gen_exp = sum(i**2 for i in range(10000000) if i%2 == 0)

Now let’s compare the speed and memory consumption (read with the top command) of these two methods when dealing with large lists :

t1 = time.time()
listcomp = sum([i**2 for i in range(10000000) if i%2 == 0])
t1 = time.time()
print "List comprehension: {0:.4}s".format(t2-t1)
#List comprehension: 8.121s
#Memory allocated : 175Mb

t3 = time.time()
S = sum(i**2 for i in range(10000000) if i%2 == 0)
t4 = time.time()
print "Generator expression,: {0:.4}s".format(t4-t3)
#Generator expression: 6.1673s
#Memory allocated : 159Mb

Better. The second method was faster and less memory consuming, because it didn’t store the entire list.

Next question is : “Could we consume even less memory ?”. The answer is “Oh yes”. The thing with the range() function is that it stores the entire result-list. So we could use a generator-based range function to use less memory. This function is called xrange().

t1 = time.time()
gen_exp = sum(i**2 for i in xrange(10000000) if i%2 == 0)
t2 = time.time()
print "Generator expression / xrange: {0:.4}s".format(t2-t1)
#Generator expression / xrange: 6.305s
#Memory allocated: 1.75Mb

Fuck yeah.

Conclusion

If the list is your objective, use list comprehensions. If the list is only an intermediate step, use generated expressions. Try to use xrange() instead of range(), especially when dealing with large ranges.

Comments

Jun 21 2011
9 notes

Decorators

Decorators are a useful trick to standardize functions ouput. It allows you to control how functions result(s) will be printed on screen for the user to see.

Let’s define three simple functions : sum(x,y), square(x) & power(x,y). Instead of inserting a print code inside them, and do it every time you define a new function, use a decorator function.

I’d like to display my result this way : “Call of function(args). Result = X”. 

To to that, let’s define our decorator : 

def display_result(func):
    def decorator(*args, **kwargs):
        result = func(*args, **kwargs)
        # *args : arguments of parameter-function 'func'
        print "Call of {0}{1}. Result = {2}".format(func.__name__, args, result)
        return result 
    return decorator

Interesting fact : the func argument is a function. In Python, it’s possible to pass functions and classes as arguments.

Anyway, let’s break down this code : we define a function display_result, which takes a function as an argument. A nested function is then defined : decorator. Its arguments are those of func, and it returns func(args), after a print.  

We’ll now apply this decorator to the 3 previous functions :

@display_result
def sum(x,y):
    return x+y

@display_result
def square(x):
    return x*x

@display_result
def power(x,y):
    return x**y

The @display_result instruction will pass the function as an argument of the display_result decorator function.

The output is finally:

if __name__=="__main__":
    a = 2
    b = 3
    sum(a,b)
    #Call of sum(2, 3). Result = 5
    square(a)
    #Call of square(2,). Result = 4
    power(a,b)
    #Call of power(2, 3). Result = 8

It’s interresting to notice that the function output could be of any (standard) type ! (Not quite sure about class, object output though).

I personally use decorators when i want to compare function speed. The decorator could then be: 

def display_result(func):
    def decorator(*args, **kwargs):
        t1 = time.time()
        result = func(*args, **kwargs)
        t2 = time.time
       print "{0}{1} executed in {2:.3}s)".format(func.__name__, args, t2-t1)
        return result
    return decorator

Conclusion : decorators are useful for displaying standardized result-strings from a bunch of functions. You’ll save some time by not having to type a print command every time you want to display the function output.

Comments

Jun 20 2011
8 notes

Module :: Pickle

This module is well known, but for those of you who’ve actually never heard of it, I hope you’ll like it as much as I do !

Data storage in a file is easy, but what if your data is an object ? How can you store data, like strings, lists, tuples, complex numbers, integers… and ensure that another program will be able to read this data and recognize its type?  Well, you could always try to craft something up, but when you have complicated dynamic and nested structures, things could get messy.

That’s what the pickle module is for. Let’s take an example.

import pickle

with open("PickleTest.txt","w") as f:
data = [42,1.337,"SPAM",complex(1,2),[1,2,str(2)]]
pickle.dump(data,f)
f.close()

with open("PickleTest.txt","r") as g:
data2 = pickle.load(g)
g.close()

assert data == data2
#True !


type(data2[2])
# <type 'str'>
type(data2[3])
# <type 'complex'>
type(data2[4])
# <type 'list'>
type(data2[4][2])
# <type 'str'>

The two important functions here are pickle.dump(data,f) and pickle.load(g). The first one will write the nested list data into the file f, and the second one will read from this file and store the original data in memory. We then check that the two variables still are exactly the same, with the assert declaration.

The complexity of the data we have stored here is quite low, however pickle can dump/load almost anything :

  • None, True, and False
  • integers, long integers, floating point numbers, complex numbers
  • normal and Unicode strings
  • tuples, lists, sets, and dictionaries containing only picklable objects
  • functions defined at the top level of a module
  • built-in functions defined at the top level of a module
  • classes that are defined at the top level of a module
  • instances of such classes whose __dict__ or __setstate__() is picklable (see section The pickle protocol for details)

(quoted from Python official documentation)

Note that you don’t have to store your data in a file if you don’t want to. The pickle.dumps(obj) function does exactly the same thing that pickle.dump(obj,file) but it stores the pickled data into a variable.

Conclusion : pickle is a very powerful tool for typed data storage, that can store and read almost any type of data. Pickle musn’t be used combined with the read/write functions : that would actually defeat the purpose… If you are curious, try to read a pickled file to see why.

Comments

1 2 Next