Fuller Web Development JavaScript, PHP, and Python Web Development by Braydon Fuller

Python Performance Part 3: Python 3000 and Transforming Large Lists into Seperate Smaller Lists

Preface

This is a redux of Python Performance Part 1, where the fastest method was using the reduce builtin function in Python2.5. December 3rd, Python 3000 final was released so I have downloaded it and gone over some of these scripts again. In Python 3000 the reduce function is no longer a builtin, and has moved to the module functools. When doing some general comparisons between Python2.5 and Python 3000, the later seemed to always run slightly slower. This is due to the new IO system and unicode indentifiers, as I was told in #python channel by Crys_. It was also recommended that I also compare my tests with List Comprehension, of which is new to me.

List Comprehension

from oids import oids as a
c = 3
res = [a[x:x+c] for x in [c*x for x in range(int(round(len(a)/c)))]]

Python 3k Times

real  0m0.218s
user  0m0.180s
sys   0m0.016s

real  0m0.262s
user  0m0.204s
sys   0m0.032s

real  0m0.287s
user  0m0.244s
sys   0m0.008s

Python 2.5 Times

real  0m0.244s
user  0m0.220s
sys   0m0.016s

real  0m0.229s
user  0m0.208s
sys   0m0.020s

real  0m0.251s
user  0m0.236s
sys   0m0.020s

This makes it the fastest method, beating the previous fastest time of 0.34s. Even better, there is little difference between Python2.5 and Python3000 here.

Free Art Exhibition Website

INTOINFINITY.ORG

Administration of artworks via a JSON text file using Emacs through SSH

Python Performance Part 2 Redux: Split & Reduce Large Strings for 'A Href' Hypertext

split_2.py

def get_value(a):
    return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))

Timing Comparison: ~ 300% Performance Improvement

Note: hypertext.html is 48MB.

braydon@bgf:~/python_tests/extract$ time python split.py 

real    0m1.263s
user    0m1.112s
sys     0m0.156s

braydon@bgf:~/python_tests/extract$ time python split_2.py 

real    0m0.392s
user    0m0.268s
sys     0m0.120s

split.py

Previously, I had found the best solution to my problem was to split() the large string up by the “>” character, and then reduce to a list of hyperlinks.

def is_ahref(a,b):
    y = b.find("<a href=")
    if y != -1: a.append(b[y+9:-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])

There is a better solution. Splitting the text up by the “>” character is wasteful; there are many “>”s in html, and most of them that will not have hyperlinks. We don’t need to even check if the item in the list is an href if we split the string into a list that all will have an href, and then reduce it as before.

def is_ahref(a,b):
    z = b.find(">")
    a.append(b[1:z-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

fc = open("hypertext.html","r").read().split("<a href=")
hrefs = preduce(is_ahref,fc,[])

However because the size of the list will be exactly the same as it started, we shouldn’t need to use reduce(), but rather we can just map() a fuction to run through the entire list, ‘reducing’ it to a list of just the hyperlinks.

split_2.py

def get_value(a):
    return a[1:a.find(">")-1]
hrefs = map(get_value,open("hypertext.html","r").read().split("<a href="))

Python Performance Part 2: Parsing Large Strings for 'A Href' Hypertext

Goal

Write a fast Python script that will take a large string and reduce it to a list of all of the hyperlinks in the html string; such as [”http://world.org”,”/tree”].

Attempt 1: Self-Recursion

f = open('hypertext_sm.html','r')
ahrefs = []
count = []
def find_ahref(h):
    a = h.find("<a href=")
    if a != -1:
        a = a+9
        b = a + h[a:-1].find(">")-1
        ahrefs.append(h[a:b])
        find_ahref(h[b:-1])

find_ahref(f.read())
f.close()

Summary

Fairly fast with small strings, however with large strings it causes a memory overload from the large string being stored multiple times from the self-recursion, causing the script to fail.

Attempt 2: Reduce

f = open('hypertext_sm.html','r')
def find_ahref(a,b):
    try:
        c = a[1] + str(b)
        x = c.find("<a href=")
        y = c[x:-1].find(">")
        if x != -1 and y != -1:
            a[0].append(c[x+9:x+y-1])
            return (a[0],"")
        else:
            return (a[0],c)
    except:
        return ([],str(a)+str(b))

hrefs = reduce(find_ahref,f.read())[0]
f.close()

Summary

Not as fast as the previous with smaller strings, however it does not overload the memory and it atually completed parsing the larger string (43Mb). Because it took nearly 7min to run though it is difficult for in to be a solution.

Attempt: 3: While Readline

f = open('hypertext.html','r')

ahrefs = []
def find_ahref(h):
    a = h.find("<a href=")
    if a != -1:
        a = a+9
        b = a + h[a:-1].find(">")-1
        ahrefs.append(h[a:b])
        find_ahref(h[b:-1])

while True:
    line = f.readline()
    if line:
        find_ahref(line)
    else:
        break

f.close()

Summary

This is pretty fast with both small and large strings (with many lines). However if it was handed a very large single line it would crumble as Attempt 1: Self-Recursion.

Attempt 4: Map/Reducer Readlines

f = open('hypertext.html','r')
def ahref_reducer(a,b):
    try:
        c = a[1] + str(b)
        x = c.find("<a href=")
        y = c[x:-1].find(">")
        if x != -1 and y != -1:
            a[0].append(c[x+9:x+y-1])
            return (a[0],"")
        else:
            return (a[0],c)
    except:
        return ([],str(a)+str(b))

def get_ahrefs(line):
    return reduce(ahref_reducer,line)[0]

lines = f.readlines()
hrefs = map(get_ahrefs,lines)
f.close()

Summary

Slower but doesn’t destroy memory. However, it doesn’t really meet the goal either as it returns a nasty list with empty parts.

Attempt 5: Reduce Recurse Readlines

f = open('hypertext.html','r')

def find_ahref(z,htxt):
    a = htxt.find("<a href=")
    if a != -1:
        a = a+9
        b = a+htxt[a:-1].find(">")-1
        href = htxt[a:b]
        z.append(href)
        return find_ahref(z,htxt[b:-1])
    else:
        return z

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

lines = f.readlines()
hrefs = preduce(find_ahref,lines,[])

f.close()

Summary

Fast, although it relies on their being many multiple lines.

Attempt 6: Something Completly Different

Instead of looking for the “a href” first we will look for the end “>” and search for the “a href” to that point.

hrefs = []

f = open("hypertext.html","r")
f_str = f.read()
f.close()

while True:
    b = f_str.find(">")
    if b == -1:
        break
    a = f_str[0:b].find("<a href=")
    if a != -1:
        hrefs.append(f_str[a+9:b-1])
    f_str = f_str[b+1:-1]

Summary

While good in theory, it does not handle large strings well; and by well I mean not at all. However it did get the correct hrefs from a smaller list.

Attempt 7: Split & (P)Reduce

Rather than breaking the large string up by line, we will break it up by a character “>”.

def is_ahref(a,b):
    y = b.find("<a href=")
    if y != -1: a.append(b[y+9:-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])

Summary

This is the fastest of them all, slightly faster than Reduce Recurse Readlines, and can even handle large single string lines quickly.

Conclusion

  • Self-recursion is not good when passing around the same string to itself.
  • Concatenating a long list of charaters from a string and checking for “a href” does not destroy memory but is very slow.
  • Breaking a large string into smaller ones is faster and deosn’t destroy memory.
  • Reading by line is only one way to make a large string (or file) into smaller strings.

Python Performance Part 1: Transforming Large Lists into Seperate Smaller Lists

Goal

Write a fast Python script that will take a large list and break it up into smaller sub-lists based on a set size; such as transforming [a,b,c,d,e,f] into [[a,b],[c,d],[e,f]].

Attempt 1: Map/Reduce (0.93s)

#import a list of 247,213 integers
from oids import oids

def pre(a):
    return (list(), a, 0)

def make_sets(a,b):
    set_size = 8
    if a[2] == set_size - 1 or a[0] == list():
        a[0].append([a[1]])
        return (a[0],b[1],0)
    else:
        a[0][-1].append(a[1])
        return (a[0],b[1],a[2]+1)

reduce(make_sets,map(pre,oids))

Times

real    0m0.935s
user    0m0.896s
sys     0m0.032s

real    0m0.948s
user    0m0.920s
sys     0m0.028s

real    0m0.929s
user    0m0.900s
sys     0m0.024s

Attempt 2: For-loop (0.41s)

from oids import oids

output = list()
count = 0
set_size = 8
for oid in oids:
    if count == set_size or output == list():
        output.append([oid])
        count = 0
    else:
        output[-1].append(oid)
        count = count + 1

Times

real    0m0.429s
user    0m0.404s
sys     0m0.024s

real    0m0.396s
user    0m0.384s
sys     0m0.012s

real    0m0.410s
user    0m0.396s
sys     0m0.012s

Attempt 3: Map (0.48s)

from oids import oids

output = list()
set_size = 8
count = [0]

def break_apart(a):
    if count[-1] == set_size or output == list():
        output.append([a])
        count.append(0)
    else:
        output[-1].append(a)
        count.append(count[-1] + 1)

map(break_apart,oids)

Timing

real    0m0.484s
user    0m0.476s
sys     0m0.012s

real    0m0.483s
user    0m0.464s
sys     0m0.016s

real    0m0.482s
user    0m0.452s
sys     0m0.028s

Attempt 4: Reduce (0.34s)

from oids import oids

def seperate(a,b,length=8):
    try:
        if len(a[-1]) == length:
            a.append([b])
            return a
        else:
            a[-1].append(b)
            return a
    except:
        return [[a,b]]

oids = reduce(seperate,oids)

Timing

real    0m0.323s
user    0m0.308s
sys     0m0.016s

real    0m0.329s
user    0m0.300s
sys     0m0.028s

real    0m0.353s
user    0m0.332s
sys     0m0.020s

The Importance in Rewriting Code

I needed a way of translating “644″, an object permission setting, into something more readable.

I wanted to be able to do the following:

def determinePermissions(obj):
    obj_usr= {"owner":obj.owner, "group":obj.group, "everyone":True}

    usr = {
        "owner":cherrypy.session.get('user'),

        "group":cherrypy.session.get('group'),
        "everyone":True

    }
    perms = self.translatePermissions(obj)
    r, w = False, False

    for x in ["owner", "group", "everyone"]:

        r = True if perms[x]["read"] and obj_user[x] == usr[x] else r

        w = True if perms[x]["write"] and obj_user[x] == usr[x] else w

    if usr["owner"] == "root":
        r, w = True, True

    return {"read":r, "write":w}

So I first wrote this [unexecuted]:

    def translatePermissions(self, obj):

        obj = self.get(oid=object_id)
        user, group, everyone = struct.unpack("3c",str(obj.permissions)):</p><p>        if owner == 6:

            owner["read"] = True
            owner["write"] = True

        if owner == 4:
            owner["read"] = True

            owner["write"] = False
        if owner == 0:

            owner["read"] = False
            owner["write"] = False</p><p>        if group == 6:

            group["read"] = True
            group["write"] = True

        if group == 4:
            group["read"] = True

            group["write"] = False
        if group == 0:

            group["read"] = False
            group["write"] = False</p><p>        if everyone == 6:

            everyone["read"] = True
            everyone["write"] = True

        if everyone == 4:
            everyone["read"] = True

            everyone["write"] = False
        if everyone == 0:

            everyone["read"] = False
            everyone["write"] = False</p><p>        return {"owner": owner, "group": group, "everyone": everyone}

And then I rewrote it to the following:

    def translatePermissions(self, object_id):
        rw = []

        obj = self.get(oid=object_id)
        for value in struct.unpack("3c",str(obj.permissions)):

            if value == 6:
                r = True

                w = True
            elif value == 4:

                r = True
                w = False
            elif value == 2:

                r = False
                w = True
            else:

                r = False
                w = False
            rw.append({"read":r, "write":w})

        return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}

And then checking for read and write rather than going through each possibility of v, as well as passing an object prior to reduce multiple seeks:

    def checkPermissions(self, obj):
        rw = []

        for v in struct.unpack("3c",str(obj.permissions)):

            if v == 6 or v == 4:

                r = True
            else:
                r = False

            if v == 6 or v == 2:

                w = True
            else:
                w = False

            rw.append({"read":r, "write":w})
        return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}

But with Python2.5 you can go even further and reduce it to:

    def translatePermissions(self, obj):
        rw = []

        for v in struct.unpack("3c",str(obj.permissions)):

            r = True if v == 6 or v == 4 else False

            w = True if v == 6 or v == 2 else False

            rw.append({"read":r, "write":w})
        return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}

From 32 lines at the start to 8 lines in the end! :) And both together they look like this:

def determinePermissions(obj):
    obj_usr= {"owner":obj.owner, "group":obj.group, "everyone":True}

    usr = {
        "owner":cherrypy.session.get('user'),

        "group":cherrypy.session.get('group'),
        "everyone":True

    }
    perms = self.translatePermissions(object_id)
    r, w = False, False

    for x in ["owner", "group", "everyone"]:

        r = True if perms[x]["read"] and obj_user[x] == usr[x] else r

        w = True if perms[x]["write"] and obj_user[x] == usr[x] else w

    if usr["owner"] == "root":
        r, w = True, True

    return {"read":r, "write":w}</p><p>def translatePermissions(self, obj):

    rw = []
    for v in struct.unpack("3c",str(o.permissions)):

        r = True if v == 6 or v == 4 else False

        w = True if v == 6 or v == 2 else False

        rw.append({"read":r, "write":w})
    return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}

Poetry!