Fuller Web Development

  BLOG | CONTACT | CLIENTS

Python Performance Part 2: Parsing Large Strings for 'A Href' Hypertext

Goal

Write a fast Python script that will take a large string and reduce it to a list of all of the hyperlinks in the html string; such as [”http://world.org”,”/tree”].

Attempt 1: Self-Recursion

f = open('hypertext_sm.html','r')
ahrefs = []
count = []
def find_ahref(h):
    a = h.find("<a href=")
    if a != -1:
        a = a+9
        b = a + h[a:-1].find(">")-1
        ahrefs.append(h[a:b])
        find_ahref(h[b:-1])

find_ahref(f.read())
f.close()

Summary

Fairly fast with small strings, however with large strings it causes a memory overload from the large string being stored multiple times from the self-recursion, causing the script to fail.

Attempt 2: Reduce

f = open('hypertext_sm.html','r')
def find_ahref(a,b):
    try:
        c = a[1] + str(b)
        x = c.find("<a href=")
        y = c[x:-1].find(">")
        if x != -1 and y != -1:
            a[0].append(c[x+9:x+y-1])
            return (a[0],"")
        else:
            return (a[0],c)
    except:
        return ([],str(a)+str(b))

hrefs = reduce(find_ahref,f.read())[0]
f.close()

Summary

Not as fast as the previous with smaller strings, however it does not overload the memory and it atually completed parsing the larger string (43Mb). Because it took nearly 7min to run though it is difficult for in to be a solution.

Attempt: 3: While Readline

f = open('hypertext.html','r')

ahrefs = []
def find_ahref(h):
    a = h.find("<a href=")
    if a != -1:
        a = a+9
        b = a + h[a:-1].find(">")-1
        ahrefs.append(h[a:b])
        find_ahref(h[b:-1])

while True:
    line = f.readline()
    if line:
        find_ahref(line)
    else:
        break

f.close()

Summary

This is pretty fast with both small and large strings (with many lines). However if it was handed a very large single line it would crumble as Attempt 1: Self-Recursion.

Attempt 4: Map/Reducer Readlines

f = open('hypertext.html','r')
def ahref_reducer(a,b):
    try:
        c = a[1] + str(b)
        x = c.find("<a href=")
        y = c[x:-1].find(">")
        if x != -1 and y != -1:
            a[0].append(c[x+9:x+y-1])
            return (a[0],"")
        else:
            return (a[0],c)
    except:
        return ([],str(a)+str(b))

def get_ahrefs(line):
    return reduce(ahref_reducer,line)[0]

lines = f.readlines()
hrefs = map(get_ahrefs,lines)
f.close()

Summary

Slower but doesn’t destroy memory. However, it doesn’t really meet the goal either as it returns a nasty list with empty parts.

Attempt 5: Reduce Recurse Readlines

f = open('hypertext.html','r')

def find_ahref(z,htxt):
    a = htxt.find("<a href=")
    if a != -1:
        a = a+9
        b = a+htxt[a:-1].find(">")-1
        href = htxt[a:b]
        z.append(href)
        return find_ahref(z,htxt[b:-1])
    else:
        return z

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

lines = f.readlines()
hrefs = preduce(find_ahref,lines,[])

f.close()

Summary

Fast, although it relies on their being many multiple lines.

Attempt 6: Something Completly Different

Instead of looking for the “a href” first we will look for the end “>” and search for the “a href” to that point.

hrefs = []

f = open("hypertext.html","r")
f_str = f.read()
f.close()

while True:
    b = f_str.find(">")
    if b == -1:
        break
    a = f_str[0:b].find("<a href=")
    if a != -1:
        hrefs.append(f_str[a+9:b-1])
    f_str = f_str[b+1:-1]

Summary

While good in theory, it does not handle large strings well; and by well I mean not at all. However it did get the correct hrefs from a smaller list.

Attempt 7: Split & (P)Reduce

Rather than breaking the large string up by line, we will break it up by a character “>”.

def is_ahref(a,b):
    y = b.find("<a href=")
    if y != -1: a.append(b[y+9:-1])
    return a

def preduce(fn,ls,a):
    ls.insert(0,a)
    return reduce(fn,ls)

hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])

Summary

This is the fastest of them all, slightly faster than Reduce Recurse Readlines, and can even handle large single string lines quickly.

Conclusion

  • Self-recursion is not good when passing around the same string to itself.
  • Concatenating a long list of charaters from a string and checking for “a href” does not destroy memory but is very slow.
  • Breaking a large string into smaller ones is faster and deosn’t destroy memory.
  • Reading by line is only one way to make a large string (or file) into smaller strings.