Goal
Write a fast Python script that will take a large string and reduce it to a list of all of the hyperlinks in the html string; such as [”http://world.org”,”/tree”].
Attempt 1: Self-Recursion
f = open('hypertext_sm.html','r')
ahrefs = []
count = []
def find_ahref(h):
a = h.find("<a href=")
if a != -1:
a = a+9
b = a + h[a:-1].find(">")-1
ahrefs.append(h[a:b])
find_ahref(h[b:-1])
find_ahref(f.read())
f.close()
Summary
Fairly fast with small strings, however with large strings it causes a memory overload from the large string being stored multiple times from the self-recursion, causing the script to fail.
Attempt 2: Reduce
f = open('hypertext_sm.html','r')
def find_ahref(a,b):
try:
c = a[1] + str(b)
x = c.find("<a href=")
y = c[x:-1].find(">")
if x != -1 and y != -1:
a[0].append(c[x+9:x+y-1])
return (a[0],"")
else:
return (a[0],c)
except:
return ([],str(a)+str(b))
hrefs = reduce(find_ahref,f.read())[0]
f.close()
Summary
Not as fast as the previous with smaller strings, however it does not overload the memory and it atually completed parsing the larger string (43Mb). Because it took nearly 7min to run though it is difficult for in to be a solution.
Attempt: 3: While Readline
f = open('hypertext.html','r')
ahrefs = []
def find_ahref(h):
a = h.find("<a href=")
if a != -1:
a = a+9
b = a + h[a:-1].find(">")-1
ahrefs.append(h[a:b])
find_ahref(h[b:-1])
while True:
line = f.readline()
if line:
find_ahref(line)
else:
break
f.close()
Summary
This is pretty fast with both small and large strings (with many lines). However if it was handed a very large single line it would crumble as Attempt 1: Self-Recursion.
Attempt 4: Map/Reducer Readlines
f = open('hypertext.html','r')
def ahref_reducer(a,b):
try:
c = a[1] + str(b)
x = c.find("<a href=")
y = c[x:-1].find(">")
if x != -1 and y != -1:
a[0].append(c[x+9:x+y-1])
return (a[0],"")
else:
return (a[0],c)
except:
return ([],str(a)+str(b))
def get_ahrefs(line):
return reduce(ahref_reducer,line)[0]
lines = f.readlines()
hrefs = map(get_ahrefs,lines)
f.close()
Summary
Slower but doesn’t destroy memory. However, it doesn’t really meet the goal either as it returns a nasty list with empty parts.
Attempt 5: Reduce Recurse Readlines
f = open('hypertext.html','r')
def find_ahref(z,htxt):
a = htxt.find("<a href=")
if a != -1:
a = a+9
b = a+htxt[a:-1].find(">")-1
href = htxt[a:b]
z.append(href)
return find_ahref(z,htxt[b:-1])
else:
return z
def preduce(fn,ls,a):
ls.insert(0,a)
return reduce(fn,ls)
lines = f.readlines()
hrefs = preduce(find_ahref,lines,[])
f.close()
Summary
Fast, although it relies on their being many multiple lines.
Attempt 6: Something Completly Different
Instead of looking for the “a href” first we will look for the end “>” and search for the “a href” to that point.
hrefs = []
f = open("hypertext.html","r")
f_str = f.read()
f.close()
while True:
b = f_str.find(">")
if b == -1:
break
a = f_str[0:b].find("<a href=")
if a != -1:
hrefs.append(f_str[a+9:b-1])
f_str = f_str[b+1:-1]
Summary
While good in theory, it does not handle large strings well; and by well I mean not at all. However it did get the correct hrefs from a smaller list.
Attempt 7: Split & (P)Reduce
Rather than breaking the large string up by line, we will break it up by a character “>”.
def is_ahref(a,b):
y = b.find("<a href=")
if y != -1: a.append(b[y+9:-1])
return a
def preduce(fn,ls,a):
ls.insert(0,a)
return reduce(fn,ls)
hrefs = preduce(is_ahref,open("hypertext.html","r").read().split(">"),[])
Summary
This is the fastest of them all, slightly faster than Reduce Recurse Readlines, and can even handle large single string lines quickly.
Conclusion
- Self-recursion is not good when passing around the same string to itself.
- Concatenating a long list of charaters from a string and checking for “a href” does not destroy memory but is very slow.
- Breaking a large string into smaller ones is faster and deosn’t destroy memory.
- Reading by line is only one way to make a large string (or file) into smaller strings.
In Code on
14 June 2008 tagged Python with no comments
Goal
Write a fast Python script that will take a large list and break it up into smaller sub-lists based on a set size; such as transforming [a,b,c,d,e,f] into [[a,b],[c,d],[e,f]].
Attempt 1: Map/Reduce (0.93s)
#import a list of 247,213 integers
from oids import oids
def pre(a):
return (list(), a, 0)
def make_sets(a,b):
set_size = 8
if a[2] == set_size - 1 or a[0] == list():
a[0].append([a[1]])
return (a[0],b[1],0)
else:
a[0][-1].append(a[1])
return (a[0],b[1],a[2]+1)
reduce(make_sets,map(pre,oids))
Times
real 0m0.935s
user 0m0.896s
sys 0m0.032s
real 0m0.948s
user 0m0.920s
sys 0m0.028s
real 0m0.929s
user 0m0.900s
sys 0m0.024s
Attempt 2: For-loop (0.41s)
from oids import oids
output = list()
count = 0
set_size = 8
for oid in oids:
if count == set_size or output == list():
output.append([oid])
count = 0
else:
output[-1].append(oid)
count = count + 1
Times
real 0m0.429s
user 0m0.404s
sys 0m0.024s
real 0m0.396s
user 0m0.384s
sys 0m0.012s
real 0m0.410s
user 0m0.396s
sys 0m0.012s
Attempt 3: Map (0.48s)
from oids import oids
output = list()
set_size = 8
count = [0]
def break_apart(a):
if count[-1] == set_size or output == list():
output.append([a])
count.append(0)
else:
output[-1].append(a)
count.append(count[-1] + 1)
map(break_apart,oids)
Timing
real 0m0.484s
user 0m0.476s
sys 0m0.012s
real 0m0.483s
user 0m0.464s
sys 0m0.016s
real 0m0.482s
user 0m0.452s
sys 0m0.028s
Attempt 4: Reduce (0.34s)
from oids import oids
def seperate(a,b,length=8):
try:
if len(a[-1]) == length:
a.append([b])
return a
else:
a[-1].append(b)
return a
except:
return [[a,b]]
oids = reduce(seperate,oids)
Timing
real 0m0.323s
user 0m0.308s
sys 0m0.016s
real 0m0.329s
user 0m0.300s
sys 0m0.028s
real 0m0.353s
user 0m0.332s
sys 0m0.020s
In Code on
3 January 2008 with 1 comment
I needed a way of translating “644″, an object permission setting, into something more readable.
I wanted to be able to do the following:
def determinePermissions(obj):
obj_usr= {"owner":obj.owner, "group":obj.group, "everyone":True}
usr = {
"owner":cherrypy.session.get('user'),
"group":cherrypy.session.get('group'),
"everyone":True
}
perms = self.translatePermissions(obj)
r, w = False, False
for x in ["owner", "group", "everyone"]:
r = True if perms[x]["read"] and obj_user[x] == usr[x] else r
w = True if perms[x]["write"] and obj_user[x] == usr[x] else w
if usr["owner"] == "root":
r, w = True, True
return {"read":r, "write":w}
So I first wrote this [unexecuted]:
def translatePermissions(self, obj):
obj = self.get(oid=object_id)
user, group, everyone = struct.unpack("3c",str(obj.permissions)):</p><p> if owner == 6:
owner["read"] = True
owner["write"] = True
if owner == 4:
owner["read"] = True
owner["write"] = False
if owner == 0:
owner["read"] = False
owner["write"] = False</p><p> if group == 6:
group["read"] = True
group["write"] = True
if group == 4:
group["read"] = True
group["write"] = False
if group == 0:
group["read"] = False
group["write"] = False</p><p> if everyone == 6:
everyone["read"] = True
everyone["write"] = True
if everyone == 4:
everyone["read"] = True
everyone["write"] = False
if everyone == 0:
everyone["read"] = False
everyone["write"] = False</p><p> return {"owner": owner, "group": group, "everyone": everyone}
And then I rewrote it to the following:
def translatePermissions(self, object_id):
rw = []
obj = self.get(oid=object_id)
for value in struct.unpack("3c",str(obj.permissions)):
if value == 6:
r = True
w = True
elif value == 4:
r = True
w = False
elif value == 2:
r = False
w = True
else:
r = False
w = False
rw.append({"read":r, "write":w})
return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}
And then checking for read and write rather than going through each possibility of v, as well as passing an object prior to reduce multiple seeks:
def checkPermissions(self, obj):
rw = []
for v in struct.unpack("3c",str(obj.permissions)):
if v == 6 or v == 4:
r = True
else:
r = False
if v == 6 or v == 2:
w = True
else:
w = False
rw.append({"read":r, "write":w})
return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}
But with Python2.5 you can go even further and reduce it to:
def translatePermissions(self, obj):
rw = []
for v in struct.unpack("3c",str(obj.permissions)):
r = True if v == 6 or v == 4 else False
w = True if v == 6 or v == 2 else False
rw.append({"read":r, "write":w})
return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}
From 32 lines at the start to 8 lines in the end!
And both together they look like this:
def determinePermissions(obj):
obj_usr= {"owner":obj.owner, "group":obj.group, "everyone":True}
usr = {
"owner":cherrypy.session.get('user'),
"group":cherrypy.session.get('group'),
"everyone":True
}
perms = self.translatePermissions(object_id)
r, w = False, False
for x in ["owner", "group", "everyone"]:
r = True if perms[x]["read"] and obj_user[x] == usr[x] else r
w = True if perms[x]["write"] and obj_user[x] == usr[x] else w
if usr["owner"] == "root":
r, w = True, True
return {"read":r, "write":w}</p><p>def translatePermissions(self, obj):
rw = []
for v in struct.unpack("3c",str(o.permissions)):
r = True if v == 6 or v == 4 else False
w = True if v == 6 or v == 2 else False
rw.append({"read":r, "write":w})
return {"owner": rw[0], "group": rw[1], "everyone": rw[2]}
Poetry!