Friday, July 22, 2011

Python for non programmers, part 4

During the last lesson, we learned how to filter out rows with invalid input. However, it is not always desirable to delete an entire row. The following variation will intercept problems on a per-value basis instead and append the word Invalid instead:


# Import libraries
import csv
import operator

# Open files and csv readers/writers
infile = open('data.txt', 'rb')
in_csv = csv.reader(infile, delimiter=',')
outfile = open('data2.txt', 'wb')
out_csv = csv.writer(outfile, delimiter=',')

# Process values from a row and intercept errors
def process(row, indices, converter, function):
try:
values = [converter(row[i]) for i in indices]
return reduce(function, values)
except(ValueError, IndexError):
return 'Invalid'

# Go through the rows
is_header = True
for row in in_csv:
if is_header:
row = row + ['Sum', 'Max', 'Description']
is_header = False
else:
row.append(process(row, [1,2], float, operator.add))
row.append(process(row, [1,2], int, max))
row.append(process(row, [1,2], str,
lambda x,y: 'Value pair of %s and %s' % (x, y)))
out_csv.writerow(row)


Your mission, should you choose to accept:


  • Try out the program. Is the output as you would have expected?

  • Add two more columns to each row. The first one should show the difference between the first and second column (a-b), the second one should show the maximum of a/10 and b/11=5.

  • Modify the program to filter out any incoming row that has fewer than 3 columns. Hint: use len(row) to get to the amount of columns.

Tuesday, July 19, 2011

Python for non programmers, part 3

Today, we will look at data format errors. Replace your old input file with the following file and re-run your program:


Row A,Row B,Row C
Not a number,a,2
Missing a value,1

After newline,3,4


The first time you run the program, you will see an error message:


Traceback (most recent call last):
File "csvcopy.py", line 17, in
row.append(float(row[1]) + float(row[2]))
ValueError: invalid literal for float(): a


If you deleted the line that causes the issue and reran the program, you would see

Traceback (most recent call last):
File "csvcopy.py", line 17, in
row.append(float(row[1]) + float(row[2]))
IndexError: list index out of range


Would you also delete that line, you would get the same error also for the blank line in your file. How do we fix this?

Replace your current program without the following code:


# Import libraries
import csv

# Open files and csv readers/writers
infile = open('data.txt', 'rb')
in_csv = csv.reader(infile, delimiter=',')
outfile = open('data2.txt', 'wb')
out_csv = csv.writer(outfile, delimiter=',')

# Go through the rows
is_header = True
row_number = 0
for row in in_csv:
row_number = row_number + 1
is_valid = True
if is_header:
row = row + ['Sum', 'Max', 'Description']
is_header = False
else:
try:
row.append(float(row[1]) + float(row[2]))
row.append(max(int(row[1]), int(row[2])))
row.append('Value pair of %s and %s' % (row[1], row[2]))
except(ValueError, IndexError):
print 'Row %s is invalid, skipping...' % row_number
is_valid = False
if is_valid:
out_csv.writerow(row)


This code will use a try/except statement to intercept any errors in parsing numbers (ValueError) or rows with insufficient amount of columns (IndexError) and not write them into the file. Your mission, should you choose toy accept: extend the program to collect all these incorrect lines in a second output-file called "invalid.txt". Try doing so by adding 6 lines or fewer.

Thursday, July 14, 2011

Python for non programmers, part 2


Welcome back, grasshopper. Ready for another round?



Note:Before you begin, assumed you use Notepad++, go to Settings/Preferences and navigate to the "Language menu and tab settings" tab. Change the "Tab size" from 4 to 2, and make sure the "replace by space" option is selected.



Lesson 2: Simple data manipulation.



Today, we are going to take our original csv data file and add a couple of rows programatically. Our program will


  • Extend the table header (first row) with three new labels ("Sum", "Max", and "Description").

  • For each data row, compute the sum as a floating point number (like 3.0).

  • For each data row, compute the maximum as an integer point number (like 3).

  • For each data row, add a quick textual description that contains both numbers.




Open up a program file, e.g. csvcopy.py in your editor and create the following program (remember, leading whitespaces are important in python):


# Import libraries
import csv

# Open files and csv readers/writers
infile = open('data.txt', 'rb')
in_csv = csv.reader(infile, delimiter=',')
outfile = open('data2.txt', 'wb')
out_csv = csv.writer(outfile, delimiter=',')

# Go through the rows
is_header = True
for row in in_csv:
if is_header:
row = row + ['Sum', 'Max', 'Description']
is_header = False
else:
row.append(float(row[1]) + float(row[2]))
row.append(max(int(row[1]), int(row[2])))
row.append('Value pair of %s and %s' % (row[1], row[2]))
out_csv.writerow(row)




Just like last time, run the program and look at the output. It should look something like this:


Row A,Row B,Row C,Sum,Max,Description
Hello,1,2,3.0,2,Value pair of 1 and 2
World,3,4,7.0,4,Value pair of 3 and 4




Experiment with the program to do other data manipulation. Try to answer the following questions from looking at the file:


  • How do I access a single element in a list of values such as row

  • How do I create a comment?

  • How do I add two values?

  • How do I find the maximum of two values?

  • How do I create a floating point number from a text?

  • How do I create an integer number from a text?

  • How do I append a single value to a list (row)?

  • How do I append multiple values to a list (row) at once?



For bonus points, try to understand how the last row.append in the program works.
You can find more information about that here.

Tuesday, July 12, 2011

Python for non programmers, part 1

A good friend of mine (not an engineer) recently faced a problem. Customers sent in data samples (usually as comma-separated files) that needed to be processed and/or validated for conformance to a specification. My friend spent hours importing the files into spreadsheets and going through them manually. The data was large for a human (a few thousand rows), but small for a computer. My friend had some basic programming skills from college and early jobs, but over the years, those had become a bit rusty. I wondered if I could help, so here goes...

The objective:
  • Getting my friend all the information needed with as little theory as possible
  • Focus on the specific use case
  • Keep it simple (no list comprehensions, or any of that stuff)
  • Must work on windows machines with minimal system modifcations.
The final result will hopefully be a chain of simple exercises to get this person up to speed.

Lesson 1: Install python, read and write a csv file:

  • Install python, version 2, from http://python.org/download/. Currently, the latest installer can be found here. Remember the folder of where python was installed.
  • Optionally, install an editor like Notepad++ (otherwise, use regular notepad)
  • Create a folder, for example c:\example
  • Add a file called data.txt with the following three lines of content:
Row A,Row B,Row C
Hello,1,2
World,3,4
  • Add a file called csvcopy.py with the following content (the leading whitespace in the last line is important):
import csv
infile = open('data.txt', 'rb')
in_csv = csv.reader(infile, delimiter=',')
outfile = open('data2.txt', 'wb')
out_csv = csv.writer(outfile, delimiter=',')
for row in in_csv:
out_csv.writerow(row)
  • Open a command prompt and switch to the folder that contains data.txt and csvcopy.py.
  • Run "C:\Python\python.exe csvcopy.py" (path needs to be adjusted depending on where you installed python).
  • Open the newly created file "data2.txt" and look at its content.
  • Exercise: rename data.txt to data.csv and modify the program to write the result into data2.csv

Sunday, April 25, 2010

Patterns of doom: chain of fetches

Let's look at another commonly seen pattern of doom that can be identified and eliminated using app stats. This specimen is very similar to chain of gets, but it is more commonly found in mashups that need to fetch a lot of data from the web. The situation is something like this:



The application gets from somewhere a list of data it needs to correlate. Each piece of data is somewhere on the web and needs to be fetched, e.g. from an atom feed or by scraping a html page. The developer then will get the data, e.g. like this:

pages = [urlfetch.fetch(url) for url in urls]


and then do the operations necessary on the data to get the information. The problem is very similar to chain of gets: each request is slow, and performing them sequentially takes time.

So, what is the solution to the problem? In the case of gets, we had an api that would let us perform all the datastore operations in one batch. For urlfetch, a similar effect can be achieved using asynchronous requests:

def asynchronous_get(list_of_urls):
rpcs = []
for url in list_of_urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]


With this little helper method, we can parallelize the fetches in our previous example by simply making a small adjustment:

pages = asynchronous_get(urls)





The improvement can be significant, as shown in the second screenshot. Note that app stats shows that the time spent in RPCs is about the same as in the original example, but because they all happened at the same time, the actual time spent (Grand total) is significantly less.

As always, you can try out the pattern yourself at my demo site. Please don't overdo it though, since this is running on free quota and run out of it pretty soon otherwise. There is a reason I am calling those patterns of doom after all ;-)

PS: for those wondering why the first get in the screenshot is not batched: that's an independent call I make to find a list of youtube urls I'd like to fetch.

Sunday, April 18, 2010

behind the scenes

Some people may wonder how I got my "patterns of doom" app to automatically redirect to the app stats page after a request was executed. The secret is lies in applying a little hack to the app stats framework. On the main documentation site, an appengine_config.py is mentioned that installs the recording mechanism for app stats. I took that script and applied a small modification:

   1: def webapp_add_wsgi_middleware(app):
2: import os
3: from google.appengine.ext.appstats import recording
4: old_init = recording.Recorder.__init__
5: def new_init(self, env):
6: old_init(self, env)
7: link = 'http://%s%s/details?time=%s' % (
8: self.env.get('HTTP_HOST', ''),
9: '/stats',
10: int(self.start_timestamp * 1000))
11: os.environ['STATS_LINK'] = link
12: recording.Recorder.__init__ = new_init
13: app = recording.appstats_wsgi_middleware(app)
14: return app

Basically, I am monkeypatching the constructor of the Recorder class to dump the link it will eventually generate into an environment variable. My request logic can then read that value and use it for redirects. Other applications would also be possible (like creating an asynchronous task that reads the value out and dumps it into another, more persistent state for longer-term analysis), so I thought I'd share the code with anyone interested :-)

Saturday, April 10, 2010

Patterns of Doom: the staircase of gets

Edit: I just realized that the first post on this blog went live on April 10th 2008. Happy birthday, blog.appenginefan.com :-). Any posts from the past you enjoyed? Let me know through comments on this article.



I have received good feedback for my last post, so I thought I'd follow up with a second pattern :-). This pattern is originally from post by Nick Johnson, where you can find much more details than here. The staircase of gets usually looks like a long rpc plateau, followed by a subsequent chain of little get requests. The reason behind is often running a loop on the results of a datastore query, and then doing something that causes additional data to be loaded (e.g. additional models via reference properties):

posts = Post.all().fetch(20)
for post in posts:
self.response.out.write(post.title)
#post.author is REFERENCE PROPERTY!!!
self.response.out.write(post.author.name)


Each call to get the name out of the reference property "author" will trigger an additional datastore get. Thus, the solution is very similar as in Chain of Gets: find a way to collect all the references needed in advance and fetch them in one big datastore get. The code on this page uses the prefetch_refprops method suggested by Nick. I have made only a small addition (the if statement) that helped provide certain failure scenarios when the referenced property was not set or had been deleted from the store:

def prefetch_refprops(entities, *props):
fields = [(entity, prop) for entity in entities for prop in props]
ref_keys = [prop.get_value_for_datastore(x) for x, prop in fields]
ref_entities = dict((x.key(), x) for x in db.get(set(ref_keys)))
for (entity, prop), ref_key in zip(fields, ref_keys):
if ref_entities[ref_key]:
prop.__set__(entity, ref_entities[ref_key])
return entities


With this method, it is possible to avoid the staircase of gets by just adding a simple method call:

posts = Post.all().fetch(20)
prefetch_refprops(posts, Post.author)
for post in posts:
self.response.out.write(post.title)
self.response.out.write(post.author.name)


You can check out the code that produced this sampe from the app stats details page. If you want to re-run it and look at a freshly generated graph, click here (please be gentle; this is running on free quota ;-). And again: if you know good examples you'd like to add to the site, please let me know.