Sunday, April 25, 2010

Patterns of doom: chain of fetches

Let's look at another commonly seen pattern of doom that can be identified and eliminated using app stats. This specimen is very similar to chain of gets, but it is more commonly found in mashups that need to fetch a lot of data from the web. The situation is something like this:



The application gets from somewhere a list of data it needs to correlate. Each piece of data is somewhere on the web and needs to be fetched, e.g. from an atom feed or by scraping a html page. The developer then will get the data, e.g. like this:

pages = [urlfetch.fetch(url) for url in urls]


and then do the operations necessary on the data to get the information. The problem is very similar to chain of gets: each request is slow, and performing them sequentially takes time.

So, what is the solution to the problem? In the case of gets, we had an api that would let us perform all the datastore operations in one batch. For urlfetch, a similar effect can be achieved using asynchronous requests:

def asynchronous_get(list_of_urls):
rpcs = []
for url in list_of_urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
return [rpc.get_result() for rpc in rpcs]


With this little helper method, we can parallelize the fetches in our previous example by simply making a small adjustment:

pages = asynchronous_get(urls)





The improvement can be significant, as shown in the second screenshot. Note that app stats shows that the time spent in RPCs is about the same as in the original example, but because they all happened at the same time, the actual time spent (Grand total) is significantly less.

As always, you can try out the pattern yourself at my demo site. Please don't overdo it though, since this is running on free quota and run out of it pretty soon otherwise. There is a reason I am calling those patterns of doom after all ;-)

PS: for those wondering why the first get in the screenshot is not batched: that's an independent call I make to find a list of youtube urls I'd like to fetch.

6 comments:

GAEfan said...

Great contribution! I have a question about how elegantly this would fail.

Let's say you batch 3 urlfetches, and one of them fails (takes too long, cannot find url, DNS error, returns non-200 status_code, or any number of ways a urlfetch can fail), would all 3 fetches fail?

In other words, under what scenarios would this return an error (not return an object for each fetch)?

I did a test, sending one urlfetch to a url at "0.com", with the others to good addresses, and the whole thing failed. Seems "asynchronous_get(urls)" needs to be put in a try/except.

What happens if one url takes too long to respond?

I'd like to test that. How can I create a web page that takes 40 seconds to respond?

agualis said...

Hi! I don't know if mine is a common improvable scenario so I would like to ask about it.
We are developing a Point Of Sale application for Android. And we are synching every operation performed by a client's shop up to an app engine application in order to have all the business information in the cloud.

We have been taking care about performance by doing parallel db.gets and db.puts, using keys to access our entities, using a group of entities for each shop, using memcache...however each synched operation has to perform several datastore writes in order to update the status of the shop and we cannot avoid this as we need the info.
For example, if the operation is a sale we have to perform many writes in different entities:
update the stock of each sold product, update the cash of the shop, update the credit of the clients, update the balance sheet...so the writes are consuming most of our cpu-time, even if we perform them altogether in a single db.put.
Furthermore, for some of the operations, we are launching several queued-tasks in order to perform aggregate calculations which are stored in datastore entities too, thus, more cpu-time consuming.

We do understand that our data model is complex, we are storing many data and that writes are the most critical RPC in terms of cpu-time, but, we are experiencing that Google App Engine is charging us quite more money than we expected.
Once we have executed appstats to understand it, we have seen that api ms charged for the db.puts are huge.

For instance, in a sale operation we perform a db.put writing to about 8 entities using its keys and it's taking, in Grand Total, 762 real ms but we are being charged 1227 cpu+api ms.

We do use some index and our entities are not small but not so big.

Do you consider there is anything we can do in order to make our business cheaper?

What do you thing about aggregates? For example, we update daily, weekly and monthly aggregates in order to count the products per category sold in each sale. This is again a problem in terms of cpu time because each sale must write lots of entities (one for each period and category) each time a sale is synched. Do you think it's not a good approach? What could we do? Avoiding aggregate calculation for each operation and perform, for example a daily one using Mapper Api?

I hope this is the right place to write it as this entry is about appstats patterns and maybe you know about some "trick" in order to solve our problem.

Thanks in advance.

Xenode_Blogger said...

Hi, Need Some Help Please:

¿Is there Any Way to build a service like MegaStreaming (To bypass the Megavideo Limit) Using Appengine + Python or maybe Java?

http://megastreaming.org/

The idea is to build a service that let users to bypass the megavideo limit ANYWAY, but I was thinking about something like: "The user inputs the MegaVideo URL, click "Go" and Voilá, like magic, He/she is watching the Video Without time restriction (I mean, by Bypassing it, of course)

¿Can You help me with this project?A Thanks!

The App Engine Fan said...

Hi,

> Let's say you batch 3 urlfetches, and one of them fails [...] would all 3 fetches fail?

The way I implemented it in this post, the answer would be yes. The problem is in the line

return [rpc.get_result() for rpc in rpcs]

get_result can raise a urlfetch.DownloadError. An alternative would be to catch such an error and return None in that part of the array. This would assume however that the particular code is able to handle a partial result.

The App Engine Fan said...

> Avoiding aggregate calculation for each operation and perform, for example a daily one using Mapper Api?

That might be an option. The answer probably depends a bit on your overall situation. How high is the percentage of entities modified each day? How often does an entity get recomputed? If a lot of entities that used to get updated 10 times a day will only need to get updated once, that can be a good amount of savings. But it really depends.

When I was once in a similar situation, it was enough to "collect" changes to an entity in memcache for a few minutes and then write them in bulk. However, that kind of a fix only works if the entities in question change very much, and if it does not matter if a few of the changes could get lost.

The App Engine Fan said...

> Can You help me with this project

No. Sorry.