The New Google App Engine Blobstore API – First Thoughts

Cloud Computing, Programming December 15th, 2009

Google’s App Engine 1.3.0 was released yesterday along with a brand new Blobstore API allowing the storage and serving of files up to 50MB.

Store and Serve – Files can be uploaded and stored as blobs, to be served later in response to user requests. Developers can build their own organizational structures and access controls on top of blobs.

The way this API works is pretty simple. To upload files you can an API that manufactures a POST URL that web forms requests containing files data are submitted to. App Engine processes the POST request and created the blobs in its storage (and BlobInfo objects – readonly datastore entities containing the metadata on each blob). It then rewrites the request, removing the uploaded files data and replacing them a Blobstore key pointing to the stored blob in the App Engine Blobstore, and calls your handler with this data.

To serve an existing blob in your app, you put a special header in the response containing the blob key. App Engine replaces the body of the response with the content of the blob.

Now this is pretty straightforward but there are few concerns with this approach:

1. What about request validation (authentication\authorization etc.)?

When uploading files, the request reaches your code only after blobs have already been processed and stored. This means that you can only handle authentication\authorization or even form validation after data has been stored.

This means you’ll have to write code to clean the relevant blob entries in case of failed authentication\authorization\validation – more datastore API calls, more CPU…

It also means that without taking care of these special cases any newbie hacker with a simple snifter (or FireBug) can start uploading (and potentially) serving files off your service (see update).

2. No way to preprocess data

As the files data is already stored prior to the program’s handler being called, there’s no way to preprocess submitted data other than reading it from the store, processing it and storing it again.

There’s also no straightforward API to access or store blob data in code, so the above process has to be implementing using URL fetching (fetch the image via http call, process it, store it again using http POST call)

There must be a way for the Google App Engine team to wrap this app nicely and provide a clean API for this to be done efficiently (along with solving the validation problem described before)

 

As the Blogstore API is still in experimental phase I guess we’ll see some quick progress made on its development and hopefully the Google team will solve the issues above.

Atleast now there’s a beginning of an alternative to Amazon S3 for AppEngine applications.

 

Update:

Bret Slatkin notes that when the API manufactures the POST URL to be used for uploading the files, it creates a unique one-time URL which which mitigates any potential sniffing.
This fits perfectly for the scenario when you’re rendering a web form to be submitted by the user. But, it makes things harder if you’re trying to provide a REST API that allows uploading files (think of something like TwitPic for example). In this case you’ll have to write your own render that simulates what a web form would do (get the files, create random POST URL, call it, …)

Tags: ,

iPhone vs. Droid

Humor, Technology December 14th, 2009

iphone-vs-motorola-droid-500x362

I found the following comparison between the iPhone and the Droid ads hilarious.

Especially, the following Droid bullets:

  • It is fast and it despises aesthetics.
  • It is packaged inside missiles launched by stealth jets. (*)
  • It is a robot and should mostly be handled by other robots.
  • Droid is to be used with robotic hands in a low-lit hi-tech laboratory or warehouse.

Actually, in one of the ads they say they Droid is like a Scud, a soviet missile that’s not known for its accuracy…

In any case, I’ve used an Android device before (the Samsung Galaxy) and I can’t really find anything good to say about the device or the Android OS (and its Apps Market).
I can’t believe there are people in the blogsphere calling the Galaxy and Droid an iPhone alternative… 

Tags: , , , ,

Insight: Hiring Programmers

Programming November 30th, 2009

crowd_standout There’s a very interesting blog post over at Raw Thought on the topic of hiring programmers. It offers the following insight on hiring:

There are three questions you have when you’re hiring a programmer (or anyone, for that matter):

  • Are they smart?
  • Can they get stuff done?
  • Can you work with them?

Someone who’s smart but doesn’t get stuff done should be your friend, not your employee. You can talk your problems over with them while they procrastinate on their actual job.

Someone who gets stuff done but isn’t smart is inefficient: non-smart people get stuff done by doing it the hard way and working with them is slow and frustrating.

Someone you can’t work with, you can’t work with.

I think its a much better and more effective approach than the traditional method of asking cheesy annoying riddles and problems…

Tags: ,

Dov Moran’s Latest Invention: A Miniature Company

Technology November 15th, 2009

This couldn’t be more ironic. The title for the presentation by Dov Moran, CEO and Chairman of Modu, at the upcoming TheMarker convention: (translated from Hebrew: “How to go from a huge company to a large one, from large to medium, from medium to small and from small to miniature, or the opposite”)

dov-moran-lecture(via Ido Keinan)

Tags: ,

Building an iPhone Application

Programming October 29th, 2009

fiddme-teaserOn the past few weeks I’ve been working on a new venture centered around the iPhone. The process of building our app has been quite an adventure and we’ve experimented with several technologies that were new to us before reaching our current technology stack.
As we’ve finally got our stuff together and made an initial release to a group of testers I thought I’d share some of the technology choices we’ve made and the reasons behind them.

First some information about the team

…because technology choices are affected by the team’s technical skillset.

  • We’re 3 developers (Yosi, Udi and myself) and one designer (the awesome Naor Suki).
  • We’ve allocated two developers for the iPhone and one for the backend APIs & website.
  • We’re all veteran developers with experience mostly on Microsoft’s Development stack. This project meant going out of our comfort zone to a whole new set of technologies. Experience does make a difference easing the learning curve…

iPhone Development

  • iPhone SDK: This one is obvious right? We looked for alternatives for writing Objective-C. Unfortunately, Flash isn’t available for the iPhone (yet?) and MonoTouch looked promising but isn’t quite there…
    Besides its always better to be developing on the platform most developers are using which means there’s a big community that can help when you get stuck. Being on Apple’s official stack also means we get the latest features without having to wait for a 3rd party to convert them…
    To be perfectly honest, I’m not on the iPhone side of the development and did not actually write a single line of Objective-C code but I noticed it took my teammates 1-2 weeks to get the hang of it.
    To me, the fact that the iPhone App store is so successful and has so many apps which Objective-C as the development language (which is definitely harder than modern languages – Java etc.) makes Apple’s achievement even more amazing…
  • Three20: Handful of UI extractions from the Facebook iPhone app and open-sourced by the developer – joehewitt. His announcement blog post details the libraries it contains, shows some demos etc.
    The source code is also a great learning tool for how stuff is done on the iPhone.
  • json-framework: This is a pretty slick JSON parser for the iPhone. Hand parsing JSON in obj-C would not have been fun. This made it easy. I’m pretty sure I followed this tutorial to get it up and running.
  • ASIHttpRequest: A Nice Http framework that enabled easily handling asynchronous Http requests.
  • Stackoverflow is an invaluable resource for asking question and solving all sort of problems. As Yosi, who’s been concentrating on the iPhone side of our development, puts it “I dont think any iPhone development could be done without StackOverflow”
  • MGTwitterEngine: an awesome objective-c wrapper for twitter api which we based our api on.
  • Google Analytics SDK: A library enabling sending information to Google Analytics from the iPhone. This is important for measuring the ways users interact with certain flows on our program. For example, it helps us measure the conversion on our signup flow – how many users go through the signup flow and finish, and if they dont, what steps makes them go away?
    This kind of functionality is essential to measuring and improving UX flows…
  • Google Toolbox for Mac: A library for working with the different services exposed by Google.
Backend Development

After playing around with ASP.NET MVC (which we all had background with having come from the Microsoft ecosystem) and Ruby on Rails (because its cheaper to host than ASP.NET and way simple, faster, more fun to use IMHO) we’ve finally settled for Google AppEngine and django (Python).

We made the decision to base our development on django rather than on Google’s own webapp framework for the following reasons:

  • Lots of out of-of-the-box features. django has been out there for quit a while and is bundled with lots of features (like an easy to build admin interface, authentication system, validation system etc.)
  • Big community. There are lots of people doing django out there lots of open source libraries and samples available. As a rule of thumb its always better to be on the majority side…
  • Not specific to Google AppEngine. django is a standalone Python web development platform. While some parts of our code has to be AppEngine specific it would still be considerably easier to move away from AppEngine (if we ever decide to do so) than if we were entirely Google specific.

Google AppEngine also have Java support. But using Python with django is way easier and has a lot more support when it comes to both AppEngine and web development. Seriously, if you’re thinking of using Java, Don’t! take the leap and go with Python…

Libraries we’ve used:

  • app-engine-patch: This library is absolutely amazing and a must if you’re using django on AppEngine. Since the AppEngine data store API is not compatible with django’s API, a lot of the really cool time-saving features of django will simply not run on AppEngine (such as the admin UI, authentication and basically anything that requires data access). app-engine-patch loads django and patches it so it is compatible with the AppEngine API making all those cool django features work. This one is a must! You just download their project template and start developing your application on top of it.
  • PyDev: An Eclipse plugin for editing and debugging Python and AppEngine applications. It might sound obvious but I was using was actually using Notepad++ (on Windows)  for development until I found out there’s a decent IDE I could use…
  • Piston: a django library for developing REST-APIs. While its not entirely compatible with AppEngine it took a simple forking to edit those parts out…
  • GeoModel: provides basic indexing and querying of geospatial data on Google AppEngine.

Also, I would recommend taking the time to learn and understand how the AppEngine datastore works so you’ll understand how to build your datamodel to run efficiently on Google’s platform.
The following two presentations from Google I/O are invaluable:

So what do you think? If you’re developing an iPhone app, I’m very interested to know what were your technology choices and reasoning…

Oh, and if you have an iPhone and you live in Israel (its a local app so we’re limiting our efforts to Israel at the moment) please head over to our beta signup form and signup :)

Tags: , , , ,

High Performance at Massive Scale – Lessons learned at Facebook

Cloud Computing, Programming October 27th, 2009

Jeff Rothschild, Vice President of Technology at Facebook gave a great presentation at UC San Diego on "High Performance at Massive Scale –  Lessons learned at Facebook". The presentation’s abstract:

Facebook has grown into one of the largest sites on the Internet today serving over 200 billion pages per month. The nature of social data makes engineering a site for this level of scale a particularly challenging proposition. In this presentation, I will discuss the aspects of social data that present challenges for scalability and will describe the the core architectural components and design principles that Facebook has used to address these challenges. In addition, I will discuss emerging technologies that offer new opportunities for building cost-effective high performance web architectures.

I’m halfway through watching it and there are already several interesting point worth a detailed post later on.
If you want to learn how Facebook manages 30K+ machines, 300 million active users, 20 billion photos, and 25TB/day of logging data you should go and watch the talk’s webcast.

Tags: ,

What would Twitter do with $100 million?

Technology September 29th, 2009

twitter-money-300x300Last week the NY Times reported that Twitter has raised about $100 million of new funding, making the company’s value to be $1 billion. Just to put things in perspective, they also provide an example:

For context, that is almost double the market capitalization of Domino’s Pizza, which has 10,500 employees and had $1.4 billion in sales last year. Twitter has some 60 employees, and although it is experimenting with running advertisements on its Web site, Biz Stone, a Twitter founder, said this week at an industry conference that the company had no plans to begin widely running ads until 2010.

Twitter previously raised $55 million and has said it still has $25 million of that in the bank. So the question is, what will it do with these $100 million? Or, as I see it, who will it acquire now?

As part of it efforts to find a business model, Twitter will most likely acquire companies that’ll help it form that model.I’m thinking\betting on two major trends for such a business model:

Manage Companies Presence on Twitter

Twitter’s most obvious business model is by helping companies manage their presence on Twitter and monitor how their brands are being discussed.

This makes companies who provide all sort of analytics information, CRM integration and even url-shorteners as potential acquisitions for a future Twitter business package…

Local Markets, Local Social Network

The minute I read about Twitter’s $100 million round I thought of companies like Foursquare. I wasn’t really surprised when I read today’s Techmeme and noticed Twitter’s co-founder Jack Dorsey invested in Foursquare.
Also, the twitter team has been working very hard lately to make Twitter location aware, allowing users to share their location via their tweets and browse stuff that is happening around them.

Twitter is great in forming local communities (just checkout the local tweetups everywhere), gather news and provide all sort of local information. Fouresquare as well as other location based social networks doing can really help twitter tap into the long-tail local businesses market and take on companies such as Yelp.

So, what do you think Twitter’s latest valuation? What will it do with its newly raised $100mil?

Tags:

Data Mining – Handling Missing Values the Database

Programming August 14th, 2009

DataMiningQuestionsAnswersFigure2 I’ve recently answered Predicting missing data values in a database on StackOverflow and thought it deserved a mention on DeveloperZen.

One of the important stages of data mining is preprocessing, where we prepare the data for mining. Real-world data tends to be incomplete, noisy, and inconsistent and an important task when preprocessing the data is to fill in missing values, smooth out noise and correct inconsistencies.

If we specifically look at dealing with missing data, there are several techniques that can be used. Choosing the right technique is a choice that depends on the problem domain – the data’s domain (sales data? CRM data? …) and our goal for the data mining process.

So how can you handle missing values in your database?

1. Ignore the data row

This is usually done when the class label is missing (assuming you data mining goal is classification), or many attributes are missing from the row (not just one). However you’ll obviously get poor performance if the percentage of such rows is high.

For example, lets say we have a database of students enrollment data (age, SAT score, state of residence, etc.) and a column classifying their success in college to “Low”, “Medium” and “High”. Lets say our goal is do build a model predicting a student’s success in college. Data rows who are missing the success column are not useful in predicting success so they could very well be ignored and removed before running the algorithm.

2. Use a global constant to fill in for missing values

Decide on a new global constant value, like "unknown", "N/A" or minus infinity, that will be used to fill all the missing values.
This technique is used because sometimes it just doesn’t make sense to try and predict the missing value.

For example, lets look at the students enrollment database again. Assuming the state of residence attribute data is missing for some students. Filling it up with some state doesn’t really makes sense as opposed to using something like “N/A”.

3. Use attribute mean

Replace missing values of an attribute with the mean (or median if its discrete) value for that attribute in the database.

For example, in a database of US family incomes, if the average income of a US family is X you can use that value to replace missing income values.

4. Use attribute mean for all samples belonging to the same class

Instead of using the mean (or median) of a certain attribute calculated by looking at all the rows in a database, we can limit the calculations to the relevant class to make the value more relevant to the row we’re looking at.

Lets say you have a cars pricing database that, among other things, classifies cars to "Luxury" and "Low budget" and you’re dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value you’d get if you factor in the low budget cars.

5. Use a data mining algorithm to predict the most probable value

The value can be determined using regression, inference based tools using Baysian formalism , decision trees, clustering algorithms (K-Mean\Median etc.).

For example, we could use a clustering algorithms to create clusters of rows which will then be used for calculating an attribute mean or median as specified in technique #3.
Another example could be using a decision tree to try and predict the probable value in the missing attribute, according to other attributes in the data.

I’d suggest looking into regression and decision trees first (ID3 tree generation) as they’re relatively easy and there are plenty of examples on the net…

Additional Notes
  • Note that methods 2-5 bias the data as the filled-in value may not be correct.
  • Method 5 uses the most information available in the present data to predict the missing value so it has a better chance for generating less bias.
  • Missing value may not necessarily imply an error in the data! forms may contain optional fields, certain attributes may be in the database for future use.

Tags: , ,

Moving Your Application to Amazon’s Cloud

Cloud Computing July 25th, 2009

6a00d8341c534853ef00e54ff18b618833-150wi I’ve been dealing a lot with Amazon’s AWS platform lately. Mostly doing offline data processing using Hadoop but the latest load balancing features finally opened the door for frontend applications to take advantage of Amazon’s cloud computing platform – making it easier for developers to make application more cost efficient an scalable.

Keeping in mind that there are a lot of applications out there who can benefit from moving to the cloud (including my own) I’ve made a list of tasks\considerations to make when preparing for such a move:

Step One: Move Static Content to S3

The first and easiest step is to move all your static content – images, CSS, JavaScript files, etc. – to Amazon S3. Let Amazon worry about storage, backups and availability for you.

Things to consider:

  • GZIP content. S3 does not support serving GZIPed content natively so you’ll have to upload both GZIPed and plaintext version of each file and figure out in code which one to use according to the headers sent by the user’s browser.
    The following post describes how its done: Using Amazon S3 as a CDN
  • Amazon’s S3 service is “eventually consistent” which means that files uploaded to S3 may not be immediately available to read.
  • Use separate sub-domains for content.

Once your content is on S3 you can also use CloudFront, Amazon’s CDN (Content Delivery Network), to serve the files and improve your application’s performance.

Step Two:  Move Web Servers and Backend Servers to EC2

Move your web server code and backend services – database, memcached, etc. – to run on Amazon EC2 instances.

Consider using Amazon’s availability zones to setup servers in different availability zones. This can help your serve customers at different parts of the worlds better, while making your infrastructure tolerant to the unlikely event of a datacenter failures at Amazon.

The Web Servers

Moving your web servers to EC2 should be fairly simple. You can setup EC2 images that are configured exactly the same way your current web servers are.

If you require a queuing service as part of your architecture, consider switching to Amazon’s SQS to make administration easier.

The Database

Moving your database to EC2 is probably the hardest part of the move to AWS. If you plan on keeping your database (as opposed to migrating to a cloud solution like SimpleDB) you should use EBS (Elastic Block Storage) so that your storage is persists independently from the life of your EC2 instance.

Backup. Figure out how to take scheduled snapshots of your EBS and store them on S3.

Consider replication and sharding. If you’re using availability zone you should consider sharding your data. For example, store data European accounts data in Europe only. You should also consider replication between the different availability zones to ensure keep your site available even when one of the datacenter is unavailable.

Related Links:

Step Three: Scale. Take advantage of the cloud services

Now that your application is entirely running on Amazon’s platform it’s time to take the full advantage of the platform and make it scale.

Setup Monitoring to keep up with what’s going on on your system. Amazon provides a service called CloudWatch that allows you to monitor your machines and applications.

Based on the monitoring metrics you should start using Amazon’s auto-scaling and load balancing capabilities to be able to consume and release computing resources according to demand.

At this point you should also investigate reducing your dependency on relational databases (RDMS) as much as possible (as its the most complex, and hardest to scale, component in the system) and try to move as much functionality as possible to S3 and SimpleDB.
S3 is suitable for storing large objects while SimpleDB is ideal for small stubs of data.

Important notes:

  • Amazon’s load balancer doesn’t support SSL. This can be a showstopper for some applications…
  • Simple DB has a max row size limitation. If your data exceeds that limit you should consider using SimpleDB as a metadata store that references the full data stored on S3.

Related Links:

Tags: ,

Facebook, Hadoop and Hive

Cloud Computing, Programming June 16th, 2009

facebook logo for website Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run distributed applications that process vast amounts of data), Yahoo being the first. It is also the creator of Hive, a data warehouse infrastructure built on top of Hadoop.

The following two posts shed some more light on why Facebook chose the Hadoop\Hive path, how they’re doing it and the challenges they’re facing:

Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS…

The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold.

Hive – A Petabyte Scale Data Warehouse using Hadoop by Ashish Thusoo from the Data Infrastructure team at Facebook discusses Facebook’s Hive implementation in details.

… using Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. Our vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility that Hadoop enjoyed.

Tags: , , , ,