The New Google App Engine Blobstore API – First Thoughts

Cloud Computing, Software Development December 15th, 2009

Google’s App Engine 1.3.0 was released yesterday along with a brand new Blobstore API allowing the storage and serving of files up to 50MB.

Store and Serve – Files can be uploaded and stored as blobs, to be served later in response to user requests. Developers can build their own organizational structures and access controls on top of blobs.

The way this API works is pretty simple. To upload files you can an API that manufactures a POST URL that web forms requests containing files data are submitted to. App Engine processes the POST request and created the blobs in its storage (and BlobInfo objects – readonly datastore entities containing the metadata on each blob). It then rewrites the request, removing the uploaded files data and replacing them a Blobstore key pointing to the stored blob in the App Engine Blobstore, and calls your handler with this data.

To serve an existing blob in your app, you put a special header in the response containing the blob key. App Engine replaces the body of the response with the content of the blob.

Now this is pretty straightforward but there are few concerns with this approach:

1. What about request validation (authentication\authorization etc.)?

When uploading files, the request reaches your code only after blobs have already been processed and stored. This means that you can only handle authentication\authorization or even form validation after data has been stored.

This means you’ll have to write code to clean the relevant blob entries in case of failed authentication\authorization\validation – more datastore API calls, more CPU…

It also means that without taking care of these special cases any newbie hacker with a simple snifter (or FireBug) can start uploading (and potentially) serving files off your service (see update).

2. No way to preprocess data

As the files data is already stored prior to the program’s handler being called, there’s no way to preprocess submitted data other than reading it from the store, processing it and storing it again.

There’s also no straightforward API to access or store blob data in code, so the above process has to be implementing using URL fetching (fetch the image via http call, process it, store it again using http POST call)

There must be a way for the Google App Engine team to wrap this app nicely and provide a clean API for this to be done efficiently (along with solving the validation problem described before)

 

As the Blogstore API is still in experimental phase I guess we’ll see some quick progress made on its development and hopefully the Google team will solve the issues above.

Atleast now there’s a beginning of an alternative to Amazon S3 for AppEngine applications.

 

Update:

Bret Slatkin notes that when the API manufactures the POST URL to be used for uploading the files, it creates a unique one-time URL which which mitigates any potential sniffing.
This fits perfectly for the scenario when you’re rendering a web form to be submitted by the user. But, it makes things harder if you’re trying to provide a REST API that allows uploading files (think of something like TwitPic for example). In this case you’ll have to write your own render that simulates what a web form would do (get the files, create random POST URL, call it, …)

Tags: ,

Moving Your Application to Amazon’s Cloud

Cloud Computing July 25th, 2009

6a00d8341c534853ef00e54ff18b618833-150wi I’ve been dealing a lot with Amazon’s AWS platform lately. Mostly doing offline data processing using Hadoop but the latest load balancing features finally opened the door for frontend applications to take advantage of Amazon’s cloud computing platform – making it easier for developers to make application more cost efficient an scalable.

Keeping in mind that there are a lot of applications out there who can benefit from moving to the cloud (including my own) I’ve made a list of tasks\considerations to make when preparing for such a move:

Step One: Move Static Content to S3

The first and easiest step is to move all your static content – images, CSS, JavaScript files, etc. – to Amazon S3. Let Amazon worry about storage, backups and availability for you.

Things to consider:

  • GZIP content. S3 does not support serving GZIPed content natively so you’ll have to upload both GZIPed and plaintext version of each file and figure out in code which one to use according to the headers sent by the user’s browser.
    The following post describes how its done: Using Amazon S3 as a CDN
  • Amazon’s S3 service is “eventually consistent” which means that files uploaded to S3 may not be immediately available to read.
  • Use separate sub-domains for content.

Once your content is on S3 you can also use CloudFront, Amazon’s CDN (Content Delivery Network), to serve the files and improve your application’s performance.

Step Two:  Move Web Servers and Backend Servers to EC2

Move your web server code and backend services – database, memcached, etc. – to run on Amazon EC2 instances.

Consider using Amazon’s availability zones to setup servers in different availability zones. This can help your serve customers at different parts of the worlds better, while making your infrastructure tolerant to the unlikely event of a datacenter failures at Amazon.

The Web Servers

Moving your web servers to EC2 should be fairly simple. You can setup EC2 images that are configured exactly the same way your current web servers are.

If you require a queuing service as part of your architecture, consider switching to Amazon’s SQS to make administration easier.

The Database

Moving your database to EC2 is probably the hardest part of the move to AWS. If you plan on keeping your database (as opposed to migrating to a cloud solution like SimpleDB) you should use EBS (Elastic Block Storage) so that your storage is persists independently from the life of your EC2 instance.

Backup. Figure out how to take scheduled snapshots of your EBS and store them on S3.

Consider replication and sharding. If you’re using availability zone you should consider sharding your data. For example, store data European accounts data in Europe only. You should also consider replication between the different availability zones to ensure keep your site available even when one of the datacenter is unavailable.

Related Links:

Step Three: Scale. Take advantage of the cloud services

Now that your application is entirely running on Amazon’s platform it’s time to take the full advantage of the platform and make it scale.

Setup Monitoring to keep up with what’s going on on your system. Amazon provides a service called CloudWatch that allows you to monitor your machines and applications.

Based on the monitoring metrics you should start using Amazon’s auto-scaling and load balancing capabilities to be able to consume and release computing resources according to demand.

At this point you should also investigate reducing your dependency on relational databases (RDMS) as much as possible (as its the most complex, and hardest to scale, component in the system) and try to move as much functionality as possible to S3 and SimpleDB.
S3 is suitable for storing large objects while SimpleDB is ideal for small stubs of data.

Important notes:

  • Amazon’s load balancer doesn’t support SSL. This can be a showstopper for some applications…
  • Simple DB has a max row size limitation. If your data exceeds that limit you should consider using SimpleDB as a metadata store that references the full data stored on S3.

Related Links:

Tags: ,

Facebook, Hadoop and Hive

Cloud Computing, Software Architecture, Software Development June 16th, 2009

facebook logo for website Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run distributed applications that process vast amounts of data), Yahoo being the first. It is also the creator of Hive, a data warehouse infrastructure built on top of Hadoop.

The following two posts shed some more light on why Facebook chose the Hadoop\Hive path, how they’re doing it and the challenges they’re facing:

Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS…

The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold.

Hive – A Petabyte Scale Data Warehouse using Hadoop by Ashish Thusoo from the Data Infrastructure team at Facebook discusses Facebook’s Hive implementation in details.

… using Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. Our vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility that Hadoop enjoyed.

Tags: , , , ,

Yahoo Releases Its Own Hadoop Distribution

Cloud Computing, Software Industry June 11th, 2009

hadoopYahoo! is releasing its own distribution of Hadoop:

Hadoop is a distributed file system and parallel execution environment that enables its users to process massive amounts of data.
In response to frequent requests from the Hadoop community, Yahoo! is opening up its investment in Hadoop quality engineering to benefit the larger ecosystem and to increase the pace of innovation around open and collaborative research and development.
The Yahoo! Distribution of Hadoop has been tested and deployed at Yahoo! on the largest Hadoop clusters in the world.

Hadoop is free Java software framework born out of an open-source implementation of Google’s published computing infrastructure and fostered within the Apache Software Foundation.
Yahoo! has been the primary developer and contributor to Apache’s Hadoop.
In 2006, Hadoop founder Doug Cutting joined Yahoo, which provided a dedicated team and resources, to lead the project of developing the open-source software and turn Hadoop into a system that ran at web scale. Today, Yahoo! is running the largest Hadoop cluster in the world, which includes more than 25,000 servers and provides the framework for many Yahoo properties including Yahoo Search, Yahoo Mail, and several content and ad services.

Yahoo says its opening up the source code to Hadoop to “benefit the larger ecosystem increase the pace of innovation around open and collaborative research and development.”.
As Nigel Daley, Quality and Release Engineering Manager at Yahoo! Grid Technologies, summarizes:

Hadoop is helping us solve key science and research problems in hours or days instead of months. It provides us a platform to solve extreme problems requiring massive amounts of data processing. It underpins major revenue-generating systems. Opening our distribution enables a faster pace of innovation for the entire Hadoop ecosystem and broadens the use — and ultimately the quality — of this key platform across the industry.

Tags: , , , , , , ,

Microsoft Updates Its Windows Live Services

Cloud Computing, Software Industry November 14th, 2008

(Cross posted from CloudAve)

Microsoft announced today its rollout plans for the 3rd wave of Windows Live services.

The goal of this latest release wave, according to company officials, is to simplify the use of the offered services and unify the user’s entire online experience into the Windows Live interface.
Microsoft is planning to rollout the new services, currently in beta, to the public within the 1-2 months timeframe.

Windows Live Goes Social

One of Microsoft’s main emphasis with the current wave of services is on social networking between users using its services.

Microsoft finally figured out that its Live Messenger with about 268 million users worldwide, is by far the most popular instant messaging software in the world, is actually a social networks. With the new release, your Live Messenger contacts are now your Friends and you can see aggregated information about their activities on the net.

Very much like Plaxo, FriendFeed etc. Microsoft allows users to bring into their profile content they create in all sorts of services on the web (Live Services, Flickr, LinkedIn, blogs and RSS feeds, …) and share it with their friends and colleagues.
When users add photos, write reviews, and update their profiles directly on Live.com, that content will be put into their activity stream as well.
This activity stream is exposed in all sort of ways throughout Microsoft’s services interface.

For example, Microsoft’s new Live Home portal shows the latest events in your social network. When emailing a friend or chatting on Messenger you’re also able to interact with that friend’s activity stream and more…

Not just for private consumers…

I’ve been told that all these new service updates will not skip Windows Live Domains used by universities and organizations to create a personalized version of Microsoft’s services.
If that really the case, having all these new social capabilities as part of its domain offering can be amazing for collaboration and communication inside the organization.
While Google doesn’t seem to care about its Google Applications for Your Domains customers its good to see that Microsoft is going forward with Live Domains.
This latest update may just be the final straw I need to make the switch to Live Domains…

Where’s Live Mesh?!

It will be really interested to see where Live Mesh comes into the picture in regards to all of these Live services.
Live Mesh should be the glue bridging between Microsoft’s online services and its offline applications and devices (S+S) allowing users to sync all their content- contacts, photos, events, favorites, etc. – across devices and services.
Unfortunately, there’s no clear answer for that…

During the launch we’ve only heard about Live Sync allowing users to sync photos across computers. Some sources say its an incarnation of FolderShare and in any case it doesn’t seem to be based on Live Mesh technology.
With Live Mesh being one of Microsoft’s core platform offering its really hard to understand why we need to have Live Sync too…

Other notes…

  • All the services are released simultaneously in all countries and in 48 (!) languages.
  • Windows Live Skydrive size limit has changed from 5GB to 25GB
  • Windows Live Hotmail looks and feels a lot better to use.
  • I’ve uploaded all the screenshots of the new services to my SkyDrive:

Tags: , , , , , ,

Office Web Applications

Cloud Computing, Software Industry October 30th, 2008

(Originally posted at Cloud Avenue)

 

This year’s Microsoft Profesional Developers Conference is full of announcements and surprises. The next big announcement besides Windows Azure (and Windows 7?) is the new “Office Web Applications” live service. The Office team will be delivering the five most popular Office applications as light weight browser based versions that include Word, Excel, PowerPoint, and OneNote.

Here are some of the demo screenshots available:

The applications will be offered in both a simple HTML/AJAX version and a rich-client Silverlight version.
Office Web Applications are not planned to replace Microsoft’s traditional desktop offering but rather complete it, together with Mobile Office for mobile devices, allowing users to seamlessly work on their documents across all environments.

Providing such a reach collaboration environment isn’t a simple task as you can see in the following interview of Antoine Leblond, Senior VP of Office Productivity Apps and Chris Bryant, General Program Manager:

 

Although its not meant to replace its Desktop Office offering, one of Microsoft’s biggest cash cows, one has got to wonder about the risk these new services to cannibalize their big desktop brother’s profits. Windows and Office, which are Microsoft’s core business, are likely to stay its core moneymakers for at least the next 2-3 years, maybe even longer.
This move clearly shows that Microsoft is starting to think beyond that and along with its other platform announcements (Azure, Live Mesh…) we can clearly see a trend away from desktop software to rich clients installed from the web….

Office Web Applications will be released to a limited set of partners and customers at the end of this year. The release date will closely align with Office 14 and Windows 7 which will be sometime in late 2009 or early 2010.
Microsoft plans to make Office Web Applications available as a service through its Live platform supporting both an ad-funded and a paid-subscription models.
Business users that require an on-premise will be able to do so through Sharepoint via its traditional volume licensing program.

Tags: , , , , , , , , , , , , , , ,

Microsoft calls OpenID a De Facto Login Standard

Cloud Computing, Software Industry October 30th, 2008

(Originally posted at Cloud Avenue)

Windows Live™Microsoft’s Windows Live ID team just announced their support for OpenID calling it a “de facto standard Web protocol for user authentication.”

Beginning today, Windows Live™ ID is publicly committing to support the OpenID digital identity framework with the announcement of the public availability of a Community Technology Preview (CTP) of the Windows Live ID OpenID Provider.

You will soon be able to use your Windows Live ID account to sign in to any OpenID Web site!

What does it means for users?

OpenID allows users to maintain their identity information (Name, E-Mail, address, etc.) on a single provider and use that information to register and login to any website that supports OpenID. This relieves the user from having to fill out registration form and maintaining multiple different user names and passwords and profiles on different sites add provides a simplified online experience while increasing security.

Over 400 million LiveID users will soon be able to use their LiveID to do just that – login and provide identity information to any site supporting OpenID without the hassles of filling out registration forms and saving user\password information and with the user experience common to all OpenId sites (or, maybe even common to their familiar LiveID user interface?)

The wide adoption of OpenID led by Yahoo and Microsoft provides the required push for site owners to support OpenID providing the same simple and familiar login interface everywhere…

What does it mean for web developers?

With a simple integration effort that shouldn’t take more than a couple of minutes, site owners can relieve themselves from taking care of authentication and registration process while providing their users with a simple familiar interface for signing up and using their services.
OpenID provides an easy and secure mechanism for authenticating and registering users, and with additional online services (like JanRain’s RPX) site owners can handover the entire care of handling their user information to the cloud – cheaper, faster, more secure.

For now, the LiveID team is testing their system’s OpenID Provider which is at a CTP (Community Technology Preview) stage. Widespread support is planned for “sometime in 2009″.

[Update: Screencast Overview]

Tags: , , , , , , , ,

Microsoft’s Next Killer OS is… SharePoint?

Cloud Computing, Software Industry October 9th, 2008

Reading Mary Jo Foley’s Microsoft 2.0 it suddenly struck me: Could Microsoft’s next killer OS be SharePoint?

Instead of being quite so blatant, Microsoft has taken a quieter back route to achieving the same ends via two related technologies:

  1. Baking SharePoint reliance into more and more of its products
  2. Requiring users to buy pricey client-access licenses (CALs) in order to use Microsoft’s servers

Microsoft has been basing a growing number of its products on SharePoint technologies to provide basic common services like storage, pub/sub, identity/security infrastructure, communications and collaboration functionalities.

With SharePoint’s BDC catalog and search server it is apparent that Microsoft is targeting SharePoint to serve as an integration layer on top of services and LOB applications in the organization.

With “Oslo” its much more…

In 2007, the company began to roll out Microsoft-hosted versions of three of its servers—Exchange, SharePoint Server, and Communications Server—with more planned. The next stage is a set of online services for application developers that offer OS-like functions, such as application-based data storage, and data synchronization among multiple connected devices.

Microsoft’s “Oslo” vision and roadmap to “Simplify SOA, Bridge Software Plus Services, and Take Composite Applications Mainstream” is largely built with SharePoint backing its platform.

As Microsoft expands its reach into cloud computing it’ll have to adjust its SharePoint infrastructure services, which its other server products rely on, to support this kind of environment.

We can already see signs for this transformation in Microsoft’s last year’s announcement on switching SharePoint to use claim based security.
Performing authentication and authorization using claims allows SharePoint to support federated identities across different services and applications – from integration with common identity services like Active Directory, LiveID and OpenID to service\application specific identity models.
This means SharePoint is no longer limited to using the Active Directory on premise but can integrate with remote external authentication providers enabling SharePoint hosted scenarios.

Office, OBAs and LOB Integration

On February 27th last year Microsoft shared some details on LOBi (Line of Business Interoperability), the next version of its SharePoint BDC, postponing it to be released as part of its Office 14 technology stack:

Consequently, LOBi technologies will now be delivered as a set of capabilities within the Office SharePoint Server as part of the next major set of Microsoft Office product releases (the Office 14 wave).”

LOBi, now known as OBAF (OBA Framework), allows developers to integrate LOB applications (SAP, Oracle, etc.) into SharePoint.

It provides developers with all the necessary services required to develop a composite LOB web application on top of existing systems while also supporting offline synchronization to Office clients (S+S strategy).

With the extensive support in application modeling planned as part of “Oslo” roadmap, Microsoft is positioning SharePoint as an integration platform that will run composite applications.

Dynamics and its other server products will probably leverage this platform, as well as its partners and ISVs.
An example for such an application’s is Duet 3.0, the next version of the SAP integration productivity product jointly developed with SAP, which will be developed on top of the new SharePoint technology stack.

Conclusion

By providing developers with rich set of platform services – security services, data indexing, search, synchronization services and offline capabilities – coupled with a development environment (Visual Studio 10) and modeling support, Microsoft is trying to provide all the essential capabilities required to build and run application in a hosted environment – the beginning of an OS for cloud applications?

Some more links on Microsoft’s Cloud Direction:

Originally published on Cloud Avenue.

Tags: , , , , , , ,

Amazon S3 Storing 29 Billion Objects

Cloud Computing, Software Industry October 9th, 2008

(Originally posted on Cloud Avenue)

logo_aws Jeff Barr from Amazon Web Service reports that Amazon’s Simple Storage Service (S3) is now storing more than 29 billion, an increase of 7 billion from the previous quarter:

As one of the S3 engineers told me last week, that’s over 4 objects for every person now on Earth!

Our customers are keeping S3 pretty busy too. To give you an example of what this means in practice, the peak S3 usage for October 1st was over 70,000 storage, retrieval, and deletion requests per second.

Amazon is also lowering the prices on S3 storage, with a new four-tier pricing plan that takes effect on Nov. 1st.  Customers storing more than 500 terabytes will get a rate of 12 cents per gigabyte.

With such a huge amount of data, low prices, and abundance of success stories, it really seems like Amazon has got a revolutionary service on its hands…

Tags: , , , ,

Cloud Envy

Cloud Computing, Software Industry September 29th, 2008

Cloud Computing is the latest, hottest new buzzword in today’s information technology world. However, and much like other buzzwords such as Web x.0, it seems to be losing whatever meaning it once had as an increasing number of companies, not wanting to miss out on the latest hype, are starting to use it for their product’s PR campaigns….

Read the complete post at Cloud Avenue.

Tags: , , ,