Facebook, Hadoop and Hive

Cloud Computing, Software Architecture, Software Development June 16th, 2009

facebook logo for website Facebook has the second largest installation of Hadoop (a software platform that lets one easily write and run distributed applications that process vast amounts of data), Yahoo being the first. It is also the creator of Hive, a data warehouse infrastructure built on top of Hadoop.

The following two posts shed some more light on why Facebook chose the Hadoop\Hive path, how they’re doing it and the challenges they’re facing:

Facebook, Hadoop, and Hive on DBMS2 by Curt Monash discusses Facebook’s architecture and motivation.

Facebook decided in 2007 to move what was then a 15 terabyte big-DBMS-vendor data warehouse to Hadoop — augmented by Hive — rather than to an MPP data warehouse DBMS…

The daily pipeline took more than 24 hours to process. Although aware that its big-DBMS-vendor warehouse could probably be tuned much better, Facebook didn’t see that as a path to growing its warehouse more than 100-fold.

Hive – A Petabyte Scale Data Warehouse using Hadoop by Ashish Thusoo from the Data Infrastructure team at Facebook discusses Facebook’s Hive implementation in details.

… using Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. End users had to write map/reduce programs for simple tasks like getting raw counts or averages. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. Our vision was to bring the familiar concepts of tables, columns, partitions and a subset of SQL to the unstructured world of Hadoop, while still maintaining the extensibility and flexibility that Hadoop enjoyed.

Tags: , , , ,

Introduction to MapReduce for .NET Developers

.NET, Software Development May 6th, 2009

The basic model for MapReduce derives from the map and reduce concept in functional languages like Lisp.
In Lisp, a map takes as input a function and a sequence of values and applies the function to each value in the sequence.
A reduce takes as input a sequence of elements and combines all the elements using a binary operation (for example, it can use “+” to sum all the elements in the sequence).

MapReduce, inspired by these concepts, was developed as a method for writing processing algorithms for large amounts of raw data. The amount of data is so large that it can’t be stored on a single machine and must be distributed across many machines in order to be processed in a reasonable time.
In systems with such data distribution, the traditional central processing algorithms are useless as just getting the data to the centralized CPU running the algorithm implies huge network costs and months (!) spent on transferring data from the distributed machines.
Therefore, processing such massive scales of distributed data implies the need for parallel computing allowing us to run the required computation “close” to where the data is located.
MapReduce is an abstraction that allows engineers to write such processing algorithms in a way that is easy to parallelize while hiding the complexities of parallelization, data distribution, fault tolerance etc.

This value proposition for MapReduce is outlined in a Google research paper on the topic:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

The MapReduce Programming Model

As explained earlier, the purpose of MapReduce is to abstract parallel algorithms into a map and reduce functions that can then be executed on a large  scale distributed system.
In order to understand this concept better lets look at a concrete map reduce example – consider the problem of counting the number of occurrences of each word in a large collection of documents:

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
  EmitIntermediate(w, "1"); 

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
  result += ParseInt(v);
Emit(AsString(result));

The map function goes over the document text and emits each word with an associated value of “1”.

The reduce functions sums together all the values for each word producing the number of occurrences for that word as a result.

First we go through the mapping phase where we go over the input data and create intermediate values as follows:

  • Records from the data source (lines out of files, rows of a database, etc.) are fed into the map function as <key,value> pairs.For example: <filename, file content>
  • The map function produces one or more intermediate values along with an output key from the input

After the mapping phase is over, we go through the reduce phase to process the intermediate values:

  • After the map phase is over, all the intermediate values for a given output key are combined together into a list and fed to the reduce function.
  • The reduce function combines those intermediate values into one or more final values for that same output key

Notice that both the map and the reduce functions run on independent set of input data. Each run of the map function process its own data source and each run of the reduce function processes the values of a different intermediate key.

Therefore both phases can be parallelized with the only bottleneck being the fact that the map phase has to finish for the reduce phase to start.

The underlying system running these method is in takes care of:

  • Initialize a set of workers that can run tasks – map or reduce functions.
  • Take the input data (in our case, lots of document filenames) and send them to the workers to map
  • Streamline values emitted by map function to the worker (or workers) doing the reduce. Note that we don’t have to wait for a certain map run to finish going over the entire file in order to start sending its emitted values to the reducer, so that the system can prepare the data for the reducer while the map function is running
    (In Hadoop – send the map values to the reducer node and andle grouping by key).
  • Handle errors – support a reliable, fault tolerant process as workers may fail, network can crush preventing workers from communicating results, etc.
  • Provides status and monitoring tools.

A Naive Implementation in C#

Lets see how we can build naive MapReduce implementation in C#.

First, we define a generic class to manage our Map-Reduce process:

public class NaiveMapReduceProgram<K1, V1, K2, V2, V3>

The generic types are used the following way:

  • (K1, V1) – key-value types for the input data
  • (K2, V2) – key value types for the intermediate results (results of our Map function)
  • V3 – The type of the result for the entire Map-Reduce process

Next, we’ll define the delegates of our Map and Reduce functions:

public delegate IEnumerable<KeyValuePair<K2, V2>>   MapFunction(K1 key, V1 value);
public delegate IEnumerable<V3>                     ReduceFunction(K2 key, IEnumerable<V2> values);
private MapFunction _map;
private ReduceFunction _reduce;
public NaiveMapReduceProgram(MapFunction mapFunction, ReduceFunction reduceFunction)
{
    _map = mapFunction;
    _reduce = reduceFunction;
}

(Yes, I realize I could use .NET’s Func<T1,T2,TResult> instead but that would just result in horribly long ugly code…)

Now for the actual program execution. The execution flow is as follows: We take the input values, pass them through the map function to get intermediate values, we group those values by key and pass them to the reduce function to get result values.

So first, lets look at the mapping step:

private IEnumerable<KeyValuePair<K2, V2>> Map(IEnumerable<KeyValuePair<K1, V1>> input)
{
    var q = from pair in input
            from mapped in _map(pair.Key, pair.Value)
            select mapped;

    return q;
}

Now after we got the mapped intermediate values we want to reduce them. The Reduce function expects a key and all its mapped values as input so to do that efficiently we want to group the intermediate values by key first and then call the Reduce function for each key.

The output of this process is a V3 value for each of the intermediate K2 keys:

private IEnumerable<KeyValuePair<K2, V3>> Reduce(IEnumerable<KeyValuePair<K2, V2>> intermediateValues)
{
    // First, group intermediate values by key
    var groups = from pair in intermediateValues
                 group pair.Value by pair.Key into g
                 select g;

    // Reduce on each group
    var reduced = from g in groups
                  let k2 = g.Key
                  from reducedValue in _reduce(k2, g)
                  select new KeyValuePair<K2, V3>(k2, reducedValue);

    return reduced;
}

Now that we have the steps code the execution itself is simply defined as Reduce(Map(input)) :

public IEnumerable<KeyValuePair<K2, V3>> Execute(IEnumerable<KeyValuePair<K1, V1>> input)
{
    return Reduce(Map(input));
}

The full source code and tests can be downloaded from here:

Map-Reduce Word Counting Sample – Revisited

Lets go back to the word-counting pseudo code and write it in C#.

The following Map function gets a key and a text value and emits a <word, 1> key-pair for each word in the text:

public IList<KeyValuePair<string, int>> MapFromMem(string key, string value)
{
    List<KeyValuePair<string, int>> result = new List<KeyValuePair<string, int>>();
    foreach (var word in value.Split(' '))
    {
        result.Add(new KeyValuePair<string, int>(word, 1));
    }
    return result;
}

Having calculated a <word, 1> key-pair for each input source, we can group the results by the word and then our Reduce function can sum the values (which are 1 in this case) for each word:

public IEnumerable<int> Reduce(string key, IEnumerable<int> values)
{
    int sum = 0;
    foreach (int value in values)
    {
        sum += value;
    }

    return new int[1] { sum };
}

Our program code looks like this:

MapReduceProgram<string, string, string, int, int> master = new MapReduceProgram<string, string, string, int, int>(MapFromMem, Reduce);
var result = master.Execute(inputData).ToDictionary(key => key.Key, v => v.Value);

The result dictionary contains a <word, number-of-occurrences> pairs.

Other Examples

Distributed LINQ Queries. One of POCs I’m working on using the above naive, LINQ-based implementation, is running a distributed LINQ query. Imagine you have a system where raw data is distributed across several SQL Servers. We can have our map function run a LINQ-to-SQL query on multiple DataContexts in parallel (the value input for the map function – V1 – can be a DataContext) and then reduce it to a single result set. This is probably a naive\simplified implementation of what the guys at Microsoft’s Dryad team are doing.

Count URL Visits. Consider you have several web servers and you want to produce the amount of visits for each page on your site. You can produce pretty much the same way the word-counting example works. The map function parses a log file and produce a <URL, 1> intermediate value. The reduce function then sums the values for each URL and emits <URL, number of visits>

Distributed Grep. You can run a grep search on a large amount of files by having the map function emits a line if it matches a given pattern. The reduce function in this case is just an identity function that copies the supplied intermediate data to the output.

Map-Reduce in the Real World

The real complexity and sophistication in MapReduce is in the underlying system takes care of running and managing the execution of MapReduce jobs. Real world MapReduce implementations, like Google’s system, Hadoop or Dryad have to go beyond the naive implementation shown here and take care of things like resource monitoring, reliability and fault tolerance (for example, handle cases where nodes running map\reduce jobs crush, or go offline due to network problems).

The following resources are worth checking out:

Tags: , , , , ,

Developing a Robust Data Driven UI Using WPF – An Overdue Summary (and full source code)

.NET, Software Development, WPF April 15th, 2009

I wrote the stocky application more than a year ago as a research project aimed at proving that using WPF we can separate presentation metadata (XAML) from program logic. The goal was to provide the Duet team at SAP with a document reference sample for using M-V-VM to achieve this separation.

I started documenting the proof-of-concept in a series of posts but unfortunately after leaving SAP my interests (and work) shifted away from WPF and I didn’t find the time to finish the series.

I’ve received numerous requests to release the source code but I couldn’t do so because it was part of a larger infrastructure code I wrote at SAP which basically ads a lot of noise to the sample (an d probably ads legal issues for me sharing it).
Anyway, I took some time off this afternoon to re-write the sample independently so that I could share it:

This, I guess is the long overdue ending for the series:

  • Introduction – introduces the concept of M-V-VM and the reasoning behind it.
  • The DataModel – describes how to write the Model part of our application.
  • Stock DataModel Sample – provides a conrete implementation of a Stock model and its view.. 

However, If you’re interested in M-V-VM in WPF, there are numerous topics worth mentioning that I didn’t get to cover and are definitely worth checking out:

Unit Testing

As I said in the introduction post, one of the most important benefits of seperating the logic code from the presentation (XAML) is that its straightforward to unit test. In fact, my next post following the Stock DataModel Sample was going to be about unit testing – specifically, how to test the DataModel its provider which, because of the use of threading, is a bit tricky.

This post is actually 99% done in the comments of the unit test code that’s in DefaultStockQuoteProviderTest.cs in the  provided source code. So do yourself a favor and go over the code. It’s not long and very well documented…

Using Lambda Expression for DataBinding

Data-binding is pretty much at the heart of the M-V-VM concept and it makes us write Value Converters which is pretty tedious and annoying.
Wouldn’t it be great if we could replace writing lots of IValueConverter classes like this:

<TextBlock Foreground="{Binding Change, Converter={StaticResource StockForegroundConverter}}" … />

[ValueConversion(typeof(double), typeof(Brush))]
public class StockChangeToBrushConverter : IValueConverter
{
    public object Convert(object value, Type targetType, object parameter, CultureInfo culture)
    {
        double change = (double)value;
        if (change == 0) return Brushes.Black;
        return (change < 0) ? Brushes.DarkRed : Brushes.Green;
    }

    public object ConvertBack(object value, Type targetType, object parameter, CultureInfo culture)
    {
        return double.NaN;
    }
}

To just the following XAML statement that embeds the conversion logic:

<TextBlock Foreground="{Binding Change,

Converter={ change=> if (change == 0) return Brushes.Black; return (change < 0) ? Brushes.DarkRed : Brushes.Green; }}" … />

M. Orçun Topdağı wrote an excellent series on using Lambda Expressions for data-binding in WPF to achieve just that:

Reference Applications and Guidance

I haven’t seen a lot of sample WPF LOB reference applications out there but here are some interesting links for further learning:

Tags: , , , ,

ASP.NET MVC RSS Feed Action Result

Software Development January 11th, 2009

Guy wrote a post about rendering an RSS feed on ASP.NET MVC using custom feed model classes and a view that renders the feed XML.

There’s a better (shorter) way for achieving the same result while leveraging on the Syndication mechanism built into .NET’s WCF.
WCF exposes the SyndicationFeed, SyndicationItem, SyndicationPerson classes which represent our data model.
In order to render this model WCF also exposes the Atom10FeedFormatter, and RSS20FeedFormatter classes that can render the feed to a stream, so all we need to do is integrate that into the ASP.NET MVC pipeline.

The ASP.NET MVC framework introduces a concept of returning an ActionResult instance as the result of Controller Actions.
This ActionResult object indicates the result from an action (a view to render, a URL to redirect to, another action/route to execute, etc).

ASP.NET MVC ships with several Action Results:

  • ContentResult – Simply writes the returned data to the response.
  • EmptyResult – Returns an empty response.
  • HttpUnauthorizedResult – Returns Http 401 code for non authorized access.
  • JsonResult – Serializes the response to Json.
  • RedirectResult – Redirects to another Url.
  • RedirectToRouteResult – Redirects to another controller action.
  • ViewResultBase (abstract) – Renders an HTML content as a result.
    • PartialViewResult (inherits from ViewResultBase) – Renders a partial HTML response.
  • BinaryResult (abstract) – Returns a binary response.
    • BinaryStreamResult (inherits from BinaryResult) – Writes a binary stream as a result.

So basically, to return a feed result all we need to do is define our own ActionResult implementation by deriving from ActionResult:

public abstract class ActionResult
{
    protected ActionResult();

    public abstract void ExecuteResult(ControllerContext context);
}

All we need to do is override the ExecuteResult method and write our data model to the output http stream using RSS20FeedFormatter:

public class RssActionResult : ActionResult
{
    public SyndicationFeed Feed { get; set; }

    public override void ExecuteResult(ControllerContext context)
    {
        context.HttpContext.Response.ContentType = "application/rss+xml";

        Rss20FeedFormatter rssFormatter = new Rss20FeedFormatter(Feed);
        using (XmlWriter writer = XmlWriter.Create(context.HttpContext.Response.Output))
        {
            rssFormatter.WriteTo(writer);
        }
    }
}

Now we can simply return RssActionResult as a result of our controller’s action.

Here’s a simple example:

public ActionResult Feed()
{
    SyndicationFeed feed =
        new SyndicationFeed("Test Feed",
                            "This is a test feed",
                            new Uri("http://Contoso/testfeed"),
                            "TestFeedID",
                            DateTime.Now);

    SyndicationItem item =
        new SyndicationItem("Test Item",
                            "This is the content for Test Item",
                            new Uri("http://Contoso/ItemOne"),
                            "TestItemID",
                            DateTime.Now);

    List<SyndicationItem> items = new List<SyndicationItem>();
    items.Add(item);
    feed.Items = items;

    return new RssActionResult() { Feed = feed };
}

… and that’s it!

A more elegant solution that leverages existing framework capabilities.

Related Posts

Tags: , ,

99 Ways to Become a Better Developer

Software Development, Tips December 5th, 2008

I encountered this post on my weekend reading. 91 Surefire Ways to Become an Event Greater Developer contain a comprehensive guide linking to all sort of blog posts providing insights on improving your skills as a developer.

While the list is very long and sometimes debatable it does have some interesting pointers. If you do nothing else, delve into item #8: Learn Programming by Not Programming referring to the following post by Jeff Atwood.

The topic in question is why some developers outperform their peers regardless of their accumulated experience:

But the dirty little secret of the software development industry is that this is also true even for people who can program: there’s a vast divide between good developers and mediocre developers.

A mediocre developer can program his or her heart out for four years, but that won’t magically transform them into a good developer. And the good developers always seem to have a natural knack for the stuff from the very beginning.

The answer lies in the quotes taken from Bill Gates remarks:

“The older I get, the more I believe that the only way to become a better programmer is by not programming. You have to come up for air, put down the compiler for a moment, and take stock of what you’re really doing. Code is important, but it’s a small part of the overall process.”

“To truly become a better programmer, you have to to cultivate passion for everything else that goes on around the programming.”

“The nature of these jobs is not just closing your door and doing coding, and it’s easy to get that fact out. The greatest missing skill is somebody who’s both good at understanding the engineering and who has good relationships with the hard-core engineers, and bridges that to working with the customers and the marketing and things like that.”

Eric Sink makes the distinction even clearer in You Need Developers, Not Programmers drawing a distinction between Programmers who are only excited about writing code and basically only care about doing that, and Developers who contribute to the software product in many ways.

The Great Programmer\Hacker Stereotype

You all know that guy (hell, most of us were that guy when we just started out, I know I was) – he has great technical skills, likes writing code and can spend hours within his IDE writing code that’ll make most of us scratch our head. Yet, he views the world only in one dimension – code. Business? that’s for the managers to figure out. Sales\Marketing? annoyances for others to take care of. Documentation? but the code is so obvious…Builds? Deployment? Configuration? …

Passion for code is a great quality. But as a specialist its all too easily digging yourself deeper and deeper into a skill you’ve already proven yourself to be capable at when you’d be better of using the time to cultivate other skills that are part of the process of making software – rendering yourself obsolete over time…

The great hacker is a one trick pony – he writes great code but that’s about it…
Most of these guys end up working alone as consultants or freelancers where they don t have to care about that other stuff, or they end up as programmers at some big firms where there’s more room for specialists doing specific jobs (Architects to architecture, PMs do project management, Programmers code…).
On the other hand, those who truly like making software, open up to the other aspects of software development.
When that change in mindset happens, that’s when you can truly grow exponentially…

So what do I do?

Ok, I guess you got the point… But how do you get started?  Here are my own 5 cents on the topic…

Read, Read and Read Some More…

We’re in an industry that is moving forward at a fast pace. Technology becomes obsolete every year and a half or so and as developers we have to constantly struggle to keep up. Books are not only great to help you keep up but also to expand your knowledge to other fields.
There are plenty of interesting books and blogs about, well, pretty much everything.
Here are some recommendations to get you started:

The Inmates Are
Running the Asylum

The Pragmatic Programmer

Made to Stick

Crossing the Chasm

The Innovator’s Dilemma

Eric Sink on the
Business of Software


(most of it is
available online here)

Oh and one word about programming books: the best ones are timeless, transcending choice of language, IDE and platform.
I try to stay away from them thick, heavy, language\platform specific references – most of them go out of date after a year or so anyway and most of the information there could be easily obtained elsewhere (online – Google, the product’s docs, blogs…)

Most programming big are just a waste of your time (and money…)

Contribute to an Open Source Project

Back in the days of Delphi I was involved in Project JEDI dedicated to exposing different APIs (especially the Win32 API) to Delphi developers.
I learned a lot working with the JEDI code base, documentation, samples and other team members.
Later when it was time to get drafted to the Israeli Army (we all have to do it at 18 here) the experience, credit and code samples help me land a (very) exclusive position as a programmer. Who knows where I’d be today if I didn’t qualify and had to serve as a combatant…

Contributing to an open source project is a great way to gain experience, learn and get better.
There are no job interviews to pass, degree requirements or commitment to working hours or schedule required – you can just join in and start submitting patches or contribute in ways other than code (submit bugs, docs, support, …).

You can learn a lot just from studying the code and interacting with your peers…

Contributing to open source shows dedication and passion – its a walking talking resume.

Get a mentor

Find yourself a mentor or mentors who can teach you about different aspects of the business. I’ve had several at SAP and talking with them proved to be an invaluable asset (If you’re reading, thanks! :) )

It doesn’t have to be official mentoring which is part of the person’s goals or job description. Many of your peers are experts in their field and they’ll be happy to show you around if you just show some interest…

Become a Mentor

Great developer are eager to learn… and teach. Can you pass you passion and knowledge to others?

You can also…

  • Open a blog about your experience, opinions, etc.
  • Start answering questions at stackoverflow.com and collect achievements

Land an Internship

Try getting an internship in a different role. When I was in SAP they had a special program allowing employees to apply for a ~6 month position somewhere within the company. The reason behind it was to get employees familiar with different aspects of the company. Maybe product management, marketing or sales in not really your first choice of profession but why not try it for a couple of month without the risk of going through a career change? How cool is that? I’m sure many large corporations has something similar and even if not, it can’t hurt if you come up with such an interesting offer to your boss…

Own a Product Area

Get ownership on some part of the product your team is working on. Weather a specific component or a vertical (like Security) you should be in charge of getting it done – from getting the definition done with the product\sales\business team, through UX, development, QA, etc…
There’s nothing better than learning about the process of software development through experiencing the entire cycle…

Innovate

Start something new. When working on Duet we’ve had many issues getting the thing deployed. So I made a tool for (myself mainly) our QA and RIG (regional implementation group – the guys who work with customers) to help diagnose problems. This later became the official Duet Support Tool and got its own dedicated development time. Is your product, development environment perfect? I’m sure not… find a need a feel the gap…

Why? If by owning a product area you learned about the entire development cycle, here you’ll learn about defining and “selling” to the team…

Bonus Reading…

Another link worth visiting is the one about the Metrosexual Developer. Funny and true… ;)

Related:

Tags: , , ,

The Dark Side of LINQ

.NET August 5th, 2008

I’ve been having mixed feeling for quite some time now regarding LINQ.
Sure it can make working with data sources a lot easier and it can definately save a lot of code…
But, what happens with the following C# foreach statement

List<KeyValuePair<string, string>> resultList = new List<KeyValuePair<string, string>>();
string[] paramsArray = parameters.Split(new char[] { '&' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string p in paramsArray)
{
    int index = p.IndexOf('=');
    if (index > 0)
    {
        string key = p.Substring(0, index);
        string value = p.Substring(index + 1);
        resultList.Add(new KeyValuePair<string, string>(key, value));
    }
}

IEnumerable<KeyValuePair<string, string>> result =
    resultList.Distinct((p1, p2) => p1.Key == p2.Key);

Turns to this query:

var distinctPairs = (from keyValuePair in parameters.Split(new char[] { '&' }, StringSplitOptions.RemoveEmptyEntries)
                     let index = keyValuePair.IndexOf('=')
                     where index != -1
                     let key = keyValuePair.Substring(0, index)
                     where !string.IsNullOrEmpty(key)
                     let valueText = keyValuePair.Substring(index + 1)
                     select new { Key = key, ValueText = valueText })
                             .Distinct( (p1, p2) => (p1.Key == p2.Key) )
                             .ToArray();

I don’t know about you but I find the first version a lot more approachable, readable and quicker to understand. The same code in LINQ is not shorter and looks simply looks Evil.

LINQ is like the force… It can be used to wonderful code that is simple and functional, but it also has the potential of producing cryptic code that’s hard to maintain.

Use it wisely and don’t be tempted for its dark side…

Tags:

Are You Designing for Bigfoot?

Design, Software Development July 25th, 2008

Consider the following (imaginary) conversation:

Programmer: What if a user will want the ability to sort the values in the report grid by columns?

Manager: We don’t need a dynamic grid for version one.

Programmer: But someone might want to sort the values! Users will expect to be able to sort values by clicking on the column headers…

Manager: We don’t have time to add this feature to our schedule. Can’t we consider it for a future release?

Programmer #2: I don’t think sorting makes sense here.

….

Sounds familiar? If you’ve been part of a product development team you’ve probably encountered this kind of feature debate before… maybe even quite frequently…

Programmers are trained to think about possibilities and logic terms, that’s why the logic of what “might” happened is irresistible to the programmer.
Alan Cooper describes this behavior in The Inmates are Running the Asylum:

Programmers call these one-in-a-million possibilities edge cases. Although these oddball situations are unlikely to occur, the program will whenever they do if preparations are not made.
Although the likelihood of edge cases is small, the cost for lack of preparedness is immense. Therefore, these remote possibilities are very real to the programmer. The fact that an edge case will crop only once every 79 years of daily use is no consolation to the programmer. What id this one time is tomorrow?

The manager can’t advance the argument with the force of reason as he has no way to logically contradict this programmer and so he’s left with the choice of giving up or using his authority, which is usually how these kinds of arguments end up.

The programmers concern with the possible can easily obscure the probable by loading the program’s interface with controls and functions that will rarely be used.

The problem with this entire argument lies in the fact that arguing about what users might expect is the same as asking “What does Bigfoot like for breakfast?”

Tintin_Yeti

Users come in many forms and shapes. Some are proficient with computers, some don’t even like them. Some are used to Microsoft UI and some work mainly on the Mac. Some require advanced control with the expense of simplicity and should would rather give advanced control away in exchange for a quick and simple way to perform their goals.

In order to avoid this kind of dead-end feature debates its useful to change the team terminology and work in terms of personas and their goals. Rather than thinking of users as an abstract, difficult-to-describe, amorphous group of people, personas instruct us to talk about specific users who have names, personalities, needs, and goals.
Understanding exactly who the users are and what they do with the software is essential to determine if a certain feature is actually required.

Do we really need to allow sorting the column? there’s no way of telling…  but if you know that we’re designing for Jeff, a CFO of a large enterprise, who would like to get his report in a fixed predetermined format then its probably a requirement not to do so…

A very good example for use of personas can be found in Nikhil Kothari’s post about where he describes three personas that were used by the development division at Microsoft while working on Visual Studio 2005:

We have three primary personas across the developer division: Mort, Elvis and Einstein.

Mort, the opportunistic developer, likes to create quick-working solutions for immediate problems and focuses on productivity and learn as needed.
Elvis, the pragmatic programmer, likes to create long-lasting solutions addressing the problem domain, and learn while working on the solution.
Einstein, the paranoid programmer, likes to create the most efficient solution to a given problem, and typically learn in advance before working on the solution.

The description above is only rough summarization of several characteristics collected and documented by our usability folks.
During the meeting a program manager on our team applied these personas in the context of server controls rather well:

  • Mort would be a developer most comfortable and satisfied if the control could be used as-is and it just worked.
  • Elvis would like to able to customize the control to get the desired behavior through properties and code, or be willing to wire up multiple controls together.
  • Einstein would love to be able to deeply understand the control implementation, and want to be able to extend it to give it different behavior, or go so far as to re-implement it.

Using the above described personas, Nikhil’s team was able to effectively design a set of .NET controls so that each persona will find them usable for his use:

All of these controls just work out of the box in implementing the end-to-end scenario of managing user sign-on. Coupled with themes, these controls can look pretty good as well. They go on to provide a whole set of properties to tweak their behavior, and appearance.

Furthermore, they provide the ability to flip into template mode for more significant changes to their content and layout.

Finally, they’re built on the provider-model in ASP.NET so an advanced developer could come along and swap out the built-in membership provider going against say the default SQL or Active Directory user database and replace it with one that goes against say an Oracle user database or some other custom store while keeping the UI functionality intact.

So, the next time you’re in a feature debate, stop and ask yourself “Who exactly am I designing this feature for? Why does he need it?”.
Describe a set of personas that your software will target and use them as reference in any design discussion or feature debate instead of referring to an amorphous group of “users” that might not even exist.

Don’t design your software for bigfoot….

Tags: , , ,

How Do You Define “Good Code”?

Software Development June 26th, 2008

I was on a phone interview the other day where I was asked for my definition of “Good Code”.

The first thought that came to mind was maintainability – if it can’t be understood, maintained and extended by other developers than its definitely not good.
Then, other things came to mind: efficiency, elegance (simple, proper use of language constructs and environment capabilities), modularity, proper object-oriented design, …
Of course, and we tend to take that for granted, it also has to work… without errors, security holes, etc.

In his book, Code Complete, Steve McConnel supports my definition of good code as maintainable code:

Another theme that runs throughout this book is an emphasis on code readability. Communication with other people is the motivation behind the quest for the Holy Grail of self-documenting code.

The computer doesn’t care whether your code is readable. It’s better at reading binary machine instructions than it is at reading high-level-language statements. You write readable code because it helps other people to read your code. Readability has a positive effect on all these aspects of a program:

  • Comprehensibility
  • Reviewability
  • Error rate
  • Debugging
  • Modifiability
  • Development time—a consequence of all of the above
  • External quality—a consequence of all of the above

Readable code doesn’t take any longer to write than confusing code does, at least not in the long run. It’s easier to be sure your code works if you can easily read what you wrote. That should be a sufficient reason to write readable code. But code is also read during reviews. It’s read when you or someone else fixes an error. It’s read when the code is modified. It’s read when someone tries to use part of your code in a similar program.

Making code readable is not an optional part of the development process, and favoring write-time convenience over read-time convenience is a false economy. You should go to the effort of writing good code, which you can do once, rather than the effort of reading bad code, which you’d have to do again and again.

On the other hand, Paul DiLascia, from MSDN’s {END BRACKET} column, provides a list of traits that good code should have:

Whether you code in C/C++, C#, Java, Basic, Perl, COBOL, or ASM, all good programming exhibits the same time-honored qualities: simplicity, readability, modularity, layering, design, efficiency, elegance, and clarity.

Simplicity means you don’t do in ten lines what you can do in five. It means you make extra effort to be concise, but not to the point of obfuscation. It means you abhor open coding and functions that span pages. Simplicity—of organization, implementation, design—makes your code more reliable and bug free. There’s less to go wrong.

Readability means what it says: that others can read your code. Readability means you bother to write comments, to follow conventions, and pause to name your variables wisely. Like choosing “taxrate” instead of “tr”.

Modularity means your program is built like the universe. The world is made of molecules, which are made of atoms, electrons, nucleons, quarks, and (if you believe in them) strings. Likewise, good programs erect large systems from smaller ones, which are built from even smaller building blocks. You can write a text editor with three primitives: move, insert, and delete. And just as atoms combine in novel ways, software components should be reusable.

Layering means that internally, your program resembles a layer cake. The app sits on the framework sits on the OS sits on the hardware. Even within your app, you need layers, like file-document-view-frame. Higher layers call ones below, which raise events back up. (Calls go down; events go up.) Lower layers should never know what higher ones are up to. The essence of an event/callback is to provide blind upward notification. If your doc calls the frame directly, something stinks. Modules and layers are defined by APIs, which delineate their boundaries. Thus, design is critical.

Design means you take time to plan your program before you build it. Thoughts are cheaper than debugging. A good rule of thumb is to spend half your time on design. You need a functional spec (what the programs does) and an internal blueprint. APIs should be codified in writing.

Efficiency means your program is fast and economical. It doesn’t hog files, data connections, or anything else. It does what it should, but no more. It loads and departs without fuss. At the function level, you can always optimize later, during testing. But at high levels, you must plan for performance. If the design requires a million trips to the server, expect a dog.

Elegance is like beauty: hard to describe but easy to recognize. Elegance combines simplicity, efficiency, and brilliance, and produces a feeling of pride. Elegance is when you replace a procedure with a table, or realize that you can use recursion—which is almost always elegant:

int factorial(int n) {   return n==0 ? 1 : n * factorial(n-1); }

Clarity is the granddaddy of good programming, the platinum quality all the others serve. Computers make it possible to create systems that are vastly more complex than physical machines.
The fundamental challenge of programming is managing complexity. Simplicity, readability, modularity, layering, design, efficiency, and elegance are all time-honored ways to achieve clarity, which is the antidote to complexity.

Clarity of code. Clarity of design. Clarity of purpose. You must understand—really understand—what you’re doing at every level. Otherwise you’re lost. Bad programs are less often a failure of coding skill than of having a clear goal. That’s why design is key. It keeps you honest. If you can’t write it down, if you can’t explain it to others, you don’t really know what you’re doing.

So what are the most important trait for “Good Code” ?
Later on, it struck me – like anything when it comes to engineering, its about balance.
When we write code we strive to find balance between complexity and simplicity by constantly evaluating the different tradeoffs we have to choose in order to get there.
Therefore, good code is code that strikes the right balance balance between all of the qualities mentioned above.

Think about it the next time you’re writing or reading someone else’s code…

Technorati Tags: ,,

Comments (2) imported from www.ekampf.com/blog/:

Tuesday, July 01, 2008 1:01:39 PM (GMT Daylight Time, UTC+01:00)

“On the other hand, I was using Google software – a lot of it – in the last year, and slick as it is, there’s just too much of it that is regularly broken. It seems like every week 10% of all the features are broken in one or the other browser. And it’s a different 10% every week – the old bugs are getting fixed, the new ones introduced. This across Blogger, Gmail, Google Docs, Maps, and more. “

As much I really like the simple but powerful UI of services like Gmail etc: Sometimes its really annoying to see bugs coming and going! The software seems never to get stable. And Google seems to be aware of this, at least they mark nearly all their apps with a BETA-tag :-)
It would be interesting to hear something about the software development process. Do they practice things like unit-tests, continuous integration, code reviews, etc.?

Florian Potschka

Tuesday, July 01, 2008 1:25:41 PM (GMT Daylight Time, UTC+01:00)

Hey Florian,
According to this and this its seems like a one big community where each project is managed like an open-source project.
From experience, having developers dividing their time between several projects (that can be unrelated as the post says) doesn’t work well…
Not much public information on their internal practices though… not sure if its a good sign :S

Regards,
Eran

Eran Kampf

Tags: , ,

Microsoft Research launches WorldWide Telescope

Software Development, Software Industry May 13th, 2008

Microsoft Research’s WorldWide Telescope, otherwise known as “the thing that made Robert Scoble cry” has been publicly launched today.

WorldWide telescope is a desktop application that essentially turns your computer into a virtual telescope, allowing you to browse the universe. You can roam the universe freely or choose from a growing number of guided tours by astronomers and educators. You can also join communities of stargazers, connect your own telescope to your computer and control using the application.

Another cool option allows you to gain a different perspective on what you’re seeing by switching between imagery sources.

WWT_CarinaNebula

The interface is pretty complex right now but everything works quite smoothly once you get the hang of it. I guess Microsoft will have to simplify it to allow wide adoption

I don’t know about you but I’m going to take some time and travel the universe…

Tags: , , , ,

Developing a Robust Data Driven UI Using WPF – Stock DataModel Sample

.NET, Software Development, WPF March 30th, 2008

On the previous post in this series we looked into the DataModel component in our architecture in detail and defined an abstract DataModel base class to derive our models from. On this post we’ll implement a concrete data model to represent a stock’s value. Why stock? It’s an object with a changing value that requires our DataModel constantly refresh and keep its data “alive”, and it’s simple to implement which makes it a perfect example for our first DataModel. The first thing we’ll do when defining our Stock DataModel is abstract the data source. This way we can easily implement several data sources for fetching a stock’s data and instantiate the DataModel with the right one (for example, read from Yahoo at runtime, read from fake data source during unit testing):

/// <summary>
/// Defines the interface allowing <see cref="StockDataModel"/> to read quotes from various providers.
/// </summary>
public interface IStockDataProvider
{
    /// <summary>
    /// Gets a given stock symbol's (given by <paramref name="symbol"/>) data.
    /// </summary>
    /// <param name="symbol">The stock's symbol.</param>
    /// <param name="name">The stock's company name.</param>
    /// <param name="quote">The last stock's quote.</param>
    /// <param name="change">The stock's change value.</param>
    /// <param name="open">The stock's open value.</param>
    /// <returns><b>True</b> if data was retrieved successfully; otherwise, <b>False</b>.</returns>
    bool TryGetData(string symbol, out string name, out double quote, out double change, out double open);
}

Now that we have our data source defined we can implement different stock data providers for our DataModel to consume. Now, lets go over the StockDataModel class:

public class StockDataModel : DataModel
{
    private string _symbol;
    private IStockDataProvider _quoteProvider;
    public StockDataModel(string symbol, IStockDataProvider provider)
    {
        _symbol = symbol;
        _quoteProvider = provider;
        this.State = DataModelState.Fetching; 

        // Queue a work item to fetch the symbol's data
        if (!ThreadPool.QueueUserWorkItem(new WaitCallback(FetchDataCallback)))
        {
            this.State = DataModelState.Invalid;
        }
    } 

    public string Symbol
    {
        get { return _symbol; }
    }

Our StockDataModel constructor takes the stock symbol that the model represents and an IStockDataProvider to fetch the stock’s data from. We set the initial DataModel state to Fetching and queue a work item for a background thread to update our model with the stock’s data – company name, quote, change value and open value. If we fail to queue the work item than we put the model in an invalid state. Next, we need to define the properties exposed by StockDataModel for data binding.

public string Name
{
    get
    {
        VerifyCalledOnUIThread();
        return _name;
    }
    private set
    {
        VerifyCalledOnUIThread(); if (_name != value) { _name = value; OnPropertyChanged("Name"); }
    }
}
public double Quote
{
    get
    {
        VerifyCalledOnUIThread(); return _quote;
    }
    private set
    {
        VerifyCalledOnUIThread(); if (_quote != value) { _quote = value; OnPropertyChanged("Quote"); }
    }
}
...


We’re sign a private setter to update the property values and trigger a PropertyChanged event if required. You can also add calculated properties. For example:

public double ChangePercent
{
    get
    {
        if (double.IsNaN(Change))
            return double.NaN; 

        if (double.IsNaN(Open))
            return double.NaN; 

        try
        {
            double change = (Change / Open) * 100; return change;
        }
        catch
        {
            return double.NaN;
        }
    }
}

In this case, it is important to remember to trigger the property change event for ChangePercent too when the values it depends on change… Now for the implementation of the FetchDataCallback. This method will be called by a background thread to update the stock data. Since this method is called by a background thread we’re free to perform expensive operations, such as calling a web service to fetch the stock’s data from an online provider (like Yahoo).

private void FetchDataCallback(object state)
{
    string fetchedName;
    double fetchedQuote;
    double fetchedChange;
    double fetchedOpen;

    if (_quoteProvider.TryGetData(_symbol, out fetchedName, out fetchedQuote, out fetchedChange, out fetchedOpen))
    {
        this.Dispatcher.BeginInvoke(
            DispatcherPriority.ApplicationIdle,
            new ThreadStart(
                delegate
                {
                    this.Name = fetchedName;
                    this.Quote = fetchedQuote;
                    this.Change = fetchedChange;
                    this.Open = fetchedOpen;
                    this.State = DataModelState.Active;
                }));
    }
    else
    {
        this.Dispatcher.BeginInvoke(
            DispatcherPriority.ApplicationIdle,
            new ThreadStart(
                delegate
                {
                    this.State = DataModelState.Invalid;
                }));
    }
}

On the previous post, on the WPF threading model overview we noted the following:

If only the creator of a DispatcherObject can access it, how can a background thread interact with the user? The background thread does not access the UI directly but it can ask the UI thread to perform a task on its behalf by registering work items to its Dispatcher using it’s Invoke (for a synchronous call that returns when the UI thread finished executing the delegate) or BeginInvoke methods (which runs asynchronously)

In the above code, after fetching the data on the _quoteProvider.TryGetData we need to communicate these changes back to the UI thread. We use the Dispatcher to set the new values for the DataModel properties which ensures that our property change events will be triggered on the UI thread.

Keeping the Data Alive

So far, our code only fetches the stock data once. Lets see what it takes make out DataModel keep its data alive.

protected override void OnEnabled()
{
    _timer = new DispatcherTimer(DispatcherPriority.Background);
    _timer.Interval = TimeSpan.FromMinutes(5);
    _timer.Tick += delegate { ScheduleUpdate(); };
    _timer.Start(); 

    ScheduleUpdate();
}
protected override void OnDisabled()
{
    _timer.Stop();
    _timer = null;
}
private void ScheduleUpdate()
{
    VerifyCalledOnUIThread();
    // Queue a work item to fetch the quote
    if (ThreadPool.QueueUserWorkItem(new WaitCallback(FetchDataCallback)))
    {
        this.State = DataModelState.Fetching;
    }
}

The above code defines a timer that is active when the DataModel is Enabled. The timer calls ScheduleUpdate every 5 minutes to perform the same data update using a background thread logic we performed on our constructor. We’re using a DispatcherTimer so that the calls to ScheduleUpdate will be made using the Dispatcher’s thread (the UI thread) so that we can update the DataModel’s state without a hassle. If we had used System.Threading.Timer then ScheduleUpdate would be called on the timer’s thread requiring the use of Dispatcher.BeginInvoke to update the state…

That’s it…

We’ve got the basic DataModel implemented. You can using it in you’re XAML window to see it working… To get a basic XAML running you’ll need to define a content control:

<ContentControl x:Name="_content" />

And set its content to a StockDataModel instance on your codebehind:

_content.Content = new StockDataModel("AAPL", someProvider);

Then all you need to do is define a data template for the StockDataModel type to control it’s appearance. Here’s a simple template for example:

<DataTemplate x:Name="StockTemplate" DataType="{x:Type local:StockDataModel}">

   <StackPanel Orientation="Horizontal" mdb:EnableModel.DataModel="{Binding}" Height="30px" Width="Auto" ClipToBounds="True">

     <TextBlock Text="{Binding Name}" Foreground="#737271" Width="120" Padding="3,0,0,3" Style="{StaticResource StockText}" /> 

     <TextBlock Text="{Binding Quote}" Foreground="#737271" Width="55" Padding="0,0,0,3" Style="{StaticResource StockText}" />  

   </StackPanel> 

</DataTemplate>

You can find the code discussed in this article plus my own implementation for an IStockDataProvider that reads stock data from Yahoo here: On the next post we’ll discuss DataModel unit testing and see how the StockDataModel tests are implemented.

kick it on DotNetKicks.com

Comments (5) imported from www.ekampf.com/blog/:

Sunday, March 30, 2008 10:45:52 PM (GMT Daylight Time, UTC+01:00)

Thanks for the series! Looking forward for the following parts. However, there’s a bug in the shown code as you cannot check if a value is NaN by comparing to double.NaN. You have to use double.IsNaN(…).

Use IsNaN to determine whether a value is not a number. It is not possible to determine whether a value is not a number by comparing it to another value equal to NaN.

Simon Monday, March 31, 2008 4:39:37 AM

(GMT Daylight Time, UTC+01:00)

Hey Simon, Thanks.

Fixing the code and the post…

Regards,
Eran

Eran Kampf

Friday, April 04, 2008 3:47:12 AM (GMT Daylight Time, UTC+01:00)

Very nice article series. Keep up the good work!

Kevin Kerr

Wednesday, May 28, 2008 2:55:06 PM (GMT Daylight Time, UTC+01:00)

Really great series, very nicely done.

Question: Why call VerifyCalledOnUIThread() in the ScheduleUpdate method? Since you’re calling BeginInvoke on the dispatcher inside FetchDataCallback all should be well, right?

Mike

Thursday, May 29, 2008 11:41:31 AM (GMT Daylight Time, UTC+01:00)

Hi Mike,

Good question. Notice that besides calling queuing a work item that calls FetchDataCallback, the ScheduleUpdate method also updates the model’s State to DataModelState.Fetching when that work item is queued. Since we’re changing the actual model we need to make sure we’re doing it in the UI thread. Alternatively, we could have used a System.Threading.Timer to do the updates ScheduleUpdate() will be called on a background thread directly, but then we couldn’t set the model state to fetching. We’d have to send that back to the UI thread.

Regards,
Eran Kampf

Eran Kampf

Tags: , , , ,