Dec 21, 2008

SOA & Routers

Why is a Cisco router $10,000 and a Linksys router $100? Or perhaps the better question is why are your network guys willing to pay $10,000 for a Cisco router when a Linksys router can be purchased for $100?

The answer is that Cisco routers have multiple layers of reliablity, managability, and security that Linksys routers don't.

But 100x (100 times) difference? Is it really worth 100 times difference?

The network is the backbone of your IT infrastructure. If your network goes down, all the IT systems stop working and so does much (if not all) of your company's operations. And keeping that up, being alerted early to problems, being able to reroute and resolve problems, that is indeed worth 100 times difference.

When creating traditional big-box applications, the traditional components are:

  • The Application You Wrote

  • The Container or Runtime

  • The Application Server

  • The Database

  • The Database Server

  • (When we move to an application with a web GUI, you can add a web server in there.)

    The whole big box application is dependent on those few components, all of which tend to be within the application team's control or at least dedicated to the application. When there's a problem, get the server guy, the database guy, the app guy and figure it out. It's all boxed together.

    Lets say we take our big box application and add some new functionality which is dependent on 3 web services - each supplied by different application. Two of those services route and transform through our ESB (Enterprise Service Bus). Being our services are web services, we add a web server to the above meaning each of those environments brings 6 dependent components times 3 services, plus another 6 for the ESB and our original 5 running our environment...

    Our application is now dependent on 29 components to operate.

    Old coaxial or 10base2 networking fast, simple, and didn't have the repeater overhead of a hub. However, every node was directly connected to the whole network. Meaning, a problem with ANY node resulted in impact to the WHOLE network. A loose wire or a bad network card and the whole network was down, with no easy way to isolate the problem!

    Clearly this wasn't a reasonable way of operating, 10baseT and the network hub quickly took over.

    Formerly, when our application had a problem our department was impacted. Say 20-100 people stop working while IT scrambled to fix the problem. When our application which is providing a service to 5 other applications has a problem, now it's 100-500 people impacted. And if it's our ESB with a problem, which has 50 services routed through it, literally half to the whole IT structure could effectively be down.

    As we shift to SOA, our primary SOA components become as important as the network itself. Our ESB must be as reliable as the highest SLA of any application that uses an ESB service - if not higher. And every service our application uses must be of equal or higher SLA as our application OR we build our service usage paradigm to compenstate.

    Services and/or the ESB can be designed and configured to degrade gracefully, use alternate routes or services, and handle service failure. Let me restate that, not can be, must be. Many an IT organization using SOA will come to this realization and make these adjustments, usually 2-3 years into the maturity cycle as they've spent many a long night tracing service failures and listened to business managers expressing their frustration with yet another IT outage (while still trying to explain to the CIO why this SOA thing was a good idea).

    The ESB's will NOT do this for you. While the SOA Runtime Governance tools are beginning to probe deeply enough to begin to catch certain types of service problems early, they may alert support to begin trouble shooting but aren't going to save the application from the negative impact.

    Necessary SOA Paradigm shift: Program for fault tolerance.