TL;DR; the team at IBM just made getting access to a world leading Data Science platform, integrated tools & collaboration, as simple as web-mail!!
Day one at World of Watson 2016 in Las Vegas in the USA just wrapped up, and I had the honour and privilege of being on stage with Derek Schoettle, General Manager of Analytics Platforms & Cloud Data Services at IBM, to share a conversation around the topic “a day in the life of a Data Scientist”, and as we discussed some of the fundamental challenges being faced by any organization or individual looking to embark on any form of Data Science journey.
I want to share with you a couple of key things which came out of our conversation, as Derek made a number of very exciting announcements, and we also touched on what I believe are three of the most fundamental key pillars of Data Science – I thought I would share them with with you as I’m certain you can put them to good use.
But before I dive into any of that, I’d first like to provide two quick quotes I took away from today’s keynote, and then some context based on my own experience over the years, so that you might fully appreciate the gravity of today’s announcements.
A COUPLE OF FUN QUOTES
“IBM’s Data Science Experience & Watson Data Platform were built with teams in mind, we believe Data Science is a team sport.”, Derek Schoettle ( IBM WoW 2016 )
“The Data Science Experience platform from IBM makes it possible for me to go from Zero to Hero in an instant.”, Dez Blanchfield ( IBM WoW 2016 )
FIRST A BIT OF CONTEXT
I’d like to set the scene for you, of just what various elements of “a day in the life of a Data Scientist” have till now been like in my experience, as I believe it will give you a far better appreciation for just how great a paradigm shift IBM’s latest offerings in Data Science and Analytics really are, and what they mean to anyone in this space.
THE GOOD OLD DAYS ( WELL LAST WEEK ACTUALLY )
For years now a regular challenge I’ve been given by organizations of all sizes, ranging from small teams of two or three in a startup of some form, through to large enterprise, multi-nationals and federal government agencies, is to “stand up” a full stack “platform” to support their desire, or more often their need, to apply Data Science in some for to a core business issue they are facing which their existing Information Management and Business Intelligence systems are not able to address, or often they might have an idea or initiative they wish to play out on how they might transform the way they do business, or simple want to explore how they might go about offering better products, services and support to existing customers, and of course entice new customers.
This basic challenge is all too often a “non-trivial undertaking”, both as a result of gaps in the knowledge, experience and skills within existing staff or teams in so many of these organizations to address something as straightforward as developing a requirements document to capture what the customer wants, and what they expect to see delivered, through to the challenge of developing a business case, supporting cost model, project management, design, development and implementation – meaning that all too often, a simple idea can quickly go from a “great idea” to a “nightmare” as organisations try to weave their way through the seemingly endless options, and how to implement even a basic proof of concept ( PoC ) environment or a “sandbox” to start playing on.
NOT ALL ASPECTS OF DATA SCIENCE ARE SEXY
Assuming we do in time reach a set of decisions around exactly what is required in the Data Science “stack” to provide a safe, secure, easy to use, consistent platform to a clients organization, I would usually set upon the challenge of standing up a complex ecosystem as quickly and cost effectively as possible. For the most part, this is about as far removed from the sexy part of Data Science as you can get – the engineering elements, most of which are often just hard work, albeit a necessary evil to get to the end goal as it were.
To stand up a big data ecosystem of any scale, I’d require network, storage and compute infrastructure in some form, either in a public cloud, on premise, or in a 3rd party data centre of some form, be it physical or virtual, and depending on the volume of data being moved around, ideally as close to the customer network as possible.
FIRST THE NETWORK
To connect any of this ecosystem to the world and or the customers own network, I’d need telecommunications providers, networks, IP address space, IP routing, IP sub-netting, domain names, bandwidth, routers, switches, firewalls, firewall polices, firewall rules, a list of TCP/IP and UDP/IP ports to open and close, multi-factor authentication with physical or software security tokens, intrusion detection, intrusion inspection, and usually some form of monitoring for this network stack.
This mind you, is merely what’s required to connect the environment to the internet or the customer network – I won’t torture you with the complexities of network typologies employed in-rack or inter-rack to connect the storage and compute nodes, that would be cruel and unusual punishment. IBM has an entire library of Red Books you can search to get up to speed with that if you have the time and inclination.
NOW STORAGE & COMPUTE
Next I’d need to procure and build servers, storage, uninterruptible power supplies, racks, mount the servers in said racks, implement a mix of physical and logical access control, install and configure operating systems, create system, application and user accounts and passwords and usually pre-shared keys, install and configure a myriad of software such as programming languages, tools, libraries, modules, plug-ins, a hybrid of the Apache Hadoop Distributed File System ( HDFS ), various parts of the Apache Hadoop “map reduce” based ecosystem with tools like Hive, Pig, Impala, and my personal favourite, the Swiss Army knife of such an ecosystem, the Hadoop User Experience aka HUE.
And more often than not these days, that stack would also include Apache Spark. I’d also of course need a Java run-time environment or Java software development kit for it to run on, I’d also need operating system services, application services and user activity logging and monitoring tools, admin tools, a web server platform or two, and that’s before I even get to the point of being able to ingest some data, cleanse that data, normalize said data in various ways, wave my magic ETL & ELT wand over it a few times, and the list goes on and on. And that’s just the tip of the proverbial large block of solid liquid in a sea of potential troubles.
CAN’T YOU USE A FREE EVAL SANDBOX FOR THAT
Some might argue “oh I can do that with any of the Hadoop distro’s evaluation sandbox offerings in a public cloud in a few hours”, and yes that’s true, and for many PoC’s that may provide what you need, but for most projects I’ve found that free cloud hosted sandbox offerings generally don’t meet the core requirements of what most customers need once they go beyond one user on a laptop or a few GB of data.
Often to gain rapid deployment using a pre-built demo sandbox, I would end up sacrificing on so many basic things, which in time turn into brick walls I have work around in some way, all too often resulting in a kludge that would never make it’s way into a production scale implementation.
So yes it’s entirely possible a free cloud hosted sandbox could for some, offer a simple quick starting point, but for most solutions I’ve designed and built, I’ve found myself working around them often enough to want to build “from scratch” as it were.
BUT THERE IS A BETTER WAY
Over the past few months, I’ve had the honour of being provided access to a number of early adopter programmes within the IBM family, each of them allowing me to have pre-release access to a range of IBM’s new tools and platforms such as the BlueMix Platform as a Service ( Paas ) offering, the Data Science Experience platform, and now the Watson Data Platform, and I look forward to getting my hands on the new Watson Machine Learning service soon.
Of course, each time I’ve had such access, it means having to keep a number of exciting secrets as it were, about the respective platforms, until they were publicly announced.
But now as of the 2016 New York city #DataFirst Data Science Experience ( aka DSX ) launch, and the 2016 World of Watson event now in Las Vegas, and the fact that these new services I’ve had the privilege of playing with already, are officially announced, launched, and very much public knowledge, I feel compelled to share information about them, as each of these platforms have in turn given us so many more options to now provide value to organizations in far shorter time periods, at far lower cost, with almost none of the overhead previously experienced in the basic challenge of “standing up a platform”, to be able to engage in the exciting new endeavor of Data Science.
AN ENTIRE DATA SCIENCE PLATFORM AT THE END OF A URL
When you picture the engineering quagmire I outlined earlier, which all too often feels like a Herculean challenge and invariably actually is, the time effort cost and drama surrounding the challenge of merely building a platform before we can even get to the business of engaging in fundamental Data Science itself, you quickly gain some appreciation for just how exciting it is for folk like myself and my peers, working in the exciting and brave new world of Data Science, to remove that whole painful issue of building and configuring a Data Science “stack” as it were, to now leapfrog directly to the business of getting on with the actual Science part of Data Science without the Engineering party tripping us up each time. It’s a game changer, by no small measure.
“IBM just put an entire Data Science platform at the end of a URL. My big data platform is now a bookmark.”, Dez Blanchfield ( IBM Wow 2016 )
As I mentioned a moment ago, IBM formally announced and launched their Data Science Experience ( DSX ) platform a couple of weeks ago in New York city ( I had the honour to be part of that amazing event as well ), hosted in what some call the centre of the big data business universe, the heart and soul of the heady world of high finance and high frequency trading.
Today my lucky stars were again aligned, as I was privileged to be on the stage with Derek Schoettle when he formally announced the availability of the Watson Data Platform, the Watson Data Service, and the Watson Machine Learning service, ground breaking offerings in their own rights.
In making these types of services available through something as accessible as the ubiquitous web browser interface, IBM have dramatically shortened and simplified the route by which individuals and organisations can now gain access to the tools required to begin applying Data Science and Machine Learning to their own challenges around data analysis and decision making, by leveraging the natively integrated Data Science Experience, Watson Data Platform, Watson Machine Learning Service and the BlueMix cloud Platform as a Service.
In effect what IBM has successfully done, is they have delivered on the long over due and much desired promise of cloud based big data, analytics and machine learning services all in a single easy to use, affordable “single pane of glass” via the now ubiquitous web browser.
“With their Data Science Experience and Watson Data platforms, IBM has made Data Science as accessible as web-mail.”, Dez Blanchfield ( IBM WoW 2016 )
They have taken Data Science, Big Data, Analytics and Machine Learning and made it as simple and affordable as web-mail, and we’ve all seen and experienced the powerful paradigm shift web-mail brought to the challenge of gaining access to email, now in turn we have the same easy access and simplicity of use via a browser based platform for Data Science – and it’s a WoW moment ( pun intended ).
Three Key Pillars of Data Science
OK so I promised not to just excite you with what I believe are some of the biggest announcements in Data Science and Analytics to come from World of Watson 2016, but I also promised I’d touch briefly on three key pillars of Data Science that I had the pleasure of discussing on stage with Derek, so here they are.
1. LEARN
Built-in learning to get started or go the distance. A native feature in the IBM DSX is something they refer to as Community Cards. These are a standard template by which DSX users can share articles, data-sets, models, links, videos, almost any form of content, aimed at sharing information, knowledge and data, either privately and securely within their own teams and organization, or with the broader DSX user community and even beyond the DSX platform.
It’s ridiculously easy to publish a Community Card and share it even with folk outside the DSX platform, through simple mechanisms such as social media such as a tweet on Twitter or a post on your LinkedIn profile wall.
This may sound like a simple idea, and in many ways it is, but it is a very powerful feature which could easily be overlooked, but I consider it one of the three most important pillars of Data Science as a whole, and in particular of the IBM Data Science Experience, for learning in general and in turn sharing our learning, is surely one of the core tenets of both the Data Science community and the broader open source community as a whole. I invite you to keep this core ideology in your top three key pillars of any Data Science journey.
2. CREATE
To allow us create with ease, IBM offer through the DSX & WDP the best of open source and IBM products. Once upon a time when the name IBM came to mind, the last thing you’d follow it by was open source, but those days are long gone. Yes IBM does build some of the biggest proprietary software platforms on the planet, but they are now also officially the largest contributor on the planet to open source, in particular the Apache Spark project.
And with that transition has come a significant shift in culture and behavior, and it’s a shift we should congratulate IBM for, as it’s come about in record breaking time, and the impact and positive repercussions are almost immeasurable. One area where we can measure that positive impact though is the power to create through a common single integrated Data Science & Analytics platform.
When you remove every possible barrier to your teams being able to jump directly to creating things, content, data, code, models, or collaboration opportunities, you can easily in turn place a value on that time saved, the productivity which it enables, and the dramatically reduced “time to value” your organization is gaining. So it is with that in mind I invite you to ensure that the ability to “create” remain in your top three key pillars of any Data Science initiative and that you consider putting a value on the benefits gained and time saved, as a result of the power to create quickly securely and collaboratively.
3. COLLABORATE
Community and social features that provide collaboration are paramount to the success of any Data Science initiative. Until the full force of what’s often referred to as Web 2.0 ( pronounced two dot oh ) came into effect, the true power of collaboration was in so many ways constrained to old school in person or small team efforts through email, conference calls, and the likes of intranets.
With recent developments in web technology we have seen websites like search engines be able to do keyword prediction and search term completion in real time, we’ve seen social media sites enable real time voice, video, chat and file exchanges in real time, and the likes of Google hangouts and WebRTC took what was once very expensive and cumbersome video and voice conferencing models and put them directly into near zero cost web browser interfaces and made them available to the great unwashed masses around the world.
When all of that is bundled natively into a Data Science platform, the ability to create a collaboration work-space, a web based notebook, code in R or Python, seamlessly use connectors to access and import data, both locally on your laptop or server as well as remotely across your network, your own business systems and across the public internet to private data if you have the relevant access or public data-sets, of which there are now millions across every imaginable key industry and market segment, all add up to immeasurable power to collaborate like never before.
Add to that drag and drop capabilities from your own resources, or shared resources from your own team or teams, from within your own organisation, or again from external sources outside your own firewalls, be it private data you have been given access to or public data, and then be able to share your own work, your own code, notebooks, and models, at the click of a mouse button ( or the touch of a finger on a tablet computer ), through social media or private invites, driving safe, secure collaboration in ways we’ve till recently only dreamed of. Well you get the picture.
The power of this type of collaboration is just mind boggling, and so many of us now take it for granted, but again I invite you to recognize just how powerful collaboration actually is, to encourage it, nurture it, and support it within your teams and across your organisation, and to keep collaboration in the top three phrases you use when referring to Data Science in any form. The rewards for doing so are such a game changer that I get shiver when I think about how clunky collaboration was before we had the likes of the tools integrated natively into IBM’s DSX & WDP.
SO WHERE TO FROM HERE
Well I’m glad you asked. I’d like to leave you with one last thought, an invitation in fact, and that is, if you have not already signed up for an account on the IBM Data Science Experience, and had a taste of what is possible today on the platform, then please do put time in your calendar, block out an hour or two, and sign up, try it out.
Once you have signed up, have a good look around, and play with it, run some of the pre-built demos, and checked out the examples and community shared articles, the free data-sets to play with, and given yourself the chance to make a fully informed decision about your own Data Science journey, you will be unlikely to ever want to build your own “stack” again, you’ll know that there’s a better way.
In short, don’t take my word for it, go try it out, prove to yourself by getting hands on. Find out for yourself if I “drank the cool aid”, or if I indeed correct in thinking that these exciting innovations from IBM are a bona fide game changer, if not in fact a complete paradigm shift, from the old to the new. Prove me wrong if you will, but I suspect that in your first hour on the platform, you may just find yourself staring at the screen as I have many times, thinking ( possibly out loud ) “can it really be this easy?”.
I look forward to hearing what you think, let me know in the comments section below, as I’d love to hear your feedback, and I’d dearly love to see a healthy debate ensue as I’m sure it no doubt will. Go forth and Learn, Create, and Collaborate. Your time starts.. now!!