I offer a half-hour a week of mentoring to fellow IBMers through an internal program called "CoachMe". Anyone in IBM can login and search for a coach who possesses skills they'd like to acquire and book a time to speak to them. I have found my own mentors to be invaluable in navigating corporate structures, finding new interests and driving results in my own career, and so its great to give back. I've met interesting people from all over the world as an added bonus. One of the most common questions I get is "I'd like to become a data scientist, but where do I start?" I thought I'd share my answer to this question more broadly - and see if others have a different or similar perspective.
Data science is about driving business results using data and analysis. Something to understand is that despite this straightforward definition, data science isn't a single skill. To be a data scientist, you need a kit-bag of skills. Everyone builds their own kit-bag based on their experiences and aptitudes. For me, its important though to think about the four zones in the kit-bag and to ensure that no one zone gets all the attention. When one zone vastly outweighs the others for too long, you run the risk of no longer being a data scientist.
The four zones, or skill areas, in my kit-bag are:
- programming
- data modeling & management
- statistics & mathematics
- business knowledge & subject matter expertise
Any data scientist needs a good foundational skill set in all four areas. However, each one has almost infinite depth and detail. So, we also need a pillar to build upon. This pillar is the core of strength. The one area where we are an expert. This area needs careful contemplation but really, it depends on your pre-data-scientist background for most of us. It can also evolve slowly over the course of a career.
Let me clarify what each of the four skills areas means to me.
Programming
Almost any technical undertaking today requires advanced abilities to commune with the computer. While AI is approaching and GUIs have made many tasks easier, data scientists need solid, basic programming skills. Without them, we end up lost or dependent on others for too many tasks. Notice though, that I didn't label this as software engineering. Basic programming comes down to knowing a single language well enough to use some basic logic within it - loops, if-then-else, functions and possibly objects. Other handy things to know usually come down to common tasks - file I/O, data manipulation, calling mathematical libraries and the like. These skills can easily grow over time and with some practice. The easiest way to improve these skills is, without doubt, an online course.
Data modeling & management
Data scientists spend a lot of time with data. They use it, analyze it, move it, massage it, improve on it, study it, mine it, and visualize it. One of the things that makes this easier - or harder - is how the data is organized. Knowing a series of common patterns for data modeling is critical for a data scientist. You don't need to be an expert in data modeling, but understanding relational databases, normalization, star schemas, and the basics of SQL will really help. In contrast, big data is all about... well, data. It has its own patterns from HDFS to Spark, and references other data concepts like libraries and tools like JSON. I realize that's a lot of buzzwords and technical jargon, but then that is rather my point. Data is the basic resource a data scientist uses. Understanding how it is organized and stored - at least the basics - makes manipulating it and deriving value from it much, much easier.
Statistics & mathematics
This area is perhaps more controversial. I've been interviewing graduates of various machine learning programs across the US and I've found that the fundamental statistics and mathematical underpinnings of data science models have been a bit neglected by many programs. While terms like "backward propagation" are used with alacrity, listing three different probability distributions is more challenging. While that's enough to get by in this area, I don't think its possible to know too much about it. Looking into the deeper implications of the normal distribution, employing exponentially-based survival models, examine autocorrelation, monitoring time series - these very different techniques can each have benefits, depending on the data and outcomes under consideration. Even more basic mathematics - derivatives, optimization, queuing theory, linear algebra - come into play when trying to tweak models, improve upon them, or "think outside the box". At its core, all machine learning comes back to some pretty high-powered mathematics. The more you understand about it, the more flexibility you have in employing optimal models effectively.
Business knowledge & subject matter expertise
There are two parts to this area. On one hand, you need to be able to speak to a business person, understand their goal, pump them for information that could influence the model and persuade them of the accuracy and usefulness of your results. Many of my colleagues draw the line here. They feel strongly that communication with the business is essential, but actually knowing the business isn't necessary. As data scientists, the data "speaks for itself." I do agree that all of the above is necessary. I just don't agree that its sufficient. I believe that if you want to be a truly great data scientist you need a "home turf" industry with which you are familiar. Mine is financial services. Sure, I could go out and interview stakeholders and build models in other areas; in fact, I've done so. However, when I'm working with data from a financial services firm, looking to help solve a problem in this domain, things move much, much faster. I know what to expect in the data. I know what data looks okay and what to be suspicious about. I know the "lingo" of the business and the culture - and that improves communications everywhere. Maybe I've convinced you of the need for some focus, maybe I haven't, but for many data scientists, subject matter expertise is a critical skill.
My advice to aspiring data scientists and career-changers
After reading through all those definitions, you deserve some actual advice, so here it comes! The important part when you're wading into the lake of data science is to stay focused. You don't want to drown in the multitude of things you don't (yet) know. Keep an eye on your core skills and grow them strategically so you can brave deeper and deeper water.
I find that for those starting their careers, technical skills are important. Make sure that you can program - and that you're not scared of programming! Focus on building a good background in mathematics and statistics when you're in school. Trust me, it gets harder when you're no longer immersed in an academic environment. Participate in hack-a-thons, talk about school projects, try a Kaggle competition; generally, show aptitude and enthusiasm for data science and you'll get the first job in the field. After the first couple of years, start looking for ways to develop industry focus in an industry that stirs your passion. This will both improve your job prospects and help to ensure you do things you love in the longer term.
For those changing careers, figure out which of the four zones is your current strength. Look at the other three zones and strategically plan what you need to do to build a minimal foundation in those areas to move into a full data science role. Look for night school courses at your local college or university, or for the more technical undertakings, look to online providers. Most importantly, don't wait to start networking - conferences, linkedin, meet ups, find a way to actively join the conversation with other data scientists as soon as possible.
No matter where you start or where you want to go, remember that being a data scientist is a commitment to life-long learning. New techniques, emerging technology, evolving ideas, innovation and awe-inspiring problems come along every day!
#GlobalAIandDataScience#GlobalDataScience