Created by Neal Caplin
2021-12-13
The world has never collected, stored or managed as much data as it does today, and because of this, data scientists dealing with big data have become increasingly important. As the volume, variety and velocity of this data has continued to grow, the role of data analysis has expanded into all walks of life. But as important as these jobs are, we often find that people don't really know what exactly these people do, or that their functions are misunderstood. So here at J On the Beach we have spoken to a number of data scientists in the tech industry in order to discover more about these crucially important cogs in the machinery of big data and shine a light on how they keep the industry running smoothly.
A Day in the Life
The industry stretches across many sectors, from more traditional areas like IT, finance, telecoms, healthcare, energy and marketing to more specialised areas such as e-commerce, consulting, tourism, insurance and real estate. And team sizes are similarly mixed; over a third in a team of between 5-10 people (the most typical size), almost a quarter working with more than 10 people, another quarter working with 1-4 people and almost 10% working alone. So we asked our participants, what is a typical day like for you?
A common feature is starting off with a daily team or stand-up meeting, something which establishes goals for the day or week. Juan, who works in e-commerce tells us he starts with “…task and meeting organisation for the current and next 2 days and a scrum meeting”. A typical morning could consist of “…main company KPI review, specific KPI analysis, data analysis and forecasting.” and afternoons often involve what he calls “…Monkey tasks: Automatization, data review, and creating new tools.”
Other common answers we heard were analysing and cleaning data to extract value, often for application in deep or machine learning models; designing and implementing DS systems; applying ML & DL algorithms; analysing databases to create prediction models; working on and developing python code to create date pipelines; testing and improving algorithms; and often meeting with colleagues or clients throughout the day.
These meetings, especially with clients are often related to companies who don’t follow the traditional waterfall methodology. Indeed, when we asked, almost 80% say they follow an agile methodology with their colleagues and/or their final users, with a third of respondents following agile methodology with both.
Jennifer, a specialist in natural language processing, says: “I need to be in permanent contact with people without DS knowledge.” It’s similar for Iker working in the energy sector: “As I work building tools for analysts, I do usually have to contact them to get their inputs so that I can start working with the data. I usually work preparing the data and exploring new data integrations into the production models we have.” For others, the daily meetings are still the main point of contact throughout the day. “I work individually on my code, developing what is expected of me for the day, while having meetings with the team or final users to solve and clarify some aspects.” says Mara.
Some of the less common responses included tasks such as automating tasks, charts analysis, dealing with NLO patterns, creating dashboards, NLP projects and writing articles. Angel works in the field of space and so his tasks are more tailored. “I work on the development of geospatial products using satellite imagery and other spatial data, and Artificial Intelligence algorithms. My current work includes the whole chain of design and implementation of prototypes in the geospatial data science team, as well as data engineering, which is a key step when working with satellite data.”
A few participants told us they also did work on product research and writing scientific papers, but this is atypical in Spain as work in this area is neither well-funded nor well-regarded and so is not commonly undertaken.
Welcome to the Machine
In answering which machine learning frameworks had been used in the last 5 years there were 3 clear winners – Panda (93%), Numpy (89%) and Scikit (85%). When we asked how many had used machine learning to solve problems, 94% responded with a resounding yes. What’s more, over 50% of those said they also found it easy and almost another 20% said it depended.
Jo works in tourism but the difficulties she faces can be found in a range of sectors: “Solving a big problem or challenge involves breaking it down into smaller parts. A typical day looks at identifying smaller segment opportunities for proof of concept and deploying models for online experimentation. From there, scale to wider segments or optimise the solution.” However, there are some sector specific issues too as we learn from Pablo, who works in the medical industry: “The biggest problem is complying with the regulations and restrictions that come with working with medical data and so we are developing an API that will make it easier.”
In terms of the data, we were interested in two things. First, we asked what types of data they typically interacted with. As this graph shows, the 4 most common types are categorical, time series, tabular and text data with a big drop-off to the other 5 types. We also enquired whether they engaged in real-time data processing or not, with only 24% saying yes, 21% saying sometimes and 55% holding that they never did.
Sharing is Caring
Of course, accessing data is all well and good but handling algorithms is often a team effort. So we asked two questions; firstly if they use GIT to manage their algorithms, and secondly how they usually shared their algorithms with their colleagues. The answers showed that 89% use GIT or a similar versioning tool to manage their algorithms and more than half of those who responded use some form of GIT as their primary algorithm sharing tool.
We were also curious if they use GPUs to train models, and only just over a quarter categorically stated that they did. The remaining respondents were split evenly, with 37% each saying they sometimes did or they never did.
Tooling Around
Although there are a variety of roles and sectors within the industry, there are certain skills which are desirable across the board. More than three quarters of our participants named Machine Learning, Statistics and SQL as crucially important, with over 95% citing Machine Learning as the most important. Two thirds of our participants named analysing and understanding data to influence product or business decisions and building prototypes to explore applying machine learning to new areas as common activities. 40% also referenced the ability to build and/or run the data infrastructure that their business uses for storing, analysing, and operationalizing data as necessary for their jobs.
Of course, just as important as the skills are the tools used to execute the various tasks.
Almost half of the people we spoke to mentioned cloud-based software, the most popular being AWS, but over 80% pointed to integrated development environments being their primary tool.
The top 3 IDEs used were Jupyter, Rstudio and PyCharm, as demonstrated in the following graph:
With a Little Help From My Friends
When asked if there were any extra means in the budget for solving problems, three quarters told us that there weren’t. And only 10% of our group said they had the freedom to spend their budget on resources themselves, with another 10% having to obtain permission from a boss. However, when asked about solving job tasks we found out that it’s very common to turn to other resources for help. Stackoverflow was named as a go-to resource for help by almost everybody, only 4 people failing to mention them. Other sources of aid were blogs, forums, books and simply asking colleagues, as shown on the chart below.
Another common way of exchanging information is via meetups. Typically, these have been common ways for like-minded individuals in the industry to interact with each other and the 3 most popular meetups are DataScience, Python Meetup and DataBeers. There are also groups targeted towards particular subsets within the industry; Cecilia mentions Ladies In Python, Marta cites Big Girls Theory while Ana says she not only collaborates with groups such as RLadies and Women in ML, but is also an organiser of PyLadiesMadrid.
Equally, we wanted to find out which websites the data scientists visited to learn new things in their field. Overwhelmingly, the most popular options were the Toward Data Science Blog (71%) and forums such as Kaggle and Fastai (60%). The other sources are detailed in this chart:
Soft Sell
It’s also very important that you are in business with trusted vendors for your software, so we queried whether our participants preferred to interact with software vendors by downloading software and using it in their own machines without sharing their email address, or by utilising web based trials that required registration with an email address. While 25% preferred to download and 20% expressed a preference for web base trials the majority said they didn’t mind either way.
We also probed further to find out if and how they use the internet to research vendors. Checking online information and forums in online communities was the most common method, with 52% following this process. 11% used the internet in another way, utilising online platforms such as Gartner, Forrester and G2. The remaining respondents preferred to either ask friends (18%) or ask vendors and do the trials to conduct their research (19%).
Holding Out for A Hero
As talented as the people we surveyed are, there are inexorably giants of the field to whom data scientists look up to. Although a wide array of professionals were referenced, Professor Andrew Ng was named by the majority of those surveyed as someone they followed or admired, and it's no surprise considering he was a co-founder of Google Brain and a crucial part of building up the Artificial Intelligence Group at Baidu. And although this is still an overwhelmingly male-dominated industry as seen by multiple suggestions of people like Jeremy Howard and Yoshua Bengio, it is encouraging that scientists such as Kamelia Aryafar, Nuria Oliver and Lisa Winter were also named as people admired and respected in the field. And now we know who are the most popular and respected in the field, we´re going to do our best to book as many of them as possible!
As you can see, these workers are not just one homogenous stereotype. The work they do and the industries they represent cover a wide spectrum and they are a fundamental part of the modern world. For many, their jobs encompass more than one role. Pablo tells us “I not only have the role of data scientist, but also software developer according to which is needed. We usually start the day looking for new features to apply because of the data we have stored in Bigquery. All our data comes from what our users (and their final users) do.” It’s a similar story for Jennifer: “My job has two very different approaches - one is working with data, which is sometimes boring and repetitive. The other approach is related to drafting models, algorithm design, etc. which is creative and intellectually rewarding.”
As in any profession, job satisfaction is an integral part of a happy data scientist, and thankfully three quarters of those asked said users don't just understand but also value the work that they do. This is vitally important going forward, because without these key workers we would all be a lot more lost.
The Lost Tapes
As part of our process, over the last year a series of interviews were carried out with some data scientists, who gave us some valuable insight into how they work. We had intended to include extracts of these interviews in this blog to add further information and increase the depth of our findings. Regrettably, due to an unfortunate series of events these tapes were lost and unable to be recovered. However, we would like to thank Clara Higuera, Gonzalo Estrán, Pelayo Arbués, Yasser Aoujil, Jonathan Espinosa among others for the time and energy they put in to help us with our research.
If you would like to see the full collection of data we gathered you can download the Excel file of the survey at this link
Neal Caplin
J On The Beach Organiser