The boom in data science continues unabated. The work of gathering and analyzing data was once just for a few scientists back in the lab. Now every enterprise wants to use the power of data science to streamline their organizations and make customers happy.
The world of data science tools is growing to support this demand. Just a few years ago, data scientists worked with the command line and a few good open source packages. Now companies are creating solid, professional tools that handle many of the common chores of data science, such as cleaning up the data.
The scale is also shifting. Data science was once just numerical chores for scientists to do after the hard work of undertaking experiments. Now it’s a permanent part of the workflow. Enterprises now integrate mathematical analysis into their business reporting and build dashboards to generate smart visualizations to quickly understand what’s going on.
The pace is also speeding up. Analysis that was once an annual or quarterly job is now running in real time. Businesses want to know what’s happening right now so managers and line employees can make smarter decisions and leverage everything data science has to offer.
Here are some of the top tools for adding precision and science to your organization’s analysis of its endless flow of data.
These bundles of words, code, and data have become the lingua franca of the data science world. Static PDFs filled with unchanging analysis and content may still command respect because they create a permanent record, but working data scientists love to pop the hood and fiddle with the mechanism underneath. Jupyter Notebooks let readers do more than absorb.
The original versions of the notebooks were created by Python users who wanted to borrow some of the flexibility of Mathematica. Today, the standard Jupyter Notebook supports more than 40 programming languages, and it’s common to find R, Julia, or even Java or C within them.
The notebook code itself is open source, making it merely the beginning of a number of exciting bigger projects for curating data, supporting coursework, or just sharing ideas. Universities run some of the classes with the notebooks. Data scientists use them to swap ideas and deliver ideas. JupyterHub offers a containerized, central server with authentication to handle the chores of deploying all your data science genius to an audience so they don’t need to install or maintain software on their desktop or worry about scaling compute servers.
Notebook lab spaces
Jupyter Notebooks don’t just run themselves. They need a home base where the data is stored and the analysis is computed. Several companies offer this support now, sometimes as a promotional tool and sometimes for a nominal fee. Some of the most prominent include Google’s Colab, Github’s Codespaces, Azure Machine Learning lab, JupyterLabs, Binder, CoCalc, and Datalore, but it’s often not too hard to set up your own server underneath your lab bench.
While the core of each of these services is similar, there are differences that might be important. Most support Python in some way, but after that, local preferences matter. Microsoft’s Azure Notebooks, for instance, will also support F#, a language developed by Microsoft. Google’s Colab supports Swift which is also supported for machine learning projects with TensorFlow. There are also numerous differences between menus and other minor features on offer from each of these notebook lab spaces.
The R language was developed by statisticians and data scientists to be optimized for loading working data sets and then applying all the best algorithms to analyze the data. Some like to run R directly from the command line, but many enjoy letting RStudio handle many of the chores. It’s an integrated development environment (IDE) for mathematical computation.
The core is an open-source workbench that enables you to explore the data, fiddle with code, and then generate the most elaborate graphics that R can muster. It tracks your computation history so you can roll back or repeat the same commands, and it offers some debugging support when the code won’t work. If you need some Python, it will also run inside RStudio.
The RStudio company is also adding features to support teams that want to collaborate on a shared set of data. That means versioning, roles, security, synchronization, and more.
Sweave and Knitr
Data scientists who write their papers in LaTeX will enjoy the complexity of Sweave and Knitr, two packages designed to integrate the data-crunching power of R or Python with the formatting elegance of TeX. The goal is to create one pipeline that turns data into a written report complete with charts, tables, and graphs.
The pipeline is meant to be dynamic and fluid but ultimately create a permanent record. As the data is cleaned, organized, and analyzed, the charts and tables adjust. When the result is finished, the data and the text sit together in one package that bundles together the raw input and the final text.
Integrated development environments
Thomas Edison once said that genius was 1% inspiration and 99% perspiration. It often feels like 99% of data science is just cleaning up the data and preparing it for analysis. Integrated development environments (IDEs) are good staging grounds because they support mainstream programming languages such as C# as well as some of the more data science–focused languages like R. Eclipse users, for instance, can clean up their code in Java and then turn to R for analysis with rJava.
Python developers rely on Pycharm to integrate their Python tools and orchestrate Python-based data analysis. Visual Studio juggles regular code with Jupyter Notebooks and specialized data science options.
As data science workloads grow, some companies are building low-code and no-code IDEs that are tuned for much of this data work. Tools such as RapidMiner, Orange, and JASP are just a few of the examples of excellent tools optimized for data analysis. They rely on visual editors, and in many cases it’s possible to do everything just by dragging around icons. If that’s not enough, a bit of custom code may be all that’s necessary.
Many data scientists today specialize in specific areas such as marketing or supply-chain optimization and their tools are following. Some of the best tools are narrowly focused on particular domains and have been optimized for specific problems that confront anyone studying them.
For instance, marketers have dozens of good options that are now often called customer data platforms. They integrate with storefronts, advertising portals, and messaging applications to create a consistent (and often relentless) information stream for customers. The built-in back-end analytics deliver key statistics marketers expect in order to judge the effectiveness of their campaigns.
There are now hundreds of good domain-specific options that work at all levels. Voyant, for example, analyzes text to measure readability and find correlations between passages. AWS’s Forecast is optimized to predict the future for businesses using time-series data. Azure’s Video Analyzer applies AI techniques to find answers in video streams.
The rise of cloud computing options has been a godsend for data scientists. There’s no need to maintain your own hardware just to run analysis occasionally. Cloud providers will rent you a machine by the minute just when you need it. This can be a great solution if you need a huge amount of RAM just for a day. Projects with a sustained need for long running analysis, though, may find it’s cheaper to just buy their own hardware.
Lately more specialized options for parallel computing jobs have been appearing. Data scientists sometimes use graphics processing units (GPUs) that were once designed for video games. Google makes specialized Tensor Processing Units (TPUs) to speed up machine learning. Nvidia calls some of their chips “Data Processing Units” or DPUs. Some startups, such as d-Matrix, are designing specialized hardware for artificial intelligence. A laptop may be fine for some work, but large projects with complex calculations now have many faster options.
The tools aren’t much good without the raw data. Some businesses are making it a point to offer curated collections of data. Some want to sell their cloud services (AWS, GCP, Azure, IBM). Others see it as a form of giving back (OpenStreetMap). Some are US government agencies that see sharing data as part of their job (Federal repository). Others are smaller, like the cities that want to help residents and businesses succeed (New York City, Baltimore, Miami, or Orlando). Some just want to charge for the service. All of them can save you trouble finding and cleaning the data yourself.