Learn about nine Python libraries for data science and how to install them.
Python is an object-oriented programming language with easy syntax and powerful tools for application development, machine learning, and data science. One of the reasons Python is so useful for data science is its open-source nature, which makes the development of Python libraries for data science easy to access, download, and develop. Python has an extensive list of free, open-source libraries and an active community willing to help troubleshoot and provide guidance to users at any level.
According to the TIOBE Index, which uses search engine results to rank programming languages, Python currently ranks at the top of its list—and it is one of the most popular programming languages in the world [1]. Explore nine Python libraries you can use for data science, how to download them, and how to start using them.
Python libraries give data scientists access to a range of tools to help them manipulate, analyse, mine, and visualise data in a simple, straightforward manner. Each Python library contains sets of code, classes, values, and templates you can download to add functionality to Python to make analysing data more efficient. Nine Python libraries you can use for data science include:
NumPy
pandas
SciPy
Matplotlib
Seaborn
Pillow
Plotly
ScraPy
Autoviz
Whilst hundreds of thousands of Python libraries are available, each library in this list has a unique set of tools that often work together to perform high-level data science computations. Read on to explore how data scientists use each library and the commands that install them.
Install commands: conda install numpy or pip install numpy
NumPy is a scientific computing package for producing and computing multidimensional arrays, matrices, Fourier transformations, statistics, linear algebra, and more. NumPy’s tools allow you to manipulate and compute large data sets efficiently and at a high level.
Install commands: conda install -c conda-forge pandas or pip install pandas
pandas uses expressive data structures to make working with labelled data more efficient. It simplifies data analysis in Python by representing missing data, allowing insertion or deletion of data, and converting data into data frames, which include merging, joining, and concatenate options. It also has useful In/Out (IO) functionality which you can use to import data directly from CSV, Excel, and databases.
Install commands: conda install scipy or python -m pip install scipy
SciPy is a scientific computing package with high-level algorithms for optimisation, integration, differential equations, eigenvectors, algebra, and statistics. It enhances the usage of NumPy-like arrays by using other matrix data structures as its main objects for data. This gives you an even wider range of ways to analyse and compute data.
Install commands: conda install matplotlib or python -m pip install -U matplotlib
Matplotlib is an essential tool for data science visualisations because it creates various data plots and graphs in print-ready formats. It creates plots like pairwise data, statistical graphs, gridded data, irregular data, and 3D volumes. Matplotlib works with Python scripts, Jupyter Notebook, web applications, and other graphic user interfaces (GUI) to generate plots, which makes it a versatile visualisation tool for data scientists.
Install commands: conda install seaborn or pip install seaborn
Seaborn is a library built on top of the Matplotlib library and helps make statistical graphics more straightforward. It works with pandas data structures and automatically plots data with characteristics, creates a legend, and performs statistical analysis on the data. This makes it an important tool if you are looking to create high-quality plots and statistical computations at the same time.
Install commands: conda install anaconda::pillow or python3 -m pip install --upgrade Pillow
Pillow is the newer fork of the old Python library PIL that allows you to manipulate image pixels directly whilst combining NumPy and SciPy for computations. Pillow is a useful image-processing tool used directly within a Python interpreter. It has features similar to other image-processing applications to convert files, resize images, create thumbnails, perform colour space conversions, and perform statistical analysis on images.
Install commands: conda install -c plotly plotly=5.20.0 or pip install plotly==5.20.0
Similar to Matplotlib, Plotly produces high-quality graphs, charts, plots, polar graphs, and more. This library also helps you create interactive and print-ready plots. Plotly is a useful program for data visualisations displayed directly in Jupyter Notebook or Dash, downloadable as HTML files.
Install commands: conda install -c conda-forge scrapy or pip install Scrapy
Scrapy is a web scraping and extraction tool for data mining. Its use extends beyond just scraping websites; you can also use it as a web crawler and to extract data from APIs, HTML, and XML sources. Scraped data turns into JSON, CSV, or XML files to store on a local disk or through file transfer protocol (FTP).
Install commands: conda install conda-forge::autoviz or pip install autoviz
AutoViz helps data scientists find patterns in their data through automated exploratory data analysis. It can be used to train beginner data scientists to see important patterns, or if you are more advanced, it can help ensure that you don’t miss anything crucial. It makes plotting easy, speeds up plot generation with less code, works with any size data set, and even gives a quality assessment of the data. It analyses any CSV or JSON files, and it can work with a pandas data frame.
To start using Python libraries for data science, install some or all of the Python libraries above and use them on your own data. The following steps give you an overview of how to install Python libraries.
If Python is not already on your computer, one of the simplest methods to install Python and its various libraries is using the open-source Anaconda software, which is an environment and package distribution system that uses conda as its command in the environment. It makes installing packages simpler by allowing you to use a command line prompt or a graphical user interface to launch, create, and install Python libraries.
Alternatively, if you already have Python installed or want to install packages as you need them, you can just use Python and the Python Package Index (PyPi).
To install packages into a particular virtual environment, create a new conda environment or use an existing conda environment for Anaconda. Or, if you are using regular Python, you can use venv.
To create a new conda virtual environment in the terminal:
conda create --name conda-env python
To activate the conda environment:
conda activate conda-env
To create a Python venv in the terminal:
Mac/Linux: python -m venv /path/to/new/virtual/environment
Windows: python -m venv c:\path\to\myenv
To activate that venv in the terminal:
Mac/Linux: source name-env/bin/activate
Windows: name-env\Scripts\activate
Once your virtual environment is active, find and install the Python libraries you want to use within that environment. Every library type has its own unique installation commands. A generic process for installing libraries uses conda commands for Anaconda environments or pip commands for venv environments.
Ensure your conda virtual environment is active using the steps above. Using NumPy as an example, you can install a library using its specific commands:
conda install numpy
Ensure your venv virtual environment is active using the steps above. Using NumPy as an example, you can install a library using its commands:
pip install numpy
After you install your packages, deactivate the virtual environment using either:
conda deactivate
Or, if you’re using regular Python:
Deactivate
Now, you can start using Python libraries for data science.
Python libraries are powerful tools for data science users to mine, analyse, and visualise data to find patterns. To begin developing in-demand skills using Python, try the Python for Everybody Specialisation from the University of Michigan. You also can expand your data science skills through a course like the IBM Data Science Professional Certificate, both found on Coursera.
TIOBE. “TIOBE Index for April 2024, https://www.tiobe.com/tiobe-index/.” Accessed 5 May 2024.
Editorial Team
Coursera’s editorial team is comprised of highly experienced professional editors, writers, and fact...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.