Getting into Python

How to get into Python, the most widely used programming language for data science.

Author

Daniel Kapitan

Published

December 21, 2023

Why Python for data science?

Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind. The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages, particularly:

  • NumPy for manipulation of homogeneous array-based data;
  • Pandas for manipulation of heterogeneous and labeled data, and the more recent high-performace dataprocessing libraries such as polars and ibis;
  • SciPy for common scientific computing tasks, Matplotlib for publication-quality visualizations;
  • IPython for interactive execution and sharing of code;
  • Scikit-Learn for machine learning.

How much Python should I know?

As with any other (programming) language, it takes years to master it fluently which is beyond the scope this anthology. Instead, our objective is to have a working knowledge of Python to be able to learn and apply machine learning. To make this explicit we take the following book and online resources as our point of reference.

  • A Whirlwind Tour of Python (pages number from the pdf version):
    • Know how to install and use Python on your own computer (pages 1 to 13)
    • Know basic semantics of variables, objects and operators (pages 13 to 24)
    • Know built-in simple values and data structures (pages 24 to 37)
    • Know how to use control flow and functions (pages 37 to 45)
    • Know how to iterate and use list comprehensions (pages 52 to 61)
  • Python for Data Analysis
PCEP™ – Certified Entry-Level Python Programmer

The learning path proposed here is similar to the PCEP™ – Certified Entry-Level Python Programmer certification. The PCEP™ certification is a good way to assess your current Python knowledge and to prepare for the Machine Learning Foundation course. The certification is offered by the Python Institute. You may opt to obtain this certificate.

How should I learn Python?

RealPython.com is the recommended online learning environment for Python. We have collated a learning path for data science.

Which Python environment should I use?

Options how to start using Python are listed below.

For those new to Python, it is probably easiest to start with one of these online notebook environments:

  • Deepnote: there is a generous free-tier. If you decide to upgrade, you can collaborate and share notebooks privately.
  • Google Colab:

Once you have gained some traction, you can move on to install Python on your local machine.

Visual Studio Code is the recommended data science workbench. To setup your local machine/laptop for data science and machine learning, do the following:

Guidelines for using Python for data science

Using Python for data science is inherently different than using it for, say, building a website. To provide you with some guidance to the many different ways c.q. styles of using Python, please consider the following:

  • Focus on using existing data science libraries, instead of writing your own basic functions. If you find yourself spending a lot of time reading documentation, you are on the right track.
  • Take a functional approach to programming instead of an object-oriented approach. The former is more fitting for data science, where it is common to structure your work in terms of pipelines and think about each processing step as a function. The latter is more suitable for application development.

For those wanting to further develop their Python skills for data science, the following books are recommended:

More on Python