freeprogrammingbooks.com

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools

By Jeroen Janssens

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools by Jeroen Janssens is a practical guide to performing end-to-end data science tasks using command-line tools.

Data science workflows often rely on high-level programming environments such as Python or R. However, many data tasks begin long before code is written in a notebook.

Extracting, cleaning, transforming, and inspecting raw data are foundational steps that directly affect analysis quality. Command-line tools, rooted in the Unix philosophy of combining small, focused utilities, offer an efficient and reproducible way to handle these tasks.

Understanding how to use the command line for data processing remains highly relevant. It enables scalable pipelines, automation, and integration across systems, making it a valuable skill for data scientists, analysts, and engineers.

About the book

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools by Jeroen Janssens is a practical guide to performing end-to-end data science tasks using command-line tools.

The book demonstrates how to combine small yet powerful Unix utilities to obtain, scrub, explore, and model data efficiently. It emphasizes workflow integration and reproducibility, showing how command-line techniques can complement environments such as Python, R, Jupyter, RStudio, and Apache Spark.

The content is suitable for data scientists, analysts, engineers, system administrators, and researchers. Readers are expected to have a basic familiarity with data science concepts. Prior command-line experience is helpful but not strictly required, as the book includes a “Getting Started” section and provides a Docker image containing over 100 Unix power tools for cross-platform use.

What you will learn

By working through this book, readers will learn how to:

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Scrub and transform text, CSV, HTML, XML, and JSON files
  • Explore datasets, compute descriptive statistics, and generate visualizations
  • Manage data science workflows using tools such as Make
  • Build custom command-line tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Apply modeling techniques including dimensionality reduction, regression, and classification
  • Integrate command-line workflows with Python, Jupyter, R, RStudio, and Apache Spark

The book presents the command line as an agile and extensible environment for handling real-world data. It focuses not only on individual commands but also on assembling them into coherent, reproducible pipelines.

Table of contents

  • Welcome
  • Foreword
  • Preface
  • 1 Introduction
  • 2 Getting Started
  • 3 Obtaining Data
  • 4 Creating Command-line Tools
  • 5 Scrubbing Data
  • 6 Project Management with Make
  • 7 Exploring Data
  • 8 Parallel Pipelines
  • 9 Modeling Data
  • 10 Polyglot Data Science
  • 11 Conclusion
  • List of Command-Line Tools

Book details

  • Title: Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools
  • Author: Jeroen Janssens
  • Main category: Data Science
  • Subcategory: Data Analysis
  • License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

More books in: Data Analysis, Data Science


Legal notice: This book is shared for educational purposes only. The content is distributed under Creative Commons licenses or with explicit permission from the author. FreeProgrammingBooks may host files that comply with their respective licenses.