Name: Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools
Availability: InStock
Author: Jeroen Janssens

Data science workflows often rely on high-level programming environments such as Python or R. However, many data tasks begin long before code is written in a notebook.

Extracting, cleaning, transforming, and inspecting raw data are foundational steps that directly affect analysis quality. Command-line tools, rooted in the Unix philosophy of combining small, focused utilities, offer an efficient and reproducible way to handle these tasks.

Understanding how to use the command line for data processing remains highly relevant. It enables scalable pipelines, automation, and integration across systems, making it a valuable skill for data scientists, analysts, and engineers.

About the book

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools by Jeroen Janssens is a practical guide to performing end-to-end data science tasks using command-line tools.

The book demonstrates how to combine small yet powerful Unix utilities to obtain, scrub, explore, and model data efficiently. It emphasizes workflow integration and reproducibility, showing how command-line techniques can complement environments such as Python, R, Jupyter, RStudio, and Apache Spark.

The content is suitable for data scientists, analysts, engineers, system administrators, and researchers. Readers are expected to have a basic familiarity with data science concepts. Prior command-line experience is helpful but not strictly required, as the book includes a “Getting Started” section and provides a Docker image containing over 100 Unix power tools for cross-platform use.

What you will learn

By working through this book, readers will learn how to:

Obtain data from websites, APIs, databases, and spreadsheets
Scrub and transform text, CSV, HTML, XML, and JSON files
Explore datasets, compute descriptive statistics, and generate visualizations
Manage data science workflows using tools such as Make
Build custom command-line tools from one-liners and existing Python or R code
Parallelize and distribute data-intensive pipelines
Apply modeling techniques including dimensionality reduction, regression, and classification
Integrate command-line workflows with Python, Jupyter, R, RStudio, and Apache Spark

The book presents the command line as an agile and extensible environment for handling real-world data. It focuses not only on individual commands but also on assembling them into coherent, reproducible pipelines.

Welcome
Foreword
Preface
1 Introduction
2 Getting Started
3 Obtaining Data
4 Creating Command-line Tools
5 Scrubbing Data
6 Project Management with Make
7 Exploring Data
8 Parallel Pipelines
9 Modeling Data
10 Polyglot Data Science
11 Conclusion
List of Command-Line Tools

Book details

Title: Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools
Author: Jeroen Janssens
Main category: Data Science
Subcategory: Data Analysis
License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

More books in: Data Analysis, Data Science

Legal notice: This book is shared for educational purposes only. The content is distributed under Creative Commons licenses or with explicit permission from the author. FreeProgrammingBooks may host files that comply with their respective licenses.

About the book

What you will learn

Table of contents

Book details