Preface | Systems and Platforms for Big Data Statistical Analysis

The best way to contribute to this book is to fork this repository as your own branch and work on the forked copy.

This book is also hosted on Github https://github.com/panCtrlV/bigDataStat-book

Gitbook Editor is a convenient tool to develop a gitbook offline.

Outline

Big Data is ...

Big data, big data statistics, and big data computing as a general introduction.
Parallel computing, cluster computing and Cloud computing (Grid computing?) including hardware and software
Build a virtual machine cluster
R for Statistical Analysis
Shared Memory Parallel Computing in R
R and Distributed Computing Platforms
Native Distributed Computing in R

Useful Reference

The Python 3 wiki page about Python efficiency discusses the efficiency problems (mainly the Global Interperter Lock (GIL)) faced by the traditional implementation of Python (i.e. CPython) and lists several workarounds to gain better computational efficiency (e.g. uisng a free-threaded Python implementation like Jython, using process-based concurrency, ...). The article also answers why "just remove the GIL" isn't an obvious answer, explains what implications we have to consider if we were to replace CPython's coarse grained lock with a fine grained lock, and envisions the future of Python.

Though this article targets the problems related to multithreading for Python, I think most arguments and considerations also apply to the current implementation of R language.