Recently, I was exploring techniques to interpolate some missing environmental data, and stumbled across something called ‘random forest’ analysis. Random what now? I did a little digging and came across the massive and insanely complicated field of machine learning. I couldn’t find a concise guide to machine learning techniques, or when I might want to use one or the other, so I thought I would cobble together a brief guide on my own. Below is a rough stab at explaining and exploring different machine learning techniques, from CARTs to GBMs, using R.
[Updated December 30, 2019: You can read more about the package, new functionality, and other approaches to SEM in my online book (work-in-progress): https://jslefche.github.io/sem_book/]
[Updated October 13, 2015: Active development has moved to Github, so please see the link for the latest versions of all functions: https://github.com/jslefche/piecewiseSEM/]
Nature is complex. This seems like an obvious statement, but too often we reduce it to straightforward models.
y ~ x and that sort of thing. Not that there’s anything wrong with that: sometimes
y is actually directly a function of
x and anything else would be, in the words of Brian McGill, ‘statistical machismo.’
But I would wager that, more often that not,
y is not directly a function of
x . Rather,
y may be affected by a host of direct and indirect factors, which themselves affect one another directly and indirectly. If only there was someway to translate this network of interacting factors into a statistical framework to better and more realistically understand nature. Oh wait, structural equation modeling.