run conditional inference tree with additional support

The partykit::ctree() function only gives the best separation at each node, i.e. one tree. This wrapper provides the following supports:

By setting recursive = T, all trees meeting p-val cutoff are produced and saved. Each round of recursion is done by removing the 1st splitting variable from the input data.frame and running runCtree(); the recursion stops if no splitting variable is found.
The info and stats of each node of each tree are collected and summarized in an excel file, which also contains ULRs to each tree.
Before running partykit::ctree(), rmNA() and rmNZV() are run to remove low-informative columns and rows to reduce computation and adjustment on association p-vals
Cases leading to crashes of partykit::ctree() are handled, e.g. Inf and -Inf are converted to NA to avoid the following errors: " 'breaks' are not unique"

Usage

runCtree(
  df1,
  cohort,
  oDir,
  yi = 1,
  pCut = 0.05,
  recursive = T,
  getReturn = T,
  ctrlParas = list(minsplit = 10, minbucket = 5, maxsurrogate = 2, alpha = pCut),
  naParas = list(margins = 1, maxNA.perc = 0.95, minNonNA.count = 5),
  nzvParas = list(minUniPerc = 0.05, minUniCount = 5),
  gList = NULL
)

Arguments

df1

data.frame; columns are variables and rows are observations

cohort

char; name of the observation cohort as an annotation in the drawn tree

oDir

char; output directory for the tree plot and a summary excel file;

one pdf file for each tree
each file is named as paste0(oDir,.Platform$file.sep, cohort,'.',yName,'.',gList$counter,'.pdf')
The excel file is the content of stats from the @return (see below), and is named as paste0(oDir,.Platform$file.sep,cohort,'.xlsx')

yi

int; index of y variable

pCut

p-val for significant association; not adjusted.

recursive

logical;

F: only produce the best tree
T: produce all trees meeting pCut

getReturn

logical; if T, return a list below; no returns otherwise. it's also used to reduce the internal data transfer load if recursive = T.

ctrlParas

list; parameters for partykit::ctree_control()

naParas

list; parameters for rmNA(); set to NULL to skip this step.

nzvParas

list; parameters for rmNZV(); set to NULL to skip this step.

gList

a listenv list; it's for internal recursion tracking; users should ignore this argument.

Value

if getReturn, a list of following items; none otherwise.

df: cleaned df1; NA if df1 has only one column or < 10 rows with y values.
stats: possible values:
- NA ctree() doesn't run due to one of the following reasons:
  - only one column in df1
  - < 10 rows where y is not NA
  - y has low variance and is removed by rmNZV()
  - no tree fitting the pCut is found. In this case, try increasing pCut
- A data.frame of following columns, one tree per row
  - counter: the index of each tree drawing
  - cohort, y, pCutoff
  - spVar1,pVal1: the name and p-val of the splitting variable at node 1
  - nNode: number of nodes of the tree
  - spVars: a string containing the names and stats of all splitting variables. for each node, the format is "name,p-val,cut,gtOnRight" nodes are separated by ';'.
  - plot: the path to the tree plot

Details

Note:

packages partykit and openxlsx are loaded, but not attached, in this function.
ctree uses coin::independence_test() to test the association of two variables of any data type. See here for theory behind the test, and here for an explanation of the algorithm.

Examples

# none