Skip to contents

The partykit::ctree() function only gives the best separation at each node, i.e. one tree. This wrapper provides the following supports:

  • By setting recursive = T, all trees meeting p-val cutoff are produced and saved. Each round of recursion is done by removing the 1st splitting variable from the input data.frame and running runCtree(); the recursion stops if no splitting variable is found.

  • The info and stats of each node of each tree are collected and summarized in an excel file, which also contains ULRs to each tree.

  • Before running partykit::ctree(), rmNA() and rmNZV() are run to remove low-informative columns and rows to reduce computation and adjustment on association p-vals

  • Cases leading to crashes of partykit::ctree() are handled, e.g. Inf and -Inf are converted to NA to avoid the following errors: " 'breaks' are not unique"

Usage

runCtree(
  df1,
  cohort,
  oDir,
  yi = 1,
  pCut = 0.05,
  recursive = T,
  getReturn = T,
  ctrlParas = list(minsplit = 10, minbucket = 5, maxsurrogate = 2, alpha = pCut),
  naParas = list(margins = 1, maxNA.perc = 0.95, minNonNA.count = 5),
  nzvParas = list(minUniPerc = 0.05, minUniCount = 5),
  gList = NULL
)

Arguments

df1

data.frame; columns are variables and rows are observations

cohort

char; name of the observation cohort as an annotation in the drawn tree

oDir

char; output directory for the tree plot and a summary excel file;

  • one pdf file for each tree

  • each file is named as paste0(oDir,.Platform$file.sep, cohort,'.',yName,'.',gList$counter,'.pdf')

  • The excel file is the content of stats from the @return (see below), and is named as paste0(oDir,.Platform$file.sep,cohort,'.xlsx')

yi

int; index of y variable

pCut

p-val for significant association; not adjusted.

recursive

logical;

  • F: only produce the best tree

  • T: produce all trees meeting pCut

getReturn

logical; if T, return a list below; no returns otherwise. it's also used to reduce the internal data transfer load if recursive = T.

ctrlParas

list; parameters for partykit::ctree_control()

naParas

list; parameters for rmNA(); set to NULL to skip this step.

nzvParas

list; parameters for rmNZV(); set to NULL to skip this step.

gList

a listenv list; it's for internal recursion tracking; users should ignore this argument.

Value

if getReturn, a list of following items; none otherwise.

  • df: cleaned df1; NA if df1 has only one column or < 10 rows with y values.

  • stats: possible values:

    • NA ctree() doesn't run due to one of the following reasons:

      • only one column in df1

      • < 10 rows where y is not NA

      • y has low variance and is removed by rmNZV()

      • no tree fitting the pCut is found. In this case, try increasing pCut

    • A data.frame of following columns, one tree per row

      • counter: the index of each tree drawing

      • cohort, y, pCutoff

      • spVar1,pVal1: the name and p-val of the splitting variable at node 1

      • nNode: number of nodes of the tree

      • spVars: a string containing the names and stats of all splitting variables. for each node, the format is "name,p-val,cut,gtOnRight" nodes are separated by ';'.

      • plot: the path to the tree plot

Details

Note:

  • packages partykit and openxlsx are loaded, but not attached, in this function.

  • ctree uses coin::independence_test() to test the association of two variables of any data type. See here for theory behind the test, and here for an explanation of the algorithm.

Examples

# none