rapport: an R templating system | Home

rapport  an R templating system


Introduction

rapport is an R package that facilitates creation of reproducible statistical report templates. Once created, rapport templates can be exported to various external formats: HTML, LaTeX, PDF, ODT, DOCX etc. The appearance of the page you are just reading is similar to a ‘rapported document’. Apart from R, all you need to know to start writing your own templates is Pandoc’s markdown syntax, and some rapport conventions that allow the reproducibility of the template.

Several predefined templates come bundled with default rapport installation, and their number will increase in future releases. Of course, we strongly encourage you to write your own templates, and/or customise the ones that are shipped with default package installation.

For a brief introduction to rapport, run:

demo(rapport, ask = FALSE)

Frequently Asked Questions

What does “rapport” mean?

Actually a lot of stuff:

  • first of all, it’s a name of absolutely marvellous R package created by two humble developers
  • initially, rapport was just a pun (and a very lame attempt to get an available domain name)
  • it also refers to “the relationship of two or more people who are in sync or on the same wavelength because they feel similar and/or relate well to each other”*
  • in psychotherapy, rapport reflects mutual empathy and understanding between a therapist and client
  • if you’re a German (oder Sie sprechen fließend Deutsch), then it’s just a synonym for “report” (BTW, in Hungarian rapport denotes interrogation in military parlance)
  • if you’re a Kwakiutl Indian or a Khalkha Mongol, rapport probably means nothing in your language, but you should give it a try, it’s bonkers!

* Stewart, Dorothy (1998). Gower handbook of management skills. Gower Publishing. p. 282. ISBN 0566078899

Who is developing this package?

Members of the rapport development team are painstakingly procrastinating on this project, to the utmost limits of endurance. Feel free to contribute and help us localising it!

Why should I prefer rapport over Sweave, brew, knitr and the like?

Short answer: you should not.

Long answer: you should not… oftentimes… especially if you already have an efficient development work flow. But rapport is a bit different from the other popular dynamic report frameworks in R: Sweave, brew, or even knitr. With those nifty tools you can easily embed R code and R output in TeX, PDF, HTML and other formats, which is a common practice in advanced R users.

rapport has a similar goal: to use R to generate dynamic, reproducible templates, which can be easily exported to various external formats, providing (hopefully) nicely formatted template elements, such as tables, graphs or in-line expressions. You may get an impression that rapport is n00b-friendly interface to statistical report creation, but from our POV, it’s just a convenient way of dealing with repetitive tasks.

rapport relies on some predefined/custom templates, which can be easily ran against any dataset and with user-defined input specifications . Don’t forget: custom templates are easy to write!

OK, you got me. How can I use this package?

Read sections on usage and/or writing custom templates.

I found a bug. What should I do?

Please join our discussion list or file an issue on Github tagged as bug!

I have a really cool idea about this package. Are you interested?

Hell yeah! Please join our discussion list or file an issue about your idea on Github tagged as feature, we will be really interested to check it out.

Can I use this package for academic/commercial/any other purpose?

Of course, with some limitations: see license terms for details.

How can I use my custom templates without always specifying the full path and name of the templates?

Check out rapport.path and rapport.path.add. If you add the paths of the directories holding your custom templates, rapport and any related function would easily find it just like you would use a package bundled template. So if you have e.g. mytemplate.rapport in /tmp, adding that to the list of custom paths like rapport.path.add('/tmp'), you can easily call rapport('mytemplate', ...). You don’t even have to include the extension of the file (unless it ends with something other than rapport).

It would be a good practice to add a call to rapport.paths.add(...) to your .Rprofile, which would be evaluated on each R session startup.

Why do you write such complex templates? Why not including those in helpers like rp.anova?

It is based on a theory behind rapport: you can define a function at any part of your template and use it, or just load any library which is installed on your system - even your own, local packages. You might even include a source command in your local template, but bear in mind that this template will not be able to run on other computers!

OK, I understand, but I have a really handy function which should be added to rapport!

We’d be glad to incorporate it in our package, please file an issue about your idea on Github tagged as feature. Or write your own package submitted to CRAN and please attract our attention to add that package to our required or suggested package list.

HTML template exports have ugly fonts in Firefox, but not in other browsers. What’s the catch?

Actually, it’s not a rapport bug, rather a Firefox feature. Type about:config in Awesome Bar™ and search for following property:

security.fileuri.strict_origin_policy

Set it to false (by double-clicking on it) and refresh the page (you may want to bypass the browser cache, too: use Shift + click on Refresh button, or Ctrl+Shift+R). Behold the pretty fonts! See an answer that solved this strange behaviour.

I’m trying to produce those fancy HTML tables you have in rapport. Any tips on how to do that? (Q of Roman Luštrik)

Sure, just put a piece of code that returns a data.frame object in a chunk, and it will be converted to HTML table once you export it. Put something like this in your .rapport file:

<%=
rp.desc("edu", "student", c(min, max, mean, sd), ius2008)
%>

and run rapport.html(<file path>, <data>, <inputs>).

Installation

We’ll start with a brief reminder: make sure that you have Pandoc installed, and then proceed with package installation.

Starting from v.0.2 rapport is hosted on CRAN, so you can install it by calling:

install.packages('rapport')

Or you can grab the latest build from GitHub via nifty function from devtools package:

library(devtools)
install_github('rapport', 'rapporter')

You can also download sources in a zip or tarball archive and build package manually. To do so, please extract archive to an empty directory and run the following commands:

R CMD build rapport
R CMD INSTALL <path to .tar.gz file>

If you’re running R on Windows, you need to install Rtools. Once you have installed it, you can either try out the install_github() approach or install package manually by downloading tarball sources and issuing following commands in the command prompt:

R CMD build --binary <path to .tar.gz file>
R CMD INSTALL <path to .zip file>

In order to export rapport templates to HTML, ODT, DOCX or PDF, please install pandoc.

And an up-to-date version of pander (an R markdown writer) installed from GitHub is also really advocated:

library(devtools)
install_github('pander', 'rapporter')

Usage

After installing, load the package:

library(rapport)

The most obvious goal of this package is to easily reproduce a report by providing a custom dataset.

rapport has some predefined templates distributed with the package. These files can be found in templates directory of the package with rapport extension. Getting a list of all available templates is easy:

> rapport.ls()
 [1] "AnalyzeWizard.rapport"               "ANOVA.rapport"                       "BartlettsTest.rapport"               "BrownForsythTest.rapport"           
 [5] "Correlation.rapport"                 "Crosstable.rapport"                  "Descriptives.rapport"                "Example.rapport"                    
 [9] "FactorAnalysis.rapport"              "FTest.rapport"                       "GenerateBeta.rapport"                "GLM.rapport"                        
[13] "HierarchicalClusterAnalysis.rapport" "HomogeneityCrosstable.rapport"       "KMeansCluster.rapport"               "KolmogorovSmirnovTest.rapport"      
[17] "KruskalWallisTest.rapport"           "LinearRegression.rapport"            "MDS.rapport"                         "Minimal.rapport"                    
[21] "NormalityTest.rapport"               "OutlierTest.rapport"                 "PCA.rapport"                         "TTestOneSample.rapport"             
[25] "TTestTwoSample.rapport"              "WilcoxonTest.rapport"       

Examples

If you, find, say Minimal.rapport promising, you can check it out by calling rapport.example function which prints out the examples specified in the template, prompting you to choose one from the list:

> rapport.example('example')
Enter example ID from the list below: 

(1)	rapport("Minimal", ius2008, var='leisure') 
(2)	rapport("Minimal", ius2008, var='leisure', desc=FALSE) 
(3)	rapport("Minimal", ius2008, var='leisure', desc=FALSE, histogram=T) 
(all)	Run all examples 


Template ID>

After you typed in a template ID (1, 2, 3 or all), press ENTER key to see it in action. For example, running the first example of Minimal template returns (which can be called like rapport.example('Minimal', 1) too):

# Début


Hello, world!

I have just specified a *Variable* in this template named to **leisure**. The label of this variable is "Internet usage in leisure time (hours per day)".

And wow, the mean of *leisure* is _3.199_!

By checking out the [sources of this template](https://github.com/rapporter/rapport/blob/master/inst/templates/Minimal.rapport), you could see that we used all `BRCATCODE`s above from `brew` syntax. `BRCODE` tags are useful when you want to loop through something or optionally add or remove a part of the template. A really easy example of this: if `desc` input equals to `TRUE`, then the resulting report would have that chunk, if set to `FALSE`, it would be left our.

## Descriptive statistics

--------------------------------------------------------
Min.   1st Qu.   Median   Mean   3rd Qu.   Max.   NA's  
------ --------- -------- ------ --------- ------ ------
0.000  2.000     3.000    3.199  4.000     12.000 37    
--------------------------------------------------------

The 5 highest values are: _12_, _12_, _10_, _10_ and _10_.

What was done here? We have executed a simple analysis on the leisure variable found in ius2008 (package bundled) dataset. This simple template only returned a local greeting, the name and the label of the given variable, alongside a variable’s summary and 5 highest values.

As you could see in the examples there are some other parameters of this template besides data and var, for example: desc and hist. In order to, get some info on the template, issue following command:

> rapport.info('Minimal')

`Minimal template`

 by Gergely Daróczi

 This template demonstrates the basic features of rapport. We all hope you will like it!
 
 packages:	lattice, xtable
 
 Examples: 
 rapport("Minimal", ius2008, var='leisure') 
 rapport("Minimal", ius2008, var='leisure', desc=FALSE) 
 rapport("Minimal", ius2008, var='leisure', desc=FALSE, histogram=T) 

Input parameters 

 "var" (Variable)  *required
  A numeric variable.
   - standalone:  no
   - length:		exactly 1 vector

 "desc" (Descriptives)
 Table of the descriptive statistics
   - class:  	logical
   - standalone:	yes
   - value:		~
   
 "histogram" (Histogram)
 Histogram
   - class:    logical
   - standalone:	yes
   - value:		~
   

Okay, we have seen the examples before, but new information appears now too:

  • the name and the author of the template,
  • a brief description of what is the template for, what will be returned,
  • required packages to run the template and
  • some input parameters (see rapport.inputs for details).

These latter shows exactly what we were looking for which can be returned by rapport.info('Minimal', meta = F) command too without meta information.

There we can see that four parameters can be provided. var is the name of the variable we want to analyze, we can set desc to FALSE instead of the default value TRUE not to print descriptive statistics and we can instruct the template to return a histogram too (see: hist parameter).

Let’s run the third example:

> rapport("example", ius2008, var='leisure', desc=FALSE, histogram=T)

# Début


Hello, world!

I have just specified a *Variable* in this template named to **leisure**. The label of this variable is "Internet usage in leisure time (hours per day)".

And wow, the mean of *leisure* is _3.199_!

**For more detailed statistics, you should have set `desc=TRUE`!**

By checking out the [sources of this template](https://github.com/rapporter/rapport/blob/master/inst/templates/example.rapport), you could see that we used all `BRCATCODE`s above from `brew` syntax. `BRCODE` tags are useful when you want to loop through something or optionally add or remove a part of the template. A really easy example of this: if `desc` input equals to `TRUE`, then the resulting report would have that chunk, if set to `FALSE`, it would be left our.

## Histogram

For demonstartion purposes you can find a histogram below:

![A nice histogram](plots/rapport-Minimal-1-1.png)

Here instead of the known rapport.example we used directly rapport which takes the above described input parameters. As you can see the descriptive statistics table is gone, instead we got a histogram. Or at least a path to a png file - which holds that image. You can find that file after running the above command (not exactly on the same path - see ?tempfile for details) on your local machine and check it out. We have attached that here.

Well, this is a quite rough way of checking out plots generated in a template :)

Export

There are a lot easier ways for that:

  • run rapport with modified global options: set graph.record and rapport.graph.replay to TRUE. This way you will see all generated plots pop-up while printing a rapport class object. This way you can even resize the plots and later export the report with modified image dimension (see below).
  • as we can export our reports (to be more technical our rapport classes, see str function on any rapport returned object) to wide range of formats with rapport.export.

Please find the HTML exported versions of the examples of Minimal.rapport here or run on your machine:

rapport.export(rapport.example('Minimal', 'all'))

Which will return the above linked HTML with all examples of Minimal.rapport. Well, not exactly the same :)

Were you aware of the change in the second line which holds the name of the useR? This is set to rapport package team @ https://github.com/rapporter/rapport in our deploying system and set to undefined by default. It might be a good idea to set this to custom strings on all users’ machine, which are hold in options().

Set your name and e-mail address:

options('rapport.user'  = 'userR')

And rerun the following commands to see the changes. For other settings in rapport.export please check out the docs.

If you would like to resize/alter the dimensions of generated images in the exported reports, a nice way of doing this is like:

  • save a rapport class to a variable (e.g.: report <- rapport("Minimal", ius2008, var='leisure', desc=FALSE, hist=T)),
  • print it (e.g.: report)
  • while the plots are shown on the graphics device feel free to resize them, rapport will resave the images to disk with modified dimensions,
  • export the R object (e.g.: rapport.export(report))

If you would export several rapport object at once, you can do that by combining those to a list, see rapport.export (especially the examples) for more details.

There are a bunch of other helper functions in rapport to deal with templates. As most starts with rapport prefix (except for the main rapport function), we can easily list them by typing rapport. and pressing TAB twice:

> tpl.
tpl.body          tpl.example       tpl.info          tpl.list          tpl.paths         tpl.paths.remove  tpl.rerun         
tpl.check         tpl.export        tpl.inputs        tpl.meta          tpl.paths.add     tpl.paths.reset  

By this method you might not find all handy functions, for example rapport.html and rapport.odt which are simple wrappers around rapport and rapport.export. Just use these functions you would use rapport and get a nifty HTML/odt output directly. For example:

rapport.html('example', ius2008, var = 'leisure')

Other functions

As you might have seen there are several general options in rapport which can be as handy as the funtions used while rapporting. Above we have set the username, which affects the result of all run templates. Please see the (almost) full list of available options below:

  • rapport.user: a (user)name to show in exported report (defaults to "Anonymous")
  • rp.file.name: a general filename of generated images and exported documents without extension. Some helper pseudo-code would be replaced with handy strings while running rapport and rapport.export:
    • %t: unique random character strings based on tempfile,
    • %T: template name in action,
    • %N: an auto-increment integer based on similar exported document’s file name,
    • %n: an auto-increment integer based on similar (plot) file names (see: ?evalsOptions)
  • rp.file.path: a directory where generated images and exported documents would take place. Please note, that the path should not contain any spaces at all (exception of this rule is on Windows where shortPathName could handle that problem with traditional directories like "Documents and Settings" etc.).
  • By default rapport function saves plots to image files (see the settings below) and print method just shows the path(s) of the generated image(s). If you would like to see the plot(s) when calling rapport function from an interactive R console, please set evalsOptions('graph.recordplot', TRUE) and rapport.graph.replay option to TRUE beforehand. In that case all generated plots will be displayed after printing the rapport object. These options are set to FALSE by default although we find these settings really handy, as you can resize the images on the fly and export resized images to HTML/ODT/DOCX/PDF etc.
  • Above a special function was called: evalsOptions() which part of pander: an R Pandoc writer besides panderOptions() - both holding a bunch of useful options which are documented on pander’s GH page and worth checking out.

And after all, do not forget to check out other templates of this package too or even write/translate your own templates! :)

Bundled templates

rapport comes bundled with several report templates, and we hope that their number will increase in succeeding releases. To list all available templates, issue the following command: tpl.list(). These package bundled templates are:

Please find below a detailed description about each with sample outputs of those in R console, HTML, odt and pdf too.

ANOVA.tpl

  • ANOVA Template
    by Aleksandar Blagotić, Dániel Nagy

    An ANOVA report with table of descriptives, diagnostic tests and ANOVA-specific statistics.

    Required packages: nortest
    Example:
    rapport("ANOVA", ius2008, resp = "leisure", fac = "gender")  # one-way
    rapport("ANOVA", ius2008, resp = "leisure", fac = c("gender", "partner")) # two-way
  • Input parameters
    ANOVA.tpl template has the following options to be (optionally) set while calling rapport:
    • resp: Response variable
      Dependent (response) variable
      Class: numeric

    • fac: Factor variables
      Independent variables (factors)
      Class: factor

    • fac.intr: Factor interaction
      Include factor interaction
      Class: logical
      Default value: TRUE

  • ##########################################################################################
    ##	 Running: rapport("ANOVA", ius2008, resp = "leisure", fac = "gender")  # one-way
    ##########################################################################################
    
    # Introduction
    
    
    **Analysis of Variance** or **ANOVA** is a statistical procedure that tests equality of means for several samples. It was first introduced in 1921 by famous English statistician Sir Ronald Aylmer Fisher.
    
    
    # Model Overview
    
    One-Way ANOVA was carried out, with _Gender_ as independent variable, and _Internet usage in leisure time (hours per day)_ as a response variable. Factor interaction was taken into account.
    
    
    # Descriptives
    
    
    In order to get more insight on the model data, a table of frequencies for ANOVA factors is displayed, as well as a table of descriptives.
    
    
    ## Frequency Table
    
    
    Below lies a frequency table for factors in ANOVA model. Note that the missing values are removed from the summary.
    
    
    ----------------------------------------
     gender   N    %    Cumul. N   Cumul. % 
    -------- --- ----- ---------- ----------
      male   410 60.92    410       60.92   
    
     female  263 39.08    673        100    
    
     Total   673  100     673        100    
    ----------------------------------------
    
    
    ## Descriptive Statistics
    
    
    The following table displays the descriptive statistics of ANOVA model. Factor levels  lie on the left-hand side, while the corresponding statistics for response variable are given on the right-hand side.
    
    
    ---------------------------------------------------------------------------
     Gender   Min   Max   Mean   Std.Dev.   Median   IQR   Skewness   Kurtosis 
    -------- ----- ----- ------ ---------- -------- ----- ---------- ----------
      male     0    12    3.27    1.953       3       3     0.9443     0.9858  
    
     female    0    12   3.064    2.355       2       3     1.398       1.87   
    ---------------------------------------------------------------------------
    
    
    # Diagnostics
    
    
    Before we carry out ANOVA, we'd like to check some basic assumptions. For those purposes, normality and homoscedascity tests are carried out alongside several graphs that may help you with your decision on model's main assumptions.
    
    
    ## Diagnostics
    
    
    ### Univariate Normality
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.168      3e-52  
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     18.75    7.261e-44
               test                                  
    
    Shapiro-Wilk normality test   0.9001    1.618e-20
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage in leisure time (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage in leisure time (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality of the Internet usage in leisure time (hours per day).
    
    ### Homoscedascity
    
    
    In order to test homoscedascity, _Bartlett_ and _Fligner-Kileen_ tests are applied.
    
    
    --------------------------------------------------
               Method             Statistic   p-value 
    ---------------------------- ----------- ---------
      Fligner-Killeen test of      0.4629     0.4963  
      homogeneity of variances                        
    
    Bartlett test of homogeneity    10.77    0.001032 
            of variances                              
    --------------------------------------------------
    
    
    
    
    When it comes to equality of variances, applied tests yield inconsistent results. While _Fligner-Kileen test_ confirmed the hypotheses of homoscedascity, _Bartlett's test_ rejected it.
    
    
    ## Diagnostic Plots
    
    
    Here you can see several diagnostic plots for ANOVA model:
    
     - residuals against fitted values
     - scale-location plot of square root of residuals against fitted values
     - normal Q-Q plot
     - residuals against leverages
    
    [![](plots/rapport-ANOVA-31-1.png)](plots/rapport-ANOVA-31-1-hires.png)
    
    # ANOVA Summary
    
    
    ## ANOVA Table
    
    
    ----------------------------------------------------------
        &nbsp;       Df   Sum.Sq   Mean.Sq   F.value   Pr..F. 
    --------------- ---- -------- --------- --------- --------
      **gender**     1    6.422     6.422     1.43     0.2322 
    
     **Residuals**  636    2856     4.49                      
    ----------------------------------------------------------
    
    
    
    _F-test_ for _Gender_ is not statistically significant, which implies that there is no Gender effect on response variable. 
    
    
    ## Post Hoc test
    
    
    ### Results
    
    
    After getting the results of the ANOVA, usually it is advisable to run a [post hoc test](http://en.wikipedia.org/wiki/Post-hoc_analysis) to explore patterns that were not specified a priori. Now we are presenting [Tukey's HSD test](http://en.wikipedia.org/wiki/Tukey%27s_range_test).
    
    
    #### gender
    
    
    --------------------------------------------------------------------
         &nbsp;        Difference   Lower Bound   Upper Bound   P value 
    ----------------- ------------ ------------- ------------- ---------
     **female-male**     -0.206       -0.543         0.132      _0.232_ 
    --------------------------------------------------------------------
    
    
    
    
    There are no categories which differ significantly here.
    
    
    ### Plot
    
    
    Below you can see the result of the post hoc test on a plot.
    
    [![](plots/rapport-ANOVA-31-2.png)](plots/rapport-ANOVA-31-2-hires.png)
    
    
    #######################################################################################################
    ##	 Running: rapport("ANOVA", ius2008, resp = "leisure", fac = c("gender", "partner")) # two-way
    #######################################################################################################
    
    # Introduction
    
    
    **Analysis of Variance** or **ANOVA** is a statistical procedure that tests equality of means for several samples. It was first introduced in 1921 by famous English statistician Sir Ronald Aylmer Fisher.
    
    
    # Model Overview
    
    Two-Way ANOVA was carried out, with _Gender_ and _Relationship status_ as independent variables, and _Internet usage in leisure time (hours per day)_ as a response variable. Factor interaction was taken into account.
    
    
    # Descriptives
    
    
    In order to get more insight on the model data, a table of frequencies for ANOVA factors is displayed, as well as a table of descriptives.
    
    
    ## Frequency Table
    
    
    Below lies a frequency table for factors in ANOVA model. Note that the missing values are removed from the summary.
    
    
    ----------------------------------------------------------
     gender       partner       N    %    Cumul. N   Cumul. % 
    -------- ----------------- --- ----- ---------- ----------
      male   in a relationship 150 23.7     150        23.7   
    
     female  in a relationship 120 18.96    270       42.65   
    
      male        married      33  5.213    303       47.87   
    
     female       married      29  4.581    332       52.45   
    
      male        single       204 32.23    536       84.68   
    
     female       single       97  15.32    633        100    
    
     Total         Total       633  100     633        100    
    ----------------------------------------------------------
    
    
    ## Descriptive Statistics
    
    
    The following table displays the descriptive statistics of ANOVA model. Factor levels and their combinations lie on the left-hand side, while the corresponding statistics for response variable are given on the right-hand side.
    
    
    -------------------------------------------------------------------------------------------------
     Gender   Relationship status   Min   Max   Mean   Std.Dev.   Median   IQR   Skewness   Kurtosis 
    -------- --------------------- ----- ----- ------ ---------- -------- ----- ---------- ----------
      male     in a relationship    0.5   12   3.058    1.969      2.5      2     1.324      2.649   
    
      male          married          0     8   2.985    2.029       3       2     0.862      0.1509  
    
      male          single           0    10   3.503    1.936       3       3     0.7574    0.08749  
    
     female    in a relationship    0.5   10   3.044    2.216       3       3     1.383      1.831   
    
     female         married          0    10   2.481    1.967       2     1.75    2.063      5.586   
    
     female         single           0    12   3.323    2.679       3      3.5    1.185      0.9281  
    -------------------------------------------------------------------------------------------------
    
    
    # Diagnostics
    
    
    Before we carry out ANOVA, we'd like to check some basic assumptions. For those purposes, normality and homoscedascity tests are carried out alongside several graphs that may help you with your decision on model's main assumptions.
    
    
    ## Diagnostics
    
    
    ### Univariate Normality
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.168      3e-52  
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     18.75    7.261e-44
               test                                  
    
    Shapiro-Wilk normality test   0.9001    1.618e-20
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage in leisure time (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage in leisure time (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality of the Internet usage in leisure time (hours per day).
    
    ### Homoscedascity
    
    
    In order to test homoscedascity, _Bartlett_ and _Fligner-Kileen_ tests are applied.
    
    
    --------------------------------------------------
               Method             Statistic   p-value 
    ---------------------------- ----------- ---------
      Fligner-Killeen test of       1.123     0.2892  
      homogeneity of variances                        
    
    Bartlett test of homogeneity    11.13    0.0008509
            of variances                              
    --------------------------------------------------
    
    
    
    
    When it comes to equality of variances, applied tests yield inconsistent results. While _Fligner-Kileen test_ confirmed the hypotheses of homoscedascity, _Bartlett's test_ rejected it.
    
    
    ## Diagnostic Plots
    
    
    Here you can see several diagnostic plots for ANOVA model:
    
     - residuals against fitted values
     - scale-location plot of square root of residuals against fitted values
     - normal Q-Q plot
     - residuals against leverages
    
    [![](plots/rapport-ANOVA-32-1.png)](plots/rapport-ANOVA-32-1-hires.png)
    
    # ANOVA Summary
    
    
    ## ANOVA Table
    
    
    ---------------------------------------------------------------
           &nbsp;         Df   Sum.Sq   Mean.Sq   F.value   Pr..F. 
    -------------------- ---- -------- --------- --------- --------
         **gender**       1    4.947     4.947     1.085    0.2979 
    
        **partner**       2    31.21     15.61     3.424   0.03324 
    
     **gender:partner**   2    3.038     1.519    0.3332    0.7168 
    
       **Residuals**     593    2703     4.558                     
    ---------------------------------------------------------------
    
    
    
    _F-test_ for _Gender_ is not statistically significant, which implies that there is no Gender effect on response variable. Effect of _Relationship status_ on response variable is  significant. Interaction between levels of _Gender_ and _Relationship status_ wasn't found significant (p = 0.717).
    
    
    ## Post Hoc test
    
    
    ### Results
    
    
    After getting the results of the ANOVA, usually it is advisable to run a [post hoc test](http://en.wikipedia.org/wiki/Post-hoc_analysis) to explore patterns that were not specified a priori. Now we are presenting [Tukey's HSD test](http://en.wikipedia.org/wiki/Tukey%27s_range_test).
    
    
    #### gender
    
    
    --------------------------------------------------------------------
         &nbsp;        Difference   Lower Bound   Upper Bound   P value 
    ----------------- ------------ ------------- ------------- ---------
     **female-male**     -0.186       -0.538         0.165      _0.298_ 
    --------------------------------------------------------------------
    
    
    
    
    There are no categories which differ significantly here.
    
    
    #### partner
    
    
    ----------------------------------------------------------------------------------
                &nbsp;               Difference   Lower Bound   Upper Bound   P value 
    ------------------------------- ------------ ------------- ------------- ---------
     **married-in a relationship**     -0.289       -1.012         0.435      _0.616_ 
    
     **single-in a relationship**      0.371        -0.061         0.803      _0.109_ 
    
          **single-married**            0.66        -0.059         1.379      _0.079_ 
    ----------------------------------------------------------------------------------
    
    
    
    
    There are no categories which differ significantly here.
    
    
    #### gender:partner
    
    
    -------------------------------------------------------------------------------------
                  &nbsp;                Difference   Lower Bound   Upper Bound   P value 
    ---------------------------------- ------------ ------------- ------------- ---------
              **female:in a               -0.014       -0.777         0.749        _1_   
          relationship-male:in a                                                         
              relationship**                                                             
    
         **male:married-male:in a         -0.073        -1.25         1.103        _1_   
              relationship**                                                             
    
        **female:married-male:in a        -0.577       -1.877         0.722      _0.801_ 
              relationship**                                                             
    
         **male:single-male:in a          0.444         -0.23         1.119      _0.412_ 
              relationship**                                                             
    
        **female:single-male:in a         0.264        -0.545         1.074      _0.938_ 
              relationship**                                                             
    
        **male:married-female:in a        -0.059       -1.266         1.148        _1_   
              relationship**                                                             
    
       **female:married-female:in a       -0.563        -1.89         0.764      _0.83_  
              relationship**                                                             
    
        **male:single-female:in a         0.459        -0.267         1.184      _0.461_ 
              relationship**                                                             
    
       **female:single-female:in a        0.279        -0.574         1.132      _0.938_ 
              relationship**                                                             
    
     **female:married-male:married**      -0.504       -2.105         1.097      _0.946_ 
    
       **male:single-male:married**       0.518        -0.635         1.67       _0.794_ 
    
      **female:single-male:married**      0.338        -0.899         1.575      _0.971_ 
    
      **male:single-female:married**      1.022        -0.256          2.3       _0.201_ 
    
     **female:single-female:married**     0.842        -0.512         2.196      _0.481_ 
    
      **female:single-male:single**       -0.18        -0.955         0.594      _0.986_ 
    -------------------------------------------------------------------------------------
    
    
    
    
    There are no categories which differ significantly here.
    
    
    ### Plot
    
    
    Below you can see the result of the post hoc test on a plot.
    
    [![](plots/rapport-ANOVA-32-2.png)](plots/rapport-ANOVA-32-2-hires.png)
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport("ANOVA", ius2008, resp = "leisure", fac = "gender")  # one-way
    rapport("ANOVA", ius2008, resp = "leisure", fac = c("gender", "partner")) # two-way
    Popular formats:
    Other morkdown formats
  • Source of `ANOVA.tpl`
    <!--head
    meta:
      title: ANOVA Template
      author: Aleksandar Blagotić, Dániel Nagy
      description: An ANOVA report with table of descriptives, diagnostic tests and ANOVA-specific
        statistics.
      email: ~
      packages: nortest
      example:
      - 'rapport("ANOVA", ius2008, resp = "leisure", fac = "gender")  # one-way'
      - 'rapport("ANOVA", ius2008, resp = "leisure", fac = c("gender", "partner")) # two-way'
    inputs:
    - name: resp
      label: Response variable
      description: Dependent (response) variable
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    - name: fac
      label: Factor variables
      description: Independent variables (factors)
      class: factor
      length:
        min: 1.0
        max: 2.0
      value: ~
      required: yes
      standalone: no
    - name: fac.intr
      label: Factor interaction
      description: Include factor interaction
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    head-->
    
    <%=
    panderOptions('table.split.table', Inf)
    d <- structure(data.frame(resp, fac), .Names = c(resp.iname, fac.name))
    f.int <- fml(resp.iname, fac.name, join.right = "*")
    f.nonint <- fml(resp.iname, fac.name, join.right = "+")
    fit <- lm(ifelse(isTRUE(fac.intr), f.int, f.nonint), data = d)
    fac.plu <- switch(fac.ilen, '', 's')
    %>
    
    # Introduction
    
    **Analysis of Variance** or **ANOVA** is a statistical procedure that tests equality of means for several samples. It was first introduced in 1921 by famous English statistician Sir Ronald Aylmer Fisher.
    
    # Model Overview
    
    <%= switch(fac.ilen, 'One', 'Two') %>-Way ANOVA was carried out, with <%= p(fac.label) %> as independent variable<%= fac.plu %>, and <%= p(resp.label) %> as a response variable. Factor interaction was<%= ifelse(fac.intr, "", "n't")%> taken into account.
    
    # Descriptives
    
    In order to get more insight on the model data, a table of frequencies for ANOVA factors is displayed, as well as a table of descriptives.
    
    ## Frequency Table
    
    Below lies a frequency table for factors in ANOVA model. Note that the missing values are removed from the summary.
    
    <%=
    (freq <- rp.freq(fac.name, rp.data))
    %>
    
    ## Descriptive Statistics
    
    The following table displays the descriptive statistics of ANOVA model. Factor levels <%=ifelse(ncol(fac) > 1, "and their combinations", "")%> lie on the left-hand side, while the corresponding statistics for response variable are given on the right-hand side.
    
    <%=
    (desc <- rp.desc(resp, fac, c(Min = min, Max = max, Mean = mean, Std.Dev. = sd, Median = median, IQR, Skewness = skewness, Kurtosis = kurtosis)))
    %>
    
    # Diagnostics
    
    Before we carry out ANOVA, we'd like to check some basic assumptions. For those purposes, normality and homoscedascity tests are carried out alongside several graphs that may help you with your decision on model's main assumptions.
    
    ## Diagnostics
    
    ### Univariate Normality
    
    <% if (length(resp) < 5000) { %>
    
    <%= ntest <- htest(resp, lillie.test, ad.test, shapiro.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {m <- l + 1}
    if (ntest$p[3] < 0.05) {n <- m + 1}
    ntest
    %>
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= resp.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _<%= resp.label %>_ is<%= ifelse(ntest[3, 3] < p, " not", "") %> normal
     
    <% } else { %>
    <%= ntest <- htest(resp, lillie.test, ad.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {n <- l + 1}
    ntest
    %>
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= resp.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    <% } %>
    
    <%= if (n > 0) {
    sprintf("As you can see, the applied tests %s of the %s.", ifelse(n > 1, "confirm departures from normality", "yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most in the case"), resp.label) 
    } else { 
    sprintf("reject departures from normality") 
    } 
    %>
    
    
    ### Homoscedascity
    
    In order to test homoscedascity, _Bartlett_ and _Fligner-Kileen_ tests are applied.
    
    <%=
    hsced <- with(d, htest(as.formula(f.nonint), fligner.test, bartlett.test))
    hp <- hsced$p
    hcons <- all(hp < .05) | all(hp > .05)
    hp.all <- all(hp < .05)
    hsced
    %>
    
    
    When it comes to equality of variances, applied tests yield <%= ifelse(hcons, "consistent", "inconsistent") %> results. <%= if (hcons) { sprintf("Homoscedascity assumption is %s.", ifelse(hp.all, "rejected", "confirmed")) } else { sprintf("While _Fligner-Kileen test_ %s the hypotheses of homoscedascity, _Bartlett's test_ %s it.", ifelse(hp[1] < .05, "rejected", "confirmed"), ifelse(hp[2] < .05, "rejected", "confirmed")) } %>
    
    ## Diagnostic Plots
    
    Here you can see several diagnostic plots for ANOVA model:
    
     - residuals against fitted values
     - scale-location plot of square root of residuals against fitted values
     - normal Q-Q plot
     - residuals against leverages
    
    <%=
    par(mfrow = c(2, 2))
    +plot(fit)
    %>
    
    # ANOVA Summary
    
    ## ANOVA Table
    
    <%=
    a <- anova(fit)
    a.f <- a$F
    a.p <- a$Pr
    a.fp <- a.p < .05
    data.frame(a)
    %>
    
    _F-test_ for <%= p(fac.label[1]) %> is <%= ifelse(a.fp[1], "", "not") %> statistically significant, which implies that there is <%= ifelse(a.fp[1], "an", "no") %> <%= fac.label[1] %> effect on response variable. <%= if (fac.ilen == 2) { sprintf("Effect of %s on response variable is %s significant. ", p(fac.label[2]), ifelse(a.fp[2], "", "not")) } else { "" } %><%= if (fac.ilen == 2 & fac.intr) { sprintf("Interaction between levels of %s %s found significant (p = %.3f).", p(fac.label), ifelse(a.fp[3], "was", "wasn't"), a.p[3]) } else { "" } %>
    
    ## Post Hoc test
    
    ### Results
    
    After getting the results of the ANOVA, usually it is advisable to run a [post hoc test](http://en.wikipedia.org/wiki/Post-hoc_analysis) to explore patterns that were not specified a priori. Now we are presenting [Tukey's HSD test](http://en.wikipedia.org/wiki/Tukey%27s_range_test).
    
    <%= 
    aovfit <- aov(fit)
    Tukey <- TukeyHSD(aovfit) 
    %>
    
    <% for (v in names(Tukey)) { %>
    
    #### <%= v %>
    
    <%=  posthoc <- round(Tukey[[v]],3)
    colnames(posthoc) <- c("Difference", "Lower Bound", "Upper Bound", "P value") 
    is.signif <- length(posthoc[,4][which(abs(posthoc[,4]) < 0.05)]) > 0
    length.signif <- length(posthoc[,4][which(abs(posthoc[,4]) < 0.05)])
    if (is.signif) {
    post.signif <- paste(pander.return(lapply(1:length.signif, function(i) paste0(p(c(rownames(posthoc)[which(abs(posthoc[,4]) < 0.05)][i])), ' (', round(posthoc[,4][which(abs(posthoc[,4]) < 0.05)][i], 3), ')'))), collapse = '\n')
    } else {
    post.signif <- NULL
    }
    
    posthoc[,4] <- add.significance.stars(posthoc[,4])
    posthoc
    %>
    
    <% if (is.signif) { %>
    The following categories differ significantly (in the brackets you can see the p-value):
    <% } else { %>
    There are no categories which differ significantly here.
    <% } %>
    <%=
    post.signif
    %>
    
    <% } %>
    
    ### Plot
    
    Below you can see the result of the post hoc test on a plot.
    
    <%= Tukey_plot <- plot(TukeyHSD(aovfit)) %>

BartlettsTest.tpl

  • Bartlett's test
    by Gergely Daróczi, Daniel Nagy

    This template will run the Bartlett's test to check the equality of variances between groups.

    Required packages: nortest
    Example:
    rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='gender')
    rapport('BartlettsTest.tpl', data=ius2008, resp='edu', group='student')
    rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='net.required')
  • Input parameters
    BartlettsTest.tpl template has the following options to be (optionally) set while calling rapport:
    • resp: Response Variable
      This is the depend variables which will be used here
      Class: numeric

    • group: Grouping Variable
      This is the grouping variable which will be used here
      Class: factor

  • ##########################################################################################
    ##	 Running: rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='gender')
    ##########################################################################################
    
    # Introduction
    
    
    Bartlett's test is used to test the homogeneity of the variances, in other words the equality of the tested variable's variances across the groups. With checking that we want to find if the two groups are coming from the same population.
    
    Homogeneity is useful to being tested, because that is an assumption of the One-Way ANOVA.
    
    
    ## References
    
    
      * Snedecor, George W. and Cochran, William G. (1989). _Statistical Methods_. Iowa State University Press
    
    
    # Normality assumption
    
    
    The Bartlett's test has an assumption of normality, thus one should obtain the information if the distribution of the tested variable had a normal distribution.
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable.
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.17     6.193e-54
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     32.16    1.26e-71 
               test                                  
    
    Shapiro-Wilk normality test   0.8216    9.445e-27
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Age_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Age_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Test results
    
    
    After checking the assumptions let's see what the test shows us!
    
    
    --------------------------------------------------
               Method             Statistic   p-value 
    ---------------------------- ----------- ---------
    Bartlett test of homogeneity    0.233     0.6293  
            of variances                              
    --------------------------------------------------
    
    
    
    According to the _Bartlett's test_, the variance of the _Age_ across the groups of _Gender_ does not differs significantly.
    
    We can conclude that, because the p-value is higher than 0.05.
    
    
    
    ###########################################################################################
    ##	 Running: rapport('BartlettsTest.tpl', data=ius2008, resp='edu', group='student')
    ###########################################################################################
    
    # Introduction
    
    
    Bartlett's test is used to test the homogeneity of the variances, in other words the equality of the tested variable's variances across the groups. With checking that we want to find if the two groups are coming from the same population.
    
    Homogeneity is useful to being tested, because that is an assumption of the One-Way ANOVA.
    
    
    ## References
    
    
      * Snedecor, George W. and Cochran, William G. (1989). _Statistical Methods_. Iowa State University Press
    
    
    # Normality assumption
    
    
    The Bartlett's test has an assumption of normality, thus one should obtain the information if the distribution of the tested variable had a normal distribution.
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable.
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors            0.2223    2.243e-92
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     42.04    3.31e-90 
               test                                  
    
    Shapiro-Wilk normality test   0.7985    6.366e-28
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage for educational purposes (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage for educational purposes (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Test results
    
    
    After checking the assumptions let's see what the test shows us!
    
    
    --------------------------------------------------
               Method             Statistic   p-value 
    ---------------------------- ----------- ---------
    Bartlett test of homogeneity    36.11    1.863e-09
            of variances                              
    --------------------------------------------------
    
    
    
    According to the _Bartlett's test_, the variance of the _Internet usage for educational purposes (hours per day)_ across the groups of _Student_ significantly differs.
    
    We can conclude that, because the p-value is smaller than 0.05.
    
    
    
    ################################################################################################
    ##	 Running: rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='net.required')
    ################################################################################################
    
    # Introduction
    
    
    Bartlett's test is used to test the homogeneity of the variances, in other words the equality of the tested variable's variances across the groups. With checking that we want to find if the two groups are coming from the same population.
    
    Homogeneity is useful to being tested, because that is an assumption of the One-Way ANOVA.
    
    
    ## References
    
    
      * Snedecor, George W. and Cochran, William G. (1989). _Statistical Methods_. Iowa State University Press
    
    
    # Normality assumption
    
    
    The Bartlett's test has an assumption of normality, thus one should obtain the information if the distribution of the tested variable had a normal distribution.
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable.
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.17     6.193e-54
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     32.16    1.26e-71 
               test                                  
    
    Shapiro-Wilk normality test   0.8216    9.445e-27
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Age_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Age_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Test results
    
    
    After checking the assumptions let's see what the test shows us!
    
    
    --------------------------------------------------
               Method             Statistic   p-value 
    ---------------------------- ----------- ---------
    Bartlett test of homogeneity    23.26    0.0001123
            of variances                              
    --------------------------------------------------
    
    
    
    According to the _Bartlett's test_, the variance of the _Age_ across the groups of _How often does your profession require Internet access?_ significantly differs.
    
    We can conclude that, because the p-value is smaller than 0.05.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='gender')
    rapport('BartlettsTest.tpl', data=ius2008, resp='edu', group='student')
    rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='net.required')
    Popular formats:
    Other morkdown formats
  • Source of `BartlettsTest.tpl`
    <!--head
    meta:
      title: Bartlett's test
      author: Gergely Daróczi, Daniel Nagy
      description: This template will run the Bartlett's test to check the equality of variances between groups.
      email: ~
      packages:
      - nortest
      example:
      - rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='gender')
      - rapport('BartlettsTest.tpl', data=ius2008, resp='edu', group='student')
      - rapport('BartlettsTest.tpl', data=ius2008, resp='age', group='net.required')
    inputs:
    - name: resp
      label: Response Variable
      description: This is the depend variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: group
      label: Grouping Variable
      description: This is the grouping variable which will be used here
      class: factor
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    head-->
    
    
    # Introduction
    
    Bartlett's test is used to test the homogeneity of the variances, in other words the equality of the tested variable's variances across the groups. With checking that we want to find if the two groups are coming from the same population.
    
    Homogeneity is useful to being tested, because that is an assumption of the One-Way ANOVA.
    
    ## References
    
      * Snedecor, George W. and Cochran, William G. (1989). _Statistical Methods_. Iowa State University Press
    
    # Normality assumption
    
    The Bartlett's test has an assumption of normality, thus one should obtain the information if the distribution of the tested variable had a normal distribution.
    
    We will use <%=ifelse(length(resp) < 5000, "_Shapiro-Wilk_, ", "")%>_Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable.
    
    <% if (length(resp) < 5000) { %>
    
    <%= ntest <- htest(resp, lillie.test, ad.test, shapiro.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {m <- l + 1}
    if (ntest$p[3] < 0.05) {n <- m + 1}
    ntest
    %>
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= resp.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _<%= resp.label %>_ is<%= ifelse(ntest[3, 3] < p, " not", "") %> normal
     
    <% } else { %>
    <%= ntest <- htest(resp, lillie.test, ad.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {n <- l + 1}
    ntest
    %>
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= resp.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    <% } %>
    
    <%= if (n > 0) {
    sprintf("As you can see, the applied tests %s.", ifelse(n > 1, "confirm departures from normality", "yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most.")) 
    } else { 
    sprintf("reject departures from normality") 
    } 
    %>
    
    # Test results
    
    After checking the assumptions let's see what the test shows us!
    
    <%=
    bartlett <- function(resp) bartlett.test(resp, group)
    h <- htest(resp, bartlett)
    h
    p <- .05
    %>
    
    According to the _Bartlett's test_, the variance of the _<%= resp.label %>_ across the groups of _<%= group.label %>_ <%= ifelse(h[1, 3] < p, "significantly differs", "does not differs significantly") %>.
    
    We can conclude that, because <%= ifelse(h[1, 3] < p,"the p-value is smaller than 0.05.","the p-value is higher than 0.05.")%>

BrownForsythTest.tpl

  • Brown-Forsyth test
    by Gergely Daróczi, Daniel Nagy

    This template will run a Brown-Forsyth test to check the equality of variances among groups.

    Required packages: HH
    Example:
    rapport('BrownForsythTest.tpl', data=ius2008, resp='age', group='gender')
    rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='gender')
    rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='net.required')
  • Input parameters
    BrownForsythTest.tpl template has the following options to be (optionally) set while calling rapport:
    • resp: Response Variable
      This is the depend variables which will be used here
      Class: numeric

    • group: Grouping Variable
      This is the grouping variable which will be used here
      Class: factor

  • #############################################################################################
    ##	 Running: rapport('BrownForsythTest.tpl', data=ius2008, resp='age', group='gender')
    #############################################################################################
    
    # Introduction
    
    
    The Brown-Forsyth test is used for checking the equality of the variances among the groups of one variable, in other words it tests the homogeneity of the variances. Equality of group variances is an assumption of the one-way ANOVA test.
    
    The base of the test is really similar to the Levene's test, but the Brown-Forsyth test uses the deviations from the group medians instead of the mean (what the Levene's does), thus the Brown-Forsyth test is called more robust.
    
    This test has the advantage over the other tests, which are also being used to check the homogeneity (F-test and Bartlett's test), that it does not have a normality assumption, so the variable we investigate do not have to follow a normal distribution.
    
    
    ## References
    
    
      * Brown, M.~B. and Forsyth, A.~B. (1974). Robust tests for equality of variances. _Journal of the American Statistical Association_, 69:364-367.
    
    
    # Result
    
    
    ----------------------------------------
          Method        Statistic   p-value 
    ------------------ ----------- ---------
    hov: Brown-Forsyth   0.3847     0.5353  
    ----------------------------------------
    
    
    
    According to the _Brown-Forsyth test_, the variances of the _Age_ across the groups of _Gender_ does not differs significantly.
    
    We can conclude that, because the p-value is higher than 0.05
    
    
    ## References
    
    
      * Heiberger, Richard M. and Holland, Burt (2004b). _Statistical Analysis and Data Display: An Intermediate Course with Examples in S-Plus, R, and SAS_.
    
    
    
    
    #############################################################################################
    ##	 Running: rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='gender')
    #############################################################################################
    
    # Introduction
    
    
    The Brown-Forsyth test is used for checking the equality of the variances among the groups of one variable, in other words it tests the homogeneity of the variances. Equality of group variances is an assumption of the one-way ANOVA test.
    
    The base of the test is really similar to the Levene's test, but the Brown-Forsyth test uses the deviations from the group medians instead of the mean (what the Levene's does), thus the Brown-Forsyth test is called more robust.
    
    This test has the advantage over the other tests, which are also being used to check the homogeneity (F-test and Bartlett's test), that it does not have a normality assumption, so the variable we investigate do not have to follow a normal distribution.
    
    
    ## References
    
    
      * Brown, M.~B. and Forsyth, A.~B. (1974). Robust tests for equality of variances. _Journal of the American Statistical Association_, 69:364-367.
    
    
    # Result
    
    
    ----------------------------------------
          Method        Statistic   p-value 
    ------------------ ----------- ---------
    hov: Brown-Forsyth    4.681     0.03087 
    ----------------------------------------
    
    
    
    According to the _Brown-Forsyth test_, the variances of the _Internet usage for educational purposes (hours per day)_ across the groups of _Gender_ significantly differs.
    
    We can conclude that, because the p-value is smaller than 0.05
    
    
    ## References
    
    
      * Heiberger, Richard M. and Holland, Burt (2004b). _Statistical Analysis and Data Display: An Intermediate Course with Examples in S-Plus, R, and SAS_.
    
    
    
    
    ###################################################################################################
    ##	 Running: rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='net.required')
    ###################################################################################################
    
    # Introduction
    
    
    The Brown-Forsyth test is used for checking the equality of the variances among the groups of one variable, in other words it tests the homogeneity of the variances. Equality of group variances is an assumption of the one-way ANOVA test.
    
    The base of the test is really similar to the Levene's test, but the Brown-Forsyth test uses the deviations from the group medians instead of the mean (what the Levene's does), thus the Brown-Forsyth test is called more robust.
    
    This test has the advantage over the other tests, which are also being used to check the homogeneity (F-test and Bartlett's test), that it does not have a normality assumption, so the variable we investigate do not have to follow a normal distribution.
    
    
    ## References
    
    
      * Brown, M.~B. and Forsyth, A.~B. (1974). Robust tests for equality of variances. _Journal of the American Statistical Association_, 69:364-367.
    
    
    # Result
    
    
    ----------------------------------------
          Method        Statistic   p-value 
    ------------------ ----------- ---------
    hov: Brown-Forsyth    15.89    2.131e-12
    ----------------------------------------
    
    
    
    According to the _Brown-Forsyth test_, the variances of the _Internet usage for educational purposes (hours per day)_ across the groups of _How often does your profession require Internet access?_ significantly differs.
    
    We can conclude that, because the p-value is smaller than 0.05
    
    
    ## References
    
    
      * Heiberger, Richard M. and Holland, Burt (2004b). _Statistical Analysis and Data Display: An Intermediate Course with Examples in S-Plus, R, and SAS_.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('BrownForsythTest.tpl', data=ius2008, resp='age', group='gender')
    rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='gender')
    rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='net.required')
    Popular formats:
    Other morkdown formats
  • Source of `BrownForsythTest.tpl`
    <!--head
    meta:
      title: Brown-Forsyth test
      author: Gergely Daróczi, Daniel Nagy
      description: This template will run a Brown-Forsyth test to check the equality of variances among groups.
      email: ~
      packages:
      - HH
      example:
      - rapport('BrownForsythTest.tpl', data=ius2008, resp='age', group='gender')
      - rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='gender')
      - rapport('BrownForsythTest.tpl', data=ius2008, resp='edu', group='net.required')
    inputs:
    - name: resp
      label: Response Variable
      description: This is the depend variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: group
      label: Grouping Variable
      description: This is the grouping variable which will be used here
      class: factor
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    head-->
    
    # Introduction
    
    The Brown-Forsyth test is used for checking the equality of the variances among the groups of one variable, in other words it tests the homogeneity of the variances. Equality of group variances is an assumption of the one-way ANOVA test.
    
    The base of the test is really similar to the Levene's test, but the Brown-Forsyth test uses the deviations from the group medians instead of the mean (what the Levene's does), thus the Brown-Forsyth test is called more robust.
    
    This test has the advantage over the other tests, which are also being used to check the homogeneity (F-test and Bartlett's test), that it does not have a normality assumption, so the variable we investigate do not have to follow a normal distribution.
    
    ## References
    
      * Brown, M.~B. and Forsyth, A.~B. (1974). Robust tests for equality of variances. _Journal of the American Statistical Association_, 69:364-367.
    
    # Result
    
    <%=
    df     <- na.omit(data.frame(resp = as.numeric(resp), group = as.factor(group)))
    BFtest <- function(x) hov(resp ~ group, data = df)
    h      <- htest(resp, BFtest)
    h
    p <- .05
    %>
    
    According to the _Brown-Forsyth test_, the variances of the _<%= resp.label %>_ across the groups of _<%= group.label %>_ <%= ifelse(h[1, 3] < p, "significantly differs", "does not differs significantly") %>.
    
    We can conclude that, because <%= ifelse(h[1, 3] < p,"the p-value is smaller than 0.05","the p-value is higher than 0.05")%>
    
    ## References
    
      * Heiberger, Richard M. and Holland, Burt (2004b). _Statistical Analysis and Data Display: An Intermediate Course with Examples in S-Plus, R, and SAS_.

Correlation.tpl

  • Correlations
    by Daróczi Gergely, Nagy Dániel

    This template will return the correlation matrix of supplied numerical variables.

    Required packages:
    Example:
    rapport('Correlation', data=ius2008, vars=c('age', 'edu'))
    rapport('Correlation', data=ius2008, vars=c('age', 'edu', 'leisure'))
    rapport('Correlation', data=mtcars, vars=c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'))
  • Input parameters
    Correlation.tpl template has the following options to be (optionally) set while calling rapport:
    • vars: Variable
      Numerical variables
      Class: numeric

  • ##############################################################################
    ##	 Running: rapport('Correlation', data=ius2008, vars=c('age', 'edu'))
    ##############################################################################
    
    # Introduction
    
    
    [Correlation](http://en.wikipedia.org/wiki/Correlation_and_dependence) is one of the most commonly used statistical tool. With the help of that we can get information about a possible [linear relation](http://en.wikipedia.org/wiki/Linear_independence) between two variables. According to the definition of the correlation, one can call it also as the standardized [covariance](http://en.wikipedia.org/wiki/Covariance).
    
    The maximum possible value of the correlation (the so-called [correlation coefficient](http://en.wikipedia.org/wiki/Correlation_coefficient)) could be 1, the minimum could be -1. In the first case there is a perfect positive (thus in the second case there is a perfect negative) linear relationship between the two variables, though perfect relationships, especially in the social sciences, are quite rare. If two variables are independent from each other, the correlation between them is 0, but 0 correlation coefficient only means certainly a [linear independency](http://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_and_linearity).
    
    Because extreme values occur seldom we have rule of thumbs for the coefficients, like other fields of statistics:
    
      * we call two variables highly correlated if the absolute value of the correlation coefficient between them is higher than 0.7 and
      * we call them uncorrelated if that is smaller than 0.2.
    
    Please note that [correlation has nothing to do with causal models](http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation), it only shows association but not effects.
    
    
    # Variable description
    
    _709_ variables with _2_ cases provided.
    
    
    
    
    
    
    
    There are no highly correlated (r < -0.7 or r > 0.7) variables.
    
    
    
    
    There are no uncorrelated correlated (r < -0.2 or r > 0.2) variables.
    
    
    
    # Correlation matrix
    
    
    -----------------------------------
       &nbsp;     age          edu     
    --------- ------------ ------------
      **age**              0.2185 * * *
    
      **edu** 0.2185 * * *             
    -----------------------------------
    
    Table: Correlation matrix
    
    
    
    Where the stars represent the [significance levels](http://en.wikipedia.org/wiki/Statistical_significance) of the bivariate correlation coefficients: one star for a [p value](http://en.wikipedia.org/wiki/P-value) below `0.05`, two for below `0.01` and three for below `0.001`.
    
    On the plot one can see the correlation in two forms: below the [diagonal](http://en.wikipedia.org/wiki/Main_diagonal) visually, above that one can find the coefficient(s).
    
    [![Scatterplot matrix](plots/rapport-Correlation-46-1.png)](plots/rapport-Correlation-46-1-hires.png)
    
    
    #########################################################################################
    ##	 Running: rapport('Correlation', data=ius2008, vars=c('age', 'edu', 'leisure'))
    #########################################################################################
    
    # Introduction
    
    
    [Correlation](http://en.wikipedia.org/wiki/Correlation_and_dependence) is one of the most commonly used statistical tool. With the help of that we can get information about a possible [linear relation](http://en.wikipedia.org/wiki/Linear_independence) between two variables. According to the definition of the correlation, one can call it also as the standardized [covariance](http://en.wikipedia.org/wiki/Covariance).
    
    The maximum possible value of the correlation (the so-called [correlation coefficient](http://en.wikipedia.org/wiki/Correlation_coefficient)) could be 1, the minimum could be -1. In the first case there is a perfect positive (thus in the second case there is a perfect negative) linear relationship between the two variables, though perfect relationships, especially in the social sciences, are quite rare. If two variables are independent from each other, the correlation between them is 0, but 0 correlation coefficient only means certainly a [linear independency](http://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_and_linearity).
    
    Because extreme values occur seldom we have rule of thumbs for the coefficients, like other fields of statistics:
    
      * we call two variables highly correlated if the absolute value of the correlation coefficient between them is higher than 0.7 and
      * we call them uncorrelated if that is smaller than 0.2.
    
    Please note that [correlation has nothing to do with causal models](http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation), it only shows association but not effects.
    
    
    # Variable description
    
    _709_ variables with _3_ cases provided.
    
    
    
    
    The highest correlation coefficient (_0.2273_) is between _edu_ and _age_ and the lowest (_-0.03377_) is between _leisure_ and _age_. It seems that the strongest association (r=_0.2273_) is between _edu_ and _age_.
    
    
    
    
    There are no highly correlated (r < -0.7 or r > 0.7) variables.
    
    
    
    
    Uncorrelated (-0.2 < r < 0.2) variables:
    
    
    
      * _leisure_ and _age_ (-0.03)
      * _leisure_ and _edu_ (0.17)
    
    <!-- end of list -->
    
    
    # Correlation matrix
    
    
    ----------------------------------------------------
           &nbsp;     age          edu        leisure   
    ------------- ------------ ------------ ------------
          **age**              0.2273 * * *   -0.0338   
    
          **edu** 0.2273 * * *              0.1732 * * *
    
      **leisure**   -0.0338    0.1732 * * *             
    ----------------------------------------------------
    
    Table: Correlation matrix
    
    
    
    Where the stars represent the [significance levels](http://en.wikipedia.org/wiki/Statistical_significance) of the bivariate correlation coefficients: one star for a [p value](http://en.wikipedia.org/wiki/P-value) below `0.05`, two for below `0.01` and three for below `0.001`.
    
    On the plot one can see the correlation in two forms: below the [diagonal](http://en.wikipedia.org/wiki/Main_diagonal) visually, above that one can find the coefficient(s).
    
    [![Scatterplot matrix](plots/rapport-Correlation-47-1.png)](plots/rapport-Correlation-47-1-hires.png)
    
    
    #############################################################################################################################################
    ##	 Running: rapport('Correlation', data=mtcars, vars=c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'))
    #############################################################################################################################################
    
    # Introduction
    
    
    [Correlation](http://en.wikipedia.org/wiki/Correlation_and_dependence) is one of the most commonly used statistical tool. With the help of that we can get information about a possible [linear relation](http://en.wikipedia.org/wiki/Linear_independence) between two variables. According to the definition of the correlation, one can call it also as the standardized [covariance](http://en.wikipedia.org/wiki/Covariance).
    
    The maximum possible value of the correlation (the so-called [correlation coefficient](http://en.wikipedia.org/wiki/Correlation_coefficient)) could be 1, the minimum could be -1. In the first case there is a perfect positive (thus in the second case there is a perfect negative) linear relationship between the two variables, though perfect relationships, especially in the social sciences, are quite rare. If two variables are independent from each other, the correlation between them is 0, but 0 correlation coefficient only means certainly a [linear independency](http://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_and_linearity).
    
    Because extreme values occur seldom we have rule of thumbs for the coefficients, like other fields of statistics:
    
      * we call two variables highly correlated if the absolute value of the correlation coefficient between them is higher than 0.7 and
      * we call them uncorrelated if that is smaller than 0.2.
    
    Please note that [correlation has nothing to do with causal models](http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation), it only shows association but not effects.
    
    
    # Variable description
    
    _32_ variables with _11_ cases provided.
    
    
    
    
    The highest correlation coefficient (_0.902_) is between _disp_ and _cyl_ and the lowest (_-0.8677_) is between _wt_ and _mpg_. It seems that the strongest association (r=_0.902_) is between _disp_ and _cyl_.
    
    
    
    
    Highly correlated (r < -0.7 or r > 0.7) variables:
    
    
    
      * _cyl_ and _mpg_ (-0.85)
      * _disp_ and _mpg_ (-0.85)
      * _hp_ and _mpg_ (-0.78)
      * _wt_ and _mpg_ (-0.87)
      * _disp_ and _cyl_ (0.9)
      * _hp_ and _cyl_ (0.83)
      * _wt_ and _cyl_ (0.78)
      * _vs_ and _cyl_ (-0.81)
      * _hp_ and _disp_ (0.79)
      * _drat_ and _disp_ (-0.71)
      * _wt_ and _disp_ (0.89)
      * _vs_ and _disp_ (-0.71)
      * _qsec_ and _hp_ (-0.71)
      * _vs_ and _hp_ (-0.72)
      * _carb_ and _hp_ (0.75)
      * _wt_ and _drat_ (-0.71)
      * _am_ and _drat_ (0.71)
      * _vs_ and _qsec_ (0.74)
      * _gear_ and _am_ (0.79)
    
    <!-- end of list -->
    
    
    
    
    
    
    
    Uncorrelated (-0.2 < r < 0.2) variables:
    
    
    
      * _gear_ and _hp_ (-0.13)
      * _qsec_ and _drat_ (0.09)
      * _carb_ and _drat_ (-0.09)
      * _qsec_ and _wt_ (-0.17)
      * _am_ and _vs_ (0.17)
      * _carb_ and _am_ (0.06)
    
    <!-- end of list -->
    
    
    # Correlation matrix
    
    
    ----------------------------------------------------
        &nbsp;      mpg           cyl          disp     
    ---------- ------------- ------------- -------------
       **mpg**               -0.8522 * * * -0.8476 * * *
    
       **cyl** -0.8522 * * *               0.9020 * * * 
    
      **disp** -0.8476 * * * 0.9020 * * *               
    
        **hp** -0.7762 * * * 0.8324 * * *  0.7909 * * * 
    
      **drat** 0.6812 * * *  -0.6999 * * * -0.7102 * * *
    
        **wt** -0.8677 * * * 0.7825 * * *  0.8880 * * * 
    
      **qsec**   0.4187 *    -0.5912 * * *   -0.4337 *  
    
        **vs** 0.6640 * * *  -0.8108 * * * -0.7104 * * *
    
        **am** 0.5998 * * *   -0.5226 * *  -0.5912 * * *
    
      **gear**  0.4803 * *    -0.4927 * *  -0.5556 * * *
    
      **carb**  -0.5509 * *   0.5270 * *     0.3950 *   
    ----------------------------------------------------
    
    Table: Correlation matrix (continued below)
    
     
    ----------------------------------------------------
        &nbsp;      hp           drat           wt      
    ---------- ------------- ------------- -------------
       **mpg** -0.7762 * * * 0.6812 * * *  -0.8677 * * *
    
       **cyl** 0.8324 * * *  -0.6999 * * * 0.7825 * * * 
    
      **disp** 0.7909 * * *  -0.7102 * * * 0.8880 * * * 
    
        **hp**                -0.4488 * *  0.6587 * * * 
    
      **drat**  -0.4488 * *                -0.7124 * * *
    
        **wt** 0.6587 * * *  -0.7124 * * *              
    
      **qsec** -0.7082 * * *    0.0912        -0.1747   
    
        **vs** -0.7231 * * *   0.4403 *    -0.5549 * * *
    
        **am**    -0.2432    0.7127 * * *  -0.6925 * * *
    
      **gear**    -0.1257    0.6996 * * *  -0.5833 * * *
    
      **carb** 0.7498 * * *     -0.0908      0.4276 *   
    ----------------------------------------------------
    
    Table: Table continues below
    
     
    ----------------------------------------------------
        &nbsp;     qsec           vs            am      
    ---------- ------------- ------------- -------------
       **mpg**   0.4187 *    0.6640 * * *  0.5998 * * * 
    
       **cyl** -0.5912 * * * -0.8108 * * *  -0.5226 * * 
    
      **disp**   -0.4337 *   -0.7104 * * * -0.5912 * * *
    
        **hp** -0.7082 * * * -0.7231 * * *    -0.2432   
    
      **drat**    0.0912       0.4403 *    0.7127 * * * 
    
        **wt**    -0.1747    -0.5549 * * * -0.6925 * * *
    
      **qsec**               0.7445 * * *     -0.2299   
    
        **vs** 0.7445 * * *                   0.1683    
    
        **am**    -0.2299       0.1683                  
    
      **gear**    -0.2127       0.2060     0.7941 * * * 
    
      **carb** -0.6562 * * * -0.5696 * * *    0.0575    
    ----------------------------------------------------
    
    Table: Table continues below
    
     
    --------------------------------------
        &nbsp;     gear          carb     
    ---------- ------------- -------------
       **mpg**  0.4803 * *    -0.5509 * * 
    
       **cyl**  -0.4927 * *   0.5270 * *  
    
      **disp** -0.5556 * * *   0.3950 *   
    
        **hp**    -0.1257    0.7498 * * * 
    
      **drat** 0.6996 * * *     -0.0908   
    
        **wt** -0.5833 * * *   0.4276 *   
    
      **qsec**    -0.2127    -0.6562 * * *
    
        **vs**    0.2060     -0.5696 * * *
    
        **am** 0.7941 * * *     0.0575    
    
      **gear**                  0.2741    
    
      **carb**    0.2741                  
    --------------------------------------
    
    
    
    Where the stars represent the [significance levels](http://en.wikipedia.org/wiki/Statistical_significance) of the bivariate correlation coefficients: one star for a [p value](http://en.wikipedia.org/wiki/P-value) below `0.05`, two for below `0.01` and three for below `0.001`.
    
    On the plot one can see the correlation in two forms: below the [diagonal](http://en.wikipedia.org/wiki/Main_diagonal) visually, above that one can find the coefficient(s).
    
    [![Scatterplot matrix](plots/rapport-Correlation-48-1.png)](plots/rapport-Correlation-48-1-hires.png)
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('Correlation', data=ius2008, vars=c('age', 'edu'))
    rapport('Correlation', data=ius2008, vars=c('age', 'edu', 'leisure'))
    rapport('Correlation', data=mtcars, vars=c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'))
    Popular formats:
    Other morkdown formats
  • Source of `Correlation.tpl`
    <!--head
    meta:
      title: Correlations
      author: Daróczi Gergely, Nagy Dániel
      description: This template will return the correlation matrix of supplied numerical variables.
      email: gergely@snowl.net
      packages: ~
      example:
      - rapport('Correlation', data=ius2008, vars=c('age', 'edu'))
      - rapport('Correlation', data=ius2008, vars=c('age', 'edu', 'leisure'))
      - rapport('Correlation', data=mtcars, vars=c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb'))
    inputs:
    - name: vars
      label: Variable
      description: Numerical variables
      class: numeric
      length:
        min: 2.0
        max: 50.0
      value: ~
      required: yes
      standalone: no
    head-->
    
    # Introduction
    
    [Correlation](http://en.wikipedia.org/wiki/Correlation_and_dependence) is one of the most commonly used statistical tool. With the help of that we can get information about a possible [linear relation](http://en.wikipedia.org/wiki/Linear_independence) between two variables. According to the definition of the correlation, one can call it also as the standardized [covariance](http://en.wikipedia.org/wiki/Covariance).
    
    The maximum possible value of the correlation (the so-called [correlation coefficient](http://en.wikipedia.org/wiki/Correlation_coefficient)) could be 1, the minimum could be -1. In the first case there is a perfect positive (thus in the second case there is a perfect negative) linear relationship between the two variables, though perfect relationships, especially in the social sciences, are quite rare. If two variables are independent from each other, the correlation between them is 0, but 0 correlation coefficient only means certainly a [linear independency](http://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_and_linearity).
    
    Because extreme values occur seldom we have rule of thumbs for the coefficients, like other fields of statistics:
    
      * we call two variables highly correlated if the absolute value of the correlation coefficient between them is higher than 0.7 and
      * we call them uncorrelated if that is smaller than 0.2.
    
    Please note that [correlation has nothing to do with causal models](http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation), it only shows association but not effects.
    
    # Variable description
    
    <%=dim(vars)[1]%> variables with <%=dim(vars)[2]%> cases provided.
    
    <%=
    cm <- cor(vars, use = 'complete.obs')
    diag(cm) <- NA
    %>
    
    <% if (length(vars) > 2) { %>
    The highest correlation coefficient (<%=max(cm, na.rm = T)%>) is between <%=row.names(which(cm == max(cm, na.rm = T), arr.ind = T))[1:2]%> and the lowest (<%=min(cm, na.rm = T)%>) is between <%=row.names(which(cm == min(cm, na.rm = T), arr.ind = T))[1:2]%>. It seems that the strongest association (r=<%=cm[which(abs(cm) == max(abs(cm), na.rm = T), arr.ind = T)][1]%>) is between <%=row.names(which(abs(cm) == max(abs(cm), na.rm = T), arr.ind = T))[1:2]%>.
    <% } %>
    
    <%
    cm[upper.tri(cm)] <- NA
    h <- which((cm > 0.7) | (cm < -0.7), arr.ind = T)
    if (nrow(h) > 0) {
    %>
    
    Highly correlated (r < -0.7 or r > 0.7) variables:
    
    <%=paste(pander.return(lapply(1:nrow(h), function(i) paste0(p(c(rownames(cm)[h[i, 1]], colnames(cm)[h[i, 2]])), ' (', round(cm[h[i, 1], h[i, 2]], 2), ')'))), collapse = '\n')%>
    
    <%} else { %>
    
    There are no highly correlated (r < -0.7 or r > 0.7) variables.
    <% } %>
    
    <%
    h <- which((cm < 0.2)&(cm > -0.2), arr.ind = T)
    if (nrow(h) > 0) {
    %>
    
    Uncorrelated (-0.2 < r < 0.2) variables:
    
    <%=
    if (nrow(h) > 0) {
        paste(pander.return(lapply(1:nrow(h), function(i) paste0(p(c(rownames(cm)[h[i, 1]], colnames(cm)[h[i, 2]])), ' (', round(cm[h[i, 1], h[i, 2]], 2), ')'))), collapse = '\n')
    }
    %>
    
    <%} else { %>
    
    There are no uncorrelated correlated (r < -0.2 or r > 0.2) variables.
    <% } %>
    
    # Correlation matrix
    
    <%=
    set.caption('Correlation matrix')
    cm <- format(round(cor(vars, use = 'complete.obs'), 4), decimal.mark = panderOptions('decimal.mark'))
    d <- attributes(cm)
    for (row in attr(cm, 'dimnames')[[1]])
        for (col in attr(cm, 'dimnames')[[2]]) {
            test.p <- cor.test(vars[, row], vars[, col])$p.value
            cm[row, col] <- paste(cm[row, col], ' ', ifelse(test.p > 0.05, '', ifelse(test.p > 0.01, ' *', ifelse(test.p > 0.001, ' * *', ' * * *'))), sep='')
        }
    diag(cm) <- ''
    set.alignment('centre', 'right')
    as.data.frame(cm)
    %>
    
    Where the stars represent the [significance levels](http://en.wikipedia.org/wiki/Statistical_significance) of the bivariate correlation coefficients: one star for a [p value](http://en.wikipedia.org/wiki/P-value) below `0.05`, two for below `0.01` and three for below `0.001`.
    
    On the plot one can see the correlation in two forms: below the [diagonal](http://en.wikipedia.org/wiki/Main_diagonal) visually, above that one can find the coefficient(s).
    
    <%=
    labels <- lapply(vars, rp.name)
    quick.plot <- FALSE
    if (nrow(vars) > 1000) {
        quick.plot <- TRUE
        vars <- vars[sample(1:nrow(vars), size = 1000), ]
    }
    
    ## custom panels
    panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    
        ## forked from ?pairs
        par(usr = c(0, 1, 0, 1))
        r   <- cor(x, y, use = 'complete.obs')
        txt <- format(c(r, 0.123456789), digits = digits, decimal.mark = panderOptions('decimal.mark'))[1]
        txt <- paste(prefix, txt, sep = "")
        if(missing(cex.cor)) { cex <- 0.8/strwidth(txt) }
        test <- cor.test(x, y)
        Signif <- symnum(test$p.value, corr = FALSE, na = FALSE,
            cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
            symbols = c("***", "**", "*", ".", " "))
        text(0.5, 0.5, txt, cex = cex * abs(r) * 1.5)
        text(.8, .8, Signif, cex = cex, col = 2)
    }
    
    ## plot
    set.caption(sprintf('Scatterplot matrix%s', ifelse(quick.plot, ' (based on a sample size of 1000)', '')))
    pairs(vars, lower.panel = 'panel.smooth', upper.panel = 'panel.cor', labels = labels)
    %>

Crosstable.tpl

  • Crosstable
    by Gergely Daróczi, Daniel Nagy

    Returning the Chi-squared test of two given variables with count, percentages and Pearson's residuals table.

    Required packages: descr
    Example:
    rapport('Crosstable', data=ius2008, row='gender', col='dwell')
    rapport('Crosstable', data=ius2008, row='email', col='dwell')
  • Input parameters
    Crosstable.tpl template has the following options to be (optionally) set while calling rapport:
    • row: Row variable
      A categorical variable.
      Class: factor

    • col: Column variable
      A categorical variable.
      Class: factor

  • ##################################################################################
    ##	 Running: rapport('Crosstable', data=ius2008, row='gender', col='dwell')
    ##################################################################################
    
    # Variable description
    
    
    Two variables specified:
    
     * "gender" ("Gender") with _673_ valid values and
     * "dwell" ("Dwelling") with _662_ valid values.
    
    
    ## Introduction
    
    
    [Crosstables](http://en.wikipedia.org/wiki/Cross_tabulation) are applicable to show the frequencies of categorical variables in a matrix form, with a table view.
    
    We will present four types of these crosstables. The first of them shows the *exact numbers of the observations*, ergo the number of the observations each of the variables' categories commonly have.
    
    The second also shows the possessions each of these cells have, but not the exact numbers of the observations, rather the *percentages* of them from the total data.
    
    The last two type of the crosstabs contain the so-called *row and column percentages* which demonstrate us the distribution of the frequencies if we concentrate only on one variable.
    
    After that we present the *tests* with which we can investigate the possible relationships, associations between the variables, like Chi-squared test, Fisher Exact Test, Goodman and Kruskal's lambda.
    
    In the last part there are some *charts* presented, with that one can visually observe the distribution of the frequencies.
    
    
    # Counts
    
    
    -----------------------------------------------------------
           &nbsp;  city   small town   village   Missing   Sum 
    ------------- ------ ------------ --------- --------- -----
         **male**  338        28         19        25      410 
    
       **female**  234        3           9        17      263 
    
      **Missing**   27        2           2         5      36  
    
          **Sum**  599        33         30        47      709 
    -----------------------------------------------------------
    
    Table: Counted values: "gender" and "dwell"
    
    
    
    Most of the cases (_338_) can be found in "male-city" categories. Row-wise "male" holds the highest number of cases (_410_) while column-wise "city" has the utmost cases (_599_).
    
    
    # Percentages
    
    
    -----------------------------------------------------------
           &nbsp;  city   small town   village   Missing   Sum 
    ------------- ------ ------------ --------- --------- -----
         **male** 47.67      3.95       2.68      3.53    57.83
    
       **female**   33       0.42       1.27       2.4    37.09
    
      **Missing**  3.81      0.28       0.28      0.71    5.08 
    
          **Sum** 84.49      4.65       4.23      6.63     100 
    -----------------------------------------------------------
    
    Table: Total percentages: "gender" and "dwell"
    
    
    -----------------------------------------------------
           &nbsp;  city   small town   village   Missing 
    ------------- ------ ------------ --------- ---------
         **male** 82.44      6.83       4.63       6.1   
    
       **female** 88.97      1.14       3.42      6.46   
    
      **Missing**   75       5.56       5.56      13.89  
    
          **Sum** 84.49      4.65       4.23      6.63   
    -----------------------------------------------------
    
    Table: Row percentages: "gender" and "dwell"
    
    
    -----------------------------------------------------------
           &nbsp;  city   small town   village   Missing   Sum 
    ------------- ------ ------------ --------- --------- -----
         **male** 56.43     84.85       63.33     53.19   57.83
    
       **female** 39.07      9.09        30       36.17   37.09
    
      **Missing**  4.51      6.06       6.67      10.64   5.08 
    -----------------------------------------------------------
    
    Table: Column percentages: "gender" and "dwell"
    
    
    # Tests of Independence
    
    
    In the below tests for [independece](http://en.wikipedia.org/wiki/Independence_(probability_theory)) we assume that the row and column variables are independent of each other. If this [null hypothesis](http://en.wikipedia.org/wiki/Null_hypothesis) would be rejected by the tests, then we can say that the assumption must have been wrong, so there is a good chance that the variables are associated.
    
    
    ## Chi-squared test
    
    
    One of the most widespread independence test is the [Chi-squared test](http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). While using that we have the alternative hypothesis, that two variables have an association between each other, in opposite of the null hypothesis that the variables are independent.
    
    We use the cell frequencies from the crosstables to calculate the test statistic for that. The test statistic is based on the difference between this distribution  and a theoretical distribution where the variables are independent of each other. The distribution of this test statistic follows a [Chi-square distribution](http://en.wikipedia.org/wiki/Chi-squared_distribution).
    
    The test was invented by Karl Pearson in 1900. It should be noted that the Chi-squared test has the disadvantage that it is sensitive to the sample size.
    
    
    ### Criteria
    
    
    Before analyzing the result of the Chi-squared test, we have to check if our data meets some requirements. There are two widely used criteria which have to take into consideration, both of them are related to the so-called expected counts. These expected counts are calculated from the marginal distributions and show how the crosstabs would look like if there were complete independency between the variables. The Chi-squared test calculates how different are the observed cells from the expected ones.
    
    The two criteria are:
    
      - none of the expected cells could be lower than 1
      - 80% of the expected cells have to be at least 5
    
    
    Let's look at on expected values then:
    
    ------------------------------------------
       &nbsp;     city   small town   village 
    ------------ ------ ------------ ---------
      **male**    349      18.91       17.08  
    
     **female**   223      12.09       10.92  
    ------------------------------------------
    
    We can see that the Chi-squared test met the requirements.
    
    
    So now check the result of the test:
    
    
    ------------------------------------
     Test statistic   df     P value    
    ---------------- ---- --------------
         12.64        2   _0.001804_ * *
    ------------------------------------
    
    Table: Pearson's Chi-squared test: `table`
    
    
    
    To decide if the null or the alternative hypothesis could be accepted we need to calculate the number of degrees of freedom. The degrees of freedom is easy to calculate, we substract one from the number of the categories of both the row and the coloumn variables and multiply them with each other.
    
    To each degrees of freedom there is denoted a [critical value](http://en.wikipedia.org/wiki/Critical_value#Statistics). The result of the Chi-square test have to be lower than that value to be able to accept the nullhypothesis.
    
    
    
    
    
    
    It seems that a real association can be pointed out between *gender* and *dwell* by the *Pearson's Chi-squared test* ($\chi$=_12.64_) at the [degree of freedom](http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)) being _2_ at the [significance level](http://en.wikipedia.org/wiki/Statistical_significance) of _0.001804_ * *.
    
    The association between the two variables seems to be weak based on [Cramer's V](http://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) (_0.1001_).
    
    
    
    
    ### References
    
    
      * Fisher, R. A. (1922): On the interpretation of chi-square from contingency tables, and the calculation of P. _Journal of the Royal Statistical Society_ 85 (1): 87-94.
      * Fisher, R.A. (1954): _Statistical Methods for Research Workers_. Oliver and Boyd.
    
    
    ### Adjusted standardized residuals
    
    
    The residuals show the contribution to reject the null hypothesis at a cell level. An extremely high or low value indicates that the given cell had a major effect on the resulting chi-square, so thus helps understanding the association in the crosstable.
    
    
    ------------------------------------------
          &nbsp;  city   small town   village 
    ------------ ------ ------------ ---------
        **male** -3.08      3.43       0.76   
    
      **female**  3.08     -3.43       -0.76  
    ------------------------------------------
    
    Table: Residuals: "gender" and "dwell"
    
    Based on Pearson's residuals the following cells seems interesting (with values higher than `2` or lower than `-2`):
    
     * "male - city"
     * "female - city"
     * "male - small town"
     * "female - small town"
    
    ### References
    
    
      * Snedecor, George W. and Cochran, William G. (1989): _Statistical Methods_. Iowa State University Press.
      * Karl Pearson (1900): _Philosophical Magazine_, Series 5 50 (302): 157-175.
    
    
    ## Fisher Exact Test
    
    
    An other test to check the possible association/independence between two variables, is the [Fisher exact test](http://en.wikipedia.org/wiki/Fisher%27_exact_test). This test is especially useful with small samples, but could be used with bigger datasets as well.
    
    We have the advantage while using the Fisher's over the Chi-square test, that we could get an exact significance value not just a level of it, thus we can have an impression about the power of the test and the association.
    
    The test was invented by, thus named after R.A. Fisher.
    
    
    
    
    
    The variables seems to be dependent based on Fisher's exact test at the [significance level](http://en.wikipedia.org/wiki/P-value) of _0.0008061_ * * *.
    
    
    
    
    
    
    # Direction of relationship
    
    
    ## Goodman and Kruskal's lambda
    
    
    With the help of the [Goodman and Kruskal's lambda](http://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_lambda) we can look for not only relationship on its own, which have directions if we set one variable as a predictor and the other as a criterion variable.
    
    
    
    The computed value for [Goodman and Kruskal's lambda](http://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_lambda) is the same for both directions: _0_. For this end, we do not know the direction of the relationship.
    
    
    
    
    # Charts
    
    
    If one would like to investigate the relationships rather visually than in a crosstable form, there are several possibilities to do that.
    
    
    #### Heat map
    
    
    At first we can have a look at on the so-called [heat map](http://en.wikipedia.org/wiki/Heat_map). This kind of chart uses the same amount of cells and a similar form as the crosstable does, but instead of the numbers there are colours to show which cell contains the most counts (or likewise the highest total percentages).
    
    The darker colour is one cell painted, the most counts/the higher total percentage it has.
    
    [![Heatmap](plots/rapport-Crosstable-31-1.png)](plots/rapport-Crosstable-31-1-hires.png)
    
    
    There can be also shown the standardized adjusted residual of each cells:
    
    [![Heatmap of residuals](plots/rapport-Crosstable-31-2.png)](plots/rapport-Crosstable-31-2-hires.png)
    
    #### Mosaic chart
    
    
    In front of the heat map, on the *mosaic charts*, not only the colours are important. The size of the cells shows the amount of the counts one cell has.
    
    The width on the axis of gender determinate one side and the height on the axis of the dwell gives the final shape of the box. The box which demonstrates a cell from the hypothetic crosstable. We can see on the top of the chart which category from the dwell draw the boxes what kind of colour.
    
    [![Mosaic chart](plots/rapport-Crosstable-31-3.png)](plots/rapport-Crosstable-31-3-hires.png)
    
    #### Fluctuation diagram
    
    
    At last but not least have a glance on the *fluctuation diagram*. Unlike the above two charts, here the colours does not have influence on the chart, but the sizes of the boxes, which obviously demonstrates here as well the cells of the crosstable.
    
    The bigger are the boxes the higher are the numbers of the counts/the total percentages, which that boxes denote.
    
    
    [![Fluctuation diagram](plots/rapport-Crosstable-31-4.png)](plots/rapport-Crosstable-31-4-hires.png)
    
    
    #################################################################################
    ##	 Running: rapport('Crosstable', data=ius2008, row='email', col='dwell')
    #################################################################################
    
    # Variable description
    
    
    Two variables specified:
    
     * "email" ("Email usage") with _672_ valid values and
     * "dwell" ("Dwelling") with _662_ valid values.
    
    
    ## Introduction
    
    
    [Crosstables](http://en.wikipedia.org/wiki/Cross_tabulation) are applicable to show the frequencies of categorical variables in a matrix form, with a table view.
    
    We will present four types of these crosstables. The first of them shows the *exact numbers of the observations*, ergo the number of the observations each of the variables' categories commonly have.
    
    The second also shows the possessions each of these cells have, but not the exact numbers of the observations, rather the *percentages* of them from the total data.
    
    The last two type of the crosstabs contain the so-called *row and column percentages* which demonstrate us the distribution of the frequencies if we concentrate only on one variable.
    
    After that we present the *tests* with which we can investigate the possible relationships, associations between the variables, like Chi-squared test, Fisher Exact Test, Goodman and Kruskal's lambda.
    
    In the last part there are some *charts* presented, with that one can visually observe the distribution of the frequencies.
    
    
    # Counts
    
    
    ---------------------------------------------------------
               &nbsp;  city   small town   village   Missing 
    ----------------- ------ ------------ --------- ---------
            **never**   12        0           0         1    
    
      **very rarely**   30        1           3         2    
    
           **rarely**   41        3           1         1    
    
        **sometimes**   67        4           8         8    
    
            **often**  101        10          5         7    
    
       **very often**   88        5           5        10    
    
           **always**  226        9           7        17    
    
          **Missing**   34        1           1         1    
    
              **Sum**  599        33         30        47    
    ---------------------------------------------------------
    
    Table: Counted values: "email" and "dwell" (continued below)
    
     
    -----------------------
               &nbsp;  Sum 
    ----------------- -----
            **never**  13  
    
      **very rarely**  36  
    
           **rarely**  46  
    
        **sometimes**  87  
    
            **often**  123 
    
       **very often**  108 
    
           **always**  259 
    
          **Missing**  37  
    
              **Sum**  709 
    -----------------------
    
    
    
    Most of the cases (_226_) can be found in "always-city" categories. Row-wise "always" holds the highest number of cases (_259_) while column-wise "city" has the utmost cases (_599_).
    
    
    # Percentages
    
    
    ---------------------------------------------------------
               &nbsp;  city   small town   village   Missing 
    ----------------- ------ ------------ --------- ---------
            **never**  1.69       0           0       0.14   
    
      **very rarely**  4.23      0.14       0.42      0.28   
    
           **rarely**  5.78      0.42       0.14      0.14   
    
        **sometimes**  9.45      0.56       1.13      1.13   
    
            **often** 14.25      1.41       0.71      0.99   
    
       **very often** 12.41      0.71       0.71      1.41   
    
           **always** 31.88      1.27       0.99       2.4   
    
          **Missing**  4.8       0.14       0.14      0.14   
    
              **Sum** 84.49      4.65       4.23      6.63   
    ---------------------------------------------------------
    
    Table: Total percentages: "email" and "dwell" (continued below)
    
     
    -----------------------
               &nbsp;  Sum 
    ----------------- -----
            **never** 1.83 
    
      **very rarely** 5.08 
    
           **rarely** 6.49 
    
        **sometimes** 12.27
    
            **often** 17.35
    
       **very often** 15.23
    
           **always** 36.53
    
          **Missing** 5.22 
    
              **Sum**  100 
    -----------------------
    
    
    ---------------------------------------------------------
               &nbsp;  city   small town   village   Missing 
    ----------------- ------ ------------ --------- ---------
            **never** 92.31       0           0       7.69   
    
      **very rarely** 83.33      2.78       8.33      5.56   
    
           **rarely** 89.13      6.52       2.17      2.17   
    
        **sometimes** 77.01      4.6         9.2       9.2   
    
            **often** 82.11      8.13       4.07      5.69   
    
       **very often** 81.48      4.63       4.63      9.26   
    
           **always** 87.26      3.47        2.7      6.56   
    
          **Missing** 91.89      2.7         2.7       2.7   
    
              **Sum** 84.49      4.65       4.23      6.63   
    ---------------------------------------------------------
    
    Table: Row percentages: "email" and "dwell"
    
    
    ---------------------------------------------------------
               &nbsp;  city   small town   village   Missing 
    ----------------- ------ ------------ --------- ---------
            **never**   2         0           0       2.13   
    
      **very rarely**  5.01      3.03        10       4.26   
    
           **rarely**  6.84      9.09       3.33      2.13   
    
        **sometimes** 11.19     12.12       26.67     17.02  
    
            **often** 16.86      30.3       16.67     14.89  
    
       **very often** 14.69     15.15       16.67     21.28  
    
           **always** 37.73     27.27       23.33     36.17  
    
          **Missing**  5.68      3.03       3.33      2.13   
    ---------------------------------------------------------
    
    Table: Column percentages: "email" and "dwell" (continued below)
    
     
    -----------------------
               &nbsp;  Sum 
    ----------------- -----
            **never** 1.83 
    
      **very rarely** 5.08 
    
           **rarely** 6.49 
    
        **sometimes** 12.27
    
            **often** 17.35
    
       **very often** 15.23
    
           **always** 36.53
    
          **Missing** 5.22 
    -----------------------
    
    
    # Tests of Independence
    
    
    In the below tests for [independece](http://en.wikipedia.org/wiki/Independence_(probability_theory)) we assume that the row and column variables are independent of each other. If this [null hypothesis](http://en.wikipedia.org/wiki/Null_hypothesis) would be rejected by the tests, then we can say that the assumption must have been wrong, so there is a good chance that the variables are associated.
    
    
    ## Chi-squared test
    
    
    One of the most widespread independence test is the [Chi-squared test](http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). While using that we have the alternative hypothesis, that two variables have an association between each other, in opposite of the null hypothesis that the variables are independent.
    
    We use the cell frequencies from the crosstables to calculate the test statistic for that. The test statistic is based on the difference between this distribution  and a theoretical distribution where the variables are independent of each other. The distribution of this test statistic follows a [Chi-square distribution](http://en.wikipedia.org/wiki/Chi-squared_distribution).
    
    The test was invented by Karl Pearson in 1900. It should be noted that the Chi-squared test has the disadvantage that it is sensitive to the sample size.
    
    
    ### Criteria
    
    
    Before analyzing the result of the Chi-squared test, we have to check if our data meets some requirements. There are two widely used criteria which have to take into consideration, both of them are related to the so-called expected counts. These expected counts are calculated from the marginal distributions and show how the crosstabs would look like if there were complete independency between the variables. The Chi-squared test calculates how different are the observed cells from the expected ones.
    
    The two criteria are:
    
      - none of the expected cells could be lower than 1
      - 80% of the expected cells have to be at least 5
    
    
    Let's look at on expected values then:
    
    -----------------------------------------------
         &nbsp;        city   small town   village 
    ----------------- ------ ------------ ---------
        **never**     10.83     0.6134     0.5559  
    
     **very rarely**  30.69     1.738       1.575  
    
       **rarely**     40.62      2.3        2.085  
    
      **sometimes**    71.3     4.038       3.66   
    
        **often**     104.7      5.93       5.374  
    
     **very often**   88.45      5.01       4.54   
    
       **always**     218.4     12.37       11.21  
    -----------------------------------------------
    
    We can see that the Chi-squared test met the requirements.
    
    
    So now check the result of the test:
    
    
    -------------------------------
     Test statistic   df   P value 
    ---------------- ---- ---------
         14.86        12   _0.249_ 
    -------------------------------
    
    Table: Pearson's Chi-squared test: `table`
    
    
    
    To decide if the null or the alternative hypothesis could be accepted we need to calculate the number of degrees of freedom. The degrees of freedom is easy to calculate, we substract one from the number of the categories of both the row and the coloumn variables and multiply them with each other.
    
    To each degrees of freedom there is denoted a [critical value](http://en.wikipedia.org/wiki/Critical_value#Statistics). The result of the Chi-square test have to be lower than that value to be able to accept the nullhypothesis.
    
    
    The requirements of the chi-squared test was not met, so [Yates's correction for continuity](http://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity) applied. The approximation may be incorrect.
    
    
    
    
    
    It seems that no real association can be pointed out between *email* and *dwell* by the *Pearson's Chi-squared test* ($\chi$=_14.86_ at the degree of freedom being _12_) at the significance level of _0.249_.
    
    
    
    
    ### References
    
    
      * Fisher, R. A. (1922): On the interpretation of chi-square from contingency tables, and the calculation of P. _Journal of the Royal Statistical Society_ 85 (1): 87-94.
      * Fisher, R.A. (1954): _Statistical Methods for Research Workers_. Oliver and Boyd.
    
    
    ### Adjusted standardized residuals
    
    
    The residuals show the contribution to reject the null hypothesis at a cell level. An extremely high or low value indicates that the given cell had a major effect on the resulting chi-square, so thus helps understanding the association in the crosstable.
    
    
    -----------------------------------------------
               &nbsp;  city   small town   village 
    ----------------- ------ ------------ ---------
            **never**  1.15     -0.81       -0.77  
    
      **very rarely** -0.41     -0.59        1.2   
    
           **rarely**  0.2       0.49       -0.8   
    
        **sometimes** -1.75     -0.02       2.49   
    
            **often** -1.28      1.9        -0.18  
    
       **very often** -0.17       0         0.24   
    
           **always**  2.1      -1.26       -1.64  
    -----------------------------------------------
    
    Table: Residuals: "email" and "dwell"
    
    Based on Pearson's residuals the following cells seems interesting (with values higher than `2` or lower than `-2`):
    
     * "always - city"
     * "sometimes - village"
    
    ### References
    
    
      * Snedecor, George W. and Cochran, William G. (1989): _Statistical Methods_. Iowa State University Press.
      * Karl Pearson (1900): _Philosophical Magazine_, Series 5 50 (302): 157-175.
    
    
    ## Fisher Exact Test
    
    
    An other test to check the possible association/independence between two variables, is the [Fisher exact test](http://en.wikipedia.org/wiki/Fisher%27_exact_test). This test is especially useful with small samples, but could be used with bigger datasets as well.
    
    We have the advantage while using the Fisher's over the Chi-square test, that we could get an exact significance value not just a level of it, thus we can have an impression about the power of the test and the association.
    
    The test was invented by, thus named after R.A. Fisher.
    
    
    
    
    
    The test could not finish within resource limits.
    
    
    
    
    
    
    # Charts
    
    
    If one would like to investigate the relationships rather visually than in a crosstable form, there are several possibilities to do that.
    
    
    #### Heat map
    
    
    At first we can have a look at on the so-called [heat map](http://en.wikipedia.org/wiki/Heat_map). This kind of chart uses the same amount of cells and a similar form as the crosstable does, but instead of the numbers there are colours to show which cell contains the most counts (or likewise the highest total percentages).
    
    The darker colour is one cell painted, the most counts/the higher total percentage it has.
    
    [![Heatmap](plots/rapport-Crosstable-32-1.png)](plots/rapport-Crosstable-32-1-hires.png)
    
    
    There can be also shown the standardized adjusted residual of each cells:
    
    [![Heatmap of residuals](plots/rapport-Crosstable-32-2.png)](plots/rapport-Crosstable-32-2-hires.png)
    
    #### Mosaic chart
    
    
    In front of the heat map, on the *mosaic charts*, not only the colours are important. The size of the cells shows the amount of the counts one cell has.
    
    The width on the axis of email determinate one side and the height on the axis of the dwell gives the final shape of the box. The box which demonstrates a cell from the hypothetic crosstable. We can see on the top of the chart which category from the dwell draw the boxes what kind of colour.
    
    [![Mosaic chart](plots/rapport-Crosstable-32-3.png)](plots/rapport-Crosstable-32-3-hires.png)
    
    #### Fluctuation diagram
    
    
    At last but not least have a glance on the *fluctuation diagram*. Unlike the above two charts, here the colours does not have influence on the chart, but the sizes of the boxes, which obviously demonstrates here as well the cells of the crosstable.
    
    The bigger are the boxes the higher are the numbers of the counts/the total percentages, which that boxes denote.
    
    
    [![Fluctuation diagram](plots/rapport-Crosstable-32-4.png)](plots/rapport-Crosstable-32-4-hires.png)
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('Crosstable', data=ius2008, row='gender', col='dwell')
    rapport('Crosstable', data=ius2008, row='email', col='dwell')
    Popular formats:
    Other morkdown formats
  • Source of `Crosstable.tpl`
    <!--head
    meta:
      title: Crosstable
      author: Gergely Daróczi, Daniel Nagy
      description: Returning the Chi-squared test of two given variables with count, percentages and Pearson's residuals table.
      email: gergely@snowl.net
      packages: 
      - descr
      - ggplot2
      - reshape
      example:
      - rapport('Crosstable', data=ius2008, row='gender', col='dwell')
      - rapport('Crosstable', data=ius2008, row='email', col='dwell')
    inputs:
    - name: row
      label: Row variable
      description: A categorical variable.
      class: factor
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    - name: col
      label: Column variable
      description: A categorical variable.
      class: factor
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    head-->
    
    # Variable description
    
    Two variables specified:
    
     * "<%=rp.name(row)%>"<%=ifelse(rp.label(row)==rp.name(row), '', sprintf(' ("%s")', rp.label(row)))%> with <%=rp.valid(as.numeric(row))%> valid values and
     * "<%=rp.name(col)%>"<%=ifelse(rp.label(col)==rp.name(col), '', sprintf(' ("%s")', rp.label(col)))%> with <%=rp.valid(as.numeric(col))%> valid values.
    
    ## Introduction
    
    [Crosstables](http://en.wikipedia.org/wiki/Cross_tabulation) are applicable to show the frequencies of categorical variables in a matrix form, with a table view.
    
    We will present four types of these crosstables. The first of them shows the *exact numbers of the observations*, ergo the number of the observations each of the variables' categories commonly have.
    
    The second also shows the possessions each of these cells have, but not the exact numbers of the observations, rather the *percentages* of them from the total data.
    
    The last two type of the crosstabs contain the so-called *row and column percentages* which demonstrate us the distribution of the frequencies if we concentrate only on one variable.
    
    After that we present the *tests* with which we can investigate the possible relationships, associations between the variables, like Chi-squared test, Fisher Exact Test, Goodman and Kruskal's lambda.
    
    In the last part there are some *charts* presented, with that one can visually observe the distribution of the frequencies.
    
    # Counts
    
    <%=
    table <- table(row, col, deparse.level = 0, useNA = 'ifany')
    if (length(which(is.na(rownames(table)))) > 0) {
        rownames(table)[which(is.na(rownames(table)))] <- 'Missing' }
    if (length(which(is.na(colnames(table)))) > 0) {
        colnames(table)[which(is.na(colnames(table)))] <- 'Missing' }
    fulltable	<- addmargins(table)
    set.caption(sprintf('Counted values: "%s" and "%s"', rp.name(row), rp.name(col)))
    set.alignment(row.names = "right")
    fulltable
    %>
    
    <%=
    table.max <- which(table == max(table), arr.ind = TRUE)
    %>
    
    Most of the cases (<%=table[table.max]%>) can be found in "<%=paste(rownames(table)[table.max[, 1]], colnames(table)[table.max[, 2]], sep = '-')%>" categories. Row-wise "<%=names(which.max(rowSums(table)))%>" holds the highest number of cases (<%=max(rowSums(table))%>) while column-wise "<%=names(which.max(colSums(table)))%>" has the utmost cases (<%=max(colSums(table))%>).
    
    # Percentages
    
    <%=
    set.caption(sprintf('Total percentages: "%s" and "%s"', rp.name(row), rp.name(col)))
    set.alignment(row.names = "right")
    fulltable <- round(addmargins(prop.table(table) * 100), 2)
    fulltable
    %>
    
    <%=
    set.caption(sprintf('Row percentages: "%s" and "%s"', rp.name(row), rp.name(col)))
    set.alignment(row.names = "right")
    fulltable <- round(prop.table(addmargins(table, 1), 1) * 100, 2)
    fulltable
    %>
    
    <%=
    set.caption(sprintf('Column percentages: "%s" and "%s"', rp.name(row), rp.name(col)))
    set.alignment(row.names = "right")
    fulltable <- round(prop.table(addmargins(table, 2), 2) * 100, 2)
    fulltable
    %>
    
    # Tests of Independence
    
    In the below tests for [independece](http://en.wikipedia.org/wiki/Independence_(probability_theory)) we assume that the row and column variables are independent of each other. If this [null hypothesis](http://en.wikipedia.org/wiki/Null_hypothesis) would be rejected by the tests, then we can say that the assumption must have been wrong, so there is a good chance that the variables are associated.
    
    ## Chi-squared test
    
    One of the most widespread independence test is the [Chi-squared test](http://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test). While using that we have the alternative hypothesis, that two variables have an association between each other, in opposite of the null hypothesis that the variables are independent.
    
    We use the cell frequencies from the crosstables to calculate the test statistic for that. The test statistic is based on the difference between this distribution  and a theoretical distribution where the variables are independent of each other. The distribution of this test statistic follows a [Chi-square distribution](http://en.wikipedia.org/wiki/Chi-squared_distribution).
    
    The test was invented by Karl Pearson in 1900. It should be noted that the Chi-squared test has the disadvantage that it is sensitive to the sample size.
    
    ### Criteria
    
    Before analyzing the result of the Chi-squared test, we have to check if our data meets some requirements. There are two widely used criteria which have to take into consideration, both of them are related to the so-called expected counts. These expected counts are calculated from the marginal distributions and show how the crosstabs would look like if there were complete independency between the variables. The Chi-squared test calculates how different are the observed cells from the expected ones.
    
    The two criteria are:
    
      - none of the expected cells could be lower than 1
      - 80% of the expected cells have to be at least 5
    
    <%=
    table  <- table(row, col, deparse.level = 0) # no need for NAs from here
    t      <- suppressWarnings(chisq.test(table))
    lambda <- lambda.test(table)
    cramer <- sqrt(as.numeric(t$statistic) / (sum(table) * min(dim(table))))
    o <- t$expected
    num <- nrow(o) * ncol(o)
    k <- 0
    l <- 0
    for (i in 1:nrow(o))
    for (j in 1:ncol(o))
    {
    if (o[i, j] < 5) {k <- k + 1}
    if (o[i, j] < 1) {l <- l + 1}
    }
    %>
    Let's look at on expected values then:
    <%=
    o
    crit <- 0
    if (l>0) {crit <- crit + 1}
    if (k<num/5)  {crit <- crit + 1}
    ifelse(crit > 0,"We can see that the Chi-squared test met the requirements.", "We can see that using the Chi-squared test is not advisable in this case, so you should be careful with the interpretation.")
    %>
    
    So now check the result of the test:
    
    <%= t %>
    
    To decide if the null or the alternative hypothesis could be accepted we need to calculate the number of degrees of freedom. The degrees of freedom is easy to calculate, we substract one from the number of the categories of both the row and the coloumn variables and multiply them with each other.
    
    To each degrees of freedom there is denoted a [critical value](http://en.wikipedia.org/wiki/Critical_value#Statistics). The result of the Chi-square test have to be lower than that value to be able to accept the nullhypothesis.
    
    <%if (inherits(tryCatch(chisq.test(table), warning = function(w) w), 'warning')) { %>
    The requirements of the chi-squared test was not met, so [Yates's correction for continuity](http://en.wikipedia.org/wiki/Yates%27s_correction_for_continuity) applied. The approximation may be incorrect.
    <% } %>
    
    <%if (!is.na(cramer)) { %>
    <%if (t$p.value < 0.05) { %>
    
    It seems that a real association can be pointed out between *<%=rp.name(row)%>* and *<%=rp.name(col)%>* by the *<%=t$method%>* ($\chi$=<%=as.numeric(t$statistic)%>) at the [degree of freedom](http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)) being <%=as.numeric(t$parameter)%> at the [significance level](http://en.wikipedia.org/wiki/Statistical_significance) of <%=add.significance.stars(t$p.value)%>.
    
    The association between the two variables seems to be <%=ifelse(cramer < 0.2, "weak", "strong")%> based on [Cramer's V](http://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) (<%=cramer%>).
    
    <% } else { %>
    
    It seems that no real association can be pointed out between *<%=rp.name(row)%>* and *<%=rp.name(col)%>* by the *<%=t$method%>* ($\chi$=<%=as.numeric(t$statistic)%> at the degree of freedom being <%=as.numeric(t$parameter)%>) at the significance level of <%=add.significance.stars(t$p.value)%>.
    
    <% } %>
    
    ### References
    
      * Fisher, R. A. (1922): On the interpretation of chi-square from contingency tables, and the calculation of P. _Journal of the Royal Statistical Society_ 85 (1): 87-94.
      * Fisher, R.A. (1954): _Statistical Methods for Research Workers_. Oliver and Boyd.
    
    ### Adjusted standardized residuals
    
    The residuals show the contribution to reject the null hypothesis at a cell level. An extremely high or low value indicates that the given cell had a major effect on the resulting chi-square, so thus helps understanding the association in the crosstable.
    
    <%=
    set.caption(sprintf('Residuals: "%s" and "%s"', rp.name(row), rp.name(col)))
    set.alignment(row.names = "right")
    table.res <- suppressWarnings(CrossTable(table, asresid = TRUE))$asr
    table.res <- round(table.res, 2)
    table.res.highlow  <- which(table.res < -2 | table.res > 2, arr.ind = TRUE)
    table.res
    %>
    
    <%=
    if (nrow(table.res.highlow) > 0) {
        sprintf('Based on Pearson\'s residuals the following cells seems interesting (with values higher than `2` or lower than `-2`):\n%s', paste(sapply(1:nrow(table.res.highlow), function(i) sprintf('\n * "%s - %s"', rownames(table)[table.res.highlow[i, 1]], colnames(table)[table.res.highlow[i, 2]])), collapse = ''))
    } else {
        sprintf('No interesting (higher then `2` or lower then `-2`) values found based on Pearson\'s residuals.')
    }
    %>
    
    ### References
    
      * Snedecor, George W. and Cochran, William G. (1989): _Statistical Methods_. Iowa State University Press.
      * Karl Pearson (1900): _Philosophical Magazine_, Series 5 50 (302): 157-175.
    
    ## Fisher Exact Test
    
    An other test to check the possible association/independence between two variables, is the [Fisher exact test](http://en.wikipedia.org/wiki/Fisher%27_exact_test). This test is especially useful with small samples, but could be used with bigger datasets as well.
    
    We have the advantage while using the Fisher's over the Chi-square test, that we could get an exact significance value not just a level of it, thus we can have an impression about the power of the test and the association.
    
    The test was invented by, thus named after R.A. Fisher.
    
    <%=
    f <- tryCatch(suppressWarnings(fisher.test(table, hybrid = TRUE, workspace = 1e6)), error = function(e) e)
    %>
    
    <% if (inherits(f, 'error')) { %>
    
    The test could not finish within resource limits.
    
    <%
        f <- list(p.value = t$p.value)
    } else {
        if (f$p.value < 0.05) {
    %>
    
    The variables seems to be dependent based on Fisher's exact test at the [significance level](http://en.wikipedia.org/wiki/P-value) of <%=add.significance.stars(f$p.value)%>.
    
    <% } else { %>
    
    The variables seems to be independent based on Fisher's exact test at the [significance level](http://en.wikipedia.org/wiki/P-value) of <%=add.significance.stars(f$p.value)%>.
    
    <% }} %>
    
    <% if (t$p.value < 0.05 | f$p.value < 0.05) { %>
    
    # Direction of relationship
    
    ## Goodman and Kruskal's lambda
    
    With the help of the [Goodman and Kruskal's lambda](http://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_lambda) we can look for not only relationship on its own, which have directions if we set one variable as a predictor and the other as a criterion variable.
    
    <%if (diff(unlist(lambda, use.names = FALSE)) != 0 & !is.na(cramer)) { %>
    
    Based on Goodman and Kruskal's lambda it seems that *<%=c(rp.name(col),rp.name(row))[which.max(lambda)]%>* ($\lambda$=<%=pander.return(max(as.numeric(lambda)))%>) has an effect on *<%=c(rp.name(col),rp.name(row))[which.min(lambda)]%>* ($\lambda$=<%=min(as.numeric(lambda))%>) if we assume both variables to be nominal.
    
    <% } else { %>
    
    The computed value for [Goodman and Kruskal's lambda](http://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_lambda) is the same for both directions: <%=lambda$row%>. For this end, we do not know the direction of the relationship.
    
    <% }}} else { %>
    Moreover: **it seems that the provided variables do not fit a real crosstable**.
    <% } %>
    
    # Charts
    
    If one would like to investigate the relationships rather visually than in a crosstable form, there are several possibilities to do that.
    
    #### Heat map
    
    At first we can have a look at on the so-called [heat map](http://en.wikipedia.org/wiki/Heat_map). This kind of chart uses the same amount of cells and a similar form as the crosstable does, but instead of the numbers there are colours to show which cell contains the most counts (or likewise the highest total percentages).
    
    The darker colour is one cell painted, the most counts/the higher total percentage it has.
    
    <%=
    set.caption('Heatmap')
    suppressWarnings(suppressMessages(ggfluctuation(table, type = 'colour') + geom_tile() + xlab('') + ylab('') + labs(fill = 'Count'))) # TODO: drop ggfluctuation
    %>
    
    There can be also shown the standardized adjusted residual of each cells:
    
    <%=
    set.caption('Heatmap of residuals')
    table2 <- as.data.frame(t(table))
    table2$Freq <- as.data.frame(t(table.res))$Freq
    names(table2) <- c("x", "y", "result")
    table2 <- transform(table2, x = as.factor(x), y = as.factor(y), freq = result)
    ceiling <- max(table2$freq, na.rm = TRUE)
    ggplot(table2, aes_string(x = "x", y = "y", fill = "freq")) + geom_tile(colour = "grey50") + scale_fill_gradient2('Std. adj. res.', limits = c(-max(abs(range(table2$freq))), max(abs(range(table2$freq)))), midpoint = 0, low = "red", mid = "white", high = "green")
    %>
    
    #### Mosaic chart
    
    In front of the heat map, on the *mosaic charts*, not only the colours are important. The size of the cells shows the amount of the counts one cell has.
    
    The width on the axis of <%=rp.name(row)%> determinate one side and the height on the axis of the <%=rp.name(col)%> gives the final shape of the box. The box which demonstrates a cell from the hypothetic crosstable. We can see on the top of the chart which category from the <%=rp.name(col)%> draw the boxes what kind of colour.
    
    <%=
    set.caption('Mosaic chart')
    glp      <- panderOptions('graph.legend.position')
    panderOptions('graph.legend.position', 'top')
    t        <- melt(table)
    t$x      <- rowSums(table) / sum(table) * 100
    t$xmax   <- cumsum(rowSums(table)) / sum(table) * 100
    t$xmin   <- t$xmax - t$x
    t$y      <- t$value / rep(rowSums(table), ncol(table)) * 100
    t        <- t[with(t, order(Var.1)), ]
    t$ymax   <- cumsum(t$y) - as.vector(sapply(1:nrow(table) - 1, rep, ncol(table))) * 100
    t$ymin   <- t$ymax - t$y
    t$xxtext <- with(t, xmin + (xmax - xmin) / 2)
    t$xytext <- as.vector(sapply(rep(c(103, -3), length.out = nrow(table)), rep, ncol(table)))
    ggplot(t, aes(ymin = ymin, ymax = ymax, xmin = xmin, xmax = xmax, fill = Var.2)) + geom_rect() + geom_rect(colour = 'white', show_guide = FALSE) + geom_text(aes(x = xxtext, y = xytext, label = Var.1), size = 4) + xlab('') + ylab('') + theme(legend.position = 'top') + labs(fill = '')
    panderOptions('graph.legend.position', glp)
    %>
    
    #### Fluctuation diagram
    
    At last but not least have a glance on the *fluctuation diagram*. Unlike the above two charts, here the colours does not have influence on the chart, but the sizes of the boxes, which obviously demonstrates here as well the cells of the crosstable.
    
    The bigger are the boxes the higher are the numbers of the counts/the total percentages, which that boxes denote.
    
    
    <%=
    set.caption('Fluctuation diagram')
    suppressWarnings(suppressMessages(ggfluctuation(table) + xlab('') + ylab(''))) # TODO: drop ggfluctuation
    %>

Descriptives.tpl

  • Descriptive statistics
    by Gergely Daróczi, Daniel Nagy

    This template will return descriptive statistics of a numerical or frequency table of a categorical variable.

    Required packages:
    Example:
    rapport('Descriptives', data=ius2008, var='gender')
    rapport('Descriptives', data=ius2008, var='age')
    rapport('Descriptives', data=mtcars, var='hp')
  • Input parameters
    Descriptives.tpl template has the following options to be (optionally) set while calling rapport:
    • var: Variable
      Categorical or numerical variable. This template will determine the measurement level of the given variable.

  • #######################################################################
    ##	 Running: rapport('Descriptives', data=ius2008, var='gender')
    #######################################################################
    
    # *gender* ("Gender")
    
    
    The dataset has _709_ observations with _673_ valid values (missing: _36_).
    
    
    
    
    ----------------------------------------
     gender   N    %    Cumul. N   Cumul. % 
    -------- --- ----- ---------- ----------
      male   410 60.92    410       60.92   
    
     female  263 39.08    673        100    
    
     Total   673  100     673        100    
    ----------------------------------------
    
    Table: Frequency table: Gender
    
    
    
    The most frequent value is *male*.
    
    
    ## Charts
    
    [![Barplot: Gender](plots/rapport-Descriptives-46-1.png)](plots/rapport-Descriptives-46-1-hires.png)
    
    
    ####################################################################
    ##	 Running: rapport('Descriptives', data=ius2008, var='age')
    ####################################################################
    
    # *age* ("Age")
    
    
    The dataset has _709_ observations with _677_ valid values (missing: _32_).
    
    
    
    ## Base statistics
    
    
    -----------------------------
     Variable   mean   sd    var 
    ---------- ------ ----- -----
       Age     24.57  6.849 46.91
    -----------------------------
    
    Table: Descriptives: Age
    
    
    
    The [standard deviation](http://en.wikipedia.org/wiki/Standard_deviation) equals to  _6.849_ (variance: _46.91_), which shows the unstandardized degree of [homogenity](http://en.wikipedia.org/wiki/Homogeneity_(statistics)): how much variation exists from the average. The [expected value](http://en.wikipedia.org/wiki/Mean) is around _24.57_, somewhere between _24.06_ and _25.09_ with the standard error of _0.2632_.
    
    The highest value found in the dataset is _58_, which is exactly _3.625_ times higher than the minimum (_16_). The difference between the two is described by the [range](http://en.wikipedia.org/wiki/Range_(statistics)): _42_.
    
    
    ## Chart
    
    
    A [histogram](http://en.wikipedia.org/wiki/Histogram) visually shows the [distribution](http://en.wikipedia.org/wiki/Probability_distribution) of the dataset based on artificially allocated [frequencies](http://en.wikipedia.org/wiki/Statistical_frequency). Each bar represents a theoretical interval of the data, where the height shows the count or density.
    
    [![Histogram: Age](plots/rapport-Descriptives-47-1.png)](plots/rapport-Descriptives-47-1-hires.png)
    
    
    If we *suppose* that *Age* is not near to the [normal distribution](http://en.wikipedia.org/wiki/Normal_distribution) (see for example [skewness](http://en.wikipedia.org/wiki/Skewness): _1.925_, [kurtosis](http://en.wikipedia.org/wiki/Kurtosis): _4.463_), checking the median (_23_) might be a better option instead of the mean. The [interquartile range](http://en.wikipedia.org/wiki/Interquartile_range) (_6_) measures the statistics dispersion of the variable (similar to standard deviation) based on median.
    
    
    
    
    ##################################################################
    ##	 Running: rapport('Descriptives', data=mtcars, var='hp')
    ##################################################################
    
    # *hp*
    
    
    The dataset has _32_ observations with _32_ valid values (missing: _0_).
    
    
    
    ## Base statistics
    
    
    -----------------------------
     Variable   mean   sd    var 
    ---------- ------ ----- -----
        hp     146.7  68.56 4701 
    -----------------------------
    
    Table: Descriptives: hp
    
    
    
    The [standard deviation](http://en.wikipedia.org/wiki/Standard_deviation) equals to  _68.56_ (variance: _4701_), which shows the unstandardized degree of [homogenity](http://en.wikipedia.org/wiki/Homogeneity_(statistics)): how much variation exists from the average. The [expected value](http://en.wikipedia.org/wiki/Mean) is around _146.7_, somewhere between _122.9_ and _170.4_ with the standard error of _12.12_.
    
    The highest value found in the dataset is _335_, which is exactly _6.442_ times higher than the minimum (_52_). The difference between the two is described by the [range](http://en.wikipedia.org/wiki/Range_(statistics)): _283_.
    
    
    ## Chart
    
    
    A [histogram](http://en.wikipedia.org/wiki/Histogram) visually shows the [distribution](http://en.wikipedia.org/wiki/Probability_distribution) of the dataset based on artificially allocated [frequencies](http://en.wikipedia.org/wiki/Statistical_frequency). Each bar represents a theoretical interval of the data, where the height shows the count or density.
    
    [![Histogram: hp](plots/rapport-Descriptives-48-1.png)](plots/rapport-Descriptives-48-1-hires.png)
    
    
    If we *suppose* that *hp* is not near to the [normal distribution](http://en.wikipedia.org/wiki/Normal_distribution) (see for example [skewness](http://en.wikipedia.org/wiki/Skewness): _0.726_, [kurtosis](http://en.wikipedia.org/wiki/Kurtosis): _-0.1356_), checking the median (_123_) might be a better option instead of the mean. The [interquartile range](http://en.wikipedia.org/wiki/Interquartile_range) (_83.5_) measures the statistics dispersion of the variable (similar to standard deviation) based on median.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('Descriptives', data=ius2008, var='gender')
    rapport('Descriptives', data=ius2008, var='age')
    rapport('Descriptives', data=mtcars, var='hp')
    Popular formats:
    Other morkdown formats
  • Source of `Descriptives.tpl`
    <!--head
    meta:
      title: Descriptive statistics
      author: Gergely Daróczi, Daniel Nagy
      description: This template will return descriptive statistics of a numerical or frequency table of a categorical variable.
      email: gergely@snowl.net
      packages: ~
      example:
      - rapport('Descriptives', data=ius2008, var='gender')
      - rapport('Descriptives', data=ius2008, var='age')
      - rapport('Descriptives', data=mtcars, var='hp')
    inputs:
    - name: var
      label: Variable
      description: Categorical or numerical variable. This template will determine the
        measurement level of the given variable.
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    head-->
    
    # *<%=rp.name(var)%>*<%=ifelse(rp.label(var) == rp.name(var), '', sprintf(' ("%s")', rp.label(var)))%>
    
    The dataset has <%=nvar<-as.numeric(var); length(nvar)%> observations with <%=rp.valid(nvar)%> valid values (missing: <%=rp.missing(nvar)%>).
    
    <% if (is.numeric(var)) { %>
    ## Base statistics
    
    <%=
    set.caption(sprintf('Descriptives: %s', rp.label(var)))
    rp.desc(rp.name(var), NULL, c('mean', 'sd', 'var'), rp.data)
    %>
    
    The [standard deviation](http://en.wikipedia.org/wiki/Standard_deviation) equals to  <%=rp.sd(var)%> (variance: <%=rp.var(var)%>), which shows the unstandardized degree of [homogenity](http://en.wikipedia.org/wiki/Homogeneity_(statistics)): how much variation exists from the average. The [expected value](http://en.wikipedia.org/wiki/Mean) is around <%=rp.mean(var)%>, somewhere between <%=rp.mean(var)-1.96*rp.se.mean(var)%> and <%=rp.mean(var) + 1.96 * rp.se.mean(var)%> with the standard error of <%=rp.se.mean(var)%>.
    
    The highest value found in the dataset is <%=rp.max(var)%>, which is exactly <%=rp.max(var) / rp.min(var)%> times higher than the minimum (<%=rp.min(var)%>). The difference between the two is described by the [range](http://en.wikipedia.org/wiki/Range_(statistics)): <%=diff(range(var, na.rm = TRUE))%>.
    
    ## Chart
    
    A [histogram](http://en.wikipedia.org/wiki/Histogram) visually shows the [distribution](http://en.wikipedia.org/wiki/Probability_distribution) of the dataset based on artificially allocated [frequencies](http://en.wikipedia.org/wiki/Statistical_frequency). Each bar represents a theoretical interval of the data, where the height shows the count or density.
    
    <%=
    set.caption(sprintf('Histogram: %s', rp.label(var)))
    hist(var)
    %>
    
    If we *suppose* that *<%=rp.label(var)%>* is not near to the [normal distribution](http://en.wikipedia.org/wiki/Normal_distribution) (see for example [skewness](http://en.wikipedia.org/wiki/Skewness): <%=rp.skewness(var)%>, [kurtosis](http://en.wikipedia.org/wiki/Kurtosis): <%=rp.kurtosis(var)%>), checking the median (<%=rp.median(var)%>) might be a better option instead of the mean. The [interquartile range](http://en.wikipedia.org/wiki/Interquartile_range) (<%=rp.iqr(var)%>) measures the statistics dispersion of the variable (similar to standard deviation) based on median.
    
    <%} else { %>
    
    <%=
    set.caption(sprintf('Frequency table: %s', rp.label(var)))
    rp.freq(rp.name(var), rp.data, na.rm = FALSE, include.na = TRUE)
    %>
    
    The most frequent value is *<%=t <- table(var); names(t[t == max(t)])%>*.
    
    ## Charts
    
    <%=
    set.caption(sprintf('Barplot: %s', rp.label(var)))
    rp.barplot(var)
    %>
    
    <% } %>

FactorAnalysis.tpl

  • Factor Analysis
    by Daniel Nagy

    In this template Rapporter will present you Factor Analysis.

    Required packages: psych
    Example:
    rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'), rot.method="varimax")
    rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'), rot.method="varimax", fact.num = 3)
    rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl', 'drat'), rot.method="varimax", fact.num = 5)
  • Input parameters
    FactorAnalysis.tpl template has the following options to be (optionally) set while calling rapport:
    • vars: Used Variables
      The Variables that will be used in Factor Analysis
      Class: numeric

    • fact.num: Number of Factors
      How many Factors you want to use?
      Class: integer

    • method: Method of factoring
      The used method of the factoring
      Class: character
      Default value: maximum likelihood

    • rot.method: Method of Rotation
      The used method of the rotation
      Class: character
      Default value: varimax

    • fa.scores: Type of scores
      Type of scores to produce
      Class: character
      Default value: regression

    • max.iter: Maximum iterations
      Maximum number of iterations for convergence
      Class: integer

  • ##########################################################################################################################
    ##	 Running: rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'), rot.method="varimax")
    ##########################################################################################################################
    
    # Introduction
    
    
    [Factor Analysis](http://en.wikipedia.org/wiki/Factor_analysis) is applied as a data reduction or structure detection method. There are two main applications of it: reducing the number of variables and detecting structure in the relationships between variables, thus explore latent structure behind the data, classify variables.
    
    
    
    #### Determining the number of the factors
    
    [![](plots/rapport-FactorAnalysis.tpl-16-1.png)](plots/rapport-FactorAnalysis.tpl-16-1-hires.png)
    
    #### Eigenvalues
    
    
    
    You can find the eigenvalues of the possible factors in the following table (_2_ factors were produced as you set):
    
    -----------------------------
     Factor Number   Eigenvalues 
    --------------- -------------
         **1**        **2.461**  
    
         **2**        **1.273**  
    
           3           0.1485    
    
           4           0.1176    
    -----------------------------
    
    
    ## Factor loadings
    
    
    At the next step let's check the factor loadings. They mean that how deep the impact of a factor for the variables. We emphasized the cells when the explained is higher than 30% of the whole variance.
    
    ---------------------------------
      &nbsp;       ML1        ML2    
    ---------- ----------- ----------
     **carb**  **0.8723**  **0.3756**
    
     **gear**   -0.08811   **0.9343**
    
     **mpg**   **-0.8197** **0.4368**
    
     **cyl**   **0.7988**  **-0.452**
    ---------------------------------
    
    
    
    
    So it can be said that 
    * ML1 is a latent factor of carb 
    * ML1 is a latent factor of mpg 
    * ML1 is a latent factor of cyl 
    * ML2 is a latent factor of carb 
    * ML2 is a latent factor of gear 
    * ML2 is a latent factor of mpg 
    * ML2 is a latent factor of cyl 
    
    <!-- end of list -->
    
    
    
    From them in the cases of the _ML1's impact on mpg_ and _ML2's impact on cyl_, we can say they are negative effects.
        
    
    
    
    
    
    
    
    ## Uniquenesses
    
    
    At last but not least let us say some words about the not explained part of the variables. There are two statistics which help us quantifying this concept: Communality and Uniqueness. They are in a really strong relationship, because Uniqueness is the variability of a variable minus its Communality. The first table contains the Uniqunesses, the second the Communalities of the variables:
    
    ----------------------------------------
      &nbsp;    Uniqunesses   Communalities 
    ---------- ------------- ---------------
     **carb**     0.09805         0.902     
    
     **gear**     0.1193         0.8807     
    
     **mpg**      0.1374         0.8626     
    
     **cyl**      0.1576         0.8424     
    ----------------------------------------
    
    
    
    We can see from the table that variable cyl has the highest Uniqueness, so could be explained the least by the factors and variable carb variance's was explained the most, because it has the lowest Uniqueness. From the communalities we can draw the same conclusion.
    
    
    
    
    ########################################################################################################################################
    ##	 Running: rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'), rot.method="varimax", fact.num = 3)
    ########################################################################################################################################
    
    # Introduction
    
    
    [Factor Analysis](http://en.wikipedia.org/wiki/Factor_analysis) is applied as a data reduction or structure detection method. There are two main applications of it: reducing the number of variables and detecting structure in the relationships between variables, thus explore latent structure behind the data, classify variables.
    
    
    
    
    
    #### Eigenvalues
    
    
    
    You can find the eigenvalues of the possible factors in the following table (_3_ factors were produced as you set):
    
    -----------------------------
     Factor Number   Eigenvalues 
    --------------- -------------
         **1**        **2.461**  
    
         **2**        **1.273**  
    
         **3**       **0.1485**  
    
           4           0.1176    
    -----------------------------
    
    
    ## Factor loadings
    
    
    At the next step let's check the factor loadings. They mean that how deep the impact of a factor for the variables. We emphasized the cells when the explained is higher than 30% of the whole variance.
    
    --------------------------------------------
      &nbsp;       ML1         ML2        ML3   
    ---------- ----------- ----------- ---------
     **carb**  **0.8771**  **0.3502**  -0.001772
    
     **gear**    -0.0595   **0.9317**  -0.005685
    
     **mpg**   **-0.8132** **0.4641**   0.0972  
    
     **cyl**   **0.7917**  **-0.4774**  0.1361  
    --------------------------------------------
    
    
    
    
    So it can be said that 
    * ML1 is a latent factor of carb 
    * ML1 is a latent factor of mpg 
    * ML1 is a latent factor of cyl 
    * ML2 is a latent factor of carb 
    * ML2 is a latent factor of gear 
    * ML2 is a latent factor of mpg 
    * ML2 is a latent factor of cyl 
    
    <!-- end of list -->
    
    
    
    From them in the cases of the _ML1's impact on mpg_ and _ML2's impact on cyl_, we can say they are negative effects.
        
    
    
    
    
    
    
    
    ## Uniquenesses
    
    
    At last but not least let us say some words about the not explained part of the variables. There are two statistics which help us quantifying this concept: Communality and Uniqueness. They are in a really strong relationship, because Uniqueness is the variability of a variable minus its Communality. The first table contains the Uniqunesses, the second the Communalities of the variables:
    
    ----------------------------------------
      &nbsp;    Uniqunesses   Communalities 
    ---------- ------------- ---------------
     **carb**     0.1081         0.8919     
    
     **gear**     0.1283         0.8717     
    
     **mpg**      0.1138         0.8862     
    
     **cyl**      0.1268         0.8732     
    ----------------------------------------
    
    
    
    We can see from the table that variable gear has the highest Uniqueness, so could be explained the least by the factors and variable carb variance's was explained the most, because it has the lowest Uniqueness. From the communalities we can draw the same conclusion.
    
    
    
    
    ################################################################################################################################################
    ##	 Running: rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl', 'drat'), rot.method="varimax", fact.num = 5)
    ################################################################################################################################################
    
    
    
    Your request cannot be implemented, because there are the same number of variables (_5_) like the number of the requested factors (_5_) . Please set the number of the factors to maximum _4_ with the same number of the variables or extend the number of those variables at least to _6_.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'), rot.method="varimax")
    rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'), rot.method="varimax", fact.num = 3)
    rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl', 'drat'), rot.method="varimax", fact.num = 5)
    Popular formats:
    Other morkdown formats
  • Source of `FactorAnalysis.tpl`
    <!--head
    meta:
      title: Factor Analysis
      author: Daniel Nagy
      description: In this template Rapporter will present you Factor Analysis.
      email: ~
      packages:
      - psych
      - nFactors
      example:
      - rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'),
                rot.method="varimax")
      - rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl'),
                rot.method="varimax", fact.num = 3)
      - rapport('FactorAnalysis.tpl', data=mtcars, vars=c('carb', 'gear', 'mpg', 'cyl', 
                'drat'), rot.method="varimax", fact.num = 5)
    inputs:
    - name: vars
      label: Used Variables
      description: The Variables that will be used in Factor Analysis
      class: numeric
      length:
        min: 2.0
        max: 500.0
      required: yes
      standalone: no
    - name: fact.num
      label: Number of Factors
      description: How many Factors you want to use?
      class: integer
      limit:
        min: 1.0
        max: 499.0
      required: no
      standalone: yes
    - name: method
      label: Method of factoring
      description: The used method of the factoring
      class: character
      options:
      - minimum residual (OLS)
      - weighted least squares (WLS)
      - generalized weighted least squares (GLS)
      - maximum likelihood
      - minimize the sample size weighted chi square
      value: maximum likelihood
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    - name: rot.method
      label: Method of Rotation
      description: The used method of the rotation
      class: character
      options:
      - none
      - varimax
      - promax
      - quartimax
      - bentlerT
      - geominT
      - bifactor
      - oblimin
      - simplimax
      - bentlerQ
      - geominQ
      - biquartimin
      - cluster
      value: varimax
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    - name: fa.scores
      label: Type of scores
      description: Type of scores to produce
      class: character
      options:
      - regression
      - Bartlett
      value: regression
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    - name: max.iter
      label : Maximum iterations
      description: Maximum number of iterations for convergence
      class: integer
      limit:
        min: 1
        max: 10000 
      required: no
      standalone: yes
    head-->
    
    <% if (exists('fact.num') && !is.null(fact.num) && fact.num > 0 && fact.num >= ncol(vars)) { %>
    
    Your request cannot be implemented, because there are <%=ifelse(ncol(vars) == fact.num, "the same number of variables", "less variables")%> (<%= ncol(vars) %>) <%=ifelse(ncol(vars) == fact.num, "like", "than")%> the number of the requested factors (<%= fact.num %>) . Please set the number of the factors to maximum <%= ncol(vars) - 1 %> with the same number of the variables or extend the number of those variables at least to <%= fact.num + 1 %>.
    
    <% } else { %>
    
    <% if (exists('fact.num') && !is.null(fact.num) && fact.num > 0) { %>
    <% } else { %>
    <% if (ncol(vars) < 3) { %>
    Your request to automatically calculate the number of the clusters cannot be implemented, because there are not enough variables. Please set at least 3 variables in order to use that automatic function or set the number of the factors (obviously to less than the number of the used variables). Now we are using a 1 factor case.
    <%= 
    fact.num <- 1 
    %>
    <% }} %>
    
    # Introduction
    
    [Factor Analysis](http://en.wikipedia.org/wiki/Factor_analysis) is applied as a data reduction or structure detection method. There are two main applications of it: reducing the number of variables and detecting structure in the relationships between variables, thus explore latent structure behind the data, classify variables.
    
    <% if (exists('fact.num') && !is.null(fact.num) && fact.num > 0) { %>
    <% } else { %>
    #### Determining the number of the factors
    <% } %>
    
    <%= 
    if (method=="minimum residual (OLS)") { method <- "minres" }
    if (method=="weighted least squares (WLS)") { method <- "wls" }
    if (method=="generalized weighted least squares (GLS)") { method <- "gls" }
    if (method=="maximum likelihood") { method <- "ml" }
    if (method=="minimize the sample size weighted chi square") { method <- "minchi" }
    
    fact.matrix <- na.omit(scale(vars))
    ev <- eigen(cor(fact.matrix))
    
    if (exists('fact.num') && !is.null(fact.num) && fact.num > 0) {
    suppressMessages(FA <- fa(fact.matrix, nfactors = fact.num, scores = fa.scores, rotate = rot.method, fm = method, max.iter = max.iter, warnings = TRUE))
    } else {
    ap <- parallel(subject=nrow(fact.matrix), var = ncol(fact.matrix), rep = 100, cent=.05)
    nS <- nScree(x = ev$values, aparallel = ap$eigen$qevpea)
    plotnScree(nS)
    fact.num <- max(which(ev$values >= 1))
    suppressMessages(FA <- fa(fact.matrix, nfactors = fact.num, scores = fa.scores, rotate = rot.method, fm = method, max.iter = max.iter, warnings = TRUE))
    }
    %>
    
    #### Eigenvalues
    
    <% if (exists('fact.num') && !is.null(fact.num) && fact.num > 0) { %>
    You can find the eigenvalues of the possible factors in the following table (<%=fact.num%> factors were produced as you set):
    <%=
    emphasize.strong.rows(1:fact.num)
    cbind("Factor Number" = 1:length(ev$values), "Eigenvalues" = ev$values)
    %>
    <% } else { %>
    As you haven't provided value for the number of the factors, we calculated that automatically based on the eigenvalues, thus it is: <%=max(which(ev$values >= 1))%>. The eigenvalues you can find in the following table:
    <%=
    emphasize.strong.rows(1:max(which(ev$values >= 1)))
    cbind("Factor Number" = 1:length(ev$values), "Eigenvalues" = ev$values)
    %>
    <% } %>
    
    
    ## Factor loadings
    
    At the next step let's check the factor loadings. They mean that how deep the impact of a factor for the variables<%= ifelse(rot.method != "none",""," (Keep in mind that these are the results without rotation, please set the parameter rot.method to see the rotated ones)")%>. We emphasized the cells when the explained is higher than 30% of the whole variance.
    <%= 
    FA_loadings <- FA$loadings
    class(FA_loadings) <- "matrix"
    emphasize.strong.cells(which(abs(FA_loadings) > 0.3, arr.ind = TRUE))
    FA_loadings
    %>
    
    
    So it can be said that <%=paste(pandoc.list.return(paste(colnames(FA_loadings)[which(abs(FA_loadings) > 0.3, arr.ind = TRUE)[, 2]],rp.name(vars)[which(abs(FA_loadings) > 0.3, arr.ind = TRUE)[, 1]], sep = " is a latent factor of ")), collapse = '\n')%>
    
    <% if (length(which(FA_loadings > 0.3)) != length(which(abs(FA_loadings) > 0.3))) { %>
    
    <%=neg.comp <- colnames(FA_loadings)[which(FA_loadings < -0.3, arr.ind = TRUE)[, 2]]%>
    
    From them in the case<%=ifelse(length(neg.comp) > 1, "s", "")%> of the <%=paste(colnames(FA_loadings)[which(FA_loadings < -0.3, arr.ind = TRUE)[, 2]], rp.name(vars)[which(FA_loadings < -0.3, arr.ind = TRUE)[, 1]], sep = "'s impact on ")%>, we can say <%=ifelse(length(neg.comp) > 1, "they are", "that is")%> negative effects.
        
    <% } else { %>
    We can say that <%=ifelse(length(which(abs(FA_loadings) > 0.3)), "none of these impacts are negative", "this impact is positive")%>. 
    
    <% }%>
    
    
    <% if (length(rownames(FA$scores)) > 0 & nrow(vars) < 10) { %>
    
    ## Plot about the distribution of the observations
    
    Now let's check how the observations distribute among the <%= ifelse(fact.num < 2, "factor", "first and the second factors")%>.
    <%= FA_scores <- FA$scores
    plot(FA_scores)
    +text(FA_scores, labels = rownames(fact.matrix), cex = 1)
    %>
    <% } else { %>
    <% } %>
    
    ## Uniquenesses
    
    At last but not least let us say some words about the not explained part of the variables. There are two statistics which help us quantifying this concept: Communality and Uniqueness. They are in a really strong relationship, because Uniqueness is the variability of a variable minus its Communality. The first table contains the Uniqunesses, the second the Communalities of the variables:
    <%= 
    uni <- as.matrix(FA$uniquenesses)
    comm <- as.matrix(FA$communality)
    table <- cbind(uni,comm)
    colnames(table) <- c("Uniqunesses", "Communalities")
    table
    %>
    
    We can see from the table that variable <%=rownames(uni)[which(max(uni) == uni)]%> has the highest Uniqueness, so could be explained the least by the factors and variable <%= rownames(uni)[which(min(uni) == uni)]%> variance's was explained the most, because it has the lowest Uniqueness. From the communalities we can draw the same conclusion.
    
    <% } %>

FTest.tpl

  • F test
    by Gergely Daróczi, Daniel Nagy

    This template will run an F-test to check if two continuous variables have the same means.

    Required packages: nortest
    Example:
    rapport('FTest', data=ius2008, xvar='edu', yvar='age')
    rapport('FTest', data=ius2008, xvar='edu', yvar='age', norm.check=F)
    rapport('FTest', data=mtcars, xvar='cyl', yvar='drat', norm.check=F)
  • Input parameters
    FTest.tpl template has the following options to be (optionally) set while calling rapport:
    • xvar: X Variable(s)
      This is the X variable which will be used here
      Class: numeric

    • yvar: Y Variable(s)
      This is the Y variable which will be used here.
      Class: numeric

    • norm.check: Normality check
      You can set to run a normality check before the F test.
      Class: logical
      Default value: TRUE

  • ##########################################################################
    ##	 Running: rapport('FTest', data=ius2008, xvar='edu', yvar='age')
    ##########################################################################
    
    # Introduction
    
    
    F test compares the means of two continuous variables. In other words it shows if their means were statistically different.
    We should be careful, while using the F test, because of the strict normality assumption, where strict means approximately normal ditribution is not enough to satisfy that.
    
    
    
    
    # Normality assumption check (_Internet usage for educational purposes (hours per day)_)
    
    
    The [_Shapiro-Wilk test_](http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test), the [_Lilliefors test_](http://en.wikipedia.org/wiki/Lilliefors_test) and the [_Anderson-Darling test_](http://en.wikipedia.org/wiki/Anderson_Darling_test) help us to decide if the above-mentioned assumption can be accepted of the _Internet usage for educational purposes (hours per day)_.
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors            0.2223    2.243e-92
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     42.04    3.31e-90 
               test                                  
    
    Shapiro-Wilk normality test   0.7985    6.366e-28
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage for educational purposes (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage for educational purposes (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Normality assumption check (_Age_)
    
    
    The [_Shapiro-Wilk test_](http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test), the [_Lilliefors test_](http://en.wikipedia.org/wiki/Lilliefors_test) and the [_Anderson-Darling test_](http://en.wikipedia.org/wiki/Anderson_Darling_test) help us to decide if the above-mentioned assumption can be accepted of the _Internet usage for educational purposes (hours per day)_.
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.17     6.193e-54
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     32.16    1.26e-71 
               test                                  
    
    Shapiro-Wilk normality test   0.8216    9.445e-27
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Age_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Age_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    
    
    _In this case it is advisable to run a more robust test, then the F-test._
    
    
    
    
    
    ########################################################################################
    ##	 Running: rapport('FTest', data=ius2008, xvar='edu', yvar='age', norm.check=F)
    ########################################################################################
    
    # Introduction
    
    
    F test compares the means of two continuous variables. In other words it shows if their means were statistically different.
    We should be careful, while using the F test, because of the strict normality assumption, where strict means approximately normal ditribution is not enough to satisfy that.
    
    
    
    # The F-test
    
    
    Here is the the result of the _F test_ to compare the means of _Internet usage for educational purposes (hours per day)_ and _Age_.
    
    
    --------------------------------------------
           Method          Statistic   p-value  
    --------------------- ----------- ----------
    F test to compare two   0.08618   3.772e-180
          variances                             
    --------------------------------------------
    
    
    
    We can see from the table (in the p-value coloumn) that there is a significant difference between the means of _Internet usage for educational purposes (hours per day)_ and _Age_.
    
    
    
    ########################################################################################
    ##	 Running: rapport('FTest', data=mtcars, xvar='cyl', yvar='drat', norm.check=F)
    ########################################################################################
    
    # Introduction
    
    
    F test compares the means of two continuous variables. In other words it shows if their means were statistically different.
    We should be careful, while using the F test, because of the strict normality assumption, where strict means approximately normal ditribution is not enough to satisfy that.
    
    
    
    # The F-test
    
    
    Here is the the result of the _F test_ to compare the means of _cyl_ and _drat_.
    
    
    -------------------------------------------
           Method          Statistic   p-value 
    --------------------- ----------- ---------
    F test to compare two    11.16    1.461e-09
          variances                            
    -------------------------------------------
    
    
    
    We can see from the table (in the p-value coloumn) that there is a significant difference between the means of _cyl_ and _drat_.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('FTest', data=ius2008, xvar='edu', yvar='age')
    rapport('FTest', data=ius2008, xvar='edu', yvar='age', norm.check=F)
    rapport('FTest', data=mtcars, xvar='cyl', yvar='drat', norm.check=F)
    Popular formats:
    Other morkdown formats
  • Source of `FTest.tpl`
    <!--head
    meta:
      title: F test
      author: Gergely Daróczi, Daniel Nagy
      description: This template will run an F-test to check if two continuous variables have the same means.
      email: ~
      packages:
      - nortest
      example:
      - rapport('FTest', data=ius2008, xvar='edu', yvar='age')
      - rapport('FTest', data=ius2008, xvar='edu', yvar='age', norm.check=F)
      - rapport('FTest', data=mtcars, xvar='cyl', yvar='drat', norm.check=F)
    inputs:
    - name: xvar
      label: X Variable(s)
      description: This is the X variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: yvar
      label: Y Variable(s)
      description: This is the Y variable which will be used here.
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: norm.check
      label: Normality check
      description: You can set to run a normality check before the F test.
      class: logical
      value: yes
      required: no
      standalone: yes
    head-->
    
    # Introduction
    
    F test compares the means of two continuous variables. In other words it shows if their means were statistically different.
    We should be careful, while using the F test, because of the strict normality assumption, where strict means approximately normal ditribution is not enough to satisfy that.
    
    <% if (norm.check) { %>
    
    # Normality assumption check (<%=p(xvar.label)%>)
    
    The <%=ifelse(length(xvar) < 5000, "[_Shapiro-Wilk test_](http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test), the", "")%> [_Lilliefors test_](http://en.wikipedia.org/wiki/Lilliefors_test) and the [_Anderson-Darling test_](http://en.wikipedia.org/wiki/Anderson_Darling_test) help us to decide if the above-mentioned assumption can be accepted of the <%=p(xvar.label)%>.
    
    <% if (length(xvar) < 5000) { %>
    
    <%= ntest <- htest(xvar, lillie.test, ad.test, shapiro.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {m <- l + 1}
    if (ntest$p[3] < 0.05) {n <- m + 1}
    ntest
    %>
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= xvar.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _<%= xvar.label %>_ is<%= ifelse(ntest[3, 3] < p, " not", "") %> normal
     
    <% } else { %>
    <%= ntest <- htest(xvar, lillie.test, ad.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {n <- l + 1}
    ntest
    %>
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= xvar.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    <% } %>
    
    <%= if (n > 0) {
    sprintf("As you can see, the applied tests %s.", ifelse(n > 1, "confirm departures from normality", "yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most.")) 
    } else { 
    sprintf("reject departures from normality") 
    } 
    %>
    
    # Normality assumption check (<%=p(yvar.label)%>)
    
    The <%=ifelse(length(yvar) < 5000, "[_Shapiro-Wilk test_](http://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test), the", "")%> [_Lilliefors test_](http://en.wikipedia.org/wiki/Lilliefors_test) and the [_Anderson-Darling test_](http://en.wikipedia.org/wiki/Anderson_Darling_test) help us to decide if the above-mentioned assumption can be accepted of the <%=p(xvar.label)%>.
    
    <% if (length(yvar) < 5000) { %>
    
    <%= ntest <- htest(yvar, lillie.test, ad.test, shapiro.test)
    k2 <- 0
    l2 <- 0
    m2 <- 0
    n2 <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l2 <- k2 + 1}
    if (ntest$p[2] < 0.05) {m2 <- l2 + 1}
    if (ntest$p[3] < 0.05) {n2 <- m2 + 1}
    ntest
    %>
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= yvar.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _<%= yvar.label %>_ is<%= ifelse(ntest[3, 3] < p, " not", "") %> normal
     
    <% } else { %>
    <%= ntest <- htest(yvar, lillie.test, ad.test)
    k2 <- 0
    l2 <- 0
    m2 <- 0
    n2 <- 0
    p2 <- 0.05
    if (ntest$p[1] < 0.05) {l2 <- k2 + 1}
    if (ntest$p[2] < 0.05) {n2 <- l2 + 1}
    ntest
    %>
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= yvar.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    <% } %>
    
    <%= if (n2 > 0) {
    sprintf("As you can see, the applied tests %s.", ifelse(n2 > 1, "confirm departures from normality", "yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most.")) 
    } else { 
    sprintf("reject departures from normality") 
    } 
    %>
    
    <% if(n + n2 < 1) { %>
    
    # The F-test
    
    Whereas we have checked the normality assumptions, let's see the results of the _F test_ to compare the means of <%= p(xvar.label) %> and <%=p(yvar.label)%>.
    
    <%=
    ftest <- function(xvar) var.test(xvar, yvar)
    Ftest <- htest(xvar, ftest)
    f.p <- Ftest$"p.value"
    p <- 0.05
    Ftest
    %>
    
    We can see from the table (in the p-value coloumn) that there is<%=ifelse(f.p < p,""," not")%> a significant difference between the means of _<%=xvar.label%>_ and _<%=yvar.label%>_.
    
    <% } else { %>
    _In this case it is advisable to run a more robust test, then the F-test._
    <% } %>
    
    <% } else { %>
    # The F-test
    
    Here is the the result of the _F test_ to compare the means of <%= p(xvar.label) %> and <%=p(yvar.label)%>.
    
    <%=
    ftest <- function(xvar) var.test(xvar, yvar)
    Ftest <- htest(xvar, ftest)
    f.p <- Ftest$"p.value"
    p <- 0.05
    Ftest
    %>
    
    We can see from the table (in the p-value coloumn) that there is<%=ifelse(f.p < p,""," not")%> a significant difference between the means of _<%=xvar.label%>_ and _<%=yvar.label%>_.
    <% } %>

GenerateBeta.tpl

  • Generate beta distribution
    by Rapporter Team (@rapporter)

    Random generation for the Beta distribution with parameters.

    Required packages: nortest
    Example:
  • Input parameters
    GenerateBeta.tpl template has the following options to be (optionally) set while calling rapport:
    • n: Observations
      Number of observations
      Class: numeric
      Default value: 100

    • shape1: Alpha
      First parameter of the Beta distribution
      Class: numeric
      Default value: 0.1

    • shape2: Beta
      Second parameter of the Beta distribution
      Class: numeric
      Default value: 0.1

  • No examples provided with template, sorry.
  • No examples provided with template, sorry.
  • Source of `GenerateBeta.tpl`
    <!--head
    meta:
      title: Generate beta distribution
      author: Rapporter Team (@rapporter)
      description: Random generation for the Beta distribution with parameters.
      email: ~
      packages: nortest
      example: ~
    inputs:
    - name: 'n'
      label: Observations
      description: Number of observations
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: 100.0
      limit:
        min: 1.0
        max: .inf
      required: yes
      standalone: yes
    - name: shape1
      label: Alpha
      description: First parameter of the Beta distribution
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: 0.1
      limit:
        min: 0.0
        max: 100.0
      required: yes
      standalone: yes
    - name: shape2
      label: Beta
      description: Second parameter of the Beta distribution
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: 0.1
      limit:
        min: 0.0
        max: 100.0
      required: yes
      standalone: yes
    head-->
    # Histogram
    
    The below [histogram](http://en.wikipedia.org/wiki/Histogram) shows the graphical representation of the estimated probability distribution (generated by the provided parameters).
    
    <%=
    g <- rbeta(n, shape1, shape2)
    set.caption(paste0('rbeta(n = ', n, ', shape1 = ', shape1, ', shape2 = ', shape2, ')'))
    hist(g, main = '', xlab = '')
    %>
    
    The plot was generated with `hist` command, the grid was added automatically.
    
    # Descriptives
    
    The [skewness](http://en.wikipedia.org/wiki/Skewness) is <%=skewness(g)%> and the [kurtosis](http://en.wikipedia.org/wiki/Kurtosis) is <%=kurtosis(g)%> based on the <%=n%> observations, although most [frequentists](http://xkcd.com/1132/) would be rather interested in the [mean](http://xkcd.com/207/) (<%=mean(g)%>) and standard deviation (<%=sd(g)%>).
    
    But did we generated a [normal distribution](http://en.wikipedia.org/wiki/Normal_distribution) after all?
    
    # Normality check
    
    Various hypothesis tests can be applied in order to test if the distribution of given random variable violates normality assumption. These procedures test the H~0~ that provided variable's distribution is _normal_. At this point only few such tests will be covered: the ones that are available in `stats` package (which comes bundled with default R installation) and `nortest` package that is [available](http://cran.r-project.org/web/packages/nortest/index.html) on CRAN.
    
     - **Shapiro-Wilk test** is a powerful normality test appropriate for small samples. In R, it's implemented in `shapiro.test` function available in `stats` package.
     - **Lilliefors test** is a modification of _Kolmogorov-Smirnov test_ appropriate for testing normality when parameters or normal distribution ($\mu$, $\sigma^2$) are not known. `lillie.test` function is located in `nortest` package.
     - **Anderson-Darling test** is one of the most powerful normality tests as it will detect the most of departures from normality. You can find `ad.test` function in `nortest` package.
     - **Pearson $\chi^2$ test** is another normality test which takes more "traditional" approach in normality testing. `pearson.test` is located in `nortest` package.
    
    ## Results
    
    Here you can see the results of applied normality tests (_p-values_ less than 0.05 indicate significant discrepancies):
    
    <%=
    if (length(g) > 5000) {
        h <- htest(g, lillie.test, ad.test, pearson.test)
    } else {
        h <- htest(g, shapiro.test, lillie.test, ad.test, pearson.test)
    }
    p <- .05
    h
    -%>
    
    So, let's draw some conclusions based on applied normality test:
    
    <% if (!is.na(h[1, 3])) { -%>
     - based on _Lilliefors test_, the distribution is <%= ifelse(h[1, 3] < p, "not normal", "normal") %>
    <% } %>
    <% if (!is.na(h[2, 3])) { -%>
     - _Anderson-Darling test_ confirms <%= ifelse(h[2, 3] < p, "violation of", "") %> normality assumption
    <% } %>
    <% if (!is.na(h[3, 3])) { -%>
     - _Pearson's $\chi^2$ test_ classifies the underlying distribution as <%= ifelse(h[3, 3] < p, "non-normal", "normal") %>
    <% } %>
    <% if (length(g) > 5000) { -%>
     - according to _Shapiro-Wilk test_, the distribution is <%= ifelse(h[4, 3] < p, "not", "") %> normal.
    <% } %>
    
    ## Q-Q Plot
    
    "Q" in _Q-Q plot_ stands for _quantile_, as this plot compares empirical and theoretical distribution (in this case, _normal_ distribution) by plotting their quantiles against each other. For normal distribution, plotted dots should approximate a "straight", `x = y` line.
    
    <%=
    set.caption('Q-Q plot')
    qqmath(g, panel=function(x){
                panel.qqmath(x)
                panel.qqmathline(x, distribution = qnorm)
    }, xlab = "Theoretical Quantiles", ylab = "Empirical Quantiles")
    %>
    
    ## Kernel Density Plot
    
    _Kernel density plot_ is a plot of smoothed _empirical distribution function_. As such, it provides good insight about the shape of the distribution. For normal distributions, it should resemble the well known "bell shape".
    
    <%=
    set.caption('Kernel Density Plot')
    rp.densityplot(g)
    %>

GLM.tpl

  • GLM
    by Daniel Nagy

    In this template Rapporter will present you GLM.

    Required packages:
    Example:
    rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), family='poisson')
    rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='poisson')
    rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='Gamma')
  • Input parameters
    GLM.tpl template has the following options to be (optionally) set while calling rapport:
    • dep: Dependent Variable
      These are the dependent variables which will be used here
      Class: numeric

    • indep: Independent Variable
      These are the independent variables which will be used here
      Class: numeric

    • indep.inter: Interaction
      Interaction between the independent variables
      Class: logical
      Default value: TRUE

    • family: Method of GLM
      Class: character
      Default value: gaussian

  • ###########################################################################################################
    ##	 Running: rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), family='poisson')
    ###########################################################################################################
    
    # Introduction
    
    
    [Generalized Linear Model (GLM)](http://en.wikipedia.org/wiki/Generalized_linear_model) is a generalization of the ordinary [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression). While using GLM we don't need the assumption of normality for response variables. There are two basic ideas of the model: It allows the linear model to be related to the response variable via a link function and the magnitude of the variance of each measurement to be a function of its predicted value. An extinsion to the GLM is the [Hierarchical generalized linear model](https://en.wikipedia.org/wiki/Hierarchical_generalized_linear_model).
    
    
    
    #Overview
    
    Multivariate-General Linear Model was carried out, with _Internet usage in leisure time (hours per day)_ and _Internet usage for educational purposes (hours per day)_ as independent variables, and _Age_ as a dependent variable.
    The [interaction](http://en.wikipedia.org/wiki/Interaction) between the independent variables was taken into account.
    
    
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   z value   Pr(>|z|) 
    ----------------- ---------- ------------ --------- ----------
     **(Intercept)**    3.198      0.02122      150.7       0     
    
       **leisure**     -0.02021    0.005847    -3.457    0.000547 
    
         **edu**       0.01474     0.007586     1.944    0.05196  
    
     **leisure:edu**   0.004439    0.001795     2.472    0.01342  
    --------------------------------------------------------------
    
    Table: Fitting General Linear Model: age based on _leisure_ and _edu_
    
    
    
    From the table one can see that 
    * (Intercept) has significant effect on the dependent variable, the p-value of that is 0 
    * leisure has significant effect on the dependent variable, the p-value of that is 0.001 
    * leisure:edu has significant effect on the dependent variable, the p-value of that is 0.013 
    
    <!-- end of list -->
    
    
    
    ##############################################################################################################################
    ##	 Running: rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='poisson')
    ##############################################################################################################################
    
    # Introduction
    
    
    [Generalized Linear Model (GLM)](http://en.wikipedia.org/wiki/Generalized_linear_model) is a generalization of the ordinary [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression). While using GLM we don't need the assumption of normality for response variables. There are two basic ideas of the model: It allows the linear model to be related to the response variable via a link function and the magnitude of the variance of each measurement to be a function of its predicted value. An extinsion to the GLM is the [Hierarchical generalized linear model](https://en.wikipedia.org/wiki/Hierarchical_generalized_linear_model).
    
    
    
    #Overview
    
    Multivariate-General Linear Model was carried out, with _Internet usage in leisure time (hours per day)_ and _Internet usage for educational purposes (hours per day)_ as independent variables, and _Age_ as a dependent variable.
    The [interaction](http://en.wikipedia.org/wiki/Interaction) between the independent variables wasn't taken into account.
    
    
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   z value   Pr(>|z|) 
    ----------------- ---------- ------------ --------- ----------
     **(Intercept)**    3.163      0.01605      197.1       0     
    
       **leisure**     -0.0095     0.003888    -2.443    0.01455  
    
         **edu**       0.03071     0.003883     7.91    2.581e-15 
    --------------------------------------------------------------
    
    Table: Fitting General Linear Model: age based on _leisure_ and _edu_
    
    
    
    From the table one can see that 
    * (Intercept) has significant effect on the dependent variable, the p-value of that is 0 
    * leisure has significant effect on the dependent variable, the p-value of that is 0.015 
    * edu has significant effect on the dependent variable, the p-value of that is 0 
    
    <!-- end of list -->
    
    
    
    ############################################################################################################################
    ##	 Running: rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='Gamma')
    ############################################################################################################################
    
    # Introduction
    
    
    [Generalized Linear Model (GLM)](http://en.wikipedia.org/wiki/Generalized_linear_model) is a generalization of the ordinary [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression). While using GLM we don't need the assumption of normality for response variables. There are two basic ideas of the model: It allows the linear model to be related to the response variable via a link function and the magnitude of the variance of each measurement to be a function of its predicted value. An extinsion to the GLM is the [Hierarchical generalized linear model](https://en.wikipedia.org/wiki/Hierarchical_generalized_linear_model).
    
    
    
    #Overview
    
    Multivariate-General Linear Model was carried out, with _Internet usage in leisure time (hours per day)_ and _Internet usage for educational purposes (hours per day)_ as independent variables, and _Age_ as a dependent variable.
    The [interaction](http://en.wikipedia.org/wiki/Interaction) between the independent variables wasn't taken into account.
    
    
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
     **(Intercept)**    0.0422    0.0008599     49.08   4.612e-212
    
       **leisure**    0.0003828   0.0002093     1.829    0.06785  
    
         **edu**      -0.001182   0.0001948    -6.065   2.332e-09 
    --------------------------------------------------------------
    
    Table: Fitting General Linear Model: age based on _leisure_ and _edu_
    
    
    
    From the table one can see that 
    * (Intercept) has significant effect on the dependent variable, the p-value of that is 0 
    * edu has significant effect on the dependent variable, the p-value of that is 0 
    
    <!-- end of list -->
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), family='poisson')
    rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='poisson')
    rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='Gamma')
    Popular formats:
    Other morkdown formats
  • Source of `GLM.tpl`
    <!--head
    meta:
      title: GLM
      author: Daniel Nagy
      description: In this template Rapporter will present you GLM.
      email: ~
      packages: ~
      example:
      - rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), family='poisson')
      - rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='poisson')
      - rapport('GLM.tpl', data=ius2008, dep='age', indep=c('leisure','edu'), indep.inter=FALSE, family='Gamma')
    inputs:
    - name: dep
      label: Dependent Variable
      description: These are the dependent variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 100.0
      required: yes
      standalone: no
    - name: indep
      label: Independent Variable
      description: These are the independent variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 100.0
      required: yes
      standalone: no
    - name: indep.inter
      label : Interaction
      description: Interaction between the independent variables
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    - name: family
      label: Method of GLM
      description: 
      class: character
      length:
        min: 1.0
        max: 1.0
      options:
      - gaussian
      - binomial
      - Gamma
      - poisson
      value: gaussian
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    head-->
    
    
    <% if (isTRUE(any(indep.name == dep.name))) { %>
    
    You provided  the same variable as a dependent and as an independent variable. In this case the model does not make sense, please replace the duplicated variables in order to run the GLM.
    
    <% } else { %>
    
    <% if (indep < 0 | indep > 1 & family == "binomial") { %>
    Values of the independent variables must be between 0 and 1 when binomial used as link function. Please change parameter family to one that is usable or switch the independent variable to meet the assumptions.
    
    <% } else { %>
    
    
    # Introduction
    
    [Generalized Linear Model (GLM)](http://en.wikipedia.org/wiki/Generalized_linear_model) is a generalization of the ordinary [Linear Regression](http://en.wikipedia.org/wiki/Linear_regression). While using GLM we don't need the assumption of normality for response variables. There are two basic ideas of the model: It allows the linear model to be related to the response variable via a link function and the magnitude of the variance of each measurement to be a function of its predicted value. An extinsion to the GLM is the [Hierarchical generalized linear model](https://en.wikipedia.org/wiki/Hierarchical_generalized_linear_model).
    
    <%=
    d <- structure(na.omit(data.frame(dep, indep)), .Names = c(dep.name, indep.name))
    indep.int <- fml(dep.name, indep.name, join.right = "*")
    indep.nonint <- fml(dep.name, indep.name, join.right = "+")
    suppressWarnings(fit <- glm(ifelse(indep.inter, indep.int, indep.nonint), data = d, family = family))
    indep.plu <- switch(indep.ilen, '', 's')
    %>
    
    #Overview
    
    <%= ifelse(indep.ilen == 1, '', 'Multivariate-') %>General Linear Model was carried out, with <%= p(indep.label) %> as independent variable<%= indep.plu %>, and <%= p(dep.label) %> as a dependent variable.
    The [interaction](http://en.wikipedia.org/wiki/Interaction) between the independent variables was<%=ifelse(indep.inter, "", "n't")%> taken into account.
    
    
    
    <%=
    set.caption(sprintf('Fitting General Linear Model: %s based on %s', dep.name, p(indep.name)))
    fit
    p_val <- summary(fit)$coefficients[, 4]
    %>
    
    From the table one can see that <%= paste(pandoc.list.return(paste(rownames(summary(fit)$coefficients)[which(p_val < 0.05)], round(p_val, 3)[which(p_val < 0.05)], sep = " has significant effect on the dependent variable, the p-value of that is ")), collapse = '\n')
    %>
    
    <% }} %>

HierarchicalClusterAnalysis.tpl

  • Hierarchical Cluster Analysis
    by Daniel Nagy

    In this template Rapporter will present you Hierarchical Cluster Analysis.

    Required packages: mclust
    Example:
  • Input parameters
    HierarchicalClusterAnalysis.tpl template has the following options to be (optionally) set while calling rapport:
    • vars: Used Variables
      Class: numeric

    • dist.mat: Distance Matrix
      Is your data a Distance matrix?
      Class: logical
      Default value: FALSE

    • method: Method
      Class: character
      Default value: ward

    • clust.num: Number of Clusters
      How many clusters you want to check on the Dendogram?
      Class: integer

  • No examples provided with template, sorry.
  • No examples provided with template, sorry.
  • Source of `HierarchicalClusterAnalysis.tpl`
    <!--head
    meta:
      title: Hierarchical Cluster Analysis
      author: Daniel Nagy
      description: In this template Rapporter will present you Hierarchical Cluster Analysis.
      email: ~
      packages:
      - mclust
      example: ~
    inputs:
    - name: vars
      label: Used Variables
      description:
      class: numeric
      length:
        min: 2.0
        max: 5000.0
      required: yes
      standalone: no
    - name: dist.mat
      label : Distance Matrix
      description: Is your data a Distance matrix?
      class: logical
      value: no
      required: no
      standalone: yes
    - name: method
      label: Method
      description: 
      class: character
      length:
        min: 1.0
        max: 1.0
      options:
      - ward
      - single
      - complete
      - average
      - mcquitty
      - median
      - centroid
      value: ward
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    - name: clust.num
      label: Number of Clusters
      description: How many clusters you want to check on the Dendogram?
      class: integer
      limit:
        min: 2.0
        max: 9999.0
      required: no
      standalone: yes
    head-->
    
    
    # Introduction
    
    [Hierarchical Cluster Analysis](http://en.wikipedia.org/wiki/Hierarchical_clustering) is a data mining method which seeks to build a hierarchy of clusters. Clusters are calculated based on the distances between the observations. At the beginning each observation is assigned to be a single cluster, later in every round the most similar clusters will be joined until all observations are in one cluster. One should not mix it up with [K-means Cluster Analysis](http://en.wikipedia.org/wiki/K-means_clustering), which calculates the clusters based on the final numbers of them.
    
    # HCA
    
    Below you can see on the plot how the clusters were made, how the observations were paired with each other. The horizontal linkage between the vertical lines indicates the stage where two clusters joined to each other. In the bottom of the plot you can see the clustering process in an other way, for each observations the shorter lines indicate later clustering. 
    
    <%=
    variables <- scale(na.omit(vars))
    if (!dist.mat) {
    d <- dist(variables) 
    } else { 
    d <- variables
    }
    try(HCA <- hclust(d,method))
    if (exists('HCA') && !is.null(HCA)) { 
    } else {
    d <- dist(variables)
    HCA <- hclust(d,method)
    }
    %>
    
    <% if (exists('clust.num') && !is.null(clust.num) && clust.num > 0) { %>
    <% if (length(HCA$order) <= clust.num) { %>
    The number of the clusters cannot be equal or higher than the unique cases (<%=length(HCA$order)%>), which you set (<%=clust.num%>), was <%=ifelse(length(HCA$order) == clust.num, "equal", "higher")%>. In the following, we will use the maximum available number of the clusters (<%=length(HCA$order) -1%>).
    <%= clust.num <- length(HCA$order) -1 %>
    <% } %>
    <%=paste("The red boxes shows the last", clust.num, "clusters.") %>
    <%=
    plclust(HCA, labels = F, main = "HCA", xlab = "Hierarchical Cluster Analysis", sub = "")
    +rect.hclust(HCA, k = clust.num, border = "red")
    %>
    <% } else { %>
    <%= 
    plclust(HCA, labels = F, main = "HCA", xlab = "Hierarchical Cluster Analysis", sub = "") 
    %>
    <% } %>
    
    We can say that <%=length(which(HCA$height == 0))%> observations have the same values on the used variables, so they were joined in the first <%=length(which(HCA$height == 0))%> round. After that <%=which(HCA$merge[, 1] >= 0)[1]-length(which(HCA$height == 0))%> times there were only made clusters with 2 observations, the first cluster that contains 3 was made in the round <%=which(HCA$merge[, 1] >= 0)[1]%>.
    
    
    ### Optimal number of clusters
    
    <%= mod1 <-  suppressWarnings(Mclust(variables))%>
    
    According to the BIC for EM initialized by hierarchical clustering for parameterized Gaussian mixture models, the optimum numbers of the clusters are <%=summary(mod1)$G%>.
    <% if (exists('clust.num') && !is.null(clust.num) && clust.num > 0) { %>
    <% if (summary(mod1)$G != clust.num) { %>
    Let's see how the Dendogram looks like when we the optimal number of the clusters plotted in it.
    <%=plclust(HCA, labels = F, main = "HCA", xlab = "Hierarchical Cluster Analysis",sub = "")
    +rect.hclust(HCA, k = summary(mod1)$G, border = "red")%>
    <% } else { %>
    So we can see the optimum number of the clusters on the plot above.
    <% } %>
    <% } else { %>
    Let's see how the Dendogram looks like when we the optimal number of the clusters (<%=summary(mod1)$G%>) plotted in it.
    <%=plclust(HCA, labels = F, main = "HCA", xlab = "Hierarchical Cluster Analysis",sub = "")
    +rect.hclust(HCA, k = summary(mod1)$G, border = "red")%>
    <% } %>

HomogeneityCrosstable.tpl

  • Homogeneity test of factor variables
    by Gergely Daróczi, Daniel Nagy

    Test of homogeneity of a given factor variable split by another factor.

    Required packages: descr
    Example:
    rapport('HomogeneityCrosstable', data=ius2008, var='gender', split='dwell')
  • Input parameters
    HomogeneityCrosstable.tpl template has the following options to be (optionally) set while calling rapport:
    • var: Variable to analyse
      Variable to analyse
      Class: factor

    • split: Split variable
      Split variable
      Class: factor

  • ###############################################################################################
    ##	 Running: rapport('HomogeneityCrosstable', data=ius2008, var='gender', split='dwell')
    ###############################################################################################
    
    # Variable description
    
    
    Analysing "gender" ("Gender") with _673_ valid values whether frequency counts are distributed equally across different categories of "dwell" ("Dwelling").
    
    "dwell" has _3_ categories:
    
    
      * city
      * small town
      * village
    
    <!-- end of list -->
    
    
    
    # Counts
    
    
    ------------------------------------------------
              &nbsp;  male   female   Missing   Sum 
    ---------------- ------ -------- --------- -----
            **city**  338     234       27      599 
    
      **small town**   28      3         2      33  
    
         **village**   19      9         2      30  
    
         **Missing**   25      17        5      47  
    
             **Sum**  410     263       36      709 
    ------------------------------------------------
    
    Table: Counted values: "dwell" and "gender"
    
    
    # Chi-squared test
    
    
    Our [null hypothetis](http://en.wikipedia.org/wiki/Null_hypothesis) says that the proportion of *gender* is indentical in each categories of *dwell*.
    
    
    ---------------------------------
     Test statistic   df    P value  
    ---------------- ---- -----------
         16.18        6   _0.01282_ *
    ---------------------------------
    
    Table: Pearson's Chi-squared test: `table`
    
    
    
    The chi-squared test returned the value of _16.18_ with a degree of freedom being _6_. Based on the returned [p value](http://en.wikipedia.org/wiki/P-value) (_0.01282_) we could state that the null hypothesis is rejected.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('HomogeneityCrosstable', data=ius2008, var='gender', split='dwell')
    Popular formats:
    Other morkdown formats
  • Source of `HomogeneityCrosstable.tpl`
    <!--head
    meta:
      title: Homogeneity test of factor variables
      author: Gergely Daróczi, Daniel Nagy
      description: Test of homogeneity of a given factor variable split by another factor.
      packages: 
      - descr
      example:
      - rapport('HomogeneityCrosstable', data=ius2008, var='gender', split='dwell')
    inputs:
    - name: var
      label: Variable to analyse
      description: Variable to analyse
      class: factor
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    - name: split
      label: Split variable
      description: Split variable
      class: factor
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    head-->
    
    # Variable description
    
    Analysing "<%=rp.name(var)%>"<%=ifelse(rp.label(var) == rp.name(var), '', sprintf(' ("%s")', rp.label(var)))%> with <%=rp.valid(as.numeric(var))%> valid values whether frequency counts are distributed equally across different categories of "<%=rp.name(split)%>"<%=ifelse(rp.label(split) == rp.name(split), '', sprintf(' ("%s")', rp.label(split)))%>.
    
    "<%=rp.name(split)%>" has <%=split.cat <- names(table(split)); length(split.cat)%> categories:
    <%=as.list(split.cat)%>
    
    # Counts
    
    <%=
    table	<- table(split, var, deparse.level = 0, useNA = 'ifany')
    if (length(which(is.na(rownames(table)))) > 0) {
        rownames(table)[which(is.na(rownames(table)))] <- 'Missing' }
    if (length(which(is.na(colnames(table)))) > 0) {
        colnames(table)[which(is.na(colnames(table)))] <- 'Missing' }
    fulltable	<- addmargins(table)
    set.caption(sprintf('Counted values: "%s" and "%s"', rp.name(split), rp.name(var)))
    set.alignment(row.names = "right")
    fulltable
    %>
    
    # Chi-squared test
    
    Our [null hypothetis](http://en.wikipedia.org/wiki/Null_hypothesis) says that the proportion of *<%=rp.name(var)%>* is indentical in each categories of *<%=rp.name(split)%>*.
    
    <%=
    t <- suppressWarnings(chisq.test(table))
    t
    %>
    
    The chi-squared test returned the value of <%=as.numeric(t$statistic)%> with a degree of freedom being <%=as.numeric(t$parameter)%>. Based on the returned [p value](http://en.wikipedia.org/wiki/P-value) (<%=t$p.value%>) we could state that the null hypothesis is <%=ifelse(t$p.value < 0.05, 'rejected', 'accepted')%>.

KMeansCluster.tpl

  • K-means cluster
    by Gergely Daróczi, Daniel Nagy

    K-means clustering with automatically estimated number of clusters

    Required packages: cluster
    Example:
    rapport('KMeansCluster.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'))
    rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'))
    rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'), clust_num=7)
  • Input parameters
    KMeansCluster.tpl template has the following options to be (optionally) set while calling rapport:
    • vars: Input variables
      These are the input variables which will be used here
      Class: numeric

    • clust_num: Number of Clusters
      One can set the Number of the Clusters for the K-Mean Clustering
      Class: integer

  • ###############################################################################################
    ##	 Running: rapport('KMeansCluster.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'))
    ###############################################################################################
    
    ## Introduction
    
    
    [K-means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) is a specific and one of the most widespread method of [clustering](http://en.wikipedia.org/wiki/Cluster_analysis). With clustering we want to divide our data into groups, which in the objects are similar to each other. K-means clustering is specified in the way, here we set the number of groups we want to make. In our case we will take into account the following variables: _Age_, _Internet usage for educational purposes (hours per day)_ and _Internet usage in leisure time (hours per day)_, to find out which observations are the nearest to each other.
    
    
    ## References
    
    
    J. B. MacQueen (1967). _"Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability"_. 1:281-297
    
    
    ## Determining the number of clusters
    
    
    As it was mentioned above, the speciality of the K-means Cluster method is to set the number of groups we want to produce. 
    
    
    
    
    Let's see how to decide which is the ideal number of them!
    
    [![](plots/rapport-KMeansCluster.tpl-46-1.png)](plots/rapport-KMeansCluster.tpl-46-1-hires.png)
    
    
    We can figure out that, as we see how much the Within groups sum of squares decreases if we set a higher number of the groups. So the smaller the difference the smaller the gain we can do with increasing the number of the clusters (thus in this case the larger decreasing means the bigger gain).
    
    
    The ideal number of clusters seems to be _2_.
    
    
    
    
    
    ## Cluster means
    
    
    The method of the K-means clustering starts with the step to set k number of centorids which could be the center of the groups we want to form. After that there comes several iterations, meanwhile the ideal centers are being calculated.
    
    The centroids are the observations which are the nearest in average to all the other observations of their group. But it could be also interesting which are the typical values of the clusters! One way to figure out these typical values is to see the group means. The _2_ cluster averages are:
    
    
    -----------------------------------------
             &nbsp;   age     edu    leisure 
    --------------- ------- ------- ---------
      **1.Cluster** 0.7716   1.298   0.5813  
    
      **2.Cluster** -0.2448 -0.4118  -0.1844 
    -----------------------------------------
    
    
    
    The size of the above clusters are: _145_ and _457_.
    
    
    ## Results
    
    
    On the chart below we can see the produced groups. To distinct which observation is related to which cluster each of the objects from the same groups have the same figure and there is a circle which shows the border of the clusters.
    
    [![](plots/rapport-KMeansCluster.tpl-46-2.png)](plots/rapport-KMeansCluster.tpl-46-2-hires.png)
    
    
    #################################################################################################
    ##	 Running: rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'))
    #################################################################################################
    
    ## Introduction
    
    
    [K-means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) is a specific and one of the most widespread method of [clustering](http://en.wikipedia.org/wiki/Cluster_analysis). With clustering we want to divide our data into groups, which in the objects are similar to each other. K-means clustering is specified in the way, here we set the number of groups we want to make. In our case we will take into account the following variables: _drat_, _cyl_, _wt_ and _mpg_, to find out which observations are the nearest to each other.
    
    
    ## References
    
    
    J. B. MacQueen (1967). _"Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability"_. 1:281-297
    
    
    ## Determining the number of clusters
    
    
    As it was mentioned above, the speciality of the K-means Cluster method is to set the number of groups we want to produce. 
    
    
    
    
    Let's see how to decide which is the ideal number of them!
    
    [![](plots/rapport-KMeansCluster.tpl-47-1.png)](plots/rapport-KMeansCluster.tpl-47-1-hires.png)
    
    
    We can figure out that, as we see how much the Within groups sum of squares decreases if we set a higher number of the groups. So the smaller the difference the smaller the gain we can do with increasing the number of the clusters (thus in this case the larger decreasing means the bigger gain).
    
    
    The ideal number of clusters seems to be _2_.
    
    
    
    
    
    ## Cluster means
    
    
    The method of the K-means clustering starts with the step to set k number of centorids which could be the center of the groups we want to form. After that there comes several iterations, meanwhile the ideal centers are being calculated.
    
    The centroids are the observations which are the nearest in average to all the other observations of their group. But it could be also interesting which are the typical values of the clusters! One way to figure out these typical values is to see the group means. The _2_ cluster averages are:
    
    
    ----------------------------------------------
             &nbsp;  drat    cyl     wt      mpg  
    --------------- ------- ------ ------- -------
      **1.Cluster**  0.838  -1.053 -0.8794  0.946 
    
      **2.Cluster** -0.4904 0.6649 0.3078  -0.5118
    
      **3.Cluster** -1.016  1.015   2.169   -1.37 
    ----------------------------------------------
    
    
    
    The size of the above clusters are: _13_, _16_ and _3_.
    
    
    ## Results
    
    
    On the chart below we can see the produced groups. To distinct which observation is related to which cluster each of the objects from the same groups have the same figure and there is a circle which shows the border of the clusters.
    
    [![](plots/rapport-KMeansCluster.tpl-47-2.png)](plots/rapport-KMeansCluster.tpl-47-2-hires.png)
    
    
    ##############################################################################################################
    ##	 Running: rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'), clust_num=7)
    ##############################################################################################################
    
    ## Introduction
    
    
    [K-means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) is a specific and one of the most widespread method of [clustering](http://en.wikipedia.org/wiki/Cluster_analysis). With clustering we want to divide our data into groups, which in the objects are similar to each other. K-means clustering is specified in the way, here we set the number of groups we want to make. In our case we will take into account the following variables: _drat_, _cyl_, _wt_ and _mpg_, to find out which observations are the nearest to each other.
    
    
    ## References
    
    
    J. B. MacQueen (1967). _"Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability"_. 1:281-297
    
    
    ## Determining the number of clusters
    
    
    As it was mentioned above, the speciality of the K-means Cluster method is to set the number of groups we want to produce. 
    
    
    
    As you set, there will be a _7_-means cluster analysis provided.
    
    
    ## Cluster means
    
    
    The method of the K-means clustering starts with the step to set k number of centorids which could be the center of the groups we want to form. After that there comes several iterations, meanwhile the ideal centers are being calculated.
    
    The centroids are the observations which are the nearest in average to all the other observations of their group. But it could be also interesting which are the typical values of the clusters! One way to figure out these typical values is to see the group means. The  cluster averages are:
    
    
    ------------------------------------------------
             &nbsp;  drat     cyl     wt      mpg   
    --------------- ------- ------- ------- --------
      **1.Cluster** -0.7735  1.015  0.4729  -0.6771 
    
      **2.Cluster** -0.6856  1.015   2.175  -0.8944 
    
      **3.Cluster**  1.225  -1.225  -1.374   1.655  
    
      **4.Cluster** 0.5925  0.08166 -0.1684 -0.1671 
    
      **5.Cluster** -1.265  -0.105  0.1229  -0.05652
    
      **6.Cluster** 0.4815  -1.225  -0.4459  0.413  
    
      **7.Cluster** -1.181   1.015   2.166   -1.608 
    ------------------------------------------------
    
    
    
    The size of the above clusters are: _10_, _1_, _6_, _6_, _2_, _5_ and _2_.
    
    
    ## Results
    
    
    On the chart below we can see the produced groups. To distinct which observation is related to which cluster each of the objects from the same groups have the same figure and there is a circle which shows the border of the clusters.
    
    [![](plots/rapport-KMeansCluster.tpl-48-1.png)](plots/rapport-KMeansCluster.tpl-48-1-hires.png)
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('KMeansCluster.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'))
    rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'))
    rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'), clust_num=7)
    Popular formats:
    Other morkdown formats
  • Source of `KMeansCluster.tpl`
    <!--head
    meta:
      title: K-means cluster
      author: Gergely Daróczi, Daniel Nagy
      description: K-means clustering with automatically estimated number of clusters
      email: ~
      packages:
      - cluster
      - fpc
      example:
      - rapport('KMeansCluster.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'))
      - rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'))
      - rapport('KMeansCluster.tpl', data=mtcars, vars=c('drat', 'cyl', 'wt', 'mpg'), clust_num=7)
    inputs:
    - name: vars
      label: Input variables
      description: These are the input variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1000.0
      required: yes
      standalone: no
    - name: clust_num
      class: integer
      label: Number of Clusters
      description: One can set the Number of the Clusters for the K-Mean Clustering
      standalone: yes
      value: ~
      length:
        min: 1.0
        max: 1.0
      limit:
        min: 2.0
        max: 9999.0
      required: no
    head-->
    
    <%=
    evalsOptions('graph.unify', FALSE)
    %>
    
    <%=
    vars <- na.omit(vars)
    varsScaled <- scale(vars)
    maxclust <- nrow(varsScaled) - 1
    %>
    
    ## Introduction
    
    [K-means Clustering](http://en.wikipedia.org/wiki/K-means_clustering) is a specific and one of the most widespread method of [clustering](http://en.wikipedia.org/wiki/Cluster_analysis). With clustering we want to divide our data into groups, which in the objects are similar to each other. K-means clustering is specified in the way, here we set the number of groups we want to make. In our case we will take into account the following variables: <%=vars.label%>, to find out which observations are the nearest to each other.
    
    ## References
    
    J. B. MacQueen (1967). _"Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability"_. 1:281-297
    
    ## Determining the number of clusters
    
    As it was mentioned above, the speciality of the K-means Cluster method is to set the number of groups we want to produce. 
    
    <% if (exists('clust_num') && !is.null(clust_num) && clust_num > 0) { %>
    <% if (clust_num >= maxclust) { %>
    The number of the clusters cannot be equal or higher than the unique cases (<%=maxclust + 1%>), which you set (<%=clust_num%>), was <%=ifelse(maxclust +1 == clust_num, "equal", "higher")%>. In the following, we will use two clusters.
    <%= clust_num <- 2 %>
    <% } else { %>
    As you set, there will be a <%=clust_num%>-means cluster analysis provided.
    <% } %>
    <%= 
    cn <- tryCatch(pam(vars, clust_num), error = function(e) e)
    fit <- kmeans(varsScaled, clust_num)
    %>
    
    <% } else { %>
    <%=
    cn <- tryCatch(pamk(vars), error = function(e) e)
    fit <- kmeans(varsScaled, cn$nc)
    cn <- cn$pamobject
    %>
    
    Let's see how to decide which is the ideal number of them!
    
    <%=
    wss <- (nrow(varsScaled) - 1) * sum(apply(varsScaled, 2, var))
    for (i in 2:15) {
        wss[i] <- sum(kmeans(varsScaled, centers = i)$withinss)
    }
    plot(1:15, wss, type = "b", xlab = "Number of Clusters",  ylab = "Within groups sum of squares")
    cn <- tryCatch(pamk(varsScaled), error = function(e) e)
    %>
    
    We can figure out that, as we see how much the Within groups sum of squares decreases if we set a higher number of the groups. So the smaller the difference the smaller the gain we can do with increasing the number of the clusters (thus in this case the larger decreasing means the bigger gain).
    
    <% if (inherits(cn, 'error')) { %>
    <%=
    nc <-  sample(2:5, 1)
    cn <- list(pamobject = pam(varsScaled, nc), nc = nc)
    stop(paste0('Unable to identify the ideal number of clusters, using a random number between 2 and 5: ', cn$nc))
    %>
    <% } else { %>
    The ideal number of clusters seems to be <%=cn$nc%>.
    <% } %>
    
    <% } %>
    
    ## Cluster means
    
    The method of the K-means clustering starts with the step to set k number of centorids which could be the center of the groups we want to form. After that there comes several iterations, meanwhile the ideal centers are being calculated.
    
    The centroids are the observations which are the nearest in average to all the other observations of their group. But it could be also interesting which are the typical values of the clusters! One way to figure out these typical values is to see the group means. The <%=cn$nc%> cluster averages are:
    
    <%=
    res <- fit$centers
    row.names(res) <- paste0(1:nrow(res), '.', 'Cluster')
    set.alignment(rep('centre', ncol(res)), 'right')
    res
    %>
    
    The size of the above clusters are: <%=fit$size%>.
    
    ## Results
    
    On the chart below we can see the produced groups. To distinct which observation is related to which cluster each of the objects from the same groups have the same figure and there is a circle which shows the border of the clusters.
    
    <%=
    if (exists('clust_num') && !is.null(clust_num) && clust_num > 0) {
    cn <- tryCatch(pam(vars, clust_num), error = function(e) e)
    fit <- kmeans(varsScaled, clust_num)
    } else {
    cn <- tryCatch(pamk(vars), error = function(e) e)
    fit <- kmeans(varsScaled, cn$nc)
    cn <- cn$pamobject
    }
    if (ncol(res) > 1) {
    clusplot(cn, fit$cluster, color = TRUE, shade = TRUE, labels = ifelse(nrow(vars) < 100, 2, 4), lines = 1, main = '', col.p = 'black', col.clus = panderOptions('graph.colors'))
    } else {
        warning('Only one variable provided, so there is no sense drawing a 2D plot here.')
    }
    %>

KolmogorovSmirnovTest.tpl

  • Kolmogorov-Smirnov-test
    by Gergely Daróczi, Daniel Nagy

    This template will run a Kolmogorov-Smirnov-test

    Required packages: nortest
    Example:
    rapport('KolmogorovSmirnovTest', data=ius2008, xvar='edu', yvar='age')
    rapport('KolmogorovSmirnovTest', data=mtcars, xvar='cyl', yvar='carb')
  • Input parameters
    KolmogorovSmirnovTest.tpl template has the following options to be (optionally) set while calling rapport:
    • xvar: X Variable
      This is the X variable which will be used here
      Class: numeric

    • yvar: Y Variable
      This is the Y variable which will be used here
      Class: numeric

  • ##########################################################################################
    ##	 Running: rapport('KolmogorovSmirnovTest', data=ius2008, xvar='edu', yvar='age')
    ##########################################################################################
    
    ## Introduction
    
    
    [Kolmogorov-Smirnov test](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) is one of the most widely used [nonparametric tests](http://en.wikipedia.org/wiki/Non-parametric_statistics). With the help of that in this case we use to check if two continuous variables had the same distribution. We do not test that here, but there is a possibility to use that in the way to check if a sample/variable followed an expected distribution.
    
    
    ## Distributions
    
    
    Before we use the K-S test to look at the possible statistical differences, it could be useful to see visually the distributions we want to observe. Below lie the [Cumulative Distribution Functions](http://en.wikipedia.org/wiki/Cumulative_distribution_function) of the variables we compared:
    
    [![](plots/rapport-KolmogorovSmirnovTest-31-1.png)](plots/rapport-KolmogorovSmirnovTest-31-1-hires.png)
    [![](plots/rapport-KolmogorovSmirnovTest-31-2.png)](plots/rapport-KolmogorovSmirnovTest-31-2-hires.png)
    
    ## Test results
    
    
    Now we will test if the Internet usage for educational purposes (hours per day) and the Age had statistically the same distribution.
    
    
    ---------------------------------------------------
     Test statistic   P value   Alternative hypothesis 
    ---------------- --------- ------------------------
           1         _0_ * * *        two-sided        
    ---------------------------------------------------
    
    Table: Two-sample Kolmogorov-Smirnov test on Internet usage for educational purposes (hours per day) and Age
    
    
    
    
    The requirements of the Kolmogorov-Smirnov Test test was not met, the approximation may be incorrect.
    
    So the variables do not follow the same distribution, according to the Kolmogorov-Smirnov test statistic.
    
    
    ##########################################################################################
    ##	 Running: rapport('KolmogorovSmirnovTest', data=mtcars, xvar='cyl', yvar='carb')
    ##########################################################################################
    
    ## Introduction
    
    
    [Kolmogorov-Smirnov test](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) is one of the most widely used [nonparametric tests](http://en.wikipedia.org/wiki/Non-parametric_statistics). With the help of that in this case we use to check if two continuous variables had the same distribution. We do not test that here, but there is a possibility to use that in the way to check if a sample/variable followed an expected distribution.
    
    
    ## Distributions
    
    
    Before we use the K-S test to look at the possible statistical differences, it could be useful to see visually the distributions we want to observe. Below lie the [Cumulative Distribution Functions](http://en.wikipedia.org/wiki/Cumulative_distribution_function) of the variables we compared:
    
    [![](plots/rapport-KolmogorovSmirnovTest-32-1.png)](plots/rapport-KolmogorovSmirnovTest-32-1-hires.png)
    [![](plots/rapport-KolmogorovSmirnovTest-32-2.png)](plots/rapport-KolmogorovSmirnovTest-32-2-hires.png)
    
    ## Test results
    
    
    Now we will test if the cyl and the carb had statistically the same distribution.
    
    
    -----------------------------------------------------------
     Test statistic       P value       Alternative hypothesis 
    ---------------- ----------------- ------------------------
         0.625       _7.453e-06_ * * *        two-sided        
    -----------------------------------------------------------
    
    Table: Two-sample Kolmogorov-Smirnov test on cyl and carb
    
    
    
    
    The requirements of the Kolmogorov-Smirnov Test test was not met, the approximation may be incorrect.
    
    So the variables do not follow the same distribution, according to the Kolmogorov-Smirnov test statistic.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('KolmogorovSmirnovTest', data=ius2008, xvar='edu', yvar='age')
    rapport('KolmogorovSmirnovTest', data=mtcars, xvar='cyl', yvar='carb')
    Popular formats:
    Other morkdown formats
  • Source of `KolmogorovSmirnovTest.tpl`
    <!--head
    meta:
      title: Kolmogorov-Smirnov-test
      author: Gergely Daróczi, Daniel Nagy
      description: This template will run a Kolmogorov-Smirnov-test
      email: ~
      packages:
      - nortest
      example:
      - rapport('KolmogorovSmirnovTest', data=ius2008, xvar='edu', yvar='age')
      - rapport('KolmogorovSmirnovTest', data=mtcars, xvar='cyl', yvar='carb')
    inputs:
    - name: xvar
      label: X Variable
      description: This is the X variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: yvar
      label: Y Variable
      description: This is the Y variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    head-->
    
    
    ## Introduction
    
    [Kolmogorov-Smirnov test](http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) is one of the most widely used [nonparametric tests](http://en.wikipedia.org/wiki/Non-parametric_statistics). With the help of that in this case we use to check if two continuous variables had the same distribution. We do not test that here, but there is a possibility to use that in the way to check if a sample/variable followed an expected distribution.
    
    ## Distributions
    
    Before we use the K-S test to look at the possible statistical differences, it could be useful to see visually the distributions we want to observe. Below lie the [Cumulative Distribution Functions](http://en.wikipedia.org/wiki/Cumulative_distribution_function) of the variables we compared:
    
    <%=
    plot(ecdf(xvar), main = sprintf('Cumulative Distribution of %s', xvar.label))
    
    
    plot(ecdf(yvar), main = sprintf('Cumulative Distribution of %s', xvar.label))
    %>
    
    ## Test results
    
    Now we will test if the <%=xvar.label%> and the <%=yvar.label%> had statistically the same distribution.
    
    <%=
    set.caption(sprintf('Two-sample Kolmogorov-Smirnov test on %s and %s', xvar.label, yvar.label))
    kstest <- suppressWarnings(ks.test(xvar, yvar))
    kstest
    ksp <- kstest$'p.value'
    p <- 0.05
    %>
    
    <%if (inherits(tryCatch(ks.test(xvar, yvar), warning = function(w) w), 'warning')) { %>
    The requirements of the Kolmogorov-Smirnov Test test was not met, the approximation may be incorrect.
    <% } %>
    <%=ifelse(ksp > p,'So the nullhypothesis, that the variables follow the same distribution was not rejected. Here the stars represent the [significance levels](http://en.wikipedia.org/wiki/Statistical_significance) of the Kolmogorov-Smirnov test coefficients: one star for `0.05`, two for `0.01` and three  for `0.001`.','So the variables do not follow the same distribution, according to the Kolmogorov-Smirnov test statistic.')%>

KruskalWallisTest.tpl

  • Kruskal Wallis test
    by Daniel Nagy

    In this template Rapporter will present you Kruskal Wallis test.

    Required packages:
    Example:
    rapport('KruskalWallisTest.tpl', data=ius2008, var1='age', var2='edu')
    rapport('KruskalWallisTest.tpl', data=mtcars, var1='mpg', var2='drat')
  • Input parameters
    KruskalWallisTest.tpl template has the following options to be (optionally) set while calling rapport:
    • var1: First Used Variable
      This is the first variable which will be used here
      Class: numeric

    • var2: Second Used Variable
      This is the second variable which will be used here
      Class: numeric

  • ##########################################################################################
    ##	 Running: rapport('KruskalWallisTest.tpl', data=ius2008, var1='age', var2='edu')
    ##########################################################################################
    
    # Introduction
    
    
    [Kruskal-Wallis test](http://en.wikipedia.org/wiki/Kruskal-Wallis) is a non-parametric statistical test that assesses hypothesis of equality of two independent sample's/variabels' variances. Most of the time it's being used beacuse the normality assumptions didn't meet for the samples/variables, but we need the assumption of the equal variances, so it can be an alternative of the Two-sample t-test.
    Significant result means difference between the samples/variables.
    
    
    
    ----------------------------------------
     Test statistic   df       P value      
    ---------------- ---- ------------------
          1010        1   _1.056e-221_ * * *
    ----------------------------------------
    
    Table: Kruskal-Wallis test for _Age_ and _Internet usage for educational purposes (hours per day)_
    
    
    
    As you can see in the table the test's degrees of freedom is _1_, the joint test-statistic is _1010_, so the p-value of the Kruskal-Wallis test is _1.056e-221_. Thus we can reject the assumption of the equal variances.
    
    
    
    
    
    
    
    
    
    ##########################################################################################
    ##	 Running: rapport('KruskalWallisTest.tpl', data=mtcars, var1='mpg', var2='drat')
    ##########################################################################################
    
    # Introduction
    
    
    [Kruskal-Wallis test](http://en.wikipedia.org/wiki/Kruskal-Wallis) is a non-parametric statistical test that assesses hypothesis of equality of two independent sample's/variabels' variances. Most of the time it's being used beacuse the normality assumptions didn't meet for the samples/variables, but we need the assumption of the equal variances, so it can be an alternative of the Two-sample t-test.
    Significant result means difference between the samples/variables.
    
    
    
    --------------------------------------
     Test statistic   df      P value     
    ---------------- ---- ----------------
         47.28        1   _6.14e-12_ * * *
    --------------------------------------
    
    Table: Kruskal-Wallis test for _mpg_ and _drat_
    
    
    
    As you can see in the table the test's degrees of freedom is _1_, the joint test-statistic is _47.28_, so the p-value of the Kruskal-Wallis test is _6.14e-12_. Thus we can reject the assumption of the equal variances.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('KruskalWallisTest.tpl', data=ius2008, var1='age', var2='edu')
    rapport('KruskalWallisTest.tpl', data=mtcars, var1='mpg', var2='drat')
    Popular formats:
    Other morkdown formats
  • Source of `KruskalWallisTest.tpl`
    <!--head
    meta:
      title: Kruskal Wallis test
      author: Daniel Nagy
      description: In this template Rapporter will present you Kruskal Wallis test.
      email: ~
      packages: ~
      example:
      - rapport('KruskalWallisTest.tpl', data=ius2008, var1='age', var2='edu')
      - rapport('KruskalWallisTest.tpl', data=mtcars, var1='mpg', var2='drat')
    inputs:
    - name: var1
      label: First Used Variable
      description: This is the first variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: var2
      label: Second Used Variable
      description: This is the second variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    head-->
    
    # Introduction
    
    [Kruskal-Wallis test](http://en.wikipedia.org/wiki/Kruskal-Wallis) is a non-parametric statistical test that assesses hypothesis of equality of two independent sample's/variabels' variances. Most of the time it's being used beacuse the normality assumptions didn't meet for the samples/variables, but we need the assumption of the equal variances, so it can be an alternative of the Two-sample t-test.
    Significant result means difference between the samples/variables.
    
    
    <%=
    set.caption(sprintf('Kruskal-Wallis test for %s and %s', p(var1.label), p(var2.label)))
    krusk <- kruskal.test(list(var1, var2))
    krusk
    p.v <- krusk$p.value
    sta <- krusk$statistic
    par <- krusk$parameter
    %>
    
    As you can see in the table the test's degrees of freedom is <%=par%>, the joint test-statistic is <%=sta%>, so the p-value of the Kruskal-Wallis test is <%=p.v%>. Thus we can <%= ifelse(p.v < 0.05,"reject", "accept")%> the assumption of the equal variances.

LinearRegression.tpl

  • Linear Regression
    by Gergely Daróczi, Daniel Nagy

    This template will run a Linear Regression

    Required packages: HH
    Example:
    rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=F)
    rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=T)
    rapport('LinearRegression.tpl', data=mtcars, dep='carb', indep='cyl', indep.inter=F)
  • Input parameters
    LinearRegression.tpl template has the following options to be (optionally) set while calling rapport:
    • dep: Dependent Variable
      This is the dependent variables which will be used here
      Class: numeric

    • indep: Independent Variable(s)
      These are the independent variables which will be used here
      Class: numeric

    • indep.inter: Interaction
      Should be calculated the interaction between the independent variables
      Class: logical

    • crPlots: crplot
      Plot checking linearity
      Class: logical
      Default value: TRUE

  • ########################################################################################################
    ##	 Running: rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=F)
    ########################################################################################################
    
    # Introduction
    
    
    With the help of the [linear regression](http://en.wikipedia.org/wiki/Linear_regression) we can investigate the relationship between two variables. More punctually we can observe if one of the variables, the so-called [dependent](http://en.wikipedia.org/wiki/Dependent_variable) variable, significantly depended on the other variable, if an increase/decrease on the dependent variable's values made an increase/decrease on the independent variable.
    In this case we only observe linear relationships. 
    
    
    # Overview
    
    Linear Regression was carried out, with _Internet usage for educational purposes (hours per day)_ as independent variable, and _Age_ as a dependent variable.
    
    
    
    
    
    # Assumptions
    
    
    
    In order to have reliable results, we have to check if the assumptions of the linear regression met with the data we used.
    
    
    ------------------------------------------
             &nbsp;           Value   p-value 
    ------------------------ ------- ---------
        **Global Stat**       875.9      0    
    
          **Skewness**        378.9      0    
    
          **Kurtosis**        490.7      0    
    
       **Link Function**     0.9379   0.3328  
    
     **Heteroscedasticity**   5.404   0.0201  
    ------------------------------------------
    
    Table: Table continues below
    
     
    ---------------------------------------------------
             &nbsp;                   Decision         
    ------------------------ --------------------------
        **Global Stat**      Assumptions NOT satisfied!
    
          **Skewness**       Assumptions NOT satisfied!
    
          **Kurtosis**       Assumptions NOT satisfied!
    
       **Link Function**      Assumptions acceptable.  
    
     **Heteroscedasticity**  Assumptions NOT satisfied!
    ---------------------------------------------------
    
    
    
    To check these assumptions the so-called GVLMA, the Global Validation of Linear Model Assumptions will help us. The result of that we can see in the table above.
    
    The GVLMA makes a thorough detection on the linear model, including tests generally about the fit, the shape of the distribution of the residuals ([skewness](http://en.wikipedia.org/wiki/Skewness) and [kurtosis](http://en.wikipedia.org/wiki/Kurtosis)), the linearity and the [homoskedasticity](http://en.wikipedia.org/wiki/Homoscedasticity). On the table we can see if our model met with the assumptions.
    As a generally accepted thumb-rule we use the critical [p-value](http://en.wikipedia.org/wiki/P-value)=0.05.
    
    So let's see the results, which the test gave us:
    
     - The general statistic tells us about the linear model, that it does not fit to our data.
    
     - According to the GVLMA the residuals of our model's skewness differs significantly from the normal distribution's skewness.
    
     - The residuals of our model's kurtosis differs significantly from the normal distribution's kurtosis, based on the result of the GVLMA.
    
     - In the row of the link function we can read that the linearity assumption of our model was accepted.
    
     - At last but not least GVLMA confirms the violation of homoscedasticity.
    
    In summary: We can 't be sure that the linear model used here fits to the data.
    
    References:
    
      * Pena, EA and Slate, EH (2006): Global validation of linear model assumptions. _J. Amer. Statist. Assoc._ **101** (473):341-354.
    
    
    
    
    ### Nonlinearity
    
    
    As we want to fit a linear regression model, it is advisable to see if the relationship between the used variables are linear indeed.
    Next to the test statistic of the GVLMA it is advisable to use a graphical device as well to check that linearity. Here we will use the so-called crPlots funtion to do that, which is an abbreviation of the Component + Residual Plot.
    
    [![](plots/rapport-LinearRegression.tpl-46-1.png)](plots/rapport-LinearRegression.tpl-46-1-hires.png)
    
    
    Here comes the question: What do we see on the plot?
    First of all we can see two lines and several circles. The red interrupted line is the best fitted linear line, which means that te square of the residuals are the least while fitting that line in the model. The green curved line is the best fitted line, which does not have to be straight, of all. The observations we investigate are the circles.
    We can talk about linearity if the green line did not lie too far from the red.
    
    
    Next to these options there is a possibility to have a glance on the so-called diagnostic plots, which on we can see the residuals in themselves and in standardized forms.
    [![](plots/rapport-LinearRegression.tpl-46-2.png)](plots/rapport-LinearRegression.tpl-46-2-hires.png)
    
    # Results
    
    
    After successfully checked the assumptions we can finally turn to the main part of the interest, the results of the Linear Regression Model.
    From the table we can read the variables which have significant effect on the dependent variable.
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
         **edu**        0.7373      0.1307      5.643   2.521e-08 
    
     **(Intercept)**    23.07       0.3771      61.17   2.273e-268
    --------------------------------------------------------------
    
    Table: Fitting linear model: age based on _edu_
    
    
    --------------------------------------------------
     Number of Cases   Adjusted R Squared   AIC   BIC 
    ----------------- -------------------- ----- -----
           637              0.04625        4231  4244 
    --------------------------------------------------
    
    
    
    From the table one can see that _(Intercept) has significant effect on the dependent variable, the p-value of that is 0_ and _edu has significant effect on the dependent variable, the p-value of that is 0_
    
    The model does not fit  well, because the Adjusted R Square is _0.04625_.
    
    
    
    
    
    ########################################################################################################
    ##	 Running: rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=T)
    ########################################################################################################
    
    # Introduction
    
    
    With the help of the [linear regression](http://en.wikipedia.org/wiki/Linear_regression) we can investigate the relationship between two variables. More punctually we can observe if one of the variables, the so-called [dependent](http://en.wikipedia.org/wiki/Dependent_variable) variable, significantly depended on the other variable, if an increase/decrease on the dependent variable's values made an increase/decrease on the independent variable.
    In this case we only observe linear relationships. 
    
    
    # Overview
    
    Linear Regression was carried out, with _Internet usage for educational purposes (hours per day)_ as independent variable, and _Age_ as a dependent variable.
    
    
    
    
    
    # Assumptions
    
    
    
    In order to have reliable results, we have to check if the assumptions of the linear regression met with the data we used.
    
    
    ------------------------------------------
             &nbsp;           Value   p-value 
    ------------------------ ------- ---------
        **Global Stat**       875.9      0    
    
          **Skewness**        378.9      0    
    
          **Kurtosis**        490.7      0    
    
       **Link Function**     0.9379   0.3328  
    
     **Heteroscedasticity**   5.404   0.0201  
    ------------------------------------------
    
    Table: Table continues below
    
     
    ---------------------------------------------------
             &nbsp;                   Decision         
    ------------------------ --------------------------
        **Global Stat**      Assumptions NOT satisfied!
    
          **Skewness**       Assumptions NOT satisfied!
    
          **Kurtosis**       Assumptions NOT satisfied!
    
       **Link Function**      Assumptions acceptable.  
    
     **Heteroscedasticity**  Assumptions NOT satisfied!
    ---------------------------------------------------
    
    
    
    To check these assumptions the so-called GVLMA, the Global Validation of Linear Model Assumptions will help us. The result of that we can see in the table above.
    
    The GVLMA makes a thorough detection on the linear model, including tests generally about the fit, the shape of the distribution of the residuals ([skewness](http://en.wikipedia.org/wiki/Skewness) and [kurtosis](http://en.wikipedia.org/wiki/Kurtosis)), the linearity and the [homoskedasticity](http://en.wikipedia.org/wiki/Homoscedasticity). On the table we can see if our model met with the assumptions.
    As a generally accepted thumb-rule we use the critical [p-value](http://en.wikipedia.org/wiki/P-value)=0.05.
    
    So let's see the results, which the test gave us:
    
     - The general statistic tells us about the linear model, that it does not fit to our data.
    
     - According to the GVLMA the residuals of our model's skewness differs significantly from the normal distribution's skewness.
    
     - The residuals of our model's kurtosis differs significantly from the normal distribution's kurtosis, based on the result of the GVLMA.
    
     - In the row of the link function we can read that the linearity assumption of our model was accepted.
    
     - At last but not least GVLMA confirms the violation of homoscedasticity.
    
    In summary: We can 't be sure that the linear model used here fits to the data.
    
    References:
    
      * Pena, EA and Slate, EH (2006): Global validation of linear model assumptions. _J. Amer. Statist. Assoc._ **101** (473):341-354.
    
    
    
    
    ### Nonlinearity
    
    
    As we want to fit a linear regression model, it is advisable to see if the relationship between the used variables are linear indeed.
    Next to the test statistic of the GVLMA it is advisable to use a graphical device as well to check that linearity. Here we will use the so-called crPlots funtion to do that, which is an abbreviation of the Component + Residual Plot.
    
    [![](plots/rapport-LinearRegression.tpl-47-1.png)](plots/rapport-LinearRegression.tpl-47-1-hires.png)
    
    
    Here comes the question: What do we see on the plot?
    First of all we can see two lines and several circles. The red interrupted line is the best fitted linear line, which means that te square of the residuals are the least while fitting that line in the model. The green curved line is the best fitted line, which does not have to be straight, of all. The observations we investigate are the circles.
    We can talk about linearity if the green line did not lie too far from the red.
    
    
    Next to these options there is a possibility to have a glance on the so-called diagnostic plots, which on we can see the residuals in themselves and in standardized forms.
    [![](plots/rapport-LinearRegression.tpl-47-2.png)](plots/rapport-LinearRegression.tpl-47-2-hires.png)
    
    # Results
    
    
    After successfully checked the assumptions we can finally turn to the main part of the interest, the results of the Linear Regression Model.
    From the table we can read the variables and interactions which have significant effect on the dependent variable.
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
         **edu**        0.7373      0.1307      5.643   2.521e-08 
    
     **(Intercept)**    23.07       0.3771      61.17   2.273e-268
    --------------------------------------------------------------
    
    Table: Fitting linear model: age based on _edu_
    
    
    --------------------------------------------------
     Number of Cases   Adjusted R Squared   AIC   BIC 
    ----------------- -------------------- ----- -----
           637              0.04625        4231  4244 
    --------------------------------------------------
    
    
    
    From the table one can see that _(Intercept) has significant effect on the dependent variable, the p-value of that is 0_ and _edu has significant effect on the dependent variable, the p-value of that is 0_
    
    The model does not fit  well, because the Adjusted R Square is _0.04625_.
    
    
    
    
    
    ########################################################################################################
    ##	 Running: rapport('LinearRegression.tpl', data=mtcars, dep='carb', indep='cyl', indep.inter=F)
    ########################################################################################################
    
    # Introduction
    
    
    With the help of the [linear regression](http://en.wikipedia.org/wiki/Linear_regression) we can investigate the relationship between two variables. More punctually we can observe if one of the variables, the so-called [dependent](http://en.wikipedia.org/wiki/Dependent_variable) variable, significantly depended on the other variable, if an increase/decrease on the dependent variable's values made an increase/decrease on the independent variable.
    In this case we only observe linear relationships. 
    
    
    # Overview
    
    Linear Regression was carried out, with _cyl_ as independent variable, and _carb_ as a dependent variable.
    
    
    
    
    
    # Assumptions
    
    
    
    In order to have reliable results, we have to check if the assumptions of the linear regression met with the data we used.
    
    
    ------------------------------------------
             &nbsp;           Value   p-value 
    ------------------------ ------- ---------
        **Global Stat**       19.91  0.0005211
    
          **Skewness**        7.299  0.006899 
    
          **Kurtosis**        5.033   0.02486 
    
       **Link Function**      2.45    0.1175  
    
     **Heteroscedasticity**   5.124   0.0236  
    ------------------------------------------
    
    Table: Table continues below
    
     
    ---------------------------------------------------
             &nbsp;                   Decision         
    ------------------------ --------------------------
        **Global Stat**      Assumptions NOT satisfied!
    
          **Skewness**       Assumptions NOT satisfied!
    
          **Kurtosis**       Assumptions NOT satisfied!
    
       **Link Function**      Assumptions acceptable.  
    
     **Heteroscedasticity**  Assumptions NOT satisfied!
    ---------------------------------------------------
    
    
    
    To check these assumptions the so-called GVLMA, the Global Validation of Linear Model Assumptions will help us. The result of that we can see in the table above.
    
    The GVLMA makes a thorough detection on the linear model, including tests generally about the fit, the shape of the distribution of the residuals ([skewness](http://en.wikipedia.org/wiki/Skewness) and [kurtosis](http://en.wikipedia.org/wiki/Kurtosis)), the linearity and the [homoskedasticity](http://en.wikipedia.org/wiki/Homoscedasticity). On the table we can see if our model met with the assumptions.
    As a generally accepted thumb-rule we use the critical [p-value](http://en.wikipedia.org/wiki/P-value)=0.05.
    
    So let's see the results, which the test gave us:
    
     - The general statistic tells us about the linear model, that it does not fit to our data.
    
     - According to the GVLMA the residuals of our model's skewness differs significantly from the normal distribution's skewness.
    
     - The residuals of our model's kurtosis differs significantly from the normal distribution's kurtosis, based on the result of the GVLMA.
    
     - In the row of the link function we can read that the linearity assumption of our model was accepted.
    
     - At last but not least GVLMA confirms the violation of homoscedasticity.
    
    In summary: We can 't be sure that the linear model used here fits to the data.
    
    References:
    
      * Pena, EA and Slate, EH (2006): Global validation of linear model assumptions. _J. Amer. Statist. Assoc._ **101** (473):341-354.
    
    
    
    
    ### Nonlinearity
    
    
    As we want to fit a linear regression model, it is advisable to see if the relationship between the used variables are linear indeed.
    Next to the test statistic of the GVLMA it is advisable to use a graphical device as well to check that linearity. Here we will use the so-called crPlots funtion to do that, which is an abbreviation of the Component + Residual Plot.
    
    [![](plots/rapport-LinearRegression.tpl-48-1.png)](plots/rapport-LinearRegression.tpl-48-1-hires.png)
    
    
    Here comes the question: What do we see on the plot?
    First of all we can see two lines and several circles. The red interrupted line is the best fitted linear line, which means that te square of the residuals are the least while fitting that line in the model. The green curved line is the best fitted line, which does not have to be straight, of all. The observations we investigate are the circles.
    We can talk about linearity if the green line did not lie too far from the red.
    
    
    Next to these options there is a possibility to have a glance on the so-called diagnostic plots, which on we can see the residuals in themselves and in standardized forms.
    [![](plots/rapport-LinearRegression.tpl-48-2.png)](plots/rapport-LinearRegression.tpl-48-2-hires.png)
    
    # Results
    
    
    After successfully checked the assumptions we can finally turn to the main part of the interest, the results of the Linear Regression Model.
    From the table we can read the variables which have significant effect on the dependent variable.
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
         **cyl**        0.4766      0.1403      3.396    0.001942 
    
     **(Intercept)**   -0.1365      0.9027     -0.1513    0.8808  
    --------------------------------------------------------------
    
    Table: Fitting linear model: carb based on _cyl_
    
    
    --------------------------------------------------
     Number of Cases   Adjusted R Squared   AIC   BIC 
    ----------------- -------------------- ----- -----
           32                0.2536        116.1 120.5
    --------------------------------------------------
    
    
    
    From the table one can see that cyl has significant effect on the dependent variable, the p-value of that is 0.002
    
    The model fits  well, because the Adjusted R Square is _0.2536_.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=F)
    rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=T)
    rapport('LinearRegression.tpl', data=mtcars, dep='carb', indep='cyl', indep.inter=F)
    Popular formats:
    Other morkdown formats
  • Source of `LinearRegression.tpl`
    <!--head
    meta:
      title: Linear Regression
      author: Gergely Daróczi, Daniel Nagy
      description: This template will run a Linear Regression
      email: ~
      packages:
      - HH
      - car
      - nortest
      - gvlma
      example:
      - rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=F)
      - rapport('LinearRegression.tpl', data=ius2008, dep='age', indep='edu', indep.inter=T)
      - rapport('LinearRegression.tpl', data=mtcars, dep='carb', indep='cyl', indep.inter=F)
    inputs:
    - name: dep
      label: Dependent Variable
      description: This is the dependent variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: indep
      label: Independent Variable(s)
      description: These are the independent variables which will be used here
      class: numeric
      length:
        min: 1.0
        max: 100.0
      required: yes
      standalone: no
    - name: indep.inter
      label: Interaction
      description: Should be calculated the interaction between the independent variables
      class: logical
      required: yes
      standalone: yes
    - name: crPlots
      label: crplot
      description: Plot checking linearity
      class: logical
      required: no
      value: yes
      standalone: yes
    head-->
    
    <%=
    d <- structure(na.omit(data.frame(dep, indep)), .Names = c(dep.name, indep.name))
    indep.int <- fml(dep.name, indep.name, join.right = "*")
    indep.nonint <- fml(dep.name, indep.name, join.right = "+")
    fit <- lm(ifelse(indep.inter, indep.int, indep.nonint), data = d)
    indep.plu <- switch(indep.ilen, '', 's')
    gvmodel <- tryCatch(gvlma(fit), error = function(e) e)
    %>
    
    # Introduction
    
    With the help of the [linear regression](http://en.wikipedia.org/wiki/Linear_regression) we can investigate the relationship <%=ifelse(indep.ilen == 1, 'between two variables', 'between the variables')%>. More punctually we can observe if one of the variables, the so-called [dependent](http://en.wikipedia.org/wiki/Dependent_variable) variable, significantly depended on the other variable<%=indep.plu%>, if an increase/decrease on the dependent variable's values made an increase/decrease on the independent variable<%=indep.plu%>.
    In this case we only observe linear relationships. <%=ifelse(indep.ilen == 1, '', 'As we use in the model more than 1 independent variables, we call the method [multivariate regression](http://en.wikipedia.org/wiki/Multivariate_regression_model).')%>
    
    # Overview
    
    <%= ifelse(indep.ilen == 1, '', 'Multivariate-') %>Linear Regression was carried out, with <%= p(indep.label) %> as independent variable<%= indep.plu %>, and <%= p(dep.label) %> as a dependent variable.
    
    <% if (ncol(as.matrix(indep)) > 1) { %>
    The [interaction](http://en.wikipedia.org/wiki/Interaction) between the independent variables was<%=ifelse(indep.inter, "", "n't")%> taken into account.
    <% } %>
    
    
    # Assumptions
    
    <% if (!inherits(gvmodel, 'error')) { %>
    In order to have reliable results, we have to check if the assumptions of the linear regression met with the data we used.
    
    <%=
    (summary(gvmodel))
    decision <- (gvmodel$Decision)
    decision <- summary(gvmodel)[, 3]
    decision1 <- (decision[1] == 'Assumptions NOT satisfied!')
    decision2 <- (decision[2] == 'Assumptions NOT satisfied!')
    decision3 <- (decision[3] == 'Assumptions NOT satisfied!')
    decision4 <- (decision[4] == 'Assumptions NOT satisfied!')
    decision5 <- (decision[5] == 'Assumptions NOT satisfied!')
    decision.any <- (any(decision == 'Assumptions NOT satisfied!'))
    %>
    
    To check these assumptions the so-called GVLMA, the Global Validation of Linear Model Assumptions will help us. The result of that we can see in the table above.
    
    The GVLMA makes a thorough detection on the linear model, including tests generally about the fit, the shape of the distribution of the residuals ([skewness](http://en.wikipedia.org/wiki/Skewness) and [kurtosis](http://en.wikipedia.org/wiki/Kurtosis)), the linearity and the [homoskedasticity](http://en.wikipedia.org/wiki/Homoscedasticity). On the table we can see if our model met with the assumptions.
    As a generally accepted thumb-rule we use the critical [p-value](http://en.wikipedia.org/wiki/P-value)=0.05.
    
    So let's see the results, which the test gave us:
    
     - The general statistic tells us about the linear model, that it <%= ifelse(decision1, "does not fit to our data", "can fit to our data")%>.
    
     - According to the GVLMA the residuals of our model's skewness<%= ifelse(decision2, "", " does not")%> differs significantly from the normal distribution's skewness.
    
     - The residuals of our model's kurtosis<%= ifelse(decision3, "", " does not")%> differs significantly from the normal distribution's kurtosis, based on the result of the GVLMA.
    
     - In the row of the link function we can read that the linearity assumption of our model was <%=ifelse(decision4, "rejected", "accepted")%>.
    
     - At last but not least GVLMA confirms<%= ifelse(decision5, " the violation of", "") %> homoscedasticity.
    
    In summary: We can<%=ifelse(decision.any, " 't", "")%> be sure that the linear model used here fits to the data.
    
    References:
    
      * Pena, EA and Slate, EH (2006): Global validation of linear model assumptions. _J. Amer. Statist. Assoc._ **101** (473):341-354.
    
    <% } else { %>
    
    We could not test the assumptions as the following R error occured: <%=gvmodel$message%>
    
    <% } %>
    
    ### Nonlinearity
    
    As we want to fit a linear regression model, it is advisable to see if the relationship between the used variables are linear indeed.
    Next to the test statistic of the GVLMA it is advisable to use a graphical device as well to check that linearity. Here we will use the so-called crPlots funtion to do that, which is an abbreviation of the Component + Residual Plot.
    
    <%=
    fit.nonint <- lm(indep.nonint, data = d)
    suppressWarnings(crPlots(fit.nonint))
    %>
    
    Here comes the question: What do we see on the plot?
    First of all we can see two lines and several circles. The red interrupted line is the best fitted linear line, which means that te square of the residuals are the least while fitting that line in the model. The green curved line is the best fitted line, which does not have to be straight, of all. The observations we investigate are the circles.
    We can talk about linearity if the green line did not lie too far from the red.
    <%=ifelse(indep.ilen==1,'','The other plots we see are simple [boxplots](http://en.wikipedia.org/wiki/Box_plot).')%>
    
    Next to these options there is a possibility to have a glance on the so-called diagnostic plots, which on we can see the residuals in themselves and in standardized forms.
    <%=
    par(mfrow = c(2, 2))
    +plot(fit)
    %>
    
    <% if (ncol(d) > 2) { %>
    
    ### Multicollinearity
    
    <%=
    mcoll <- as.table(matrix(c(vif(fit), 1 / vif(fit)), ncol(d)-1, 2))
    colnames(mcoll) <- c("VIF", "Tolerance")
    rownames(mcoll) <- c(indep.name)
    mcoll 
    CNM <- as.table(kappa(fit))
    row.names(CNM) <- "The Condition Number of a Matrix"
    CNM
    %>
    <% } else { } %>
    
    # Results
    
    After successfully checked the assumptions we can finally turn to the main part of the interest, the results of the Linear Regression Model.
    From the table we can read the variables <%=ifelse(indep.inter,'and interactions ','')%>which have significant effect on the dependent variable.
    
    <%=
    set.caption(sprintf('Fitting linear model: %s based on %s', dep.name, p(indep.name)))
    fit
    gof <- as.table(c(nrow(d), summary(fit)$adj.r.squared, AIC(fit), BIC(fit)),1,4)
    rownames(gof) <- c("Number of Cases", "Adjusted R Squared", "AIC", "BIC")
    gof
    p_val <- summary(fit)$coefficients[, 4]
    %>
    
    From the table one can see that <%= paste(rownames(summary(fit)$coefficients)[which(p_val < 0.05)], round(p_val, 3)[which(p_val < 0.05)], sep = " has significant effect on the dependent variable, the p-value of that is ")%>
    
    The model <%= sprintf("%s %s well", ifelse(gof[2] > 0.2, "fits", "does not fit"), ifelse(gof[2] > 0.4, "very", "")) %>, because the Adjusted R Square is <%=gof[2]%>.

MDS.tpl

  • Multidimensional Scaling
    by Daniel Nagy

    In this template Rapporter will present you Multidimensional Scaling.

    Required packages:
    Example:
    rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=16)
    rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=17, min.dist.num=30)
    rapport('MDS.tpl', data=mtcars, vars=c('drat', 'cyl', 'mpg'), max.dist.num=17, min.dist.num=30)
  • Input parameters
    MDS.tpl template has the following options to be (optionally) set while calling rapport:
    • vars: Used Variables
      These are the variables which MDS use to differentiate between values
      Class: numeric

    • id: Labels of the cases
      On the plot, these names will be shown as the labels of the cases

    • dist.mat: Distance Matrix
      Is your data a Distance matrix?
      Class: logical
      Default value: FALSE

    • max.dist.num: Maximum Distance Number
      Number of pairs with the biggest differences
      Class: integer
      Default value: 10

    • min.dist.num: Minimum Distance Number
      Number of pairs with the smallest distances
      Class: integer
      Default value: 10

  • ######################################################################################################
    ##	 Running: rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=16)
    ######################################################################################################
    
    # Introduction
    
    
    [Multidimensional-scaling](http://en.wikipedia.org/wiki/Multidimensional_scaling) is a technique which gives us a visual representation about the distances between the observations. 
    
    
    # MDS
    
    
    Below you can see a plot, that presents you the distance between the observations, which was calculated based on _Age_, _Internet usage for educational purposes (hours per day)_ and _Internet usage in leisure time (hours per day)_.
    
    [![](plots/rapport-MDS.tpl-46-1.png)](plots/rapport-MDS.tpl-46-1-hires.png)
    
    ### What can be seen here?
    
    
    #### Outsiders
    
    84 differs the most from the others, and 8 seems to be the most "common" observation, which lie nearest to all other observations.
    
    
    #### Outsider Pairs
    
    _284_ and _84_ (8.02) are the "furthest", _280_ and _1_ (0) are the "nearest" to each other.
    
    
    #### In General
    
    
    Now let's see which observations can be said statistically far/similar to each other in general. The _16_ pairs with the biggest differences and the _10_ pairs with the smallest differences will be presented. In the brackets you can see the amount of the distances between two observations.
    
    
    
    There are _17_ observations which are the most similar, and equal in the same time, that is a higher number than the wanted _16_, thus will not be reported one-by-one. Set _17_ as parameter _max.dist.num_ to check the pairs if you are interested.
    
    
     
    
    
    There are _318_ observations which are the most similar and equal in the same time, that is a higher number than the wanted _10_, thus will not be reported one-by-one. Set _318_ as parameter _min.dist.num_ to check the pairs if you are interested.
    
    
    
    
    
    
    #######################################################################################################################
    ##	 Running: rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=17, min.dist.num=30)
    #######################################################################################################################
    
    # Introduction
    
    
    [Multidimensional-scaling](http://en.wikipedia.org/wiki/Multidimensional_scaling) is a technique which gives us a visual representation about the distances between the observations. 
    
    
    # MDS
    
    
    Below you can see a plot, that presents you the distance between the observations, which was calculated based on _Age_, _Internet usage for educational purposes (hours per day)_ and _Internet usage in leisure time (hours per day)_.
    
    [![](plots/rapport-MDS.tpl-47-1.png)](plots/rapport-MDS.tpl-47-1-hires.png)
    
    ### What can be seen here?
    
    
    #### Outsiders
    
    84 differs the most from the others, and 8 seems to be the most "common" observation, which lie nearest to all other observations.
    
    
    #### Outsider Pairs
    
    _284_ and _84_ (8.02) are the "furthest", _280_ and _1_ (0) are the "nearest" to each other.
    
    
    #### In General
    
    
    Now let's see which observations can be said statistically far/similar to each other in general. The _17_ pairs with the biggest differences and the _30_ pairs with the smallest differences will be presented. In the brackets you can see the amount of the distances between two observations.
    
    
    
    According to the used variables (_Age_, _Internet usage for educational purposes (hours per day)_ and _Internet usage in leisure time (hours per day)_) the _17_ furthest pair of observations are:
    
    
    
      * _284_ and _84_ (8.02)
      * _224_ and _84_ (7.87)
      * _230_ and _84_ (7.84)
      * _84_ and _68_ (7.81)
      * _463_ and _84_ (7.79)
      * _583_ and _84_ (7.79)
      * _582_ and _84_ (7.72)
      * _122_ and _84_ (7.72)
      * _460_ and _84_ (7.72)
      * _606_ and _84_ (7.7)
      * _607_ and _84_ (7.7)
      * _128_ and _84_ (7.69)
      * _253_ and _84_ (7.69)
      * _84_ and _41_ (7.69)
      * _269_ and _84_ (7.65)
      * _376_ and _84_ (7.63)
      * _506_ and _84_ (7.63)
    
    <!-- end of list -->
    
    
    
    
    
     
    
    
    There are _318_ observations which are the most similar and equal in the same time, that is a higher number than the wanted _30_, thus will not be reported one-by-one. Set _318_ as parameter _min.dist.num_ to check the pairs if you are interested.
    
    
    
    
    
    
    ###################################################################################################################
    ##	 Running: rapport('MDS.tpl', data=mtcars, vars=c('drat', 'cyl', 'mpg'), max.dist.num=17, min.dist.num=30)
    ###################################################################################################################
    
    # Introduction
    
    
    [Multidimensional-scaling](http://en.wikipedia.org/wiki/Multidimensional_scaling) is a technique which gives us a visual representation about the distances between the observations. 
    
    
    # MDS
    
    
    Below you can see a plot, that presents you the distance between the observations, which was calculated based on _drat_, _cyl_ and _mpg_.
    
    [![](plots/rapport-MDS.tpl-48-1.png)](plots/rapport-MDS.tpl-48-1-hires.png)
    
    ### What can be seen here?
    
    
    #### Outsiders
    
    Honda Civic differs the most from the others, and Ferrari Dino seems to be the most "common" observation, which lie nearest to all other observations.
    
    
    #### Outsider Pairs
    
    _Honda Civic_ and _Cadillac Fleetwood_ (5.48) are the "furthest", _Mazda RX4 Wag_ and _Mazda RX4_ (0) are the "nearest" to each other.
    
    
    #### In General
    
    
    Now let's see which observations can be said statistically far/similar to each other in general. The _17_ pairs with the biggest differences and the _30_ pairs with the smallest differences will be presented. In the brackets you can see the amount of the distances between two observations.
    
    
    
    According to the used variables (_drat_, _cyl_ and _mpg_) the _17_ furthest pair of observations are:
    
    
    
      * _Honda Civic_ and _Cadillac Fleetwood_ (5.48)
      * _Honda Civic_ and _Lincoln Continental_ (5.39)
      * _Dodge Challenger_ and _Honda Civic_ (5.25)
      * _Toyota Corolla_ and _Cadillac Fleetwood_ (5.1)
      * _Toyota Corolla_ and _Lincoln Continental_ (5.04)
      * _Honda Civic_ and _Merc 450SLC_ (4.85)
      * _Fiat 128_ and _Cadillac Fleetwood_ (4.79)
      * _Honda Civic_ and _Merc 450SE_ (4.74)
      * _Honda Civic_ and _Duster 360_ (4.74)
      * _AMC Javelin_ and _Honda Civic_ (4.74)
      * _Fiat 128_ and _Lincoln Continental_ (4.74)
      * _Honda Civic_ and _Chrysler Imperial_ (4.68)
      * _Honda Civic_ and _Valiant_ (4.68)
      * _Honda Civic_ and _Merc 450SL_ (4.67)
      * _Dodge Challenger_ and _Toyota Corolla_ (4.67)
      * _Pontiac Firebird_ and _Honda Civic_ (4.52)
      * _Honda Civic_ and _Hornet Sportabout_ (4.46)
    
    <!-- end of list -->
    
    
    
    
    
     
    
    
    According to the used variables (_drat_, _cyl_ and _mpg_) the _30_ nearest pair of observations are:
    
    
    
      * _Mazda RX4 Wag_ and _Mazda RX4_ (0)
      * _Chrysler Imperial_ and _Duster 360_ (0.08)
      * _Merc 230_ and _Datsun 710_ (0.13)
      * _Lincoln Continental_ and _Cadillac Fleetwood_ (0.13)
      * _Merc 450SL_ and _Merc 450SE_ (0.15)
      * _AMC Javelin_ and _Merc 450SLC_ (0.15)
      * _Pontiac Firebird_ and _Hornet Sportabout_ (0.15)
      * _AMC Javelin_ and _Chrysler Imperial_ (0.17)
      * _AMC Javelin_ and _Duster 360_ (0.19)
      * _Merc 450SLC_ and _Merc 450SE_ (0.2)
      * _Merc 280C_ and _Merc 280_ (0.23)
      * _AMC Javelin_ and _Merc 450SE_ (0.25)
      * _Merc 450SL_ and _Hornet Sportabout_ (0.28)
      * _Merc 280_ and _Mazda RX4_ (0.3)
      * _Merc 280_ and _Mazda RX4 Wag_ (0.3)
      * _Merc 450SLC_ and _Duster 360_ (0.3)
      * _Chrysler Imperial_ and _Merc 450SLC_ (0.31)
      * _Pontiac Firebird_ and _Merc 450SL_ (0.32)
      * _Merc 450SLC_ and _Merc 450SL_ (0.35)
      * _Toyota Corona_ and _Datsun 710_ (0.35)
      * _Toyota Corolla_ and _Fiat 128_ (0.36)
      * _AMC Javelin_ and _Merc 450SL_ (0.38)
      * _Merc 240D_ and _Datsun 710_ (0.4)
      * _Merc 450SE_ and _Hornet Sportabout_ (0.41)
      * _Chrysler Imperial_ and _Merc 450SE_ (0.41)
      * _Volvo 142E_ and _Merc 230_ (0.42)
      * _Merc 450SE_ and _Duster 360_ (0.44)
      * _Maserati Bora_ and _Camaro Z28_ (0.45)
      * _Toyota Corona_ and _Merc 230_ (0.46)
      * _Pontiac Firebird_ and _Merc 450SE_ (0.46)
    
    <!-- end of list -->
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=16)
    rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=17, min.dist.num=30)
    rapport('MDS.tpl', data=mtcars, vars=c('drat', 'cyl', 'mpg'), max.dist.num=17, min.dist.num=30)
    Popular formats:
    Other morkdown formats
  • Source of `MDS.tpl`
    <!--head
    meta:
      title: Multidimensional Scaling
      author: Daniel Nagy
      description: In this template Rapporter will present you Multidimensional Scaling.
      email: ~
      packages: ~
      example:
      - rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=16)
      - rapport('MDS.tpl', data=ius2008, vars=c('age', 'edu', 'leisure'), max.dist.num=17, min.dist.num=30)
      - rapport('MDS.tpl', data=mtcars, vars=c('drat', 'cyl', 'mpg'), max.dist.num=17, min.dist.num=30)
    inputs:
    - name: vars
      label: Used Variables
      description: These are the variables which MDS use to differentiate between values
      class: numeric
      length:
        min: 2.0
        max: 100.0
      required: yes
      standalone: no
    - name: id
      label: Labels of the cases
      description: On the plot, these names will be shown as the labels of the cases
      class: ~
      required: no
      standalone: no
    - name: dist.mat
      label : Distance Matrix
      description: Is your data a Distance matrix?
      class: logical
      value: no
      required: no
      standalone: yes
    - name: max.dist.num
      label: Maximum Distance Number
      description: Number of pairs with the biggest differences
      class: integer
      length:
        min: 1.0
        max: 1.0
      limit:
        min: 1.0
        max: 1000.0
      value: 10
      required: no
      standalone: yes
    - name: min.dist.num
      label: Minimum Distance Number
      description: Number of pairs with the smallest distances
      class: integer
      length:
        min: 1.0
        max: 1.0
      limit:
        min: 1.0
        max: 1000.0
      value: 10
      required: no
      standalone: yes
    head-->
    
    
    # Introduction
    
    [Multidimensional-scaling](http://en.wikipedia.org/wiki/Multidimensional_scaling) is a technique which gives us a visual representation about the distances between the observations. 
    
    # MDS
    
    Below you can see a plot, that presents you the distance between the observations, which was calculated based on <%=rp.label(vars)%>.
    
    <%= if (exists('id') && !is.null(id)) {
    id <- as.character(id)
    id[which(is.na(id))] <- "noname"
    dd <- duplicated(id)
    if (any(dd)) {
    warning("Among labels some of them are duplicated. Possibly there is a better way to label, please consider other options. In this report duplications will be distinguished by following '_' and numbers after them.")
    }
    whichisduplicated <- apply(data.frame(need = names(table(id[dd]))), 1, function(i) which(id==i))
    if (class(whichisduplicated) != "list") {
      whichisduplicated <- list('1' = whichisduplicated) }
    lapply(whichisduplicated, function(i) id[i] <<- paste(id[i], 1:length(i), sep = "_") )
    rownames(vars) <- id
    }
    variables <- scale(na.omit(vars))
    if (!dist.mat) {
    d <- dist(variables) 
    } else { 
    d <- variables
    }
    try(fit <- cmdscale(d, eig = TRUE, k = 2))
    if (exists('fit') && !is.null(fit)) { 
    } else {
    d <- dist(variables)
    fit <- cmdscale(d, eig = TRUE, k = 2)
    }
    x <- fit$points[, 1]
    y <- fit$points[, 2]
    plot(x, y, xlab="Coordinate 1", ylab = "Coordinate 2", main = "Metric  MDS",   type = "n")
    +text(x, y, labels = row.names(variables), cex = 0.7, col = panderOptions("graph.colors")[4]) 
    distance <- as.matrix(d)
    maxind <- which(distance == max(distance), arr.ind = TRUE)
    minind <- which(distance == min(distance[distance != min(distance)]), arr.ind = TRUE)
    %>
    
    ### What can be seen here?
    
    #### Outsiders
    <%=
    furthest <- colnames(distance)[which(colSums(distance) == max(colSums(distance)))]
    nearest <- colnames(distance)[which(colSums(distance) == min(colSums(distance)))]%>
    
    <%=furthest %> differ<%= ifelse(length(furthest) > 1, "", "s")%> the most from the others, and <%=nearest%> seem<%= (ifelse (length(nearest) > 1, "", "s"))%> to be the most "common" observation<%= (ifelse (length(nearest) > 1, "s", ""))%>, which <%=ifelse(length(nearest) > 1, "lies", "lie") %> nearest to all other observations.
    
    <%=
    distance[upper.tri(distance, diag = T)] <- NA
    h <- NULL
    notneeded <- apply(data.frame(unique(as.vector(sort(distance[lower.tri(distance)], decreasing = T))[1:max.dist.num])), 1, function(i) h <<- rbind(h, which(distance == i, arr.ind = T)))
    j <- NULL
    notneeded <- apply(data.frame(unique(as.vector(sort(distance[lower.tri(distance)], decreasing = F))[1:min.dist.num])), 1, function(i) j <<- rbind(j, which(distance == i, arr.ind = T)))
    %>
    
    #### Outsider Pairs
    
    <%=paste0(p(c(rownames(distance)[h[1, 1]], colnames(distance)[h[1, 2]])), ' (', round(distance[h[1, 1], h[1, 2]], 2), ')')%> are the "furthest", <%=paste0(p(c(rownames(distance)[j[1, 1]], colnames(distance)[j[1, 2]])), ' (', round(distance[j[1, 1], j[1, 2]], 2), ')') %> are the "nearest" to each other.
    
    #### In General
    
    Now let's see which observations can be said statistically far/similar to each other in general. The <%=max.dist.num%> pairs with the biggest differences and the <%=min.dist.num%> pairs with the smallest differences will be presented. In the brackets you can see the amount of the distances between two observations.
    
    <%if (nrow(h) <= max.dist.num) { %>
    
    According to the used variables (<%=rp.label(vars)%>) the <%=max.dist.num%> furthest pair of observations are:
    
    <%=
    paste(pander.return(lapply(1:nrow(h), function(i) paste0(p(c(rownames(distance)[h[i, 1]], colnames(distance)[h[i, 2]])), ' (', round(distance[h[i, 1], h[i, 2]], 2), ')'))), collapse = '\n')
    %>
    
    <% } else { %>
    
    There are <%=nrow(h)%> observations which are the most similar, and equal in the same time, that is a higher number than the wanted <%=max.dist.num%>, thus will not be reported one-by-one. Set <%=nrow(h)%> as parameter _max.dist.num_ to check the pairs if you are interested.
    <%}%>
    
     
    <%if (nrow(j) <= min.dist.num) { %>
    
    According to the used variables (<%=rp.label(vars)%>) the <%=min.dist.num%> nearest pair of observations are:
    
    <%=
    paste(pander.return(lapply(1:nrow(j), function(i) paste0(p(c(rownames(distance)[j[i, 1]], colnames(distance)[j[i, 2]])), ' (', round(distance[j[i, 1], j[i, 2]], 2), ')'))), collapse = '\n')
    %>
    
    <% } else { %>
    
    There are <%=nrow(j)%> observations which are the most similar and equal in the same time, that is a higher number than the wanted <%=min.dist.num%>, thus will not be reported one-by-one. Set <%=nrow(j)%> as parameter _min.dist.num_ to check the pairs if you are interested.
    <%}%>

NormalityTest.tpl

  • Normality Tests
    by Aleksandar Blagotić

    Overview of several normality tests and diagnostic plots that can screen departures from normality.

    Required packages: nortest
    Example:
    rapport("NormalityTest", ius2008, var = "leisure")
    rapport("NormalityTest", ius2008, var = "leisure", nc.plot = FALSE)
    rapport("NormalityTest", ius2008, var = "leisure", qq.line = FALSE)
  • Input parameters
    NormalityTest.tpl template has the following options to be (optionally) set while calling rapport:
    • var: Test variables
      Variables to test for normality
      Class: numeric

    • nc.plot: Normal curve plot
      Plot normal curve?
      Class: logical
      Default value: TRUE

    • qq.line: Q-Q plot line
      Add line to Q-Q plot?
      Class: logical
      Default value: TRUE

  • ######################################################################
    ##	 Running: rapport("NormalityTest", ius2008, var = "leisure")
    ######################################################################
    
    # Introduction
    
    
    In statistics, _normality_ refers to an assumption that the distribution of a random variable follows _normal_ (_Gaussian_) distribution. Because of its bell-like shape, it's also known as the _"bell curve"_. The formula for _normal distribution_ is:
    
    $$f(x) = \frac{1}{\sqrt{2\pi{}\sigma{}^2}} e^{-\frac{(x-\mu{})^2}{2\sigma{}^2}}$$
    
    _Normal distribution_ belongs to a _location-scale family_ of distributions, as it's defined two parameters:
    
     - $\mu$ - _mean_ or _expectation_ (location parameter)
     - $\sigma^2$ - _variance_ (scale parameter)
    
    [![](plots/rapport-NormalityTest-46-1.png)](plots/rapport-NormalityTest-46-1-hires.png)
    
    # Normality Tests
    
    
    ## Overview
    
    
    Various hypothesis tests can be applied in order to test if the distribution of given random variable violates normality assumption. These procedures test the H~0~ that provided variable's distribution is _normal_. At this point only few such tests will be covered: the ones that are available in `stats` package (which comes bundled with default R installation) and `nortest` package that is [available](http://cran.r-project.org/web/packages/nortest/index.html) on CRAN.
    
     - **Shapiro-Wilk test** is a powerful normality test appropriate for small samples. In R, it's implemented in `shapiro.test` function available in `stats` package.
     - **Lilliefors test** is a modification of _Kolmogorov-Smirnov test_ appropriate for testing normality when parameters or normal distribution ($\mu$, $\sigma^2$) are not known. `lillie.test` function is located in `nortest` package.
     - **Anderson-Darling test** is one of the most powerful normality tests as it will detect the most of departures from normality. You can find `ad.test` function in `nortest` package.
    
    
    ## Results
    
    
    Here you can see the results of applied normality tests (_p-values_ less than 0.05 indicate significant discrepancies):
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable. 
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.168      3e-52  
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     18.75    7.261e-44
               test                                  
    
    Shapiro-Wilk normality test   0.9001    1.618e-20
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage in leisure time (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage in leisure time (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Diagnostic Plots
    
    
    There are various plots that can help you decide about the normality of the distribution. Only a few most commonly used plots will be shown: _histogram_, _Q-Q plot_ and _kernel density plot_.
    
    
    ## Histogram
    
    
    _Histogram_ was first introduced by _Karl Pearson_ and it's probably the most popular plot for depicting the probability distribution of a random variable. However, the decision depends on number of bins, so it can sometimes be misleading. If the variable distribution is normal, bins should resemble the "bell-like" shape.
    
    [![](plots/rapport-NormalityTest-46-2.png)](plots/rapport-NormalityTest-46-2-hires.png)
    
    ## Q-Q Plot
    
    
    "Q" in _Q-Q plot_ stands for _quantile_, as this plot compares empirical and theoretical distribution (in this case, _normal_ distribution) by plotting their quantiles against each other. For normal distribution, plotted dots should approximate a "straight", `x = y` line.
    
    [![](plots/rapport-NormalityTest-46-3.png)](plots/rapport-NormalityTest-46-3-hires.png)
    
    ## Kernel Density Plot
    
    
    _Kernel density plot_ is a plot of smoothed _empirical distribution function_. As such, it provides good insight about the shape of the distribution. For normal distributions, it should resemble the well known "bell shape".
    
    [![](plots/rapport-NormalityTest-46-4.png)](plots/rapport-NormalityTest-46-4-hires.png)
    
    
    #######################################################################################
    ##	 Running: rapport("NormalityTest", ius2008, var = "leisure", nc.plot = FALSE)
    #######################################################################################
    
    # Introduction
    
    
    In statistics, _normality_ refers to an assumption that the distribution of a random variable follows _normal_ (_Gaussian_) distribution. Because of its bell-like shape, it's also known as the _"bell curve"_. The formula for _normal distribution_ is:
    
    $$f(x) = \frac{1}{\sqrt{2\pi{}\sigma{}^2}} e^{-\frac{(x-\mu{})^2}{2\sigma{}^2}}$$
    
    _Normal distribution_ belongs to a _location-scale family_ of distributions, as it's defined two parameters:
    
     - $\mu$ - _mean_ or _expectation_ (location parameter)
     - $\sigma^2$ - _variance_ (scale parameter)
    
    
    # Normality Tests
    
    
    ## Overview
    
    
    Various hypothesis tests can be applied in order to test if the distribution of given random variable violates normality assumption. These procedures test the H~0~ that provided variable's distribution is _normal_. At this point only few such tests will be covered: the ones that are available in `stats` package (which comes bundled with default R installation) and `nortest` package that is [available](http://cran.r-project.org/web/packages/nortest/index.html) on CRAN.
    
     - **Shapiro-Wilk test** is a powerful normality test appropriate for small samples. In R, it's implemented in `shapiro.test` function available in `stats` package.
     - **Lilliefors test** is a modification of _Kolmogorov-Smirnov test_ appropriate for testing normality when parameters or normal distribution ($\mu$, $\sigma^2$) are not known. `lillie.test` function is located in `nortest` package.
     - **Anderson-Darling test** is one of the most powerful normality tests as it will detect the most of departures from normality. You can find `ad.test` function in `nortest` package.
    
    
    ## Results
    
    
    Here you can see the results of applied normality tests (_p-values_ less than 0.05 indicate significant discrepancies):
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable. 
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.168      3e-52  
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     18.75    7.261e-44
               test                                  
    
    Shapiro-Wilk normality test   0.9001    1.618e-20
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage in leisure time (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage in leisure time (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Diagnostic Plots
    
    
    There are various plots that can help you decide about the normality of the distribution. Only a few most commonly used plots will be shown: _histogram_, _Q-Q plot_ and _kernel density plot_.
    
    
    ## Histogram
    
    
    _Histogram_ was first introduced by _Karl Pearson_ and it's probably the most popular plot for depicting the probability distribution of a random variable. However, the decision depends on number of bins, so it can sometimes be misleading. If the variable distribution is normal, bins should resemble the "bell-like" shape.
    
    [![](plots/rapport-NormalityTest-47-1.png)](plots/rapport-NormalityTest-47-1-hires.png)
    
    ## Q-Q Plot
    
    
    "Q" in _Q-Q plot_ stands for _quantile_, as this plot compares empirical and theoretical distribution (in this case, _normal_ distribution) by plotting their quantiles against each other. For normal distribution, plotted dots should approximate a "straight", `x = y` line.
    
    [![](plots/rapport-NormalityTest-47-2.png)](plots/rapport-NormalityTest-47-2-hires.png)
    
    ## Kernel Density Plot
    
    
    _Kernel density plot_ is a plot of smoothed _empirical distribution function_. As such, it provides good insight about the shape of the distribution. For normal distributions, it should resemble the well known "bell shape".
    
    [![](plots/rapport-NormalityTest-47-3.png)](plots/rapport-NormalityTest-47-3-hires.png)
    
    
    #######################################################################################
    ##	 Running: rapport("NormalityTest", ius2008, var = "leisure", qq.line = FALSE)
    #######################################################################################
    
    # Introduction
    
    
    In statistics, _normality_ refers to an assumption that the distribution of a random variable follows _normal_ (_Gaussian_) distribution. Because of its bell-like shape, it's also known as the _"bell curve"_. The formula for _normal distribution_ is:
    
    $$f(x) = \frac{1}{\sqrt{2\pi{}\sigma{}^2}} e^{-\frac{(x-\mu{})^2}{2\sigma{}^2}}$$
    
    _Normal distribution_ belongs to a _location-scale family_ of distributions, as it's defined two parameters:
    
     - $\mu$ - _mean_ or _expectation_ (location parameter)
     - $\sigma^2$ - _variance_ (scale parameter)
    
    [![](plots/rapport-NormalityTest-48-1.png)](plots/rapport-NormalityTest-48-1-hires.png)
    
    # Normality Tests
    
    
    ## Overview
    
    
    Various hypothesis tests can be applied in order to test if the distribution of given random variable violates normality assumption. These procedures test the H~0~ that provided variable's distribution is _normal_. At this point only few such tests will be covered: the ones that are available in `stats` package (which comes bundled with default R installation) and `nortest` package that is [available](http://cran.r-project.org/web/packages/nortest/index.html) on CRAN.
    
     - **Shapiro-Wilk test** is a powerful normality test appropriate for small samples. In R, it's implemented in `shapiro.test` function available in `stats` package.
     - **Lilliefors test** is a modification of _Kolmogorov-Smirnov test_ appropriate for testing normality when parameters or normal distribution ($\mu$, $\sigma^2$) are not known. `lillie.test` function is located in `nortest` package.
     - **Anderson-Darling test** is one of the most powerful normality tests as it will detect the most of departures from normality. You can find `ad.test` function in `nortest` package.
    
    
    ## Results
    
    
    Here you can see the results of applied normality tests (_p-values_ less than 0.05 indicate significant discrepancies):
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable. 
    
    
    
    
    -------------------------------------------------
              Method             Statistic   p-value 
    --------------------------- ----------- ---------
            Lilliefors             0.168      3e-52  
       (Kolmogorov-Smirnov)                          
          normality test                             
    
    Anderson-Darling normality     18.75    7.261e-44
               test                                  
    
    Shapiro-Wilk normality test   0.9001    1.618e-20
    -------------------------------------------------
    
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _Internet usage in leisure time (hours per day)_ is not normal
       
     - _Anderson-Darling test_ confirms violation of normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _Internet usage in leisure time (hours per day)_ is not normal
     
    
    
    As you can see, the applied tests confirm departures from normality.
    
    # Diagnostic Plots
    
    
    There are various plots that can help you decide about the normality of the distribution. Only a few most commonly used plots will be shown: _histogram_, _Q-Q plot_ and _kernel density plot_.
    
    
    ## Histogram
    
    
    _Histogram_ was first introduced by _Karl Pearson_ and it's probably the most popular plot for depicting the probability distribution of a random variable. However, the decision depends on number of bins, so it can sometimes be misleading. If the variable distribution is normal, bins should resemble the "bell-like" shape.
    
    [![](plots/rapport-NormalityTest-48-2.png)](plots/rapport-NormalityTest-48-2-hires.png)
    
    ## Q-Q Plot
    
    
    "Q" in _Q-Q plot_ stands for _quantile_, as this plot compares empirical and theoretical distribution (in this case, _normal_ distribution) by plotting their quantiles against each other. For normal distribution, plotted dots should approximate a "straight", `x = y` line.
    
    [![](plots/rapport-NormalityTest-48-3.png)](plots/rapport-NormalityTest-48-3-hires.png)
    
    ## Kernel Density Plot
    
    
    _Kernel density plot_ is a plot of smoothed _empirical distribution function_. As such, it provides good insight about the shape of the distribution. For normal distributions, it should resemble the well known "bell shape".
    
    [![](plots/rapport-NormalityTest-48-4.png)](plots/rapport-NormalityTest-48-4-hires.png)
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport("NormalityTest", ius2008, var = "leisure")
    rapport("NormalityTest", ius2008, var = "leisure", nc.plot = FALSE)
    rapport("NormalityTest", ius2008, var = "leisure", qq.line = FALSE)
    Popular formats:
    Other morkdown formats
  • Source of `NormalityTest.tpl`
    <!--head
    meta:
      title: Normality Tests
      author: Aleksandar Blagotić
      description: Overview of several normality tests and diagnostic plots that can screen
        departures from normality.
      email: ~
      packages: nortest
      example:
      - rapport("NormalityTest", ius2008, var = "leisure")
      - rapport("NormalityTest", ius2008, var = "leisure", nc.plot = FALSE)
      - rapport("NormalityTest", ius2008, var = "leisure", qq.line = FALSE)
    inputs:
    - name: var
      label: Test variables
      description: Variables to test for normality
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    - name: nc.plot
      label: Normal curve plot
      description: Plot normal curve?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    - name: qq.line
      label: Q-Q plot line
      description: Add line to Q-Q plot?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    head-->
    
    # Introduction
    
    In statistics, _normality_ refers to an assumption that the distribution of a random variable follows _normal_ (_Gaussian_) distribution. Because of its bell-like shape, it's also known as the _"bell curve"_. The formula for _normal distribution_ is:
    
    $$f(x) = \frac{1}{\sqrt{2\pi{}\sigma{}^2}} e^{-\frac{(x-\mu{})^2}{2\sigma{}^2}}$$
    
    _Normal distribution_ belongs to a _location-scale family_ of distributions, as it's defined two parameters:
    
     - $\mu$ - _mean_ or _expectation_ (location parameter)
     - $\sigma^2$ - _variance_ (scale parameter)
    
    <%=
    # generate normal curve plot
    if (nc.plot){
       x <- seq(-3, 3, length.out = 1e3)
       plot(x, dnorm(x), type = "l", ylab = "p", xlab = "", main = "Normal distribution")
    }
    %>
    
    # Normality Tests
    
    ## Overview
    
    Various hypothesis tests can be applied in order to test if the distribution of given random variable violates normality assumption. These procedures test the H~0~ that provided variable's distribution is _normal_. At this point only few such tests will be covered: the ones that are available in `stats` package (which comes bundled with default R installation) and `nortest` package that is [available](http://cran.r-project.org/web/packages/nortest/index.html) on CRAN.
    
     - **Shapiro-Wilk test** is a powerful normality test appropriate for small samples. In R, it's implemented in `shapiro.test` function available in `stats` package.
     - **Lilliefors test** is a modification of _Kolmogorov-Smirnov test_ appropriate for testing normality when parameters or normal distribution ($\mu$, $\sigma^2$) are not known. `lillie.test` function is located in `nortest` package.
     - **Anderson-Darling test** is one of the most powerful normality tests as it will detect the most of departures from normality. You can find `ad.test` function in `nortest` package.
    
    ## Results
    
    Here you can see the results of applied normality tests (_p-values_ less than 0.05 indicate significant discrepancies):
    
    We will use <%=ifelse(length(var) < 5000, "_Shapiro-Wilk_, ", "")%>_Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable. <%=ifelse(length(var) > 5000, "_Shapiro-Wilk_ test can only be used with samples below 5000 cases, now we have more.", "")%>
    
    <% if (length(var) < 5000) { %>
    
    <%= ntest <- htest(var, lillie.test, ad.test, shapiro.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {m <- l + 1}
    if (ntest$p[3] < 0.05) {n <- m + 1}
    ntest
    %>
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= var.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    
     - according to _Shapiro-Wilk test_, the distribution of _<%= var.label %>_ is<%= ifelse(ntest[3, 3] < p, " not", "") %> normal
     
    <% } else { %>
    <%= ntest <- htest(var, lillie.test, ad.test)
    k <- 0
    l <- 0
    m <- 0
    n <- 0
    p <- 0.05
    if (ntest$p[1] < 0.05) {l <- k + 1}
    if (ntest$p[2] < 0.05) {n <- l + 1}
    ntest
    %>
    
    So, the conclusions we can draw with the help of test statistics: 
       
     - based on _Lilliefors test_, distribution of _<%= var.label %>_ is <%= ifelse(ntest[1, 3] < p, "not normal", "normal") %>
       
     - _Anderson-Darling test_ confirms<%= ifelse(ntest[2, 3] < p, " violation of", "") %> normality assumption
    <% } %>
    
    <%= if (n > 0) {
    sprintf("As you can see, the applied tests %s.", ifelse(n > 1, "confirm departures from normality", "yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most.")) 
    } else { 
    sprintf("reject departures from normality") 
    } 
    %>
    
    
    # Diagnostic Plots
    
    There are various plots that can help you decide about the normality of the distribution. Only a few most commonly used plots will be shown: _histogram_, _Q-Q plot_ and _kernel density plot_.
    
    ## Histogram
    
    _Histogram_ was first introduced by _Karl Pearson_ and it's probably the most popular plot for depicting the probability distribution of a random variable. However, the decision depends on number of bins, so it can sometimes be misleading. If the variable distribution is normal, bins should resemble the "bell-like" shape.
    
    <%=
    rp.hist(var)
    %>
    
    ## Q-Q Plot
    
    "Q" in _Q-Q plot_ stands for _quantile_, as this plot compares empirical and theoretical distribution (in this case, _normal_ distribution) by plotting their quantiles against each other. For normal distribution, plotted dots should approximate a "straight", `x = y` line.
    
    <%=
    if (qq.line){
       qqmath(var, panel=function(x){
                   panel.qqmath(x)
                   panel.qqmathline(x, distribution = qnorm)
       }, xlab = "Theoretical Quantiles", ylab = "Empirical Quantiles")
    } else {
       rp.qqplot(var)
    }
    %>
    
    ## Kernel Density Plot
    
    _Kernel density plot_ is a plot of smoothed _empirical distribution function_. As such, it provides good insight about the shape of the distribution. For normal distributions, it should resemble the well known "bell shape".
    
    <%=
    rp.densityplot(var)
    %>

OutlierTest.tpl

  • Outlier tests
    by Gergely Daróczi, Dániel Nagy

    This template will check if provided variable has any outliers.

    Required packages: outliers
    Example:
    rapport('OutlierTest', data=ius2008, var='edu')
    rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE)
    rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE, references=FALSE, grubb=FALSE, dixon=FALSE)
  • Input parameters
    OutlierTest.tpl template has the following options to be (optionally) set while calling rapport:
    • var: Variable
      Numerical variable
      Class: numeric

    • lund.res: Residuals
      Return Lund's residuals?
      Class: logical
      Default value: FALSE

    • references: References
      Print references?
      Class: logical
      Default value: TRUE

    • grubb: Grubb's test
      Show Grubb's test?
      Class: logical
      Default value: TRUE

    • dixon: Dixon's test
      Show Dixon's test?
      Class: logical
      Default value: TRUE

  • ###################################################################
    ##	 Running: rapport('OutlierTest', data=ius2008, var='edu')
    ###################################################################
    
    # Introduction
    
    
    An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.
    There are several ways to detect the outliers of our data. However, we cannot say one of them is the perfect method for that, thus it could be useful to take different methods into consideration.
    We present here four of them, one by a chart (a Box Plot based on IQR) and three by statistical descriptions (Lund Test, Grubb's test, Dixon's test).
    
    
    ## References
    
    
      * Grubbs, F. E.: 1969, Procedures for detecting outlying observations in samples. Technometrics 11, pp. 1-21.
    
    
    # Charts
    
    
    Among the graphical displays the Box plots are quite widespread, because of their several advantages. For example, one can easily get approximately punctual first impression from the data and one can visually see the positions of the (possible) outliers, with the help of them.
    
    The Box Plot we used here is based on IQR (Interquartile Range), which is the difference between the higher and the lower quartiles. On the chart the blue box shows the "middle-half" of the data, the so-called whiskers shows the border where from the possible values can be called outliers. The lower whisker is placed 1.5 times below the first quartile, similarly the higher whisker 1.5 times above the third quartile.
    
    [![Boxplot: edu](plots/rapport-OutlierTest-46-1.png)](plots/rapport-OutlierTest-46-1-hires.png)
    
    ## References
    
    
      * Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, (1983), Graphical Methods for Data Analysis, Wadsworth.
      * Upton, Graham; Cook, Ian (1996). Understanding Statistics. Oxford University Press. p. 55.
    
    
    # Lund test
    
    
    It seems that _4_ extreme values can be found in "Internet usage for educational purposes (hours per day)". These are: _10_, _0.5_, _1.5_ and _0.5_.
    
    
    ## Explanation
    
    
    The above test for outliers was based on *lm(edu ~ 1)*:
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
     **(Intercept)**    2.048      0.07797      26.27   7.939e-105
    --------------------------------------------------------------
    
    Table: Linear model: edu ~ 1
    
    
    ## References
    
    
      * Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476.
      * Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132.
    
    
    
    
    
    
    # Grubb's test
    
    Grubbs test for one outlier shows that highest value 12 is an outlier (p=_0.0001964_).
    
    
    
    
    ## References
    
    
      * Grubbs, F.E. (1950). Sample Criteria for testing outlying observations. Ann. Math. Stat. 21, 1, 27-58.
    
    
    
    
    
    
    
    
    
    
    # Dixon's test
    
    chi-squared test for outlier shows that highest value 12 is an outlier (p=_7.441e-07_).
    
    
    
    
    ## References
    
    
      * Dixon, W.J. (1950). Analysis of extreme values. Ann. Math. Stat. 21, 4, 488-506.
    
    
    
    
    
    
    ###################################################################################
    ##	 Running: rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE)
    ###################################################################################
    
    # Introduction
    
    
    An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.
    There are several ways to detect the outliers of our data. However, we cannot say one of them is the perfect method for that, thus it could be useful to take different methods into consideration.
    We present here four of them, one by a chart (a Box Plot based on IQR) and three by statistical descriptions (Lund Test, Grubb's test, Dixon's test).
    
    
    ## References
    
    
      * Grubbs, F. E.: 1969, Procedures for detecting outlying observations in samples. Technometrics 11, pp. 1-21.
    
    
    # Charts
    
    
    Among the graphical displays the Box plots are quite widespread, because of their several advantages. For example, one can easily get approximately punctual first impression from the data and one can visually see the positions of the (possible) outliers, with the help of them.
    
    The Box Plot we used here is based on IQR (Interquartile Range), which is the difference between the higher and the lower quartiles. On the chart the blue box shows the "middle-half" of the data, the so-called whiskers shows the border where from the possible values can be called outliers. The lower whisker is placed 1.5 times below the first quartile, similarly the higher whisker 1.5 times above the third quartile.
    
    [![Boxplot: edu](plots/rapport-OutlierTest-47-1.png)](plots/rapport-OutlierTest-47-1-hires.png)
    
    ## References
    
    
      * Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, (1983), Graphical Methods for Data Analysis, Wadsworth.
      * Upton, Graham; Cook, Ian (1996). Understanding Statistics. Oxford University Press. p. 55.
    
    
    # Lund test
    
    
    It seems that _4_ extreme values can be found in "Internet usage for educational purposes (hours per day)". These are: _10_, _0.5_, _1.5_ and _0.5_.
    
    
    ## Explanation
    
    
    The above test for outliers was based on *lm(edu ~ 1)*:
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
     **(Intercept)**    2.048      0.07797      26.27   7.939e-105
    --------------------------------------------------------------
    
    Table: Linear model: edu ~ 1
    
    
    ## References
    
    
      * Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476.
      * Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132.
    
    
    
    
    
    
    # Grubb's test
    
    Grubbs test for one outlier shows that highest value 12 is an outlier (p=_0.0001964_).
    
    
    
    
    ## References
    
    
      * Grubbs, F.E. (1950). Sample Criteria for testing outlying observations. Ann. Math. Stat. 21, 1, 27-58.
    
    
    
    
    
    
    
    
    
    
    # Dixon's test
    
    chi-squared test for outlier shows that highest value 12 is an outlier (p=_7.441e-07_).
    
    
    
    
    ## References
    
    
      * Dixon, W.J. (1950). Analysis of extreme values. Ann. Math. Stat. 21, 4, 488-506.
    
    
    
    
    
    
    ###############################################################################################################################
    ##	 Running: rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE, references=FALSE, grubb=FALSE, dixon=FALSE)
    ###############################################################################################################################
    
    # Introduction
    
    
    An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.
    There are several ways to detect the outliers of our data. However, we cannot say one of them is the perfect method for that, thus it could be useful to take different methods into consideration.
    We present here four of them, one by a chart (a Box Plot based on IQR) and three by statistical descriptions (Lund Test, Grubb's test, Dixon's test).
    
    
    ## References
    
    
      * Grubbs, F. E.: 1969, Procedures for detecting outlying observations in samples. Technometrics 11, pp. 1-21.
    
    
    # Charts
    
    
    Among the graphical displays the Box plots are quite widespread, because of their several advantages. For example, one can easily get approximately punctual first impression from the data and one can visually see the positions of the (possible) outliers, with the help of them.
    
    The Box Plot we used here is based on IQR (Interquartile Range), which is the difference between the higher and the lower quartiles. On the chart the blue box shows the "middle-half" of the data, the so-called whiskers shows the border where from the possible values can be called outliers. The lower whisker is placed 1.5 times below the first quartile, similarly the higher whisker 1.5 times above the third quartile.
    
    [![Boxplot: edu](plots/rapport-OutlierTest-48-1.png)](plots/rapport-OutlierTest-48-1-hires.png)
    
    ## References
    
    
      * Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, (1983), Graphical Methods for Data Analysis, Wadsworth.
      * Upton, Graham; Cook, Ian (1996). Understanding Statistics. Oxford University Press. p. 55.
    
    
    # Lund test
    
    
    It seems that _4_ extreme values can be found in "Internet usage for educational purposes (hours per day)". These are: _10_, _0.5_, _1.5_ and _0.5_.
    
    
    ## Explanation
    
    
    The above test for outliers was based on *lm(edu ~ 1)*:
    
    
    --------------------------------------------------------------
         &nbsp;        Estimate   Std. Error   t value   Pr(>|t|) 
    ----------------- ---------- ------------ --------- ----------
     **(Intercept)**    2.048      0.07797      26.27   7.939e-105
    --------------------------------------------------------------
    
    Table: Linear model: edu ~ 1
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('OutlierTest', data=ius2008, var='edu')
    rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE)
    rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE, references=FALSE, grubb=FALSE, dixon=FALSE)
    Popular formats:
    Other morkdown formats
  • Source of `OutlierTest.tpl`
    <!--head
    meta:
      title: Outlier tests
      author: Gergely Daróczi, Dániel Nagy
      description: This template will check if provided variable has any outliers.
      email: gergely@snowl.net
      packages: outliers
      example:
      - rapport('OutlierTest', data=ius2008, var='edu')
      - rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE)
      - rapport('OutlierTest', data=ius2008, var='edu', lund.res=FALSE, references=FALSE,
        grubb=FALSE, dixon=FALSE)
    inputs:
    - name: var
      label: Variable
      description: Numerical variable
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    - name: lund.res
      label: Residuals
      description: Return Lund's residuals?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: no
      required: no
      standalone: yes
    - name: references
      label: References
      description: Print references?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    - name: grubb
      label: Grubb's test
      description: Show Grubb's test?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    - name: dixon
      label: Dixon's test
      description: Show Dixon's test?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    head-->
    
    # Introduction
    
    An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.
    There are several ways to detect the outliers of our data. However, we cannot say one of them is the perfect method for that, thus it could be useful to take different methods into consideration.
    We present here four of them, one by a chart (a Box Plot based on IQR) and three by statistical descriptions (Lund Test, Grubb's test, Dixon's test).
    
    ## References
    
      * Grubbs, F. E.: 1969, Procedures for detecting outlying observations in samples. Technometrics 11, pp. 1-21.
    
    # Charts
    
    Among the graphical displays the Box plots are quite widespread, because of their several advantages. For example, one can easily get approximately punctual first impression from the data and one can visually see the positions of the (possible) outliers, with the help of them.
    
    The Box Plot we used here is based on IQR (Interquartile Range), which is the difference between the higher and the lower quartiles. On the chart the blue box shows the "middle-half" of the data, the so-called whiskers shows the border where from the possible values can be called outliers. The lower whisker is placed 1.5 times below the first quartile, similarly the higher whisker 1.5 times above the third quartile.
    
    <%=
    set.caption(sprintf('Boxplot: %s', rp.name(var)))
    rp.boxplot(var)
    %>
    
    ## References
    
      * Chambers, John, William Cleveland, Beat Kleiner, and Paul Tukey, (1983), Graphical Methods for Data Analysis, Wadsworth.
      * Upton, Graham; Cook, Ian (1996). Understanding Statistics. Oxford University Press. p. 55.
    
    # Lund test
    
    It seems that <%=out <- rp.outlier(var); length(out)%> extreme values can be found in "<%=rp.label(var)%>". <%=ifelse(length(out) > 0, sprintf('These are: %s.', p(out)), '')%>
    
    ## Explanation
    
    The above test for outliers was based on *lm(<%=rp.name(var)%> ~ 1)*:
    
    <%=
    set.caption(sprintf('Linear model: %s ~ 1', rp.name(var)))
    lm(var ~ 1)
    %>
    
    <% if (lund.res) { %>
    
    ## The residuals returned:
    
    <%=rstandard(lm(var ~ 1))%>
    
    <% } %>
    
    <%if (references) { %>
    
    ## References
    
      * Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476.
      * Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132.
    
    <% } %>
    
    <%if (grubb & suppressMessages(suppressWarnings(require(outliers)))) { %>
    
    # Grubb's test
    
    <%=test <- grubbs.test(var); test$method%> shows that <%=ifelse(test$p.value>0.05, 'there are no outliers', test$alternative)%> (p=<%=test$p.value%>).
    
    <%if (references) { %>
    
    ## References
    
      * Grubbs, F.E. (1950). Sample Criteria for testing outlying observations. Ann. Math. Stat. 21, 1, 27-58.
    
    <% } %>
    
    <% } %>
    
    
    
    <% if (dixon & suppressMessages(suppressWarnings(require(outliers)))) { %>
    
    # Dixon's test
    
    <%=test <- chisq.out.test(var); test$method%> shows that <%=ifelse(test$p.value > 0.05, 'there are no outliers', test$alternative)%> (p=<%=test$p.value%>).
    
    <% if (references) { %>
    
    ## References
    
      * Dixon, W.J. (1950). Analysis of extreme values. Ann. Math. Stat. 21, 4, 488-506.
    
    <% } %>
    
    <% } %>

PCA.tpl

  • Principal Component Analysis
    by Daniel Nagy

    In this template Rapporter will present you Principal Component Analysis.

    Required packages: psych
    Example:
    rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3)
    rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3, rot.method="oblimin")
  • Input parameters
    PCA.tpl template has the following options to be (optionally) set while calling rapport:
    • vars: Used Variables
      Which variables would you use?
      Class: numeric

    • components: Number of Components
      How many Principal Components you want to use?
      Class: integer
      Default value: 2

    • rot.matrix: Rotation Matrix
      Would you check the Rotation Matrix?
      Class: logical
      Default value: TRUE

    • rot.method: Method of Rotation
      Which rotation method would you use?
      Class: character
      Default value: varimax

  • #######################################################################################################
    ##	 Running: rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3)
    #######################################################################################################
    
    # Introduction
    
    
    [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) is a dimension reduction method. It produces linearly independent principal components using the variances of the observations in a set of variables.
    
    
    
    # Results
    
    
    ---------------------------------------------------
               &nbsp;             PC1     PC2     PC3  
    ---------------------------- ------ ------- -------
       **Standard deviation**    6.298   1.35   0.9088 
    
     **Proportion of Variance**  0.9354 0.04298 0.01947
    
     **Cumulative Proportion**   0.9354 0.9783  0.9978 
    ---------------------------------------------------
    
    
    From the table above one can see that the first _3_ Principal Components contains the _93.535 %_, _4.298 %_ and _1.947 %_ of the variances, so together the 99.78 % of that.
    
    
    ### Visual representation
    
    
    It could be informative to see visually how the observations lies on these components. On that two dimensional plot below, where the axes are the components which contains the two most variances, you can see (the red vectors) the effect of the variables as well. 
    
    [![](plots/rapport-PCA.tpl-31-1.png)](plots/rapport-PCA.tpl-31-1-hires.png)
    
    ## Rotation
    
    As you wanted to check the Rotation matrix let us present that for you:
    
    -------------------------------------------
      &nbsp;      PC1        PC2        PC3    
    ---------- ---------- ---------- ----------
     **carb**   -0.1486   **0.9728**  -0.08587 
    
     **mpg**   **0.9557**   0.1614     0.2433  
    
     **cyl**    -0.2476    0.07389   **0.9502**
    
     **drat**   0.05777     0.1488    -0.1745  
    -------------------------------------------
    
    
    
    The cells written in bold shows which components explain the most variances of the variables, with the help of them we can draw the following conclusion:
    
    * PC1 is a principal component of mpg 
    * PC2 is a principal component of carb 
    * PC3 is a principal component of cyl 
    
    <!-- end of list -->
    
    
    
    
    We can say that none of these impacts are negative.
    
    
    
    
    #############################################################################################################################
    ##	 Running: rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3, rot.method="oblimin")
    #############################################################################################################################
    
    # Introduction
    
    
    [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) is a dimension reduction method. It produces linearly independent principal components using the variances of the observations in a set of variables.
    
    
    
    # Results
    
    
    ---------------------------------------------------
               &nbsp;             PC1     PC2     PC3  
    ---------------------------- ------ ------- -------
       **Standard deviation**    6.298   1.35   0.9088 
    
     **Proportion of Variance**  0.9354 0.04298 0.01947
    
     **Cumulative Proportion**   0.9354 0.9783  0.9978 
    ---------------------------------------------------
    
    
    From the table above one can see that the first _3_ Principal Components contains the _93.535 %_, _4.298 %_ and _1.947 %_ of the variances, so together the 99.78 % of that.
    
    
    ### Visual representation
    
    
    It could be informative to see visually how the observations lies on these components. On that two dimensional plot below, where the axes are the components which contains the two most variances, you can see (the red vectors) the effect of the variables as well. 
    
    [![](plots/rapport-PCA.tpl-32-1.png)](plots/rapport-PCA.tpl-32-1-hires.png)
    
    ## Rotation
    
    As you wanted to check the Rotation matrix let us present that for you:
    
    -------------------------------------------
      &nbsp;      PC1        PC2        PC3    
    ---------- ---------- ---------- ----------
     **carb**   -0.1486   **0.9728**  -0.08587 
    
     **mpg**   **0.9557**   0.1614     0.2433  
    
     **cyl**    -0.2476    0.07389   **0.9502**
    
     **drat**   0.05777     0.1488    -0.1745  
    -------------------------------------------
    
    
    
    The cells written in bold shows which components explain the most variances of the variables, with the help of them we can draw the following conclusion:
    
    * PC1 is a principal component of mpg 
    * PC2 is a principal component of carb 
    * PC3 is a principal component of cyl 
    
    <!-- end of list -->
    
    
    
    
    We can say that none of these impacts are negative.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3)
    rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3, rot.method="oblimin")
    Popular formats:
    Other morkdown formats
  • Source of `PCA.tpl`
    <!--head
    meta:
      title: Principal Component Analysis
      author: Daniel Nagy
      description: In this template Rapporter will present you Principal Component Analysis.
      email: ~
      packages:
      - psych
      - GPArotation
      example:
      - rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3)
      - rapport('PCA.tpl', data=mtcars, vars=c('carb', 'mpg', 'cyl', 'drat'), components=3, rot.method="oblimin")
    inputs:
    - name: vars
      label: Used Variables
      description: Which variables would you use?
      class: numeric
      length:
        min: 2.0
        max: 1000.0
      required: yes
      standalone: no
    - name: components
      label: Number of Components
      description: How many Principal Components you want to use?
      class: integer
      limit:
        min: 1.0
        max: 999.0
      value: 2.0
      required: yes
      standalone: yes
    - name: rot.matrix
      label : Rotation Matrix
      description: Would you check the Rotation Matrix?
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: yes
      required: no
      standalone: yes
    - name: rot.method
      label: Method of Rotation
      description: Which rotation method would you use?
      class: character
      length:
        min: 1.0
        max: 1.0
      options:
      - none
      - varimax
      - quartimax
      - promax
      - oblimin
      - simplimax
      - cluster
      value: varimax
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    head-->
    
    
    <% if (components > ncol(vars)) { %>
    
    Your request cannot be implemented, because there are more components (<%= components %>) than the number of the used variables (<%= ncol(vars) %>). Please set the number of the components to maximum <%= ncol(vars) - 1 %> with the same number of the variables or extend the number of those variables to <%= components + 1 %>
    
    <% } else { %>
    
    
    # Introduction
    
    [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) is a dimension reduction method. It produces linearly independent principal components using the variances of the observations in a set of variables.
    
    
    # Results
    
    <%=
    vars <- na.omit(vars)
    summ <- as.matrix(summary(prcomp(vars))$importance[, 1:components])
    if (components == 1) colnames(summ)[1] <- "PC1"
    summ
    %>
    From the table above one can see that the <%=ifelse(components == 1, "", "first")%> <%=components%> Principal Component<%=ifelse(components == 1, '', 's')%> contains the <%=paste(summary(prcomp(vars))$importance[2, 1:components] * 100, "%")%> of the variance<%=ifelse(components == 1, '', 's')%><% if (components != 1) { %>, so together the <%=paste(sum(summary(prcomp(vars))$importance[2, 1:components] * 100), "%")%> of that.<% } else { %>.<% } %>
    
    ### Visual representation
    
    It could be informative to see visually how the observations lies on these components. On that two dimensional plot below, where the axes are the components which contains the two most variances, you can see (the red vectors) the effect of the variables as well. <% if(ncol(vars) > 5) { %> In the case we have several variables, the plot could be a little bit confusing but it still helps us to explore the number of the Principal Components. <% } %>
    
    <%=
    biplot(prcomp(vars))
    %>
    
    <% if (rot.matrix) { %>
    
    ## Rotation
    
    
    <%=
    ifelse(rot.matrix,"As you wanted to check the Rotation matrix let us present that for you:","")
    if (rot.matrix) {
    rot <- as.matrix(prcomp(vars)$rotation[,1:components])
    if (components == 1) colnames(rot)[1] <- "PC1"
    emphasize.strong.cells(which(abs(rot) > 0.3, arr.ind = TRUE))
    rot
    } else { }
    %>
    
    The cells written in bold shows which components explain the most variances of the variables, with the help of them we can draw the following conclusion:
    <%=paste(pandoc.list.return(paste(colnames(rot)[which(abs(rot) > 0.3, arr.ind = TRUE)[, 2]],rp.name(vars)[which(abs(rot) > 0.3, arr.ind = TRUE)[, 1]], sep = " is a principal component of ")), collapse = '\n')%>
    
    <% if (length(which(rot > 0.3)) != length(which(abs(rot) > 0.3))) { %>
    
    <%=neg.comp <- colnames(rot)[which(rot < -0.3, arr.ind = TRUE)[, 2]]%>
    
    From them in the case<%=ifelse(neg.comp < 1, "s", "")%> of the <%=paste(colnames(rot)[which(rot < -0.3, arr.ind = TRUE)[, 2]],rp.name(vars)[which(rot < -0.3, arr.ind = TRUE)[, 1]], sep = "'s impact on ")%>, we can say <%=ifelse(neg.comp < 1, "they are", "that is")%> negative.
    
    <% } else { %>
    We can say that <%=ifelse(length(which(abs(rot) > 0.3)), "none of these impacts are negative", "this impact is positive")%>.
    
    <% }}} %>

TTest.tpl

  • t-test Template
    by Aleksandar Blagotić

    A t-test report with table of descriptives, diagnostic tests and t-test specific statistics.

    Required packages: nortest
    Example:
    rapport("TTest", ius2008, x = "leisure", y = "gender")
    rapport("TTest", ius2008, x = "leisure", mu = 3.2)
  • Input parameters
    TTest.tpl template has the following options to be (optionally) set while calling rapport:
    • x: X variable
      Dependent (response) variable
      Class: numeric

    • y: Y variable
      Independent variable (factor, or another numeric)

    • alter: Alternative hypothesis
      Whether two-sided, greater or less variant will be applied
      Class: character
      Default value: two.sided

    • mu: Mean value
      Mean value for one-sample t-test
      Class: numeric

    • paired: Paired t-test
      Carry out paired t-test or not
      Class: logical
      Default value: FALSE

    • var.equal: Variance equality
      Equal variances assumed: choose automatically or not
      Class: logical
      Default value: FALSE

    • ci.level: Confidence interval
      Confidence interval level
      Class: numeric
      Default value: 0.95

  • ##########################################################################
    ##	 Running: rapport("TTest", ius2008, x = "leisure", y = "gender")
    ##########################################################################
    
    # Introduction
    
    
    In a nutshell, _t-test_ is a statistical test that assesses hypothesis of equality of two means. But in theory, any hypothesis test that yields statistic which follows [_t-distribution_](https://en.wikipedia.org/wiki/Student%27s_t-distribution) can be considered a _t-test_. The most common usage of _t-test_ is to:
    
     - compare the mean of a variable with given test mean value - **one-sample _t-test_**
     - compare means of two variables from independent samples - **independent samples _t-test_**
     - compare means of two variables from dependent samples - **paired-samples _t-test_**
    
    
    # Overview
    
    Independent samples _t-test_ is carried out with _Internet usage in leisure time (hours per day)_ as dependent variable, and _Gender_ as independent variable. Confidence interval is set to 95%. Equality of variances wasn't assumed.
    
    
    # Descriptives
    
    
    In order to get more insight on the underlying data, a table of basic descriptive statistics is displayed below.
    
    
    ------------------------------------------------------
     Gender   min   max   mean   sd    var   median   IQR 
    -------- ----- ----- ------ ----- ----- -------- -----
      male     0    12    3.27  1.953 3.816    3       3  
    
     female    0    12   3.064  2.355 5.544    2       3  
    ------------------------------------------------------
    
    Table: Table continues below
    
     
    ---------------------
     skewness   kurtosis 
    ---------- ----------
      0.9443     0.9858  
    
      1.398       1.87   
    ---------------------
    
    
    # Diagnostics
    
    
    Since _t-test_ is a parametric technique, it sets some basic assumptions on distribution shape: it has to be _normal_ (or approximately normal). A few normality test are to be applied, in order to screen possible departures from normality.
    
    
    ## Normality Tests
    
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable (_Internet usage in leisure time (hours per day)_).
    
    
    --------------------------------------------
                 N                p             
    --------------------------- ------ ---------
    Shapiro-Wilk normality test 0.9001 1.618e-20
    
            Lilliefors          0.168    3e-52  
       (Kolmogorov-Smirnov)                     
          normality test                        
    
    Anderson-Darling normality  18.75  7.261e-44
               test                             
    --------------------------------------------
    
    
    
    As you can see, applied tests yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most..
    
    
    
    # Results
    
    Welch Two Sample t-test was applied, and significant differences were found.
    
    
    --------------------------------------------------------
    &nbsp;   statistic   df     p     CI(lower)   CI(upper) 
    ------- ----------- ----- ------ ----------- -----------
     **t**     1.148    457.9 0.2514   -0.1463     0.5576   
    --------------------------------------------------------
    
    
    
    ######################################################################
    ##	 Running: rapport("TTest", ius2008, x = "leisure", mu = 3.2)
    ######################################################################
    
    # Introduction
    
    
    In a nutshell, _t-test_ is a statistical test that assesses hypothesis of equality of two means. But in theory, any hypothesis test that yields statistic which follows [_t-distribution_](https://en.wikipedia.org/wiki/Student%27s_t-distribution) can be considered a _t-test_. The most common usage of _t-test_ is to:
    
     - compare the mean of a variable with given test mean value - **one-sample _t-test_**
     - compare means of two variables from independent samples - **independent samples _t-test_**
     - compare means of two variables from dependent samples - **paired-samples _t-test_**
    
    
    # Overview
    
    One-sample _t-test_ is carried out with _Internet usage in leisure time (hours per day)_ as dependent variable. Confidence interval is set to 95%. Equality of variances wasn't assumed.
    
    
    # Descriptives
    
    
    In order to get more insight on the underlying data, a table of basic descriptive statistics is displayed below.
    
    
    -------------------------------------------------------------
               Variable             min   max   mean   sd    var 
    ------------------------------ ----- ----- ------ ----- -----
    Internet usage in leisure time   0    12   3.199  2.144 4.595
           (hours per day)                                       
    -------------------------------------------------------------
    
    Table: Table continues below
    
     
    ------------------------------------
     median   IQR   skewness   kurtosis 
    -------- ----- ---------- ----------
       3       2     1.185      1.533   
    ------------------------------------
    
    
    # Diagnostics
    
    
    Since _t-test_ is a parametric technique, it sets some basic assumptions on distribution shape: it has to be _normal_ (or approximately normal). A few normality test are to be applied, in order to screen possible departures from normality.
    
    
    ## Normality Tests
    
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable (_Internet usage in leisure time (hours per day)_).
    
    
    --------------------------------------------
                 N                p             
    --------------------------- ------ ---------
    Shapiro-Wilk normality test 0.9001 1.618e-20
    
            Lilliefors          0.168    3e-52  
       (Kolmogorov-Smirnov)                     
          normality test                        
    
    Anderson-Darling normality  18.75  7.261e-44
               test                             
    --------------------------------------------
    
    
    
    As you can see, applied tests yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most..
    
    
    
    # Results
    
    One Sample t-test was applied, and significant differences were found.
    
    
    -------------------------------------------------------
    &nbsp;   statistic   df    p     CI(lower)   CI(upper) 
    ------- ----------- ---- ------ ----------- -----------
     **t**   -0.007198  671  0.9943    3.037       3.362   
    -------------------------------------------------------
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport("TTest", ius2008, x = "leisure", y = "gender")
    rapport("TTest", ius2008, x = "leisure", mu = 3.2)
    Popular formats:
    Other morkdown formats
  • Source of `TTest.tpl`
    <!--head
    meta:
      title: t-test Template
      author: Aleksandar Blagotić
      description: A t-test report with table of descriptives, diagnostic tests and t-test
        specific statistics.
      email: ~
      packages: nortest
      example:
      - rapport("TTest", ius2008, x = "leisure", y = "gender")
      - rapport("TTest", ius2008, x = "leisure", mu = 3.2)
    inputs:
    - name: x
      label: X variable
      description: Dependent (response) variable
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: yes
      standalone: no
    - name: 'y'
      label: Y variable
      description: Independent variable (factor, or another numeric)
      length:
        min: 1.0
        max: 1.0
      value: ~
      required: no
      standalone: no
    - name: alter
      label: Alternative hypothesis
      description: Whether two-sided, greater or less variant will be applied
      class: character
      length:
        min: 1.0
        max: 1.0
      options:
      - two.sided
      - less
      - greater
      value: two.sided
      matchable: yes
      allow_multiple: no
      required: no
      standalone: yes
    - name: mu
      label: Mean value
      description: Mean value for one-sample t-test
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: ~
      limit:
        min: -.inf
        max: .inf
      required: no
      standalone: yes
    - name: paired
      label: Paired t-test
      description: Carry out paired t-test or not
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: no
      required: no
      standalone: yes
    - name: var.equal
      label: Variance equality
      description: 'Equal variances assumed: choose automatically or not'
      class: logical
      length:
        min: 1.0
        max: 1.0
      value: no
      required: no
      standalone: yes
    - name: ci.level
      label: Confidence interval
      description: Confidence interval level
      class: numeric
      length:
        min: 1.0
        max: 1.0
      value: 0.95
      limit:
        min: 0.0
        max: 1.0
      required: no
      standalone: yes
    head-->
    <%=
    
    if (!exists('y') || is.null(y)) {
    
        ## if y is NULL, you're stuck with one-sample t-test
        arg.list <- list(x = x, mu = mu, alternative = alter, conf.level = ci.level)
        test <- stats:::t.test
    
        variant <- "one-sample"
    
    } else {
    
        ## "y" is specified, so it's either independent samples or paired samples
        if (!inherits(y, c('factor', 'numeric')))
            stop('"y" has to be either a factor or a numeric vector')
    
        arg.list <- list(alternative = alter, paired = paired, var.equal = var.equal, conf.level = ci.level)
    
        if (is.factor(y)){
           test <- stats:::t.test.formula
           arg.list$formula <- x ~ y
        }
    
        if (is.numeric(y)){
            test <- stats:::t.test
            arg.list$x <- x
            arg.list$y <- y
        }
    
        variant <- ifelse(paired, "paired-samples", "independent samples")
    
    }
    
    tt <- do.call(test, arg.list)
    
    %>
    
    # Introduction
    
    In a nutshell, _t-test_ is a statistical test that assesses hypothesis of equality of two means. But in theory, any hypothesis test that yields statistic which follows [_t-distribution_](https://en.wikipedia.org/wiki/Student%27s_t-distribution) can be considered a _t-test_. The most common usage of _t-test_ is to:
    
     - compare the mean of a variable with given test mean value - **one-sample _t-test_**
     - compare means of two variables from independent samples - **independent samples _t-test_**
     - compare means of two variables from dependent samples - **paired-samples _t-test_**
    
    # Overview
    
    <%= capitalise(variant) %> _t-test_ is carried out with <%= p(x.label) %> as dependent variable<%= if (!is.null(y)) sprintf(", and %s as independent variable", p(y.label)) else ""%>. Confidence interval is set to <%= pct(ci.level * 100, 0) %>. <%= if (!is.null(variant)) sprintf("Equality of variances %s assumed.", ifelse(var.equal, "was", "wasn't")) else "" %>
    
    # Descriptives
    
    In order to get more insight on the underlying data, a table of basic descriptive statistics is displayed below.
    
    <%=
    if (is.null(y)) {
        rp.desc(x, NULL, c('min', 'max', 'mean', 'sd', 'var', 'median', 'IQR', 'skewness', 'kurtosis'), rp.data)
    } else {
        rp.desc(x, y, c('min', 'max', 'mean', 'sd', 'var', 'median', 'IQR', 'skewness', 'kurtosis'), rp.data)
    }
    %>
    
    # Diagnostics
    
    Since _t-test_ is a parametric technique, it sets some basic assumptions on distribution shape: it has to be _normal_ (or approximately normal). A few normality test are to be applied, in order to screen possible departures from normality.
    
    ## Normality Tests
    
    We will use _Shapiro-Wilk_, _Lilliefors_ and _Anderson-Darling_ tests to screen departures from normality in the response variable (<%= p(x.label) %>).
    
    <%=
    (ntest <- htest(x, shapiro.test, lillie.test, ad.test, colnames = c("N", "p")))
    %>
    
    As you can see, applied tests <%= ifelse(all(ntest$p < .05), "confirm departures from normality", "yield different results on hypotheses of normality, so you may want to stick with one you find most appropriate or you trust the most.") %>.
    
    
    # Results
    
    <%= tt$method %> was applied, and significant differences <%= ifelse(tt$p.value < 1 - ci.level, "weren't", "were") %> found.
    
    <%=
    with(tt, data.frame(statistic, df = parameter, p = p.value, `CI(lower)` = conf.int[1], `CI(upper)` = conf.int[2], check.names = FALSE))
    %>

WilcoxonTest.tpl

  • Wilcoxon test
    by Daniel Nagy

    In this template Rapporter will present you Wilcoxon test.

    Required packages:
    Example:
    rapport('WilcoxonTest.tpl', data=mtcars, var1='mpg', var2='cyl')
  • Input parameters
    WilcoxonTest.tpl template has the following options to be (optionally) set while calling rapport:
    • var1: Variable1
      This is the first variable which will be used here
      Class: numeric

    • var2: Variable2
      This is the second variable which will be used here
      Class: numeric

    • alter: Alternative hypothesis
      Class: character
      Default value: two.sided

  • ####################################################################################
    ##	 Running: rapport('WilcoxonTest.tpl', data=mtcars, var1='mpg', var2='cyl')
    ####################################################################################
    
    # Introduction
    
    
    [Wilcoxon test](http://en.wikipedia.org/wiki/Wilcoxon_test) (or its equivalent 
    the Mann-Whitney U test)  is a non-parametric statistical test which can be an alternative of the  paired Student's t-test, t-test for matched pairs, or the t-test for dependent samples. When the Wilcoxon test is used to compare two samples of values which are not paired, it makes no assumption about the shape of the distribution, only that (as null hypothesis) the distribution from which mpg was sampled is the same as the distribution from which cyl was sampled. The null hypothesis states that the variances of the variables are equal.
    
    
    -----------------------------------------------------------
     Test statistic       P value       Alternative hypothesis 
    ---------------- ----------------- ------------------------
          1024       _4.306e-12_ * * *        two.sided        
    -----------------------------------------------------------
    
    Table: Wilcoxon rank sum test with continuity correction mpg and cyl
    
    
    
    As you can see in the table the the p-value of the Wilcoxon test is _4.306e-12_, thus we can accept the assumption of the null hypothesis, so we can say that statistically the variances are equal.
  • Exports of example
    pandoc and the other backends provide a great number of formats to export your templates.
    We have generated the following types for demonstartion purposes based on the example command(s):
    rapport('WilcoxonTest.tpl', data=mtcars, var1='mpg', var2='cyl')
    Popular formats:
    Other morkdown formats
  • Source of `WilcoxonTest.tpl`
    <!--head
    meta:
      title: Wilcoxon test
      author: Daniel Nagy
      description: In this template Rapporter will present you Wilcoxon test.
      email: ~
      packages: ~
      example:
      - rapport('WilcoxonTest.tpl', data=mtcars, var1='mpg', var2='cyl')
    inputs:
    - name: var1
      label: Variable1
      description: This is the first variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: var2
      label: Variable2
      description: This is the second variable which will be used here
      class: numeric
      length:
        min: 1.0
        max: 1.0
      required: yes
      standalone: no
    - name: alter
      label: Alternative hypothesis
      description: 
      class: character
      options:
      - two.sided
      - less
      - greater
      value: two.sided
      matchable: yes
      allow.multiple: no
      required: no
      standalone: yes
    head-->
    
    # Introduction
    
    [Wilcoxon test](http://en.wikipedia.org/wiki/Wilcoxon_test) (or its equivalent 
    the Mann-Whitney U test)  is a non-parametric statistical test which can be an alternative of the  paired Student's t-test, t-test for matched pairs, or the t-test for dependent samples. When the Wilcoxon test is used to compare two samples of values which are not paired, it makes no assumption about the shape of the distribution, only that (as null hypothesis) the distribution from which <%=var1.label%> was sampled is the same as the distribution from which <%=var2.label%> was sampled. The null hypothesis states that the variances of the variables are equal.
    
    <%=
    set.caption(sprintf('Wilcoxon rank sum test with continuity correction %s and %s', var1.label, var2.label))
    suppressWarnings(wilc <- wilcox.test(var1, var2, alter))
    wilc
    p.v <- wilc$p.value
    %>
    
    As you can see in the table the the p-value of the Wilcoxon test is <%=p.v%>, thus we can <%= ifelse(p.v > 0.05, "reject", "accept")%> the assumption of the null hypothesis, so we can say that statistically the variances are<%= ifelse(p.v > 0.05, " not", "")%> equal.

Writing templates

Writing a custom rapport template or modifying an existing one is not trickier than writing an ordinary statistical report. It requires some basic R skills, and a familiarity with rapport input specifications that we’re about to cover thoroughly. Of course, sophisticated reports would require more proficiency in R.

Input Header

Recent changes

As of version 0.50, rapport is using the new header specification that relies solely on YAML syntax. The old syntax is deprecated, though kept in the package for backwards compatibility.

In order to define a valid rapport template, you’ll have to specify some info at the beginning of the document, in the so-called template header. The header itself is nothing but a YAML syntax placed within custom HTML comment tags: <!--head and head-->. It consists of metadata and inputs sections defined under meta and inputs YAML keys, respectively.

Template metadata

Template metadata can contain following fields:

  • title - template title (required)
  • author - template author (required)
  • description - template description (required)
  • email - email of the author
  • packages - list of packages that the template depends on
  • example - a list of example rapport() calls for the given template.

If you’re familiar with the package development in R, you’ll probably find this specification similar to the DESCRIPTION file. Here’s an example of metadata section:


meta:
  title: Custom template
  author: John Doe
  description: Just a custom template
  packages:
	- lme4
  - nortest
	- ggplot2
  example:
  - rapport("custom-template", mtcars, x = "wt")
	- rapport("custom-template", mtcars, x = c("mpg", "hp"))

As you can see, it depends on lme4, nortest and ggplot2 packages, and it has 2 example calls to rapport.

Just a reminder about the technical details: One should not forget to type in the accurate number of the ‘spaces’ in the beginning of the lines! Before each fields two ‘spaces’ are required, next to them the exact packages and the examples have a hyphen and an other ‘space’ in their front.

Template inputs

The template inputs are probably the most important feature of rapport. By using the template inputs, you can match a dataset variable or custom R object and assign it to a symbol in a template evaluation environment. That way you can use given input’s name throughout the template. Since the version 0.50, we wanted to make inputs more familiar to R users, so we ditched previous input definition syntax as it was inconsistent with R conventions. New input specification relies on R class system and resembles all important methods and/or attributes of R objects.

Template inputs can be divided into two categories:

  • dataset inputs that hold names of the elements of the object provided in data argument in rapport function. This usually refers to a vector containing column names of a data.frame object, but it can be any R object that has named elements.
  • standalone inputs do not depend on the object provided in the data argument. They usually accept an R object passed by the user, or the value of the value attribute provided in the input definition (see below).
General input attributes

Following options are available for all inputs:

  • name (character string, required) - input name. It acts as an identifier for a given input, and is required as such. Template cannot contain duplicate input names. rapport inputs currently have custom naming conventions - see ?guess.input.name for details.
  • label (character string) - input label. It can be blank, but it’s useful to provide an input label as rapport helpers use that information in plot labels and/or exported HTML tables. Defaults to empty string.
  • description (character string) - similar to label, but should contain long(er) description of given input.
  • class (character string) - defines an input class. Currently supported input classes are: character, complex, factor, integer, logical, numeric and raw (all atomic vector classes). Class attribute should usually be provided, but it can also be NULL (default) - in that case the input class will be guessed based on matched R object’s value.
  • required (logical value) - does an input require a value? Defaults to FALSE.
  • standalone (logical value) - indicates that the input depends on a dataset. Defaults to FALSE (inputs are standalone).
  • length (integer value, or a specific key: value pair) - sets restrictions on the matched R object’s lenght attribute. length input attribute can be defined in various ways:
    • an integer value, e.g. length: 10, which require all R object values to have the length of 10.
    • exactly tag - previous example (length: 10) will be interpreted as:
    length:
      exactly: 10
     
    • min and/or max tags that define the range within which an input length must fall. Note that the range limits are inclusive - for instance:
    length:
      min: 2
      max: 10
     

    will accept all R objects with length of at least 2 and at most 10. Either min or max tag can be omitted, and they will default to 1 and Inf, respectively. For example:

    length:
      min: 1
     

    is identical to:

    length:
      min: 1
      max: Inf
     

    Similarly,

    length:
      max: 10
     

    is identical to:

    length:
      min: 1
      max: 10
     
    • if omitted (NULL) length will default to:
    length:
      exactly: 10
     

    It’s worth noting that rapport treats input length in a bit different manner. If you match a subset of, e.g. 10 character vectors from the dataset, the input length will be 10, as you might expect. But if you select only one variable, length will be equal to 1, and not equal to the number of vector elements. This stands both for standalone and dataset inputs. However, if you match a character vector against a standalone input, length will be stored correctly - as the number of vector elements.

  • value (vector(s) of an appropriate class) - this attribute only exists for standalone inputs. Provided value must satisfy rules defined in class and length attributes, as well as any other class-specific rules (see below).

It worth to note, that the inputs with the possible outcome of a logical value (required, standalone) can be set with the TRUE/FALSE and the yes/no options as well.

Class-specific attributes
character
  • nchar - restricts the number of characters of the input value. It accepts the same attribute format as length attribute. If NULL (default), no checks will be performed.
  • regexp (character string) - contains a string with regular expression. If non-NULL, all strings in the character vector must match the given regular expression. Defaults to NULL - no checks are applied.
  • matchable (logical value) - if TRUE, options attribute must be provided, while value is optional, though recommended. options should contain values to be chosen from, just like <option> tag does when nested in <select> HTML tag, while value must contain a value from options or it can be omitted (NULL). allow_multiple will allow values from options list to be matched multiple times. Note that unlike previous versions of rapport, partial matching is not performed.
numeric, integer
  • limit - accepts the same format as length attribute, only that it checks input values rather than input length. limit attribute is NULL by default and checks are performed only when limit is defined.
factor
  • nlevels - accepts the same format as length attribute, but the check is performed on the number of factor levels.
  • matchable - ibid as in character inputs (note that in previous versions of rapport matching was performed against factor levels - well, not any more, now we match against values to make it consistent with character inputs).

Body

The body of the template uses brew syntax with a forked back-end. Please check out pander’s documentation for details, in the followings the most important informations will be elaborated.

Brew syntaxes basically have two parts:

  • normal text
  • commands between special tags

The special tags have two types:

  • <% command %> running R command
  • <%= command %> next to running an R command, applies pander to the returning R object, thus it will appear in a nice Pandoc markdown format.

These two codes look pretty the same and actually doesn’t differ too much, but still in a crucial way. Both of them are good for running R commands, but:

  • the code in the belly of the first tag is good for unprinted results and mostly for conditional statements or loops
  • the second will run each expression after each other and send the markdown format result to the interface.

Let us show you some basic examples to show how they work and the difference between them!

Tags without equal sign:

<% if (2 > 1) { %> Math works, 2 is more than 1! <% } %> The following sentence will be written on the interface we use: “Math works, 2 is more than 1!”

Tags with the equal sign: <%= 2 > 1 %> That was not more complicated than just create a logical value with comparing two integer.

And obviously you can mix the two types as well:

<% if (2 > 1) { %> Math works? Is 2 really more than 1? Yeah, that's <%= 2 > 1 %>. <% } %> That will produce the same sentence like in the first example.

Here are a few useful calls, that you can use in the body section of a template:

  • rapport.data returns the entire active dataset as a data frame
  • rapport.template produces a list with the meta informations of the template
  • rapport.inputs gives you a list with the inputs of the template.

If you need information about the variables from the dataset, you can use the followings (“var” is any input variable):

  • var.name tells you the name of the variable in the dataset (ie.: If you are using the edu variable from the ius2008 dataset in the input variable called x, x.name will return edu.)
  • var.len gives you the length of the variable
  • var.label returns the variable label.

Translating templates

Translating an existing template could not be easier: just copy the content of the desired template to a new file and translate the text in the header and body of the template. Warning: this is harder than it seems!

See e.g. rapport:::rapport.find('i18n/hu/correlations') which is based on rapport:::rapport.find('correlations'). Check out the colorized diff of the two files!

Fields to be translated in header section (between <!--head and head--> tags):

  • Title
  • Author/Email
  • Description
  • Example (updating the name/path of the template is sufficient)
  • variable information (3rd and 4th block)

In the body of the report (after head--> tag) you may change any text, reorder R code (stuff between <%= and %> tags) or even tweak those (if you know what you are doing).

We suggest placing the translated templates in a directory named to the short, alpha-2 abbreviation of the language (e.g. hu for Hungarian).

Translated (or just altered) templates are highly welcome, so please give the pull request a try or open an issue on GitHub.

Discussion

Please request a feature or file a bug report on issues page on GitHub. Feel free to watch and/or fork our project!

You can also start and/or join the discussion on various topics related to rapport (installation, configuration issues, questions, new templates, translation, etc.) via Google group.

License

This package is released under the Affero General Public License (ver. 3) based on FSF suggestions. Feel free to use our package and/or modify its source code in accordance with the licence conditions.

To cite package rapport in publications, please use:

Blagotić, A. and Daróczi, G. (2013). Rapport: a report templating
  system. R package version 0.51, URL
  http://cran.r-project.org/package=rapport

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {Rapport: a report templating system},
    author = {Aleksandar Blagotić and Gergely Daróczi},
    year = {2013},
    note = {R package version 0.51},
    url = {http://cran.r-project.org/package=rapport},
  }

Thanks

Special thanks to:

  • R Development Core Team for bringing beautiful language for statistical computation to millions
  • John MacFarlane for an awesome “universal document converter”: pandoc
  • Stefan Bogdanović for the inspiring logo and miscellaneous design aids
  • Dave Gamache for his nice “collection of CSS & JS files”: Skeleton
  • Prasad Chalasani for MacOS testing feedback
  • Roman Luštrik, Yihui Xie, Prasad Chalasani, Barry Rowlingson and David Purdy for casting their votes on a poll about the package name

and following authors for their immense effort in development of packages that rapport depends (or depended a while ago) on:

* the list is sorted alphabetically based on the package name

Contact

Gergely Daróczi

Ph.D. student at BCE, Hungary and co-founder of rapporter.net.

Aleksandar Blagotić

Psychology student at Faculty of Philosophy, Niš, Serbia and R/web-developer at rapporter.net.


© 2011-2013 rapport Development Team | License (AGPL3) | Styled with skeleton.