RUV How-To

Introduction

The purpose of this how-to is to get you up-and-running with RUV as quickly and productively as possible. Thus, the focus of the this how-to is on a script, ruv_starter_analysis, that runs several RUV analyses for you, and formats the output nicely in the form of a web page, and is a quick and easy way to get a "first look" at your data.

Some brief background: RUV is distributed as a collection of R packages. The packages are: ruv, ruv.extras, ruv.htmllatex, and several data packages. The ruv package contains the core statistical routines. The ruv.extras package includes routines for making plots and example scripts (including ruv_starter_analysis). The ruv.htmllatex package is a dependency of ruv.extras that generates html.

The core statistical routines provided by the ruv package are RUV-2, RUV-4, RUV-inv, RUV-rinv, and a few others. These routines are meant for use by you, the end-user, but they are not the focus of this how-to. For more information on these routines, please consult the standard R package documentation. For more information on the statistical methodology behind these routines, please consult the references at http://www-personal.umich.edu/~johanngb/ruv.

IMPORTANT: The analyses performed by ruv_starter_analysis should not be considered complete, final analyses, but rather as a starting point for further investigation. Moreover, the ruv_starter_analysis script should not be considered part of the "core" of RUV. It may be more properly thought of as a demo of what one can do with the ruv package (albeit an elaborate and particularly useful demo). Thus, you are encouraged to modify this script to fit your own individual needs (see "Going Further" below). Moreover, in future releases of RUV, this script may be modified, perhaps in a way that is not backwards-compatible, perhaps beyond recognition, and may even be omitted entirely. For this reason, ruv.extras is not on CRAN.

Example

The most simple usage of ruv_starter_analysis is as follows:

ruv_starter_analysis(Y, X, ctl)

All that is absolutely required is a data matrix Y, a factor of interest X, and a set of control features (control genes). For example, you could try running:

library(ruv.extras)
library(ruv.data.gender.sm)
data(gender.sm)
ruv_starter_analysis(gender.sm$Y, gender.sm$X, gender.sm$hkctl)

However, for our first example we will make use of a few addional features:

library(ruv.extras)
library(ruv.data.gender.sm)
data(gender.sm)
ruv_starter_analysis(gender.sm$Y, gender.sm$X, gender.sm$hkctl, 
                     pctl=gender.sm$pctl, geneinfo=gender.sm$geneinfo, kset=c(1,5,10),
                     outdir="gender1", webtitle="Gender Example 1")

Here, pctl is a set of positive controls, geneinfo is a matrix including gene names and chromosome numbers, kset is a set of the values of K that we wish to consider, outdir is the directory where the web page will be written, and webtitle is the html title. The webpage output by running the above commands looks like this: Gender Example 1

As you can see, the web page is divided up into 9 sections: General Info, Unadjusted, RUV-2 Combined Analyses, RUV-2 Individual Analyses, RUV-4 Combined Analyses, RUV-4 Individual Analyses, RUV-inv, RUV-rinv, and Projection Plot Table. Each of these sections can be hidden / unhidden by clicking on the section title heading. This makes navigating the web page a bit easier (since it can be quite big), and also allows you to more easily compare different analyses (e.g. by hiding the RUV-2 and RUV-4 analyses, the Unadjusted and RUV-inv analyses will be next to each other). Individual subsections (RLE plots, p-value plots, etc.) can be hidden as well.

Now let's go through and see what each of these sections has to offer.

General Info

The first information is how many samples, features (genes), and control features there are, respectively:

m = 84, n = 12600, n_c = 799

After this there are scree plots:

Y	Y_c

On the left is a scree plot in which all features (genes) are considered; on the right is one in which only control features are used. Note that the log of the eigenvalues are plotted, not the eigenvalues themselves. Note also that it is possible to hide rows / columns of the table by unchecking the boxes. This is not very helpful here, but it can be quite helpful for some of the larger tables later.

After the Scree plots come RLE plots:

Y	Y_c

Note that the vertical scale of the RLE plots is from -0.5 to 0.5. This default is used throughout, in order to make all of the RLE plots comparable. Usually this default is fine, but in this case it is not, because the (unnormalized) gender dataset has a huge amount of unwanted variation.

After the RLE plots comes a canonical correlation plot.

This plot shows, for each value of K, the square of the first canonical correlation between X and the first K left singular values of Y (black) and Y_c (green). If Y were just IID noise, we would expect these plots to form a diagonal line from (0,0) to (m,1). The fact that the curves lie above this line are evidence that either 1) the unwanted variation is systematically correlated with the factor of interest or 2) the negative control features (genes) are not good negative controls (and are in fact influenced by the factor of interest). However, the fact that the green curve stays relatively low when the black curve jumps up (around K = 20) suggests that the negative controls are relatively uninfluenced by X, and therefore probably are in fact good negative controls. Finally, by looking at this plot we can also conclude that we are probably safe to make K as large as 12 or so without worrying about strong correlation between W and X. Of course, we may still wish to make K even larger, if other evidence suggests that this is a good idea.

After the canonical correlation plot comes a table of principal component plots. An SVD is taken of Y (and also Y_c), and the left singular vectors are referred to as factors. We denote this matrix of factors by W_all (or W_ctl in the case of Y_c). In these plots, we plot one factor against another. This is helpful for seeing if there are any clusters or other interesting variation in the data, and also whether the variation in the control features (genes) is representative of the variation in the entire dataset. The factors of Y are plotted on the right, and the factors of Y_c are plotted on the left.

Note from the plots of Factor 1 vs Factor 2 that two clear clusters are visible. These clusters turn out to be the two chip types.

Finally, the General Info section concludes with a table of alpha plots. These are similar to the PC plots, except that instead of plotting the factors (columns of W) against one another, we plot the rows of alpha against one another. Here, alpha is equal to W_all'Y in the plots on the right, and W_ctl' Y in the plots on the left.

Note that the negative controls are plotted in green, the positive controls are plotted in purple, and everything else is plotted in gray (this is the default coloring scheme, but can be changed). These plots are useful for seeing if the control features (genes) look representative of the other features, or if there are any outliers or any other notable surprises in the data. In this case, the controls look fairly representative, and there don't appear to be any serious outliers or other surprises.

Unadjusted

The Unadjusted section provides RLE plots, plots of p-value distributions, variance plots, and tables of top-ranked features (genes) for an "unadjusted" analysis in which no RUV methods are applied.

The first plots are RLE plots. These plots differ from the RLE plots in the General Info section only in that here, the data has first been adjusted by known covariates Z, if in fact a Z matrix has been supplied.

Following the RLE plots are a table of p-value plots:

standard	ebayes	rsvar	rsvar ebayes	evar

P-values have been computed in 5 different ways -- standard, ebayes (empirical bayes -- i.e. using the methods of Limma), rsvar (rescaled variances), rsvar ebayes, and evar (empirical variances). These 5 sets of p-values have each been plotted in 3 different ways. First, as a histogram; second, by their rank (effectively a qq plot); and third by their rank, but on a log-log scale.

A second table of p-value plots is then shown, in which only p-values of the negative controls are plotted. Ideally, the histograms will be flat, and the qq plots will be straight lines. The extent to which the histograms are not flat and the extent to which the qq plots are not straight lines give us some indication of how much unwanted variation is present in the data.

Following the p-value plots are variance plots.

Standard Coloring	standard	ebayes	rsvar	rsvar ebayes	evar

In these plots, the squared betahat values are plotted against the estimates of sigma squared. The axes are transformed to be on a fourth-root scale. In addition, the estimated variance of betahat is plotted in black as a function of the estimate of sigma squared. There are six plots. In the first, the variance of betahat is estimated in the usual way, and the features (genes) are colored as they were in the alpha plots and the p-value plots. In the remaining five plots, the variance of betahat is estimated in several different ways, and the features are colored based on the rank of their p-values. The features with the 15 smallest p-values are colored purple, and their rank is plotted as a number. The next 25 most highly ranked features are plotted in blue; the next 35 in cyan; the next 75 in orange; and the next 150 in brown. These plots help us to visualize the consequences of the different methods of estimating the variance of betahat. Also, these plots may help us identify any features that have been "suspiciously" labelled as differentially expressed. For example, if a feature has a very small p-value, but is also an outlier with respect to its estimated value of sigma squared, we might be suspicious that the feature is not truly differentially expressed, but rather that its variance of betahat has simply been poorly estimated, or that the feature is influenced by unwanted variation that has not been properly adjusted for.

After the variance plots are a set of tables showing how many positive controls are found in the top N most highly-ranked features (genes). By default, values of N = 20, 40, 60, 80, and 100 are shown, but this can be changed using the topcount_threshold variable. (Note: If no positive controls are specified, this section will not exist.)

standard

ebayes

rsvar

rsvar ebayes

evar

20	40	60	80	100
7	7	7	8	10

20	40	60	80	100
6	7	7	7	11

20	40	60	80	100
7	7	7	8	10

20	40	60	80	100
6	7	7	7	11

20	40	60	80	100
7	7	9	10	10

We see, for example, that when we calculate p-values the standard way, of the twenty genes with the smallest p-values, 7 are positive controls.

Finally, the Unadjusted section concludes with tables of the top N most highly ranked features (genes). By default N is 40, but this can be changed using the topN variable. Tables are provided for each of the 5 methods of computing p-values.

standard

ebayes

rsvar

rsvar ebayes

evar

rank	p	BH	beta	chrom	sym	index
1	0.001	1	1.2	Y	RPS4Y1	11296
2	0.02	1	0.82	Y	DDX3Y	8410
3	0.07	1	-0.66	11	HBB	2045
4	0.1	1	-0.58	NA	NA	1513
5	0.2	1	-0.55	11	HBB	1676
6	0.3	1	0.38	19	CIRBP	9932
7	0.3	1	0.36	Y	KDM5D	7630
8	0.4	1	0.33	Y	USP9Y	5916
9	0.4	1	0.34	2	GAD1	7226
10	0.4	1	0.28	X Y	SLC25A6	10509
11	0.4	1	0.35	12	GAPDH	12563
12	0.4	1	-0.3	X	XIST	8501
13	0.4	1	0.32	16	CRYM	8339
14	0.4	1	0.29	3	PFN2	8898
15	0.5	1	0.23	14	RPL36AL	9924
16	0.5	1	0.26	Y	UTY	4494
17	0.5	1	0.21	4	SLC25A4	2823
18	0.5	1	0.21	NA	NA	3856
19	0.5	1	0.23	6	FBXO9	9050
20	0.5	1	0.26	20	SNAP25	8539
21	0.5	1	0.22	2	RPL31	3685
22	0.5	1	0.3	7	ACTB	12559
23	0.5	1	0.24	3	RHOA	7354
24	0.5	1	0.25	5	RPS23	4820
25	0.5	1	0.3	14	CALM1	11225
26	0.5	1	0.24	6	RPL10A	6826
27	0.5	1	-0.26	4	SPP1	4358
28	0.5	1	-0.25	17	GFAP	10256
29	0.5	1	0.23	12	UBC	395
30	0.5	1	0.23	20	CHGB	3433
31	0.5	1	0.27	3	GAP43	7763
32	0.5	1	0.25	20	RPS21	2744
33	0.5	1	0.24	1	NMNAT2	2083
34	0.5	1	0.23	14	CALM1	11369
35	0.5	1	0.23	12	UBC	2330
36	0.5	1	0.24	12	RAN	912
37	0.5	1	0.28	17	PRKAR1A	1207
38	0.5	1	0.22	7	CYCS	5849
39	0.5	1	0.23	12	ATP2A2	9858
40	0.5	1	0.28	12	GAPDH	12565

rank	p	BH	beta	chrom	sym	index
1	6e-04	1	1.2	Y	RPS4Y1	11296
2	0.02	1	0.82	Y	DDX3Y	8410
3	0.06	1	-0.66	11	HBB	2045
4	0.1	1	-0.58	NA	NA	1513
5	0.1	1	-0.55	11	HBB	1676
6	0.3	1	0.38	19	CIRBP	9932
7	0.3	1	0.36	Y	KDM5D	7630
8	0.3	1	0.35	12	GAPDH	12563
9	0.3	1	0.34	2	GAD1	7226
10	0.3	1	0.33	Y	USP9Y	5916
11	0.4	1	0.32	16	CRYM	8339
12	0.4	1	-0.3	X	XIST	8501
13	0.4	1	0.3	14	CALM1	11225
14	0.4	1	0.3	7	ACTB	12559
15	0.4	1	0.29	3	PFN2	8898
16	0.4	1	0.28	17	PRKAR1A	1207
17	0.4	1	0.28	7	ACTB	12557
18	0.4	1	0.28	X Y	SLC25A6	10509
19	0.4	1	0.28	12	GAPDH	12565
20	0.4	1	0.28	20	GNAS	7496
21	0.4	1	0.27	3	GAP43	7763
22	0.5	1	0.26	20	GNAS	7495
23	0.5	1	-0.26	4	SPP1	4358
24	0.5	1	0.26	20	SNAP25	8539
25	0.5	1	0.26	Y	UTY	4494
26	0.5	1	0.25	19	SLC17A7	6605
27	0.5	1	0.25	5	RPS23	4820
28	0.5	1	0.25	20	RPS21	2744
29	0.5	1	0.25	6	HSP90AB1	178
30	0.5	1	0.25	18	ATP5A1	10166
31	0.5	1	-0.25	17	GFAP	10256
32	0.5	1	0.24	3	RHOA	7354
33	0.5	1	0.24	8	YWHAZ	256
34	0.5	1	0.24	6	RPL10A	6826
35	0.5	1	0.24	13	DCLK1	9017
36	0.5	1	0.24	12	RAN	912
37	0.5	1	0.24	1	NMNAT2	2083
38	0.5	1	0.23	12	UBC	395
39	0.5	1	0.23	22	ATXN10	9752
40	0.5	1	0.23	12	ATP2A2	9858

rank	p	BH	beta	chrom	sym	index
1	3e-24	3e-20	1.2	Y	RPS4Y1	11296
2	3e-16	2e-12	0.82	Y	DDX3Y	8410
3	6e-12	2e-08	-0.66	11	HBB	2045
4	1e-09	4e-06	-0.58	NA	NA	1513
5	2e-08	6e-05	-0.55	11	HBB	1676
6	1e-05	0.03	0.38	19	CIRBP	9932
7	2e-05	0.04	0.36	Y	KDM5D	7630
8	2e-04	0.2	0.33	Y	USP9Y	5916
9	3e-04	0.4	0.34	2	GAD1	7226
10	4e-04	0.6	0.28	X Y	SLC25A6	10509
11	5e-04	0.6	0.35	12	GAPDH	12563
12	5e-04	0.6	-0.3	X	XIST	8501
13	7e-04	0.7	0.32	16	CRYM	8339
14	0.001	1	0.29	3	PFN2	8898
15	0.002	1	0.23	14	RPL36AL	9924
16	0.002	1	0.26	Y	UTY	4494
17	0.002	1	0.21	4	SLC25A4	2823
18	0.003	1	0.21	NA	NA	3856
19	0.003	1	0.23	6	FBXO9	9050
20	0.003	1	0.26	20	SNAP25	8539
21	0.003	1	0.22	2	RPL31	3685
22	0.003	1	0.3	7	ACTB	12559
23	0.004	1	0.24	3	RHOA	7354
24	0.004	1	0.25	5	RPS23	4820
25	0.004	1	0.3	14	CALM1	11225
26	0.004	1	0.24	6	RPL10A	6826
27	0.004	1	-0.26	4	SPP1	4358
28	0.004	1	-0.25	17	GFAP	10256
29	0.005	1	0.23	12	UBC	395
30	0.005	1	0.23	20	CHGB	3433
31	0.005	1	0.27	3	GAP43	7763
32	0.006	1	0.25	20	RPS21	2744
33	0.006	1	0.24	1	NMNAT2	2083
34	0.006	1	0.23	14	CALM1	11369
35	0.006	1	0.23	12	UBC	2330
36	0.006	1	0.24	12	RAN	912
37	0.007	1	0.28	17	PRKAR1A	1207
38	0.007	1	0.22	7	CYCS	5849
39	0.007	1	0.23	12	ATP2A2	9858
40	0.008	1	0.28	12	GAPDH	12565

rank	p	BH	beta	chrom	sym	index
1	1e-45	1e-41	1.2	Y	RPS4Y1	11296
2	9e-22	6e-18	0.82	Y	DDX3Y	8410
3	1e-14	5e-11	-0.66	11	HBB	2045
4	6e-12	2e-08	-0.58	NA	NA	1513
5	7e-11	2e-07	-0.55	11	HBB	1676
6	8e-06	0.02	0.38	19	CIRBP	9932
7	2e-05	0.04	0.36	Y	KDM5D	7630
8	5e-05	0.07	0.35	12	GAPDH	12563
9	7e-05	0.1	0.34	2	GAD1	7226
10	9e-05	0.1	0.33	Y	USP9Y	5916
11	1e-04	0.2	0.32	16	CRYM	8339
12	4e-04	0.4	-0.3	X	XIST	8501
13	5e-04	0.4	0.3	14	CALM1	11225
14	5e-04	0.4	0.3	7	ACTB	12559
15	7e-04	0.6	0.29	3	PFN2	8898
16	9e-04	0.6	0.28	17	PRKAR1A	1207
17	0.001	0.6	0.28	7	ACTB	12557
18	0.001	0.6	0.28	X Y	SLC25A6	10509
19	0.001	0.6	0.28	12	GAPDH	12565
20	0.001	0.7	0.28	20	GNAS	7496
21	0.001	0.8	0.27	3	GAP43	7763
22	0.002	1	0.26	20	GNAS	7495
23	0.002	1	-0.26	4	SPP1	4358
24	0.002	1	0.26	20	SNAP25	8539
25	0.003	1	0.26	Y	UTY	4494
26	0.003	1	0.25	19	SLC17A7	6605
27	0.003	1	0.25	5	RPS23	4820
28	0.004	1	0.25	20	RPS21	2744
29	0.004	1	0.25	6	HSP90AB1	178
30	0.004	1	0.25	18	ATP5A1	10166
31	0.004	1	-0.25	17	GFAP	10256
32	0.004	1	0.24	3	RHOA	7354
33	0.005	1	0.24	8	YWHAZ	256
34	0.005	1	0.24	6	RPL10A	6826
35	0.005	1	0.24	13	DCLK1	9017
36	0.005	1	0.24	12	RAN	912
37	0.006	1	0.24	1	NMNAT2	2083
38	0.006	1	0.23	12	UBC	395
39	0.006	1	0.23	22	ATXN10	9752
40	0.006	1	0.23	12	ATP2A2	9858

rank	p	BH	beta	chrom	sym	index
1	5e-276	6e-272	1.2	Y	RPS4Y1	11296
2	8e-128	5e-124	0.82	Y	DDX3Y	8410
3	1e-83	4e-80	-0.66	11	HBB	2045
4	3e-29	9e-26	0.38	19	CIRBP	9932
5	2e-26	5e-23	0.36	Y	KDM5D	7630
6	3e-25	6e-22	-0.58	NA	NA	1513
7	6e-21	1e-17	0.33	Y	USP9Y	5916
8	7e-19	1e-15	-0.3	X	XIST	8501
9	1e-16	2e-13	0.28	X Y	SLC25A6	10509
10	5e-14	6e-11	0.26	Y	UTY	4494
11	3e-13	3e-10	0.26	20	SNAP25	8539
12	5e-13	5e-10	-0.25	17	GFAP	10256
13	7e-13	7e-10	0.24	3	RHOA	7354
14	2e-12	2e-09	0.25	5	RPS23	4820
15	2e-12	2e-09	0.24	6	RPL10A	6826
16	3e-12	3e-09	0.24	1	NMNAT2	2083
17	4e-12	3e-09	0.23	12	UBC	395
18	6e-12	4e-09	0.23	12	UBC	2330
19	1e-11	9e-09	0.23	20	CHGB	3433
20	1e-11	9e-09	0.23	14	RPL36AL	9924
21	2e-11	1e-08	0.23	6	FBXO9	9050
22	3e-11	2e-08	0.23	14	CALM1	11369
23	5e-11	3e-08	0.22	2	RPL31	3685
24	6e-11	3e-08	0.22	7	CYCS	5849
25	1e-10	6e-08	0.22	NA	NA	1450
26	3e-10	1e-07	0.21	1	GABRD	5121
27	4e-10	2e-07	0.21	NA	NA	3856
28	5e-10	2e-07	0.21	2	NCL	2588
29	7e-10	3e-07	0.21	4	SLC25A4	2823
30	7e-10	3e-07	-0.55	11	HBB	1676
31	4e-09	1e-06	0.2	17	ATP5H	5790
32	6e-09	3e-06	0.2	2	ATP5G3	4832
33	8e-09	3e-06	0.2	NA	NA	11297
34	9e-09	3e-06	0.19	19	GPX4	3943
35	1e-08	4e-06	0.19	10	ATP5C1	10186
36	2e-08	8e-06	0.19	19	CA11	4292
37	3e-08	1e-05	0.19	6	EEF1A1	10965
38	3e-08	1e-05	0.19	12	HNRNPA1	10283
39	4e-08	1e-05	0.19	6	ATP6V1G2	3036
40	5e-08	1e-05	0.19	5	HINT1	10

You may choose to hide some of the tables to make side-by-side comparisons of two tables. Each table includes a p-value, a FDR-adjusted p-value (BH), and any information included in the "geneinfo" matrix. The index column tells us the index of the feature (gene).

If you click on an entry of the table, you will get the results of a google search for that entry. This is useful for quickly googling the genes.

RUV-2 Combined Analyses

The RUV-2 Combined Analyses section is intended to help you choose a good value of K. This section includes two tables of plots. The first shows the number of top-ranked positive controls as a function of K:

standard	ebayes	rsvar	rsvar ebayes	evar

Of course, if no positive controls are supplied, this table won't exist.

The second table of plots (not shown) includes RLE plots, a projection plot, and an extensive variety of p-value plots for each value of K. These plots are all duplicated in the RUV-2 Individual Analyses section (below), but are included here in one giant table so that an easy comparison between different values of K can be made. It will be very helpful to hide various rows / columns when viewing this table.

RUV-2 Individual Analyses

There is an individual analysis for each value of K. Each individual analysis is similar to the Unadjusted analysis, except that it also contains a table of projection plots:

Standard Coloring	standard	ebayes	rsvar	rsvar ebayes	evar

The layout of these plots is analogous to that of the variance plots. In the first plot the features (genes) are colored as they were in the alpha plots and p-value plots. In the remaining 5 plots, the features are colored according to their rank, just as with the variance plots.

RUV-4 Combined and Individual Analyses

These are just like the RUV-2 Combined and Individual Analyses

RUV-inv and RUV-rinv

These are just like the RUV-2 / RUV-4 individual analyses, except that there are no RLE plots.

Projection Plot table

TODO: Describe Projection plot table

Additional Options

The full list of arguments to ruv_starter_analysis is as follows:

Y
The data. A m by n matrix, where m is the number of samples and n is the number of features.

X
The factor of interest. A m by 1 matrix, where m is the number of samples.

ctl
The negative controls. A logical vector of length n.

Z
Any additional covariates to include in the model. Either a m by q matrix of covariates, or simply 1 (the default) for an intercept term.

eta
Gene-wise (as oposed to sample-wise) covariates. These covariates are adjusted for by RUV-1 before any further analysis proceeds. A matrix with n columns.

pctl
Positive controls. A logical vector of length n.

genecoloring
A vector of length n. The colors to use when plotting genes.

samplecoloring
A vector of length m. The colors to use when plotting samples.

genetexts
A vector of length n. Any text to be used in place of symbols, when plotting genes. Elements that are NA are plotted as symbols.

sampletexts
A vector of length m. Any text to be used in place of symbols, when plotting samples. Elements that are NA are plotted as symbols.

genesymbols
A vector of length n. The plot symbols to use when plotting genes.

samplesymbols
A vector of length m. The plot symbols to use when plotting symbols.

geneinfo
A matrix with n rows. Each column should contain some information about the genes (such as their names) for use in tables.

rankbybeta
Should the analysis include a ranking of the features based on the absolue value of estimated effect size (betahat)?

topN
The number of top-ranked genes to include in tables.

topcount_thresholds
The thresholds to use when counting the number of top-ranked positive controls.

rankset
The genes to be considered when determining which are top-ranked. A logical vector. NULL implies all genes.

kset
Which values of K should be considered.

factorset
Which factors should be included in the projection plot table.

bin
The bin size in the method of empirical variances.

do_general
Should the "general" analysis be performed?

do_unadjusted
Should the "unadjusted" analysis be performed?

do_ruv2
Should the RUV-2 analysis be performed?

do_ruv4
Should the RUV-4 analysis be performed?

do_ruvinv
Should the RUV-inv analysis be performed?

do_ruvrinv
Should the RUV-rinv analysis be performed?

do_pptable
Should the factor projection plot table be created?

outdir
Directory where the web page should be written.

initialize_collapsed
Should the web page be created so that only headers are shown, and must be manually expanded?

webtitle
The title of the web page.

inputcheck
Perform a basic sanity check on the inputs, and issue a warning if there is a problem.

verbose
Verbose output.

A few of these arguments warrant further comment.

First note that X must consist of only a single column. Although RUV-2, RUV-4, etc. support an X with more than one column, ruv_starter_analysis does not. This is because many of the plots (e.g. p-value histograms, projection plots) only make sense in the context of a single-column X. If you have several factors of interest, the easiest way to handle this situation is to run ruv_starter_analysis several times, each time setting X to be just one of the factors of interest. If desired, the remaining factors of interest can be included in the model by including them in the Z matrix.

The do_unadjusted, do_ruv2, do_ruv4, do_ruvinv, do_ruvrinv, and do_pptable arguments can be set to FALSE in order to omit these sections of the analysis and speed up execution.

The genecoloring and samplecoloring arguments can be used to specify the colors used in the plots. If there are any NAs in the coloring vector, those samples / features will be plotted in light gray. If a coloring vector is not specified at all, by default, all features are plotted in light gray except for negative controls, which are plotted in green, and positive controls (if they are given), which are plotted in purple. NOTE: The plots are done in a special way, so that points of "rare" colors are plotted last, to ensure they are visible. So, for example, if there are 10,000 gray points, 1,000 green points, and 100 purple points, all 10,000 gray points will be plotted first, then all 1,000 green points, and finally all 100 purple points.

The genesymbols and samplesymbols arguments can be used to specify the symbols used in the plots. If there are any NAs in the symbol vector, those samples / features will be plotted by default as a circle.

The genetexts and sampletexts arguments can be used to specify text that should be plotted instead of a symbol. If there are any NAs in the text vector, those samples / features will be plotted by a symbol instead.

The initialize_collapsed argument can be used to create the web page so that all of the plots / tables are initially hidden. This is particularly useful if the web page will actually be posted on a web server and viewed over the internet, since the page can then load much more quickly.

To see some these featurs in action, consider a second example using the gender data:

library(ruv.extras)
library(ruv.data.gender.sm)
data(gender.sm)
genetexts = rep(NA,ncol(gender.sm$Y))
ygenes = which(gender.sm$geneinfo[,1]=="Y")
genetexts[ygenes] = gender.sm$geneinfo[ygenes,2]
ruv_starter_analysis(gender.sm$Y, gender.sm$X, gender.sm$hkctl, 
                     pctl=gender.sm$pctl, geneinfo=gender.sm$geneinfo, kset=c(1,10),
                     genecoloring = gender.sm$genecoloring, samplecoloring=gender.sm$samplecoloring,
                     samplesymbols = gender.sm$X + 1,
                     genetexts = genetexts,
                     do_unadjusted = FALSE, do_ruv2 = FALSE, do_ruvinv = FALSE, do_ruvrinv = FALSE,
                     outdir="gender2", webtitle="Gender Example 2")

The output looks like this: Gender Example 2

In this example, samples are colored by lab / chiptype: Red -- site A, HG-U95A; yellow -- site A, HG-U95Av2; black -- site B, HG-U95A; gray -- site B, HG-U95Av2; cyan -- site C, HG-U95Av2. Males are plotted as triangles, and females are plotted as circles (see PC Plots). Genes are colored as follows: Green -- negative controls; pink -- on X chromosome; blue -- on Y chromosome; purple -- on X and Y chromosomes; gray -- everything else. Moreover, genes from the Y chromosome are plotted as using their gene name, instead of the standard circle symbol.

Going Further

Eventually, you will probably want to modify the plots in various ways, generate plots of your own, or simply want to know in more detail what ruv_starter_analysis does. To help you, there are 4 files in the my_ruv sub-folder of this how-to:

my_ruv.R

my_ruv_simpler.R

my_ruv_simplest.R

my_ruv_plots_and_tables.R

The file my_ruv_plots_and_tables.R contains all of the plot routines in the ruv.extras package, but the names of the routines have been given the prefix "my_". For example, "ruv_scree" is renamed "my_ruv_scree." Therefore, you can easily edit the routines in any way you wish, source the file, and then use your version of the plot routines simply by adding the prefix "my_" in any of the code that calls the routines.

The file my_ruv.R is similar in nature. This file contains the script my_ruv_starter_analysis (and supporting subscripts). The only difference is the prefix "my." Thus, if you source the files my_ruv_plots_and_tables.R and my_ruv.R you will have all of the functionality of the ruv.extras package, just all the routines now have a "my" prefix. Of course, now you can edit these files however you like.

my_ruv.R is a rather complicated file, especially when you first look at it. Thus, before tackling this file, it is recommended that you first examine the file my_ruv_simplest.R. This script also contains a version of my_ruv_starter_analysis, but it is greatly simplified. This version does not create a web page. Instead, it simply outputs text and plots to the screen. Also, some of the less important options have been omitted. This script is great for understanding what my_ruv_starter_analysis does. It is also a great script to edit when you wish to create your own analysis.

Finally, my_ruv_simpler.R is somewhere in between. This version does not create a web page, but it does at least save the plots to disk. Once you have an understanding of my_ruv_simplest.R you may wish to examine this file, either as a next step in understanding my_ruv.R, or simply as a convenient way to save any plots you create to disk.

`Y`	The data. A m by n matrix, where m is the number of samples and n is the number of features.
`X`	The factor of interest. A m by 1 matrix, where m is the number of samples.
`ctl`	The negative controls. A logical vector of length n.
`Z`	Any additional covariates to include in the model. Either a m by q matrix of covariates, or simply 1 (the default) for an intercept term.
`eta`	Gene-wise (as oposed to sample-wise) covariates. These covariates are adjusted for by RUV-1 before any further analysis proceeds. A matrix with n columns.
`pctl`	Positive controls. A logical vector of length n.
`genecoloring`	A vector of length n. The colors to use when plotting genes.
`samplecoloring`	A vector of length m. The colors to use when plotting samples.
`genetexts`	A vector of length n. Any text to be used in place of symbols, when plotting genes. Elements that are NA are plotted as symbols.
`sampletexts`	A vector of length m. Any text to be used in place of symbols, when plotting samples. Elements that are NA are plotted as symbols.
`genesymbols`	A vector of length n. The plot symbols to use when plotting genes.
`samplesymbols`	A vector of length m. The plot symbols to use when plotting symbols.
`geneinfo`	A matrix with n rows. Each column should contain some information about the genes (such as their names) for use in tables.
`rankbybeta`	Should the analysis include a ranking of the features based on the absolue value of estimated effect size (betahat)?
`topN`	The number of top-ranked genes to include in tables.
`topcount_thresholds`	The thresholds to use when counting the number of top-ranked positive controls.
`rankset`	The genes to be considered when determining which are top-ranked. A logical vector. NULL implies all genes.
`kset`	Which values of K should be considered.
`factorset`	Which factors should be included in the projection plot table.
`bin`	The bin size in the method of empirical variances.
`do_general`	Should the "general" analysis be performed?
`do_unadjusted`	Should the "unadjusted" analysis be performed?
`do_ruv2`	Should the RUV-2 analysis be performed?
`do_ruv4`	Should the RUV-4 analysis be performed?
`do_ruvinv`	Should the RUV-inv analysis be performed?
`do_ruvrinv`	Should the RUV-rinv analysis be performed?
`do_pptable`	Should the factor projection plot table be created?
`outdir`	Directory where the web page should be written.
`initialize_collapsed`	Should the web page be created so that only headers are shown, and must be manually expanded?
`webtitle`	The title of the web page.
`inputcheck`	Perform a basic sanity check on the inputs, and issue a warning if there is a problem.
`verbose`	Verbose output.