PHS7045-labs/Homework02-FunctionsProfilingRcpp.qmd at master · EricAnto0/PHS7045-labs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "Homework 02 - Functions, Profiling, and Rcpp"
format: gfm
---

# Due date

Tuesday, February 28

# Background

For this assignment, you'll be quested with speeding up some code using what
you have learned about vectorization and Rcpp.

## Part 1: Vectorizing code

The following functions can be written to be more efficient without using
parallel computing:

1. This function generates a `n x k` dataset with all its entries distributed
Poisson with mean `lambda`.

```{r}
#| label: p1-fun1
fun1 <- function(n = 100, k = 4, lambda = 4) {
  x <- NULL

  for (i in 1:n)
    x <- rbind(x, rpois(k, lambda))

  return(x)
}

fun1alt <- function(n = 100, k = 4, lambda = 4) {
  #sapply(1:k, function(g)rpois(n, lambda))
  matrix(rpois(k*n, lambda), nrow = n)
}

# Benchmarking
bench::mark(
  fun1(),
  fun1alt(), relative = TRUE, check = FALSE
)
```

2. Like before, speed up the following functions (it is OK to use StackOverflow)

```{r}
#| label: p1-fun2
# Total row sums
fun1 <- function(mat) {
  n <- nrow(mat)
  ans <- double(n)
  for (i in 1:n) {
    ans[i] <- sum(mat[i, ])
  }
  ans
}

fun1alt <- function(mat){
  rowSums(mat)
}

# Cumulative sum by row
fun2 <- function(mat){
  n <- nrow(mat)
  k <- ncol(mat)
  ans <- mat
  for (i in 1:n) {
    for (j in 2:k) {
      ans[i,j] <- mat[i, j] + ans[i, j - 1]
    }
  }
  ans
}

fun2alt <- function(mat) {
  #t(cumsum(data.frame(t(dat))))
  t(apply(mat, 1, cumsum))
  #t(sapply(1:nrow(mat), function(k)cumsum(mat[k, ])))
}

# Use the data with this code
set.seed(2315)
dat <- matrix(rnorm(200 * 100), nrow = 200)
# Test for the first
bench::mark(
  fun1(dat),
  fun1alt(dat), relative = TRUE
)

# Test for the second
bench::mark(
  fun2(dat),
  fun2alt(dat), relative = TRUE#, check = FALSE
)
```

3.  Find the column max (hint: Check out the function `max.col()`).

```{r}
#| label: p1-fun3
# Find each column's max value
fun2 <- function(x){
  apply(x, 2, max)
}

fun2alt <- function(x){
  matrixStats::colMaxs(x)
  #diag(t(x)[, max.col(t(x))])
}
# Data Generating Process (10 x 10,000 matrix)
set.seed(1234)
x <- matrix(rnorm(1e4), nrow=10)

# Benchmarking
bench::mark(
  fun2(x),
  fun2alt(x), relative = TRUE, check = F
)
```

## Part 2: Rcpp code

As we saw in the Rcpp week, vectorization may not be the best solution. For this
part, you must write a function using Rcpp that implements the propensity score
matching algorithm. You can use [Week 5's lab](https://github.com/UofUEpiBio/PHS7045-advanced-programming/issues/8#issuecomment-1424974938) as a starting point for the problem. Your C++ file
should look something like the following:


In this C++ function, a control (i.e., untreated) is matched with more than one treated individual such that each treated individual is matched to only one control,
```{Rcpp}
#include<Rcpp.h>

using namespace Rcpp;
// [[Rcpp::export]]
List psmatch(
  NumericVector pscores,
  LogicalVector is_treated
)
{
  /*... setup the problem creating the output...*/
 int n = static_cast<int>(is_treated.size());
  IntegerVector indices(n);
  NumericVector values(n);
    for (int i = 0; i < n; ++i) {
    //  Treated (is_treated == true) must be matched to control (is_treated == false)
    if(is_treated[i] == true){
    // Maximum value
    double cur_best = std::numeric_limits< double >::max();
    int cur_i = 0;

    for (int j = 0; j < n; ++j) {

        // grab treated == false and compare
        if (is_treated[j] == false){

        // If it is lower, then update
        if (std::abs(pscores[i] - pscores[j]) < cur_best) {

          cur_best = std::abs(pscores[i] - pscores[j]);
          cur_i    = j;

        }
      }

    }

    // register the result
    indices[i] = cur_i;
    values[i]  = pscores[cur_i];
    }
  }

  // Returning
  return List::create(
   _["match_id"] = indices + 1, // We add one to match R's indices
    _["match_x"]  = values
  );

}
```
## Example
```{r}
set.seed(1001)
pscore = runif(20);
trt = rbinom(20, 1, 0.5)
res <- psmatch(pscore, trt)
cbind(pscore, trt, id = res$match_id, untreatedpscore = res$match_x)
```