tags= #r_code #ggplot2 #data_science
library(tidyverse)
mydata <- read.delim("my_data.txt", header=TRUE, na.strings="NA")
Stack Overflow - Importing Multiple CSV files in R
temp = list.files(pattern="*.csv")
for (i in 1:length(temp)) assign(temp[i], read.csv(temp[i]))
#tab delimitted text file
write.table(my_data, "my_data_output", sep="\t")
Remove an entire column from a data.frame in R
Remove a column
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
Remove multiple columns
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Adding a New Column to a Data Frame
df["MY_NEW_COLUMN"] <- NA # That creates the new column named "MY_NEW_COLUMN" filled with "NA"
df[c("E","F","G","H","I")] <- NA #create multiple empty columns
When you want to put an observation into a group (e.g Case vs. Control), you need to create a new variable. A simple way to do this is to create a vector(s) that contains the Group members and then use mutate
with if_else()
.
In this example, I created a group called “Domestic” for some cars based on their manufacturer variable.
domestic = c("chevrolet", "dodge", "ford", "jeep", "lincoln", "mercury", "pontiac")
mutate(mpg, Location = if_else(manufacturer %in% domestic, "Domestic", "Foreign"))
data <- merge(d1, d2, by="Patient.ID")
Adding multiple columns simultaneously with different values in R -SO
library('tidyverse')
df <- head(cars)
mutate(df, Subject = 'F1', Slide = '1A')
reshape::merge_all(your_list_with_dfs, ...)
pc$Group <- factor(pc$Group, levels = c('Control','Case'),ordered = TRUE)
Purpose: When you want take numerical values and bin them into categories.
Use cases:
With cut
df$category <- cut(df$a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high"))
With dplyr
res <- df %>% mutate(category=cut(a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high")))
select()
from dplyr()
“The select() function can be used to select columns of a data frame that you want to focus on. Often you’ll have a large data frame containing “all” of the data, but any given analysis might only use a subset of variables or observations. The select() function allows you to get the few columns you might need.” - Excerpt From: Roger D. Peng. “R Programming for Data Science.” iBooks.
> names(chicago)[1:3]
[1] "city" "tmpd" "dptp"
> subset <- select(chicago, city:dptp)
> head(subset)
city tmpd dptp
1 chic 31.5 31.500
2 chic 33.0 29.875
3 chic 33.0 27.375
4 chic 29.0 28.625
5 chic 32.0 28.875
6 chic 40.0 35.125
You get some output from the function and you want it to be stored in the global environment so it can be used elsewhere.
From SO:
assign()
or <<-
can both be used.
You could use assign:
assign("v","hi",envir = globalenv())
This requires that you have the name of the target global variable as a string, but it can be easy to do this even with a vector of dozens of such things.
This question discusses the differences between assign and <<-. The chief difference is that assign lets you specify the environment – so it is easy to use it to store data in a non-global but persistent environment so that you could e.g. emulate static variables in R. While it is possible to use assign to modify the global environment, you should be aware that it is seldom a good thing to do so. There is too much of a danger of accidentally overwriting data that you don’t want to have overwritten. Code which makes heavy use of global variables can almost always be refactored into cleaner code which doesn’t. If you need to get a lot of heterogeneous data from a function to the calling environment, the cleanest solution would be to return the needed data in a list.
summarize()
from dplyr
mpg_summary <- mpg %>%
group_by(class) %>% # group by grouping variable class
summarize(hwy_sd = sd(hwy), hwy_mean = mean(hwy)) %>% # find the sd and se of the mean
mutate(hwy_se = hwy_sd/sqrt(length(mpg$hwy))) # add the standard error of the mean
mpg_summary
summaryBy
(Base R)summaryBy(Age ~ med_year, data = data, FUN = function(x) { c(m = mean(x, na.rm=TRUE), s = sd(x, na.rm=TRUE), n =length(x), r=range(x,na.rm=TRUE)) }
psych( )
packagelibrary(psych)
describeBy(my_data, my_data$Group, mat=T)
Base R
length(which(data_reshaped$Population=="Vglut1-ChR2"))
Tidyverse
way using count()
or tally()
tally(x, wt, sort = FALSE)
count(x, ..., wt = NULL, sort = FALSE)
add_tally(x, wt, sort = FALSE)
add_count(x, ..., wt = NULL, sort = FALSE)
ggplot2 is a package that is used to make attractive and flexible plots. It’s based on the idea of graphics composed of various layers.
Our new template takes seven parameters, the bracketed words that appear in the template. In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.
The seven parameters in the template compose the grammar of graphics, a formal system for building plots. The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme. (http://r4ds.had.co.nz/data-visualisation.html#first-steps)
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
And from socviz
- Tell the ggplot() function what our data is.The data = … step.
- Tell ggplot() what relationships we want to see.The mapping = aes(…) step. For convenience we will put the results of the first two steps in an object called p.
- Tell ggplot how we want to see the relationships in our data.Choose a geom.
- Layer on geoms as needed, by adding them to the p object one at a time.
- Use The scale_, family, labs() and guides() functions. some additional functions to adjust scales, labels, tick marks, titles. We’ll learn more about some of these functions shortly.
geom_histogram
layer with aes()
set to the variable you’re interested in.ggplot(mtcars) + geom_histogram(aes(mpg), binwidth = 5)
aes()
for the variable of interest. Add another geom_hist()
layer doing the same. You can change binwidth
and alpha
ggplot(mtcars) + geom_histogram(aes(disp), fill="red", alpha=0.2, binwidth = 100) + geom_histogram(aes(hp), fill="blue", alpha=0.2, binwidth = 100)
#using ggplot2
ggplot(mtcars) + geom_point(aes(x = mpg, y = disp, color= mpg >25))
For Categorical x Categorical Data (Contingency Table)
library(ggmosaic)
library(NHANES)
ggplot(data = NHANES) +
geom_mosaic(aes(weight = Weight, x = product(SleepHrsNight, AgeDecade), fill=factor(SleepHrsNight)), na.rm=TRUE) + theme(axis.text.x=element_text(angle=-25, hjust= .1)) + labs(x="Age in Decades ", title='f(SleepHrsNight | AgeDecade) f(AgeDecade)') + guides(fill=guide_legend(title = "SleepHrsNight", reverse = TRUE))
stat=summary
argument. Error bars, however, must be calculated and entered into the data.Plotting means and error bars - R Cookbook
#get summary statistics
mpg_summary <- mpg %>%
group_by(class) %>% # group by grouping variable class
summarize(hwy_sd = sd(hwy), hwy_mean = mean(hwy)) %>% # find the sd and se of the mean
mutate(hwy_se = hwy_sd/sqrt(length(mpg$hwy))) # add the standard error of the mean
#build base bar plot
p = ggplot(mpg, aes(x=class, y=hwy)) + geom_bar(stat="summary", fun.y="mean") + geom_jitter(aes(color=factor(year)))
#add error bars
p + geom_errorbar(data=mpg_summary, aes(x=class, y=hwy_sd, ymin=hwy_mean-hwy_sd, ymax = hwy_mean+ hwy_sd))
require(gridExtra)
plot1 <- qplot(mtcars$mpg)
plot2 <- qplot(mtcars$disp)
grid.arrange(plot1, plot2, ncol=2)
simple <- ggplot(df, aes(x = lfc, y = pvalue)) +
geom_point(size = 3, alpha = 0.7, na.rm = T) + # Make dots bigger
theme_bw(base_size = 16) + # change theme
ggtitle(label = "Volcano Plot", subtitle = "Simple black & white") + # Add a title
xlab(expression(log[2]("Treatment" / "Untreated"))) + # x-axis label
ylab(expression(-log[10]("adjusted p-value"))) + # y-axis label
geom_vline(xintercept = c(-2,2), colour = "darkgrey") + # Add cutoffs
geom_hline(yintercept = 1.3, colour = "darkgrey") + # Add cutoffs
geom_vline(xintercept = 0, colour = "black") + # Add 0 lines
scale_colour_gradient(low = "black", high = "black", guide = FALSE) + # Color black
scale_x_continuous(limits = c(-4, 4)) # min/max of lfc
# Plot figure
simple
Color can be added like this.
# Modify dataset to add new coloumn of colors
data <- data %>%
mutate(color = ifelse(data$lfc > 0 & data$pvalue > 1.3,
yes = "Treated",
no = ifelse(data$lfc < 0 & data$pvalue > 1.3,
yes = "Untreated",
no = "none")))
# Color corresponds to fold change directionality
colored <- ggplot(data, aes(x = lfc, y = pvalue)) +
geom_point(aes(color = factor(color)), size = 1.75, alpha = 0.8, na.rm = T) + # add gene points
theme_bw(base_size = 16) + # clean up theme
theme(legend.position = "none") + # remove legend
ggtitle(label = "Volcano Plot", subtitle = "Colored by directionality") + # add title
xlab(expression(log[2]("Treated" / "Untreated"))) + # x-axis label
ylab(expression(-log[10]("adjusted p-value"))) + # y-axis label
geom_vline(xintercept = 0, colour = "black") + # add line at 0
geom_hline(yintercept = 1.3, colour = "black") + # p(0.05) = 1.3
annotate(geom = "text",
label = "Untreated",
x = -2, y = 85,
size = 7, colour = "black") + # add Untreated text
annotate(geom = "text",
label = "Treated",
x = 2, y = 85,
size = 7, colour = "black") + # add Treated text
scale_color_manual(values = c("Treated" = "#E64B35",
"Untreated" = "#3182bd",
"none" = "#636363")) # change colors
# Plot figure
colored
Goal: With each iteration of the For()
loop, create a new variable that appends the iterator variable to the new variable. This is useful for creating objects as you go through a list or sequence.
for (i in seq(1,6)){
name <- paste("data", "_", i, sep = "")
assign(name, read_excel("Left_Female.xlsx", i))
}
rm(list=ls(pattern="temp"))
Make a list of data frames based on naming pattern
df_list = mget(ls(pattern = "df[0-9]"))
# this would match any object with "df" followed by a digit in its name
# you can test what objects will be got by just running the
ls(pattern = "df[0-9]")
# part and adjusting the pattern until it gets the right objects.
Note: The critical piece here is mget()
which calls the value of a named object
Filter a list by class of list item
Filter(function(x) is(x, "data.frame"), mget(ls(pattern="_list")))
This is cool. ## Sequence Generation
X <- 10
sample(c(0,1), replace=TRUE, size=X)
geom_point
or other plot selectively using annotate()
ggplot XY scatter - how to change alpha transparency for select points?
## you start here
library(ggplot2)
special.points <- sample(1:n, 7)
## then add annotate text
ggplot(df, aes(x=SeqIdentityMean,
y=SeqIdentityStdDev)) +
geom_point(alpha=0.05) +
annotate("point",
df$SeqIdentityMean[special.points],
df$SeqIdentityStdDev[special.points],
col="red") +
annotate("text",
df$SeqIdentityMean[special.points],
df$SeqIdentityStdDev[special.points],
#text we want to display
label=round(df$SeqIdentityStdDev[special.points],1),
#adjust horizontal position of text
hjust=-0.1)
Adding greek symbols Special characters in labels
#for use with ggplot2
labs(x = expression("Cross-sectional Area"*" ("*mu*"m"^{2}* ")")) # microns.