辅导data留学生、辅导R编程语言、讲解via Canvas、R辅导辅导R语言程序|讲解Python程序

The aim of this assignment is to demonstrate your familiarity with data manipulation
and analysis, using the software package R. Write commented R code to address each
individual task or question below. Your code will be checked to confirm that it works
- please ensure that all code functions as expected prior to submitting the assignment!
Marks will be awarded for accurate code and a concise description of how it works
(code with no explanation as to how it works will receive reduced marks). An
example answer is provided at the bottom of the assignment.
The assignment must be submitted by 5pm on the due date via Canvas and further
guidance will be provided when the assignment is set.

An association study of breast cancer has identified a region on chromosome 5,
mapping to 5p12, that is associated with predisposition to the disease. A fine-mapping
analysis of almost 3,500 SNPs has been performed to refine the association signals in
the region. You have been provided with the association statistics from this finemapping
study breastFineMapping_5p12.csv .
1) Read the fine-mapping summary statistics into an R object called snp.data
and determine precisely how many SNPs were genotyped. Note that the dataset
is comma delimited.
For each SNP in the dataset, SNP name (rsid), chromosome, position on human
genome build 37, reference allele, effect allele and minor allele are provided, along
with the minor allele and effect allele frequencies in control samples). The log odds
ratio (OR), corresponding to the effect of each additional copy of the effect allele
upon risk of breast cancer, is shown in the all_beta column and the standard error of
the log OR is provided in the all_se column. Finally, the p-value for association with
risk of breast cancer is given in the all_pvalue column.
2) Using your knowledge of relational operators and subsetting in R, find:
i) the genomic coordinate of SNP rs10941673.
ii) the two possible alleles for SNP rs114796267.
iii) the number of SNPs in the dataset that map within the interval 44,044,000 bp
to 44,188,000 bp.
3) Remove all SNPs with MAF of less than 1% from the dataset. How many
SNPs remain?
4) Create new columns corresponding to odds ratios and 95% confidence
intervals, rounding to two decimal places, for each of the remaining SNPs in the
dataset. For the subset of SNPs with p-values ≤ 0.05, for how many SNPs is the
effect allele associated with
i) an increased risk of breast cancer?
ii) a decreased risk of breast cancer?
5) Using the which.min function, extract from chr5 the row of data that
corresponds to the SNP with the smallest p-value. Does the minor allele of this
SNP confer increased or decreased risk of breast cancer?
You have been provided with log10 transformed gene expression data from breast
tissue for three genes, FGF10, MRPS30 and HCN1 that map within the vicinity of the
b r e a s t c a n c e r p r e d i s p o s i t i o n S N P t h a t y o u h a v e
identified breastGeneExpressionData.txt . The genotypes of the predisposition SNP
(called SNP_A in this data) for each individual in the gene expression dataset are also
provided and are encoded such that 0 = common allele homozygote, 1 = heterozygote
and 2 = minor allele homozygote.
6) Make a new dataframe in R called gen.exp that corresponds to the gene
expression data. How many breast tissue samples are in the dataset? Rename the
column called SNP_A to the name of the SNP that you identified in task 5.
7) Assess the relationship between log10 gene expression for each gene and SNP
genotype using box plots. Label the axes of each plot and give each plot a title.
For each gene, does expression increase or decrease with each additional copy of
the risk allele?
8) Perform an eQTL analysis to test the association between log10 gene expression
and SNP genotype for each gene using the lm function in R. When specifying the
model formula for this linear regression analysis use log10 gene expression as the
response variable and SNP as the predictor variable. The effect estimates from
the linear regression analysis correspond to the expected change in gene
expression for each additional minor allele of the SNP. For which of the three
genes is SNP genotype associated with expression?
9) Based on your findings, write a short report (500 words max) discussing the
breast cancer risk locus at 5p12. The report should include the summary
statistics (SNP name, OR, 95% CIs, P-value and MAF) of the most significantly
associated SNP from the fine-mapping data and the findings from your eQTL
analysis, including your boxplots for each gene. Your report should reference
recently published literature describing the characterisation of this risk locus.

Date set: 27.02.20
Date due: 27.03.20

Example
How many SNPs in the "breastFineMapping_5p12.csv" dataset have either “C”
or “T” reference alleles and a minor allele frequency of greater than 45%?
Answer: 6
Solution:
# Subset the data frame to include only rows that meet the criteria (RefAllele = C or T
and MAF > 0.45) and output the number of rows
dim(snp.data[(snp.data$ref_allele=="C" | snp.data$ref_allele=="T") &
snp.data$maf>0.45,])[1]
Description: A subset of the snp.data data-frame is created by using square brackets.
To do so, the name of the data-frame to be subset is specified, followed by square
brackets. Since the object to be subset has two dimensions, row and columns, these
must be defined and a comma is used to delineate them, with rows being specified by
arguments to the left of the comma and columns by arguments to the right of the
comma. Since our data-frame comprises one row per SNP, we need only to define
arguments to subset rows of snp.data. The “==” relational operator is used to identify
rows that have either “C” or “T” in the ref_allele column (defined using the $
operator) and the logical operator for OR “|”. Brackets are placed around the OR
argument for ref_allele so that the OR statement is first evaluated before an AND
statement evaluates if the rows for which the OR statement is true also have minor
allele frequency greater than 0.45. Finally, the dim function is wrapped around the
subset function so that the number of rows for which the subset statement is true is
returned, rather than the actual subset data-frame. Since we are only interested in row
numbers, [1] is used outside the dim function to return the first element of the output
from dim, which corresponds to the number of rows.