Title: | Datasets and Functions for Books by Julian Faraway |
---|---|
Description: | Books are "Linear Models with R" published 1st Ed. August 2004, 2nd Ed. July 2014, 3rd Ed. February 2025 by CRC press, ISBN 9781439887332, and "Extending the Linear Model with R" published by CRC press in 1st Ed. December 2005 and 2nd Ed. March 2016, ISBN 9781584884248 and "Practical Regression and ANOVA in R" contributed documentation on CRAN (now very dated). |
Authors: | Julian Faraway [aut, cre] |
Maintainer: | Julian Faraway <[email protected]> |
License: | GPL |
Version: | 1.0.9 |
Built: | 2025-02-12 12:37:34 UTC |
Source: | https://github.com/julianfaraway/faraway |
The data comes from the U.S. Historical Climatology Network.
A data frame with 115 observations on the following 2 variables.
year from 1854 to 2000
annual mean temperatures in degrees F in Ann Arbor
United States Historical Climatology Network: https://www.ncei.noaa.gov/products/land-based-station/us-historical-climatology-network
The abrasion
data frame has 16 rows and 4 columns. Four materials
were fed into a wear testing machine and the amount of wear recorded. Four
samples could be processed at the same time and the position of these
samples may be important. A Latin square design was used.
This data frame contains the following columns:
The run number 1-4
The position number 1-4
The material A-D
The wear measured loss of weight in 0.1mm over testing period
The Design and Analysis of Industrial Experiments by O. Davies, 1954, published by Wiley
Aflatoxin B1 was fed to lab animals at vary doses and the number responding with liver cancer recorded.
A data frame with 6 observations on the following 3 variables.
dose in ppb
number of test animals
number with liver cancer
Gaylor DW (1987) "Linear nonparametric upper limits for low dose extrapolation" ASA Proceedings of the Biopharmaceutical Section.
data(aflatoxin)
data(aflatoxin)
Data is a subset of a larger study on factors affecting regime stability in Sub-Saharan Africa
A data frame with 47 observations on the following 9 variables.
number of successful military coups from independence to 1989
number years country ruled by military oligarchy from independence to 1989
Political liberalization - 0 = no civil rights for political expression, 1 = limited civil rights for expression but right to form political parties, 2 = full civil rights
Number of legal political parties in 1993
Percent voting in last election
Population in millions in 1989
Area in 1000 square km
Total number of legislative and presidential elections
Number of regime types
Bratton, Michael, and Nicholas Van De Walle. 1997. “Political Regimes and Regime Transitions in Africa, 1910-1994.” Study Number I06996. Ann Arbor: Inter-University Consortium for Political and Social Research.
"Bayesian Methods: A Social and Behavioral Sciences Approach" by Jeff Gill 2002.
Monthly totals of airline passengers from 1949 to 1951
A data frame with 144 observations on the following 2 variables.
number of passengers in thousands
the date as a decimal
Well known time series example dataset
Brown, R.G.(1962) Smoothing, Forecasting and Prediction of Discrete Time Series. Englewood Cliffs, N.J.: Prentice-Hall.
Box, G.E.P., Jenkins, G.M. and Reinsel, G.C. (1994) Time Series Analysis, Forecasting and Control, 3rd edn. Englewood Cliffs, N.J.: Prentice-Hall.
data(airpass) ## maybe str(airpass) ; plot(airpass) ...
data(airpass) ## maybe str(airpass) ; plot(airpass) ...
The alfalfa
data frame has 25 rows and 4 columns. Data comes from an
experiment to test the effects of seed inoculum, irrigation and shade on
alfalfa yield. A latin square design has been used.
This data frame contains the following columns:
Distance of location from tree line divided into 5 shade areas
Irrigation effect divided into 5 levels
Four types of seed incolum, A-D with E as control.
Dry matter yield of alfalfa
Petersen, R.G. 1994. Agricultural Field Experiments, Design and Analysis. Marcel Dekker, Inc., New York. Pages 70-74. 1994
A matched case control study carried out to investigate the connection between X-ray usage and acute myeloid leukemia in childhood. The pairs are matched by age, race and county of residence.
A data frame with 238 observations on the following 11 variables.
a factor denoting the matched pairs
0=control, 1=case
F
or
M
Presence of Downs syndrome: no
or
yes
Age in years
Did the
mother ever have an Xray: no
or yes
Did
the mother have an Xray of the upper body during pregnancy: no
or
yes
Did the mother have an Xray of the lower
body during pregnancy: no
or yes
Did the
father ever have an Xray: no
or yes
Did
the child ever have an Xray: no
or yes
Total number of Xrays of the child 1
=none <
2
=1 or 2 < 3
=3 or 4 < 4
= 5 or more
Chap T. Le (1998) "Applied Categorical Data Analysis" Wiley.
A doctor at major London hospital compared the effects of 4 anaesthetics used in major operations. 80 patients were divided into groups of 20.
A data frame with 80 observations on the following 2 variables.
time in minutes to start breathing unassisted
Four treatment groups A
B
C
D
Chatfield C. (1995) Problem Solving: A Statistician's Guide, 2ed Chapman Hall.
data(anaesthetic) ## maybe str(anaesthetic) ; plot(anaesthetic) ...
data(anaesthetic) ## maybe str(anaesthetic) ; plot(anaesthetic) ...
Study on infant respiratory disease, namely the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding and sex.
A data frame with 6 observations on the following 4 variables.
number with disease
number without disease
a
factor with levels Boy
Girl
a factor with
levels Bottle
Breast
Suppl
Payne, C. (1987). The GLIM System Release 3.77 Manual (2 ed.). Oxford: Nu- merical Algorithms Group.
data(babyfood) ## maybe str(babyfood) ; plot(babyfood) ...
data(babyfood) ## maybe str(babyfood) ; plot(babyfood) ...
Grain beetles were exposed to ethylene oxide
A data frame with 10 observations on the following 3 variables.
concentration of ethylene oxide in mg/l
number affected
number exposed
Busvine (1938)
Collet D. "Modelling Binary Data"
data(beetle) ## maybe str(beetle) ; plot(beetle) ...
data(beetle) ## maybe str(beetle) ; plot(beetle) ...
An experiment measuring death rates for insects, with 30 insects at each of five treatment levels.
A data frame with 5 observations on the following 3 variables.
number dead
number alive
concentration of insecticide
Bliss (1935). The calculation of the dosage-mortality curve. Annals of Applied Biology 22, 134-167.
data(bliss) ## maybe str(bliss) ; plot(bliss) ...
data(bliss) ## maybe str(bliss) ; plot(bliss) ...
An experiment was conducted to select the supplier of raw materials for production of a component. The breaking strength of the component was the objective of interest. Four suppliers were considered. The four operators can only produce one component each per day. A Latin square design was used.
A data frame with 16 observations on the following 4 variables.
The breaking strength of the component
the operator - a factor with levels op1
op2
op3
op4
the day of production -
a factor with levels day1
day2
day3
day4
the supplier of the raw material - a factor with
levels A
B
C
D
Lentner M. and Bishop T. (1986) Experimental Design and Analysis, Valley Book Company
A number of growers supply broccoli to a food processing plant. The plant instructs the growers to pack the broccoli into standard size boxes. There should be 18 clusters of broccoli per box and each cluster should weigh between 1.33 and 1.5 pounds. Because the growers use different varieties, methods of cultivation etc, there is some variation in the cluster weights. The plant manager selected 3 growers at random and then 4 boxes at random supplied by these growers. 3 clusters were selected from each box.
A data frame with 36 observations on the following 4 variables.
weight of broccoli
the
grower - a factor with levels 1
2
3
the box - a factor with levels 1
2
3
4
the cluster - a factor with levels 1
2
3
Lentner M. and Bishop T. (1986) Experimental Design and Analysis, Valley Book Company
Average butterfat content (percentages) of milk for random samples of twenty cows (ten two-year old and ten mature (greater than four years old)) from each of five breeds. The data are from Canadian records of pure-bred dairy cattle.
A data frame with 100 observations on the following 3 variables.
butter fat content by percentage
a factor with levels Ayrshire
Canadian
Guernsey
Holstein-Fresian
Jersey
a
factor with levels 2year
Mature
Sokal, R. R. and Rohlf, F. J. (1994) Biometry. W. H. Freeman, New York, third edition.
data(butterfat) ## maybe str(butterfat) ; plot(butterfat) ...
data(butterfat) ## maybe str(butterfat) ; plot(butterfat) ...
Example Dataset from "Practical Regression and Anova"
A dataset with 25 cases
of the cathedral - romanesque or gothic
in feet
in feet
Weisberg, S. (2005). Applied Linear Regression, 3rd edition. New York: Wiley
Reference details may be found in "Practical Regression and Anova" by Julian Faraway
In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Overall taste scores were obtained by combining the scores from several tasters.
A data frame with 30 observations on the following 4 variables.
a subjective taste score
concentration of acetic acid (log scale)
concentration of hydrogen sulfide (log scale)
concentration of lactic acid
David S. Moore and George P. McCabe (1993) Introduction to the Practice of Statistics, W. H. Freeman and company, second edition.
data(cheddar) ## maybe str(cheddar) ; plot(cheddar) ...
data(cheddar) ## maybe str(cheddar) ; plot(cheddar) ...
Data from a 1970's study on the relationship between insurance redlining in Chicago and racial composition, fire and theft rates, age of housing and income in 47 zip codes.
This dataframe contains the following columns
racial composition in percent minority
fires per 100 housing units
theft per 1000 population
percent of housing units built before 1939
new FAIR plan policies and renewals per 100 housing units
median family income in thousands of dollars
North or South side of Chicago
Adapted from "Data : A Collection of Problems from Many Fields for the Student and Research Worker" by D. Andrews and A. Herzberg published by Springer-Verlag, in 1985
Complements the chicago and chmiss datasets by dividing the zip codes into north and south
takes the values "n" (north) and "s" south
Reference details may be found in "Practical Regression and Anova" by Julian Faraway
chicago
Data from a 1970's study on the relationship between insurance redlining in Chicago and racial composition, fire and theft rates, age of housing and income in 47 zip codes. Missing values have been randomly added.
This dataframe contains the following columns
racial composition in percent minority
fires per 100 housing units
theft per 1000 population
percent of housing units built before 1939
new FAIR plan policies and renewals per 100 housing units
median family income in thousands of dollars
North or South side of Chicago
Adapted from "Data : A Collection of Problems from Many Fields for the Student and Research Worker" by D. Andrews and A. Herzberg published by Springer-Verlag, in 1985
An experiment was conducted to determine the effect of recipe and baking temperature on chocolate cake quality. 15 batches of cake mix for each recipe were prepared. Each batch was sufficient for six cakes. Each of the six cakes was baked at a different temperature which was randomly assigned. Several measures of cake quality were recorded of which breaking angle was just one.
A data frame with 270 observations on the following 4 variables.
Chocolate for recipe 1 was added at 40C, Chocolate for recipe 2 was added at 60C and recipe 3 had extra sugar
batch number from 1 to 15
temperature at which cake was baked: 175C
185C
195C
205C
215C
225C
the breaking angle of the cake
Cochran W. and Cox G. (1992) Experimental Designs, 2nd Edition Wiley
Data from a 1970's study on the relationship between insurance redlining in Chicago and racial composition, fire and theft rates, age of housing and income in 47 zip codes
This dataframe contains the following columns
racial composition in percent minority
fires per 100 housing units
theft per 1000 population
percent of housing units built before 1939
new FAIR plan policies and renewals per 100 housing units
median family income in thousands of dollars
North or South side of Chicago
Adapted from "Data : A Collection of Problems from Many Fields for the Student and Research Worker" by D. Andrews and A. Herzberg published by Springer-Verlag, in 1985
The clotting times of blood for plasma diluted with nine different percentage concentrations with prothrombin-free plasma
This data frame contains the following columns:
time in seconds to clot
concentration in percent
lot number - either one or two
Hurn et al (1945)
Nelder & McCullagh (1989) Generalized Linear Models (2ed)
Social class mobility from 1971 to 1981 for 42425 men from the United Kingdom census. Subjects were aged 45-64.
A data frame with 36 observations on the following 3 variables.
Frequency of observation
social class in
1971 - a factor with levels I
, professionals, II
semi-professionals, IIIN
skilled non-manual, IIIM
skilled
manual, IV
semi-skilled, V
unskilled
social
class in 1971 - a factor with levels I
II
IIIN
IIIM
IV
V
with same classification
D. Blane and S. Harding and M. Rosato (1999) "Does social mobility affect the size of the socioeconomic mortality differential?: Evidence from the Office for National Statistics Longitudinal Study" JRSS-A, 162 59-70.
Frequencies of various malformations of the central nervous system recorded on live births in South Wales, UK. Study was designed to determine the effect of water hardness on the incidence of such malformations.
A data frame with 16 observations on the following 7 variables.
a factor with levels Cardiff
GlamorganC
GlamorganE
GlamorganW
MonmouthOther
MonmouthV
Newport
Swansea
being areas of South Wales
count of births with no CNS problem
count of Anencephalus births
count of Spina Bifida births
count of other CNS births
water hardeness
a factor with levels
Manual
NonManual
being the type of work done by the parents
C. Lowe and C. Roberts and S. Lloyd, (1971) Malformations of the central nervous system and softness of local water supplies, British Medical Journal, 15,357-361.
P. McCullagh and J. Nelder (1989), Generalized Linear Models, Chapman and Hall, 2nd Ed.
Dataset comes from a study of blood coagulation times. 24 animals were randomly assigned to four different diets and the samples were taken in a random order.
This dataframe contains the following columns
coagulation time in seconds
diet type - A,B,C or D
"Statistics for Experimenters" by G. P. Box, W. G. Hunter and J. S. Hunter, Wiley, 1978
The composite
data frame has 9 rows and 3 columns. Data comes from an
experiment to test the strength of a thermoplastic composite depending on
the power of a laser and speed of a tape.
This data frame contains the following columns:
interply bond strength of the composite
laser power at 40, 50 or 60W
tape speed, slow=6.42 m/s, medium=13m/s and fast=27m/s
Mazumdar, S and Hoa S (1995) "Application of a Taguchi Method for Process enhancement of an online consolidation technique" Composites 26, 669-673
The relationship between corn yield (bushels per acre) and nitrogen (pounds per acre) fertilizer application were studied in Wisconsin.
A data frame with 44 observations on the following 2 variables.
corn yield in bushels per acre
pounds per acre
Unknown
Data consist of thirteen specimens of 90/10 Cu-Ni alloys with varying iron content in percent. The specimens were submerged in sea water for 60 days and the weight loss due to corrosion was recorded in units of milligrams per square decimeter per day.
This dataframe contains the following columns
Iron content in percent
Weight loss in mg per square decimeter per day
"Applied Regression Analysis" by N. Draper and H. Smith, Wiley, 1998
Projected and actual sales of 20 consumer products. Data have been disguised from original form.
A data frame with 20 observations on the following 2 variables.
projected sales in dollars
actual sales in dollars
G. Whitmore (1986) "Inverse Gaussian Ratio Estimation" Applied Statistics, 35, 8-15.
Makes a Cp plot
Cpplot(cp)
Cpplot(cp)
cp |
A leaps object returned from leaps() |
Requires leaps package
none
Julian Faraway
leaps()
A study investigated whether babies take longer to learn to crawl in cold months when they are often bundled in clothes that restrict their movement, than in warmer months. The study sought an association between babies' first crawling age and the average temperature during the month they first try to crawl (about 6 months after birth). Parents brought their babies into the University of Denver Infant Study Center between 1988-1991 for the study. The parents reported the birth month and age at which their child was first able to creep or crawl a distance of four feet in one minute. Data were collected on 208 boys and 206 girls (40 pairs of which were twins)
A data frame with 12 observations on the following 4 variables.
average crawling age in weeks
standard deviation of crawling age
sample size
average temperature(F) six months after birth
Benson, Janette. (1993). Infant Behavior and Development
data(crawl) ## maybe str(crawl) ; plot(crawl) ...
data(crawl) ## maybe str(crawl) ; plot(crawl) ...
An experiment was conducted to study the effects of surface and vision on balance. The balance of subjects were observed for two different surfaces and for restricted and unrestricted vision. Balance was assessed qualitatively on an ordinal four-point scale based on observation by the experimenter. Forty subjects were studied, twenty males and twenty females ranging in age from 18 to 38, with heights given in cm and weights in kg. The subjects were tested while standing on foam or a normal surface and with their eyes closed or open or with a dome placed over their head. Each subject was tested twice in each of the surface and eye combinations for a total of 12 measures per subject.
A data frame with 480 observations on the following 8 variables.
an indicator
a factor
with levels female
male
in years
in cm
in kg
a factor with levels foam
norm
a factor with levels closed
dome
open
a four point scale measuring balance
Steele, R. (1998). Effect of surface and vision on balance. Ph. D. thesis, Depart- ment of Physiotherapy, University of Queensland.
OzDasl
data(ctsib) ## maybe str(ctsib) ; plot(ctsib) ...
data(ctsib) ## maybe str(ctsib) ; plot(ctsib) ...
Data on 326 defendents in homicide indictments in 20 Florida counties during 1976-77.
A data frame with 8 observations on the following 4 variables.
a numeric vector
Did the
subject recieve the death penalty? no
or yes
Was the victim b
lack or w
hite?
Was the defendent b
lack or w
hite?
Radelet M. (1981) Racial characteristics and the imposition of the death penalty. Amer. Sociol. Rev. 46 918-927.
Agresti A. (1990) Categorical Data Analysis, Wiley.
The data arise from a large postal survey on the psychology of debt.
A data frame with 464 observations on the following 13 variables.
income group (1=lowest, 5=highest)
security of housing tenure (1=rent, 2=mortgage, 3=owned outright)
number of children in household
is the respondent a single parent?
age group (1=youngest)
does the respondent have a bank account?
does the respondent have a building society account?
self-rating of money management skill (high values=high skill)
how often did s/he use credit cards (1=never... 3=regularly)
does s/he buy cigarettes?
does s/he buy Christmas presents for children?
score on a locus of control scale (high values=internal)
score on a scale of attitudes to debt (high values=favourable to debt
All yes/no questions are coded 0=no, 1=yes. Locus of control is a personality measure introduced by Rotter, which claims to differentiate people according to how much they feel things that happen to them are as a result of processes within themselves (internal locus of control) or outside events (external locus of control).
Lea, Webley & Walker, 1995, Journal of Economic Psychology, 16, 181-201 Data obtained from http://au.exeter.ac.uk/SEGLea/.
Five suppliers cut denim material for a jeans manufacturer. An algorithm is used to estimate how much material will be wasted given the dimensions of the material supplied. Typically, a supplier wastes more material than the target based on the algorithm although occasionally they waste less. The percentage of waste relative to target was collected weekly for the 5 suppliers. In all, 95 observations were recorded.
A data frame with 95 observations on the following 2 variables.
percentage wastage
a factor with levels 1
2
3
4
5
Unknown
data(denim) ## maybe str(denim) ; plot(denim) ...
data(denim) ## maybe str(denim) ; plot(denim) ...
403 African Americans were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia.
A data frame with 403 observations on the following 19 variables.
Subject ID
Total Cholesterol
Stabilized Glucose
High Density Lipoprotein
Cholesterol/HDL Ratio
Glycosolated Hemoglobin
County
- a factor with levels Buckingham
Louisa
age in years
a factor with levels
male
female
height in inches
weight in pounds
a factor with
levels small
medium
large
First Systolic Blood Pressure
First Diastolic Blood Pressure
Second Systolic Blood Pressure
Second Diastolic Blood Pressure
waist in inches
hip in inches
Postprandial Time (in minutes) when Labs were Drawn
Glycosolated hemoglobin greater than 7.0 is usually taken as a positive diagnosis of diabetes
Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997
Schorling JB, Roach J, Siegel M, Baturka N, Hunt DE, Guterbock TM, Stewart HL: A trial of church-based smoking cessation interventions for rural African Americans. Preventive Medicine 26:92-101; 1997
An experiment was conducted to determine the effect of gamma radiation on the numbers of chromosomal abnormalities observed
A data frame with 27 observations on the following 4 variables.
Number of cells in hundreds
Number of chromosomal abnormalities
amount of dose in Grays
rate of dose in Grays/hour
Purott R. and Reeder E. (1976) The effect of changes in dose rate on the yield of chromosome aberrations in human lymphocytes exposed to gamma radiation. Mutation Research. 35, 437-444.
Frome E. and DuFrain R. (1986) Maximum Likelihood Estimation for Cytogenic Dose-Response Curves. Biometrics. 42, 73-84.
Divorce rates in the USA from 1920-1996
A data frame with 77 observations on the following 7 variables.
the year from 1920-1996
divorce per 1000 women aged 15 or more
unemployment rate
percent female participation in labor force aged 16+
marriages per 1000 unmarried women aged 16+
births per 1000 women aged 15-44
military personnel per 1000 population
Unknown
A sample of psychiatry patients were cross-classified by their diagnosis and whether a drug treatment was prescribed.
A data frame with 10 observations on the following 3 variables.
the number of patients
a factor with levels Affective.Disorder
Neurosis
Personality.Disorder
Schizophrenia
Special.Symptoms
a factor with levels no
yes
Helmes E. and Fekken G. (1986) Effects of psychotropic drugs and psychiatric illness on vocational aptitude and interest assessment. J. Clin. Psychol. 42 569-576
Agresti A. (1990) "Categorical Data Analysis" Wiley
The data come from the Australian Health Survey of 1977-78 and consist of 5190 single adults where young and old have been oversampled.
A data frame with 5190 observations on the following 19 variables.
1 if female, 0 if male
Age in years divided by 100 (measured as mid-point of 10 age groups from 15-19 years to 65-69 with 70 or more coded treated as 72)
age squared
Annual income in Australian dollars divided by 1000 (measured as mid-point of coded ranges Nil, less than 200, 200-1000, 1001-, 2001-, 3001-, 4001-, 5001-, 6001-, 7001-, 8001-10000, 10001-12000, 12001-14000, with 14001- treated as 15000
1 if covered by private health insurance fund for private patient in public hospital (with doctor of choice), 0 otherwise
1 if covered by government because low income, recent immigrant, unemployed, 0 otherwise
1 if covered free by government because of old-age or disability pension, or because invalid veteran or family of deceased veteran, 0 otherwise
Number of illnesses in past 2 weeks with 5 or more coded as 5
Number of days of reduced activity in past two weeks due to illness or injury
General health questionnaire score using Goldberg's method. High score indicates bad health
1 if chronic condition(s) but not limited in activity, 0 otherwise
1 if chronic condition(s) and limited in activity, 0 otherwise
Number of consultations with a doctor or specian the past 2 weeks
Number of consultations with non-doctor health professionals (chemist, optician, physiotherapist, social worker, district community nurse, chiropodist or chiropractor) in the past 2 weeks
Number of admissions to a hospital, psychiatric hospital, nursing or convalescent home in the past 12 months (up to 5 or more admissions which is coded as 5)
Number of nights in a hospital, etc. during most recent admission: taken, where appropriate, as the mid-point of the intervals 1, 2, 3, 4, 5, 6, 7, 8-14, 15-30, 31-60, 61-79 with 80 or more admissions coded as 80. If no admission in past 12 months then equals zero
Total number of prescribed and nonprescribed medications used in past 2 days
Total number of prescribed medications used in past 2 days
Total number of nonprescribed medications used in past 2 days
Cameron A, Trivedi P, Milne F and Piggot J (1988) A Microeconometric model of the demand for health care and health insurance in Australia, Review of Economic Studies 55, 85-106
Relationship between 1998 per capita income dollars from all sources and the proportion of legal state residents born in the United States in 1990 for each of the 50 states plus the District of Columbia
This dataframe contains the following columns
Percentage of population born in the United States
Per capita annual income in dollars
Percentage born in state
Population of state
US Bureau of the Census
The eggprod
data frame has 12 rows and 3 columns. Six pullets were
placed into each of 12 pens. Four blocks were formed from groups of 3 pens
based on location. Three treatments were applied. The number of eggs
produced was recorded
This data frame contains the following columns:
Three treatments: O, E or F
Four blocks labeled 1-4
Number of eggs produced
Mead, R., R.N. Curnow, and A.M. Hasted. 1993. Statistical Methods in Agriculture and Experimental Biology. Chapman and Hall, London, p. 64. 1993
Consistency between laboratory tests is important and yet the results may depend on who did the test and where the test was performed. In an experiment to test levels of consistency, a large jar of dried egg powder was divided up into a number of samples. Because the powder was homogenized, the fat content of the samples is the same, but this fact is withheld from the laboratories. Four samples were sent to each of six laboratories. Two of the samples were labeled as G and two as H, although in fact they were identical. The laboratories were instructed to give two samples to two different technicians. The technicians were then instructed to divide their samples into two parts and measure the fat content of each. So each laboratory reported eight measures, each technician four measures, that is, two replicated measures on each of two samples.
A data frame with 48 observations on the following 4 variables.
a numeric vector
a factor
with levels I
II
III
IV
V
VI
a factor with levels one
two
a factor with levels G
H
Bliss, C. I. (1967). Statistics in Biology. New York: McGraw Hill.
data(eggs) ## maybe str(eggs) ; plot(eggs) ...
data(eggs) ## maybe str(eggs) ; plot(eggs) ...
Data from a clinical trial of 59 epileptics. For a baseline, patients were observed for 8 weeks and the number of seizures recorded. The patients were then randomized to treatment by the drug Progabide (31 patients) or to the placebo group (28 patients). They were observed for four 2-week periods and the number of seizures recorded.
A data frame with 295 observations on the following 6 variables.
number of seizures
identifying number
1=treated, 0=not
0=baseline period, 1=treatment period
weeks of period
in years
Thall, P. F. and S. C. Vail (1990). Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657-671.
Breslow, N. E. and D. G. Clayton (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9-25. Diggle, P. J., P. Heagerty, K. Y. Liang, and S. L. Zeger (2002). Analysis of Longitudinal Data (2 ed.). Oxford: Oxford University Press.
data(epilepsy) ## maybe str(epilepsy) ; plot(epilepsy) ...
data(epilepsy) ## maybe str(epilepsy) ; plot(epilepsy) ...
Data was recorded on 44 doctors working in an emergency service at a hospital to study the factors affecting the number of complaints received.
A data frame with 44 observations on the following 6 variables.
the number of patient visits
the number of complaints
is the doctor in residency training N
or
Y
gender of doctor F
or M
dollars per hour earned by the doctor
total number of hours worked
Chap T. Le (1998) "Applied Categorical Data Analysis" Wiley
True function is f(x)=sin^3(2pi x^3).
A data frame with 256 observations on the following 3 variables.
input
response
true value
Haerdle, W. (1991). Smoothing Techniques with Implementation in S. New York:Springer.
data(exa) ## maybe str(exa) ; plot(exa) ...
data(exa) ## maybe str(exa) ; plot(exa) ...
True function is f(x)=0
A data frame with 256 observations on the following 3 variables.
input
response
true value
Haerdle, W. (1991). Smoothing Techniques with Implementation in S. New York:Springer.
data(exa) ## maybe str(exa) ; plot(exa) ...
data(exa) ## maybe str(exa) ; plot(exa) ...
A sample of women are rated for the performance of distance vision in each eye.
A data frame with 16 observations on the following 3 variables.
the observed count
rated vision in the
right eye - a factor with levels best
second
third
worst
rated vision in the left eye - a factor with
levels best
second
third
worst
A. Stuart (1955) A test for homogeneity of the marginal distributions in a two-way classification, Biometrika, 42, 412-416.
Age, weight, height, and 10 body circumference measurements are recorded for 252 men. Each man's percentage of body fat was accurately estimated by an underwater weighing technique.
A data frame with 252 observations on the following 18 variables.
Percent body fat using Brozek's equation, 457/Density - 414.2
Percent body fat using Siri's equation, 495/Density - 450
Density (gm/$cm^3$)
Age (yrs)
Weight (lbs)
Height (inches)
Adiposity index = Weight/Height$^2$ (kg/$m^2$)
Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek's formula (lbs)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm) at the umbilicus and level with the iliac crest
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Extended biceps circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm) distal to the styloid processes
Johnson R. Journal of Statistics Education v.4, n.1 (1996)
In 1972-74, a survey of one in six residents of Whickham, near Newcastle, England was made. Twenty years later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked are included.
A data frame with 28 observations on the following 4 variables.
observed count for given combination
a
factor with levels yes
no
a factor with levels
yes
no
a factor with agegroup levels 18-24
25-34
35-44
45-54
55-64
65-74
75+
D. Appleton, J. French, M. Vanderpump (1996) "Ignoring a Covariate: An Example of Simpson's Paradox" American Statistician, 50, 340-341
Fortune magazine publishes a f the world's billionaires each year. The 1992 list includes 233 individuals. Their wealth, age, and geographic location (Asia, Europe, Middle East, United States, and Other) are reported.
A data frame with 232 observations on the following 3 variables.
Billions of dollars
age in years
a factor with levels A
, Asia,
E
, Europe, M
, Middle East, O
Other, U
USA
Fortune magazine
data(fortune) ## maybe str(fortune) ; plot(fortune) ...
data(fortune) ## maybe str(fortune) ; plot(fortune) ...
Elections for the French presidency proceed in two rounds. In 1981, there were 10 candidates in the first round. The top two candidates then went on to the second round, which was won by Francois Mitterand over Valery Giscard-d'Estaing. The losers in the first round can gain political favors by urging their supporters to vote for one of the two fina Since voting is private, we cannot know how these votes were transferred, we might hope to infer from the published vote totals how this might have happened. Data is given for vote totals in every fourth department of France:
This dataframe contains the following columns (vote totals are in thousands)
Electeur Inscrits (registered voters)
Voters for Mitterand in the first round
Voters for Giscard in the first round
Voters for Chirac in the first round
Voters for Communists in the first round
Voters for Ecology party in the first round
Voters for party F in the first round
Voters for party G in the first round
Voters for party H in the first round
Voters for party I in the first round
Voters for party J in the first round
Voters for party K in the first round
Voters for Mitterand in the second round
Voters for party Giscard in the second round
Difference between the number of voters in the second round and in the first round
"The Teaching of Practical Statistics" by C.W. Anderson and R.M. Loynes, Wiley,1987
fround
rounds the values in its first argument to the specified
number of decimal places with surrounding quotes.
fround(x, digits)
fround(x, digits)
x |
a numeric vector. |
digits |
integer indicating the precision to be used. |
pfround
rounds the values in its first argument to the specified
number of decimal places without surrounding quotes.
Andrew Gelman; Yu-Sung Su
Copied from the arm
package
x <- 3.1415926 fround(x, digits=2) pfround(x, digits=2)
x <- 3.1415926 fround(x, digits=2) pfround(x, digits=2)
The fruitfly
data frame has 9 rows and 3 columns. 125 fruitflies
were divided randomly into 5 groups of 25 each. The response was the
longevity of the fruitfly in days. One group was kept solitary, while
another was kept individually with a virgin female each day. Another group
was given 8 virgin females per day. As an additional control the fourth and
fifth groups were kept with one or eight pregnant females per day. Pregnant
fruitflies will not mate. The thorax length of each male was measured as
this was known to affect longevity. One observation in the many group has
been lost.
This data frame contains the following columns:
Thorax length
Lifetime in days
The group: isolated = fly kept solitary, one = fly kept with one pregnant fruitfly, many = fly kept with eight pregnant fruitflies, low= fly kept with one virgin fruitfly, high = fly kept with eight virgin fruitflies.
"Sexual Activity and the Lifespan of Male Fruitflies" by L. Partridge and M. Farquhar, Nature, 1981, 580-581
There are 30 Galapagos islands and 7 variables in the dataset. The
relationship between the number of plant species and several geographic
variables is of interest. The original dataset contained several missing
values which have been filled for convenience. See the galamiss
dataset for the original version.
The dataset contains the following variables
the number of plant species found on the island
the number of endemic species
the area of the island (km$^2$)
the highest elevation of the island (m)
the distance from the nearest island (km)
the distance from Santa Cruz island (km)
the area of the adjacent island (square km)
M. P. Johnson and P. H. Raven (1973) "Species number and endemism: The Galapagos Archipelago revisited" Science, 179, 893-895
There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. This is the original version of the dataset containing missing values.
The dataset contains the following variables
the number of plant species found on the island
the number of endemic species
the area of the island (km$^2$)
the highest elevation of the island (m)
the distance from the nearest island (km)
the distance from Santa Cruz island (km)
the area of the adjacent island (square km)
M. P. Johnson and P. H. Raven (1973) "Species number and endemism: The Galapagos Archipelago revisited" Science, 179, 893-895
The X-ray decay light curve of Gamma ray burst 050525a obtained with the X-Ray Telescope (XRT) on board the Swift satellite. The dataset has 63 brightness measurements in the 0.4-4.5 keV spectral band at times ranging from 2 minutes to 5 days after the burst.
A data frame with 63 observations on the following 3 variables.
in seconds since burst
X-ray flux in units of 10^-11 erg/cm2/s, 2-10 keV
measurement error of the flux based on detector signal-to-noise values
A. J. Blustin and 64 coauthors, Astrophys. J. 637, 901-913 2006. Available at http://arxiv.org/abs/astro-ph/0507515.
data(gammaray) ## maybe str(gammaray) ; plot(gammaray) ...
data(gammaray) ## maybe str(gammaray) ; plot(gammaray) ...
The data comes from the US presidential election in the state of Georgia. The undercount is the difference between the number of ballots cast and votes recorded. Voters may have chosen not to vote for president, voted for more than one candidate (disqualified) or the equipment may have failed to register their choice.
A data frame with 159 observations on the following 10 variables. Each case represents a county in Georgia.
The voting equipment used: LEVER
, OS-CC
(optical, central count), OS-PC
(optical, precinct count)
PAPER
, PUNCH
economic status of county:
middle
poor
rich
percent of African Americans in county
indicator of whether
county is rural
or urban
indicator of
whether county is in Atlanta
or not: notAtlanta
number of votes for Gore
number of votes for Bush
number of votes for other candidates
number of votes
number of ballots
Meyer M. (2002) Uncounted Votes: Does Voting Equipment Matter? Chance, 15(4), 33-38
Average Northen Hemisphere Temperature from 1856-2000 and eight climate proxies from 1000-2000AD. Data can be used to predict temperatures prior to 1856.
A data frame with 1001 observations on the following 10 variables.
Northern hemisphere average temperature (C) provided by the UK Met Office (known as HadCRUT2)
Tree ring proxy information from the Western USA.
Tree ring proxy information from Canada.
Ice core proxy information from west Greenland
Sea shell proxy information from Chesapeake Bay
Tree ring proxy information from Sweden
Tree ring proxy information from the Urals
Tree ring proxy information from Mongolia
Tree ring proxy information from Tasmania
Year 1000-2000AD
See the source and references below for the original data. Only some proxies have been included here. Some missing values have been imputed. The proxy data have been smoothed. This version of the data is intended only for demonstration purposes. If you are specifically interested in the subject matter, use the original data.
P.D. Jones and M.E. Mann (2004) "Climate Over Past Millennia" Reviews of Geophysics, Vol. 42, No. 2, RG2002, doi:10.1029/2003RG000143
www.ncdc.noaa.gov/paleo/pubs/jones2004/jones2004.html
data(globwarm) ## maybe str(globwarm) ; plot(globwarm) ...
data(globwarm) ## maybe str(globwarm) ; plot(globwarm) ...
Data collected from 592 students in an introductory statistics class
A data frame with 16 observations on the following 3 variables.
count of the number of student with given hair/eye combination
a factor with levels green
hazel
blue
brown
a factor with levels BLACK
BROWN
RED
BLOND
Snee R. (1974) Graphical display of two-way contingency tables. American Statistician, 28, 9-12
Makes a half-normal plot
halfnorm( x, nlab = 2, labs = as.character(1:length(x)), ylab = "Sorted Data", ... )
halfnorm( x, nlab = 2, labs = as.character(1:length(x)), ylab = "Sorted Data", ... )
x |
a numeric vector |
nlab |
number of points to label |
labs |
labels for points |
ylab |
label for Y-axis |
... |
arguments passed to plot() |
none
Julian Faraway
qqnorm
halfnorm(runif(10))
halfnorm(runif(10))
Data were collected from 39 students in a University of Chicago MBA class
mba
mba
A data frame with 39 observations on the following 5 variables.
Happiness on a 10 point scale where 10 is most happy
family income in thousands of dollars
1 = satisfactory sexual activity, 0 = not
1 = lonely, 2 = secure relationships, 3 = deep feeling of belonging and caring
5 point scale where 1 = no job, 3 = OK job, 5 = great job
An object of class data.frame
with 39 rows and 5 columns.
George and McCulloch (1993) "Variable Selection via Gibbs Sampling" JASA, 88, 881-889
16 insulin-dependent diabetic children were enrolled in a study involving a new treatment. 8 children received the new treatment(N) while the other 8 received the standard treatment(S). The age and sex of the child was recorded along with the measured value of gycosolated hemoglobin both before and after treatment.
A data frame with 16 observations on the following 5 variables.
age in years
a factor with
levels F
M
a factor with levels
N
S
measured value of hemoglobin before treatment
measured value of hemoglobin after treatment
Unknown
data(hemoglobin) ## maybe str(hemoglobin) ; plot(hemoglobin) ...
data(hemoglobin) ## maybe str(hemoglobin) ; plot(hemoglobin) ...
Data from Royal Mineral Hospital in Bath. AS is a chronic form of arthritis. A study conducted to determine whether daily stretching of the hip tissues would improve mobility. 39 “typical” AS patients were randomly allocated to control (standard treatment) group or the treatment group in a 1:2 ratio. Responses were flexion and rotation angles at the hip measured in degrees. Larger numbers indicate more flexibility.
A data frame with 78 observations on the following 7 variables.
flexion angle before
flexion angle after
rotation angle before
rotation angle after
treatment group - a factor with levels control
treat
side of the body - a factor with levels
right
left
id for the individual
Chatfield C. (1995) Problem Solving: A Statistician's Guide, 2ed Chapman Hall.
data(hips) ## maybe str(hips) ; plot(hips) ...
data(hips) ## maybe str(hips) ; plot(hips) ...
Urinary androsterone (androgen) and etiocholanolone (estrogen) values were recorded from 26 healthy males.
A data frame with 26 observations on the following 3 variables.
concentration
concentration
sexual
orientation with levels g
s
Margolese, M. (1970). Homosexuality: A new endocrine correlate. Hormones and Behavior 1, 151-155.
Hand, D. (1981). Discrimination and Classification. Chichester, UK: Wiley.
data(hormone) ## maybe str(hormone) ; plot(hormone) ...
data(hormone) ## maybe str(hormone) ; plot(hormone) ...
Data on housing prices in 36 US metropolitan statistical areas (MSAs) over 9 years from 1986-1994 were collected.
A data frame with 324 observations on the following 8 variables.
natural log average sale price in thousands of dollars
average per capita income
percentage growth in per capita income
Regulatory environment index (high values = more regulations)
Rent control - a factor with levels
0
=no 1
=yes
Adjacent to a coastline - a
factor with levels 0
=no 1
=yes
indicator for the MSA
Year 1=1986 to 9=1994
Longitudinal and Panel Data: Analysis and Applications in the Social Sciences, by Edward W. Frees, Cambridge University Press, August 2004.
Data was collected as a subset of the "High School and Beyond" study conducted by the National Education Longitudinal Studies (NELS) program of the National Center for Education Statistics (NCES).
A data frame with 200 observations on the following 11 variables.
ID of student
a factor
with levels female
male
a factor with
levels african-amer
asian
hispanic
white
socioeconomic class - a factor with levels high
low
middle
school type - a factor with
levels private
public
choice of high
school program - a factor with levels academic
general
vocation
reading score
writing score
math score
science score
social science score
One purpose of the study was to determine which factors are related to the choice of the type of program, academic, vocational or general, that the students pursue in high school.
National Education Longitudinal Studies (NELS) program of the National Center for Education Statistics (NCES).
Computes the inverse logit transformation
ilogit(x)
ilogit(x)
x |
a numeric vector |
exp(x)/(1+exp(x))
Julian Faraway
logit
ilogit(1:3) #[1] 0.7310586 0.8807971 0.9525741
ilogit(1:3) #[1] 0.7310586 0.8807971 0.9525741
The infmort
data frame has 105 rows and 4 columns. The infant
mortality in regions of the world may be related to per capita income and
whether oil is exported. The dataset is not recent.
This data frame contains the following columns:
Region of the world, Africa, Europe, Asia or the Americas
Per capita annual income in dollars
Infant mortality in deaths per 1000 births
Does the country export oil or not?
Unknown
Data on natural gas usage in a house. The weekly gas consumption (in 1000 cubic feet) and the average outside temperature (in degrees Celsius) was recorded for 26 weeks before and 30 weeks after cavity-wall insulation had been installed. The house thermostat was set at 20C throughout.
A data frame with 44 observations on the following 3 variables.
a factor with levels After
Before
Outside temperature
Weekly consumption in 1000 cubic feet
MASS package as whiteside
data(insulgas) ## maybe str(insulgas) ; plot(insulgas) ...
data(insulgas) ## maybe str(insulgas) ; plot(insulgas) ...
In an agricultural field trial, the objective was to determine the effects of two crop varieties and four different irrigation methods. Eight fields were available, but only one type of irrigation may be applied to each field. The fields may be divided into two parts with a different variety planted in each half. The whole plot factor is the method of irrigation, which should be randomly assigned to the fields. Within each field, the variety is randomly assigned.
A data frame with 16 observations on the following 4 variables.
a factor with levels f1
f2
f3
f4
f5
f6
f7
f8
a factor with levels i1
i2
i3
i4
a factor with levels v1
v2
a numeric vector
Found online but source not recorded.
data(irrigation) ## maybe str(irrigation) ; plot(irrigation) ...
data(irrigation) ## maybe str(irrigation) ; plot(irrigation) ...
Junior School Project collected from primary (U.S. term is elementary) schools in inner London.
A data frame with 3236 observations on the following 9 variables.
50 schools code 1-50
a factor with levels 1
2
3
4
a factor with levels boy
girl
class of the father I=1; II=2; III nonmanual=3; III manual=4; IV=5; V=6; Long-term unemployed=7; Not currently employed=8; Father absent=9
test score
student id coded 1-1402
score on English
score on Maths
year of school
Mortimore, P., P. Sammons, L. Stoll, D. Lewis, and R. Ecob (1988). School Matters. Wells, UK: Open Books.
Goldstein, H. (1995). Multilevel Statistical Models (2 ed.). London: Arnold.
data(jsp) ## maybe str(jsp) ; plot(jsp) ...
data(jsp) ## maybe str(jsp) ; plot(jsp) ...
Sex and species of an specimens of kangaroo.
A data frame with 148 observations on the following 20 variables.
a factor with levels fuliginosus
giganteus
melanops
a factor with levels
Female
Male
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Andrews and Herzberg (1985) Chapter 53.
Andrews, D. F. and Herzberg, A. M. (1985). Data. Springer-Verlag, New York.
data(kanga) ## maybe str(kanga) ; plot(kanga) ...
data(kanga) ## maybe str(kanga) ; plot(kanga) ...
Data on the cut-off times of lawnmowers was collected. 3 machines were randomly selected from those produced by manufacturers A and B. Each machine was tested twice at low speed and high speed.
A data frame with 24 observations on the following 4 variables.
Manufacturer - a factor with levels
A
B
Lawn mower - a factor with levels
m1
m2
m3
m4
m5
m6
Speed of testing - a factor with levels H
L
cut-off time
Unknown.
The data gives the proportion of leaf area affected by leaf blotch on 10 varieties of barley at 9 different sites.
A data frame with 90 observations on the following 3 variables.
proportion of the barley leaf affected by blotch
the physical location - a factor with levels
1
2
3
4
5
6
7
8
9
variety of barley - a factor with levels
1
2
3
4
5
6
7
8
9
10
R. W. M. Wedderburn (1974) "Quasilikelihood functions, generalized linear models and the Gauss-Newton method" Biometrika, 61, 439-447.
P. McCullagh and J. Nelder (1989) "Generalized Linear Models" Chapman and Hall, 2nd ed.
Data on the burning time of samples of tobacco leaves
A data frame with 30 observations on the following 4 variables.
nitrogen content by percentage weight
chlorine content by percentage weight
potassium content by percentage weight
burn time in seconds
Steel, R. G. D. and Torrie, J. H. (1980), Principles and Procedures of Statistics, Second Edition, New York: McGraw-Hill
Computes the logit transformation
logit(x)
logit(x)
x |
a numeric vector |
x <=0 or >=1 will return NA
log(x/(1-x))
Julian Faraway
ilogit
logit(c(0.1,0.5,1.0,1.1)) #[1] -2.197225 0.000000 NA NA
logit(c(0.1,0.5,1.0,1.1)) #[1] -2.197225 0.000000 NA NA
The mammalsleep
data frame has 62 rows and 10 columns. Sleep in
Mammals: Ecological and Constitutional Correlates
This data frame contains the following columns:
body weight in kg
brain weight in g
slow wave ("nondreaming") sleep (hrs/day)
paradoxical ("dreaming") sleep (hrs/day)
total sleep (hrs/day) (sum of slow wave and paradoxical sleep)
maximum life span (years)
gestation time (days)
predation index (1-5) 1 = minimum (least likely to be preyed upon) to 5 = maximum (most likely to be preyed upon)
sleep exposure index (1-5) 1 = least exposed (e.g. animal sleeps in a well-protected den) 5 = most exposed
overall danger index (1-5) (based on the above two indices and other information) 1 = least danger (from other animals) 5 = most danger (from other animals)
"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976), Science, November 12, vol. 194, pp. 732-734.
In 1750, Tobias Mayer collected data on various landmarks on the moon in order to determine its orbit. The data involving the position of the Manilius crater resulted in a least squares like problem. The example is discussed in Steven Stigler's History of Statistics.
A data frame with 27 observations on the following 4 variables.
an angle known as h in Stigler's notation
the sin(g-k) where g and k are two angles in Stigler
the cos(g-k) where g and k are two angles in Stigler
one of three groups determined by Mayer
See Stigler for a detailed description.
Stigler, S. (1986) History of Statistics. Belknap Press, Harvard.
Mayer, T. (1750) Abhandlung uber die Umwaltzung des Monds um seine Axe und die scheinbare Bewegung der Mondsflecken published in the Kosmographische Nachrichten und Sammlungen auf das Jahr 1748. 52-183
data(manilius)
data(manilius)
Displays the best models from a leaps object
maxadjr(l, best = 3)
maxadjr(l, best = 3)
l |
A leaps object returned from leaps() |
best |
An optional argument specify the number of models to be returned taking the default value of 3 |
Requires leaps package
A list of the best models
Julian Faraway
leaps()
A Tecator Infratec Food and Feed Analyzer working in the wavelength range 850 - 1050 nm by the Near Infrared Transmission (NIT) principle was used to collect data on samples of finely chopped pure meat. 215 samples were measured. For each sample, the fat content was measured along with a 100 channel spectrum of absorbances. Since determining the fat content via analytical chemistry is time consuming we would like to build a model to predict the fat content of new samples using the 100 absorbances which can be measured more easily.
Dataset contains the following variables
absorbances across a range of 100 wavelengths
fat content
H. H. Thodberg (1993) "Ace of Bayes: Application of Neural Networks With Pruning", report no. 1132E, Maglegaardvej 2, DK-4000 Roskilde, Danmark
Data comes from a study of Malignant Melanoma involving 400 subjects.
A data frame with 12 observations on the following 3 variables.
number of cases
type
of tumor - a factor with levels freckle
indeterminate
nodular
superficial
location of tumor on
the body - a factor with levels extremity
head
trunk
Dobson A. (2002) An introduction to generalized linear models, Chapman Hall.
In Sweden all motor insurance companies apply identical risk arguments to classify customers, and thus their portfolios and their claims statistics can be combined. The data were compiled by a Swedish Committee on the Analysis of Risk Premium in Motor Insurance. The Committee was asked to look into the problem of analyzing the real influence on claims of the risk arguments and to compare this structure with the actual tariff.
A data frame with 1797 observations on the following 8 variables.
an ordered factor representing kilomoters per year with levels 1: < 1000, 2: 1000-15000, 3: 15000-20000, 4: 20000-25000, 5: > 25000
a factor representing geographical area with levels 1: Stockholm, Goteborg, Malmo with surroundings 2: Other large cities with surroundings 3: Smaller cities with surroundings in southern Sweden 4: Rural areas in southern Sweden 5: Smaller cities with surroundings in northern Sweden 6: Rural areas in northern Sweden 7: Gotland
No claims bonus. Equal to the number of years, plus one, since last claim
A factor representing eight different common car models. All other models are combined in class 9
Number of insured in policy-years
Number of claims
Total value of payments in Skr
payment per claim
http://www.statsci.org/data/general/motorins.html
Hallin, M., and Ingenbleek, J.-F. (1983). The Swedish automobile portfolio in 1977. A statistical study. Scandinavian Actuarial Journal, 49-64.
Subjects were asked questions in a study of neighborly help. Questions below are a subset of the full study.
A data frame with 181 observations on the following 8 variables.
About how long have you lived where you do now?
Ans is a factor with levels <6mos
6-12mos
1-3yrs
3-10yrs
10yrs
Where were you living before
you moved to your present house? Ans is a factor with levels same
Exeter
Devon
Britain
Abroad
How
neighborly do you think the area where you now live is? Ans is a factor with
levels Vunfriendly
NVfriendly
Average
FFriendly
VFriendly
Roughly how many people in your street, or
in the streets just near you, do you know the names of? Ans is a factor with
levels none
1-5
6-20
20+
How
many of those people (not counting children) would you call by their first
names? Ans is a factor with levels none
1-5
6-20
20+
a factor with levels -18
18-30
31-50
51-65
65+
a factor with levels
1
2
3
4
a factor with levels
female
male
Exeter is a city in the county of Devon which is in Britain. The four districts can be briefly described as follows. District 1 was a long-established residential area near the city centre, with housing dating from the late nineteenth century. Originally working class, it now has a considerable middle class population with some student and other temporary accommodation. District 2 was a working-class housing estate dating from the 1930s, with mainly rented accommodation but some owner occupation. District 3 was the oldest part of a more recently developed, mainly middle-class, almost exclusively owner-occupied estate, dating from the 1960s. District 4 was the most recently developed part of a more sought-after middle-class residential area, with smaller but almost entirely owner-occupied properties dating from the 1970s and 1980s.
P. Webley & S. Lea 1993, Human Relations 46, 65-76.
A subset of the National Education Longitudinal Study of 1988
A data frame with 260 observations on the following 5 variables.
a factor with levels Female
Male
a factor with levels White
Asian
Black
Hispanic
a numeric vector
a factor with levels ba
college
hs
lesshs
ma
phd
a numeric vector
http://www.icpsr.umich.edu/icpsrweb/ICPSR/series/107
data(nels88) ## maybe str(nels88) ; plot(nels88) ...
data(nels88) ## maybe str(nels88) ; plot(nels88) ...
The data are a subset from public health study on Nepalese children.
A data frame with 1000 observations on the following 9 variables.
There is a six digit code for the child's ID: 2 digits for the panchayat number; 2 digits for the ward within panchayat; 1 digits for the household; 1 digit for child within household.
1 = male; 2 = female
Child's weight measured in kilograms
Child's height measured in centimeters
Mother's age in years
Indicator of mother's literacy: 0 = no; 1 = yes
The number of children the mother has had that died.
The number of children the mother has ever had born alive
age of child
West KP, Jr., LeClerq SC, Shrestha SR, Wu LS, Pradhan EK, Khatry SK, Katz J, Adhikari R, Sommer A. Effects of vitamin A on growth of vitamin A deficient children: field studies in Nepal. J Nutr 1997;10:1957-1965.
10 variable subset of the 1996 American National Election Study. Missing values and "don't know" responses have been se deleted. Respondents expressing a voting preference other than Clinton or Dole have been removed.
A data frame with 944 observations on the following 10 variables.
population of respondent's location in 1000s of people
days in the past week spent watching news on TV
Left-Right self-placement of respondent: an ordered factor
with levels extremely liberal, extLib
< liberal, Lib
<
slightly liberal, sliLib
< moderate, Mod
< slightly
conservative, sliCon
< conservative, Con
< extremely
conservative, extCon
Left-Right placement of Bill
Clinton (same scale as selfLR): an ordered factor with levels extLib
< Lib
< sliLib
< Mod
< sliCon
< Con
<
extCon
Left-Right placement of Bob Dole (same scale as
selfLR): an ordered factor with levels extLib
< Lib
<
sliLib
< Mod
< sliCon
< Con
< extCon
Party identification: an ordered factor with levels strong
Democrat, strDem
< weak Democrat, weakDem
< independent
Democrat, indDem
< independent independentindind
< indepedent
Republican, indRep
< waek Republican, weakRep
< strong
Republican, strRep
Respondent's age in years
Respondent's education: an ordered factor with levels 8 years or
less, MS
< high school dropout, HSdrop
< high school diploma
or GED, HS
< some College, Coll
< Community or junior College
degree, CCdeg
< BA degree, BAdeg
< postgraduate degree,
MAdeg
Respondent's family income: an ordered factor
with levels $3Kminus
< $3K-$5K
< $5K-$7K
<
$7K-$9K
< $9K-$10K
< $10K-$11K
< $11K-$12K
<
$12K-$13K
< $13K-$14K
< $14K-$15K
< $15K-$17K
<
$17K-$20K
< $20K-$22K
< $22K-$25K
< $25K-$30K
<
$30K-$35K
< $35K-$40K
< $40K-$45K
< $45K-$50K
<
$50K-$60K
< $60K-$75K
< $75K-$90K
< $90K-$105K
<
$105Kplus
Expected vote in 1996 presidential election: a
factor with levels Clinton
and Dole
Sapiro, Virginia, Steven J. Rosenstone, Donald R. Kinder, Warren E. Miller, and the National Election Studies. AMERICAN NATIONAL ELECTION STUDIES, 1992-1997: COMBINED FILE [Computer file]. 2nd ICPSR version. Ann Arbor, MI: University of Michigan, Center for Political Studies [producer], 1999. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 1999.
Found at http://www.stat.washington.edu/
Votes and other demographic information from 276 wards in the 2008 Democratic Party presidential primary.
A data frame with 276 observations on the following 12 variables.
The voting system used where H is counted by hand and D is counted by machine.
The number of votes for Barack Obama.
The number of votes for Hillary Clinton.
The total number of votes cast in the Democratic primary (there were other candidates besides Clinton and Obama).
The poverty rate as a proportion as determined by the 2000 census.
Per capita annual income in USD in 1999.
The proportion of voters for Howard Dean in the 2004 Democratic primary.
The proportion of voters for John Kerry in the 2004 Democratic primary.
The proportion of non-Hispanic whites according to the 2000 census.
The proportion voting by absentee ballot.
An estimate of the population from 2002.
Proportion voting for Obama
On the 8th January 2008, primaries to select US presidential candidates were held in New Hampshire. In the Democratic party primary, Hillary Clinton defeated Barack Obama contrary to the expectations pre-election opinion polls. Essentially two different voting technologies were used in New Hampshire. Some wards used paper ballots, counted by hand while others used optically scanned ballots, counted by machine. Among the paper ballots, Obama had more votes than Clinton while Clinton defeated Obama on just the machine counted ballots. Since the method of voting should make no causal difference to the outcome, suspicions have been raised regarding the integrity of the election.
Herron, M., W. M. Jr, and J. Wand (2008). Voting Technology and the 2008 New Hampshire Primary. Wm. & Mary Bill Rts. J. 17, 351-374.
Data from an experiment to compare 8 varieties of oats. The growing area was heterogeneous and so was grouped into 5 blocks. Each variety was sown once within each block and the yield in grams per 16ft row was recorded.
The dataset contains the following variables
Yield in grams per 16ft row
Blocks I to V
Variety 1 to 8
"Statistical Theory in Research" by R. Anderson and T. Bancroft, McGraw Hill,1952
Data from an experiment to determine the effects of column temperature, gas/liquid ratio and packing height in reducing unpleasant odor of chemical product that was being sold for household use
Odor score
Temperature coded as -1, 0 and 1
Gas/Liquid ratio coded as -1, 0 and 1
Packing height coded as -1, 0 and 1
"Statistical Design and Analysis of Experiments" by P. John, Macmillan, 1971
The ohio
data frame has 2148 rows and 4 columns. The dataset is a
subset of the six-city study, a longitudinal study of the health effects of
air pollution.
This data frame contains the following columns:
an indicator of wheeze status (1=yes, 0=no)
a numeric vector for subject id
a numeric vector of age, 0 is 9 years old
an indicator of maternal smoking at the first year of the study
Fitzmaurice, G.M. and Laird, N.M. (1993) A likelihood-based method for analyzing longitudinal binary responses, Biometrika 80: 141–151.
The 1986 crash of the space shuttle Challenger was linked to failure of O-ring seals in the rocket engines. Data was collected on the 23 previous shuttle missions. The launch temperature on the day of the crash was 31F.
A data frame with 23 observations on the following 2 variables.
temperature at launch in degrees F
number of damage incidents out of 6 possible
Presidential Commission on the Space Shuttle Challenger Accident, Vol. 1, 1986: 129-131.
S. Dalal, E. Fowlkes and B. Hoadley (1989) "Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure." Journal of the American Statistical Association. 84: 945-957.
A study the relationship between atmospheric ozone concentration and meteorology in the Los Angeles Basin in 1976. A number of cases with missing variables have been removed for simplicity.
A data frame with 330 observations on the following 10 variables.
Ozone conc., ppm, at Sandbug AFB.
a numeric vector
wind speed
a numeric vector
temperature
inversion base height
Daggett pressure gradient
a numeric vector
visibility
day of the year
Breiman, L. and J. H. Friedman (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association 80, 580-598.
data(ozone) ## maybe str(ozone) ; plot(ozone) ...
data(ozone) ## maybe str(ozone) ; plot(ozone) ...
445 college students were classified according to both frequency of marijuana use and parental use of alcohol and psychoactive drugs.
A data frame with 9 observations on the following 3 variables.
Number of parents using drugs or alcohol -
a factor with levels Both
Neither
One
Student usage of marijuana - a factor with levels
Never
Occasional
Regular
the number of cases
Ellis, Godfrey J. and Stone, Lorene H. (1979) Marijuana Use in College: "An Evaluation of a Modeling Explanation" Youth and Society 10, 323-34
The peanut
data frame has 16 rows and 6 columns. Carbon dioxide
effects on peanut oil extraction
This data frame contains the following columns:
CO2 pressure - two levels low=0, high=1
CO2 temperature - two levels low=0, high=1
peanut moisture - two levels low=0, high=1
CO2 flow rate - two levels low=0, high=1
peanut particle size - two levels low=0, high=1
the amount of oil that could dissolve in the CO2
Kilgo, M (1989) "An Application of Fractional Factorial Experimental Designs" Quality Engineering, 1, 45-54
The production of penicillin uses a raw material, corn steep liquor, is quite variable and can only be made in blends sufficient for four runs. There are four processes, A, B, C and D, for the production.
A data frame with 20 observations on the following 3 variables.
a factor with levels A
B
C
D
a factor with levels Blend1
Blend2
Blend3
Blend4
Blend5
a numeric vector
Box, G., W. Hunter, and J. Hunter (1978). Statistics for Experimenters. New York: Wiley.
data(penicillin) ## maybe str(penicillin) ; plot(penicillin) ...
data(penicillin) ## maybe str(penicillin) ; plot(penicillin) ...
Data based on a 5
A data frame with 1115 observations on the following 5 variables.
is the mother Black?
mother's years of education
does the mother smoke during pregnancy?
gestational age in weeks
birth weight in grams
I. T. Elo, G. Rodriguez and H. Lee (2001). Racial and Neighborhood Disparities in Birthweight in Philadelphia. Paper presented at the Annual Meeting of the Population Association of America, Washington, DC 2001.
data(phbirths) ## maybe str(phbirths) ; plot(phbirths) ...
data(phbirths) ## maybe str(phbirths) ; plot(phbirths) ...
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.
The dataset contains the following variables
Number of times pregnant
Plasma glucose concentration at 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in metres squared))
Diabetes pedigree function
Age (years)
test whether the patient shows signs of diabetes (coded 0 if negative, 1 if positive)
The data may be obtained from UCI Repository of machine learning databases at http://archive.ics.uci.edu/ml/
Researchers at National Institutes of Standards and Technology (NIST) collected data on ultrasonic measurements of the depths of defects in the Alaska pipeline in the field. The depth of the defects were then remeasured in the laboratory. These measurements were performed in six different batches. The laboratory measurements are more accurate than the in-field measurements, but more time consuming and expensive.
A data frame with 107 observations on the following 3 variables.
measurement of depth of defect on site
measurement of depth of defect in the lab
the batch of measurements
Office of the Director of the Institute of Materials Research (now the Materials Science and Engineering Laboratory) of NIST
The data for this example contains the number of coal miners classified by radiological examination into one of three categories of pneumonoultramicroscopicosilicovolcanoconiosis (known as pneumonoconiosis for short) and by number of years spent working at the coal face divided into eight categories.
A data frame with 24 observations on the following 3 variables.
number of miners
pneumoconiosis status - a factor with levels
mild
normal
severe
number of years service (midpoint of interval)
M. Aitkin and D. Anderson and B. Francis and J. Hinde (1989) "Statistical Modelling in GLIM" Oxford University Press.
The National Youth Survey collected a sample of 11 to 17 year olds - 117 boys and 120 girls - asking questions about marijuana usage.
A data frame with 486 observations on the following 7 variables.
1=Male, 2=Female
1=never used, 2=used no more than once a month, 3=used more than once a month in 1976
1=never used, 2=used no more than once a month, 3=used more than once a month in 1977
1=never used, 2=used no more than once a month, 3=used more than once a month in 1978
1=never used, 2=used no more than once a month, 3=used more than once a month in 1979
1=never used, 2=used no more than once a month, 3=used more than once a month in 1980
Number of cases in this category
ICPSR, University of Michigan
Lang J., McDonald, J and Smith P. (1999) "Association-Marginal Modeling of Mutlivariate Categorical Responses: A Maximum Likelihood Approach" JASA 94, 1161-
The prostate
data frame has 97 rows and 9 columns. A study on 97 men
with prostate cancer who were due to receive a radical prostatectomy.
This data frame contains the following columns:
log(cancer volume)
log(prostate weight)
age
log(benign prostatic hyperplasia amount)
seminal vesicle invasion
log(capsular penetration)
Gleason score
percentage Gleason scores 4 or 5
log(prostate specific antigen)
Andrews DF and Herzberg AM (1985): Data. New York: Springer-Verlag
Makes a Partial Residual plot
prplot(g, i)
prplot(g, i)
g |
An object returned from lm() |
i |
index of predictor |
none
Julian Faraway
data(stackloss) g <- lm(stack.loss ~ .,stackloss) prplot(g,1)
data(stackloss) g <- lm(stack.loss ~ .,stackloss) prplot(g,1)
The Panel Study of Income Dynamics (PSID), begun in 1968, is a longitudinal study of a representative sample of U.S. individuals. The study is conducted at the Survey Research Center, Institute for Social Research, University of Michigan and is still continuing. The data represents a small subset of the total data.
A data frame with 1661 observations on the following 6 variables.
age in 1968
years of education
sex of individual, F
or M
annual income in dollars
calendar year
ID number for individual
Martha S. Hill, The Panel Study of Income Dynamics: A User's Guide, Sage Publications, 1992,Newbury Park, CA.
The pulp
data frame has 20 rows and 2 columns. Data comes from an
experiment to test the paper brightness depending on a shift operator.
This data frame contains the following columns:
Brightness of the pulp as measured by a reflectance meter
Shift operator a-d
"Statistical techniques applied to production situations" F. Sheldon (1960) Industrial and Engineering Chemistry, 52, 507-509
Investigators studied physical characteristics and ability in 13 (American) football punters. Each volunteer punted a football ten times. The investigators recorded the average distance for the ten punts, in feet.
A data frame with 13 observations on the following 7 variables.
average distance over 10 punts
hang time
right leg strength in pounds
left leg strength in pounds
right hamstring muscle flexibility in degrees
left hamstring muscle flexibility in degrees
overall leg strength in foot pounds
Unknown
data(punting) ## maybe str(punting) ; plot(punting) ...
data(punting) ## maybe str(punting) ; plot(punting) ...
Data from an experiment to study factors affecting the production of the plastic PVC, 3 operators used 8 different devices called resin railcars to produce PVC. For each of the 24 combinations, two samples were produced.
Dataset contains the following variables
Particle size
Operator number 1, 2 or 3
Resin railcar 1-8
R. Morris and E. Watson (1998) "A comparison of the techniques used to evaluate the measurement process" Quality Engineering, 11, 213-219
Structural information on 74 2,4-diamino- 5-(substituted benzyl) pyrimidines used as inhibitors of DHFR in E. coli. There are 3 positions where chemical activity occurs and 9 attributes per position leading to 27 total predictors. One predictor had no variability and was removed from the data set. 26 chemical properties of 74 compounds and an activity level
A data frame with 74 observations on the following 27 variables.
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
measured on a [0,1] scale
log 1/Ki, where Ki is the inhibition constant as experimentally assayed, scaled to [0,1]
Jonathan D. Hirst, Ross D. King, Michael J. E. Sternberg (1994) Quantitative structure-activity relationships by neural networks and inductive logic programming. I. The inhibition of dihydrofolate reductase by pyrimidines doi:10.1007/BF00125375
data(pyrimidines) ## maybe str(pyrimidines) ; plot(pyrimidines) ...
data(pyrimidines) ## maybe str(pyrimidines) ; plot(pyrimidines) ...
Makes a labeled QQ plot
qqnorml( y, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", ... )
qqnorml( y, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", ... )
y |
A numeric vector |
main |
main label |
xlab |
x-axis label |
ylab |
y-axis label |
... |
arguments passed to plot() |
none
Julian Faraway
qqnorm
qqnorml(rnorm(16))
qqnorml(rnorm(16))
A nutritionist studied the effects of six diets, on weight gain of domestic rabbits. From past experience with sizes of litters, it was felt that only 3 uniform rabbits could be selected from each available litter. There were ten litters available forming blocks of size three.
The variables in the dataset were
Diet a through f
Weight gain
Block (10 litters)
"Experimental Design and Analysis" by M. Lentner and T. Bishop, Valley Book Company, 1986
The data consist of 5 weekly measurements of body weight for 27 rats. The first 10 rats are on a control treatment while 7 rats have thyroxine added to their drinking water. 10 Rats have thiouracil added to their water.
A data frame with 135 observations on the following 4 variables.
Weight of the rat
Week of the study from 0 to 4
the rat code number
treatment applied to the rat drinking water - a factor
with levels control
thiouracil
thyroxine
Unknown
An experiment was conducted as part of an investigation to combat the effects of certain toxic agents.
A data frame with 48 observations on the following 3 variables.
survival time in tens of hours
the poison type - a factor with levels I
II
III
the treatment - a factor with
levels A
B
C
D
Box G and Cox D (1964) "An analysis of transformations" J. Roy. Stat. Soc. Series B. 26 211.
The resceram
data frame has 12 rows and 3 columns. Shape and plate
effects on current noise in resistors
This data frame contains the following columns:
current noise
the geometrical shape of the resistor, A, B, C or D
the ceramic plate on which the resistor was mounted. Only three resistors will fit on one plate.
Natrella, M (1963) "Experimental Statistics" National Bureau of Standards Handbook 91, Gaithersburg MD.
The data was collected in a salmonella reverse mutagenicity assay where the numbers of revertant colonies of TA98 Salmonella observed on each of three replicate plates for different doses of quinoline
A data frame with 18 observations on the following 2 variables.
numbers of revertant colonies of TA98 Salmonella
dose level of quinoline
Breslow N.E. (1984), Extra-Poisson Variation in Log-linear Models, ApplStat, pp. 38-44.
The sat
data frame has 50 rows and 7 columns. Data were collected to
study the relationship between expenditures on public education and test
results.
This data frame contains the following columns:
Current expenditure per pupil in average daily attendance in public elementary and secondary schools, 1994-95 (in thousands of dollars)
Average pupil/teacher ratio in public elementary and secondary schools, Fall 1994
Estimated average annual salary of teachers in public elementary and secondary schools, 1994-95 (in thousands of dollars)
Percentage of all eligible students taking the SAT, 1994-95
Average verbal SAT score, 1994-95
Average math SAT score, 1994-95
Average total score on the SAT, 1994-95
"Getting What You Pay For: The Debate Over Equity in Public School Expenditures" D. Guber, Journal of Statistics Education, 1999
The savings
data frame has 50 rows and 5 columns. The data is
averaged over the period 1960-1970.
This data frame contains the following columns:
savings rate - personal saving divided by disposable income
percent population under age of 15
percent population over age of 75
per-capita disposable income in dollars
percent growth rate of dpi
Now also appears as LifeCycleSavings
in the datasets
package
Belsley, D., Kuh. E. and Welsch, R. (1980) "Regression Diagnostics" Wiley.
LifeCycleSavings
Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know where different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers.
The dataset contains the following variables
Age in years
Weight in lbs
Height in shoes in cm
Height bare foot in cm
Seated height in cm
lower arm length in cm
Thigh length in cm
Lower leg length in cm
horizontal distance of the midpoint of the hips from a fixed location in the car in mm
"Linear Models in R" by Julian Faraway, CRC Press, 2004
A Biologist analyzed an experiment to determine the effect of moisture content on seed germination. Eight boxes of 100 seeds each were treated with the same moisture level. 4 boxes were covered and 4 left uncovered. The process was repeated at 6 different moisture levels (nonlinear scale).
A data frame with 48 observations on the following 3 variables.
percentage germinated
moisture level
a factor with
levels no
yes
Chatfield C. (1995) Problem Solving: A Statistician's Guide, 2ed Chapman Hall.
data(seeds) ## maybe str(seeds) ; plot(seeds) ...
data(seeds) ## maybe str(seeds) ; plot(seeds) ...
The semicond
data frame has 48 rows and 5 columns.
This data frame contains the following columns:
a numeric vector
a factor with levels
1
to 4
representing etch time.
a factor with
levels 1
to 3
a factor with levels 1
to 4
an ordered factor with levels 1/1
<
1/2
< 1/3
< 2/1
< 2/2
< 2/3
< 3/1
< 3/2
< 3/3
< 4/1
< 4/2
< 4/3
Also found in the SASmixed
package
Littel, R. C., Milliken, G. A., Stroup, W. W., and Wolfinger, R. D. (1996), SAS System for Mixed Models, SAS Institute (Data Set 2.2(b)).
The data for this example come from a study of the effects of childhood sexual abuse on adult females. 45 women being treated at a clinic, who reported childhood sexual abuse, were measured for post traumatic stress disorder and childhood physical abuse both on standardized scales. 31 women also being treated at the same clinic, who did not report childhood sexual abuse were also measured. The full study was more complex than reported here and so readers interested in the subject matter should refer to the original article.
The variables in the dataset are
Childhood physical abuse on standard scale
Post-traumatic stress disorder on standard scale
Childhood sexual abuse - abused or not abused
N. Rodriguez and S. Ryan and H. Vande Kemp and D. Foy (1997) "Postraumatic stress disorder in adult female survivors of childhood sexual abuse: A comparison study", Journal of Consulting and Clinical Pyschology, 65, 53-59
Data from a questionaire from 91 couples in the Tucson, Arizona area. Subjects answered the question "Sex is fun for me and my partner". The possible answers were "never or occasionally","fairly often","very often" and "almost always"
A data frame with 16 observations on the following 3 variables.
the count
a factor with levels
never
fairly
very
always
a factor
with levels never
fairly
very
always
Hout, M., Duncan, O. and Sobel M. (1987) Association and heterogeneity: Structural models of similarities and differences. Sociological Methods. 17, 145-184.
A study was conducted to optimize snail production for consumption. The percentage water content of the tissues of snails grown under three different levels of relative humidity and two different temperatures was recorded. For each combination, 4 snails were observed.
A data frame with 24 observations on the following 3 variables.
percentage water content
temperature in C
relative humidity
Unknown
data(snail) ## maybe str(snail) ; plot(snail) ...
data(snail) ## maybe str(snail) ; plot(snail) ...
ATT ran an experiment varying five factors relevant to a wave-soldering procedure for mounting components on printed circuit boards. The response variable, skips, is a count of how many solder skips appeared to a visual inspection.
A data frame with 900 observations on the following 6 variables.
a factor with levels L
M
S
a factor with levels Thick
Thin
a factor with levels A1.5
A3
A6
B3
B6
a factor with levels
D4
D6
D7
L4
L6
L7
L8
L9
W4
W9
a numeric vector
count of how many solder skips appeared to a visual inspection
Comizzoli, R. B., J. M. Landwehr, and J. D. Sinclair (1990). Robust materials and processes: Key to reliability. AT&T Technical Journal 69(6), 113-128.
data(solder) ## maybe str(solder) ; plot(solder) ...
data(solder) ## maybe str(solder) ; plot(solder) ...
Behavioural scientists at Macquarie University conducted an experiment to test the time taken to perform a block design task with 24 fifth grade children (12 boys and 12 girls).
A data frame with 24 observations and 3 variables.
Solution attempted first by row(r) or corner(c)
Time taken to complete the task in seconds
Score on the embedded figures test which is a measure of difficulty in abstracting logical structure of a problem from its context.
Statistical Modelling in GLIM (1989) M. Aitkin and D. Anderson and B. Francis and J. Hinde Oxford University Press
The sono
data frame has 16 rows and 8 columns. Sonoluminescence is
the process of turning sound energy into light. An experiment was conducted
to study factors affecting this process.
This data frame contains the following columns:
Sonoluminescent light intensity
Amount of Solute. The coding is "low" for 0.10 mol and "high" for 0.33 mol.
Solute type. The coding is "low" for sugar and "high" for glycerol.
The coding is "low" for 3 and "high" for 11.
Gas type in water. The coding is "low" for helium and "high" for air.
Water depth. The coding is "low" for half and "high" for full.
Horn depth. The coding is "low" for 5 mm and "high" for 10 mm.
Flask clamping. The coding is "low" for unclamped and "high" for clamped.
Eva Wilcox and Ken Inn of the NIST Physics Laboratory conducted this experiment during 1999 and published in NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/
An experiment was conducted to compare the germination rates of the five varieties of soybean. Five plots were available.
A data frame with 25 observations on the following 3 variables.
the variety - a factor with levels
arasan
check
fermate
semesan
spergon
the plot - a factor with levels 1
2
3
4
5
the number of failures out of 100 planted seeds
Snedecor G. and Cochran W. (1967) Statistical Methods (6th Ed) Iowa State University Press
A study to determine the effectiveness of a new teaching method in Economics
A data frame with 32 observations on the following 4 variables.
1 = exam grades improved, 0 = not improved
1 = student exposed to PSI (a new teach method), 0 = not exposed
a measure of ability when entering the class
grade point average
Spector, L. and Mazzeo, M. (1980), "Probit Analysis and Economic Education", Journal of Economic Education, 11, 37 - 44.
Speedometer cables can be noisy because of shrinkage in the plastic casing material. An experiment was conducted to find out what caused shrinkage by screening a large number of factors. The engineers started with 15 different factors.
The dataset contains the following variables: (variables a-o are 2 level factors, coded "+" and "-" where "+" indicates a higher value where appropriate)
liner outer diameter
liner die
liner material
liner line speed
wire braid type
braiding tension
wire diameter
liner tension
liner temperature
coating material
coating die type
melt temperature
screen pack
cooling method
line speed
percentage shrinkage per specimen
G. P. Box and S. Bisgaard and C. Fung (1988) "An explanation and critque of Taguchi's contributions to quality engineering", Quality and reliability engineering international, 4, 123-131
Data on the log of the surface temperature and the log of the light intensity of 47 stars in the star cluster CYG OB1, which is in the direction of Cygnus,
A data frame with 47 observations on the following 3 variables.
a numeric vector
temperature
light intensity
Rousseeuw, P. and A. Leroy (1987). Robust Regression and Outlier Detection. New York: Wiley.
data(star) ## maybe str(star) ; plot(star) ...
data(star) ## maybe str(star) ; plot(star) ...
Marks from Statistics 500 one year at the University of Michigan
A data frame with 55 observations on the following 4 variables.
a numeric vector
a numeric vector
a numeric vector
a numeric vector
Julian Faraway
data(stat500) ## maybe str(stat500) ; plot(stat500) ...
data(stat500) ## maybe str(stat500) ; plot(stat500) ...
An experiment was conducted to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights.
A data frame with 30 observations on the following 6 variables.
running order within the experiment
Experimenter used
0 if step at the low (5.75in) height, 1 if at the high (11.5in) height
the rate of stepping. 0 if slow (14 steps/min), 1 if medium (21 steps/min), 2 if high (28 steps/min)
the resting heart rate of the subject before a trial, in beats per minute
the final heart rate of the subject after a trial, in beats per minute
Unknown
data(stepping) ## maybe str(stepping) ; plot(stepping) ...
data(stepping) ## maybe str(stepping) ; plot(stepping) ...
Example Dataset from "Practical Regression and Anova"
Dataframe with 10 cases
inverse total energy
Scattering cross-section/sec
standard deviation
Weisberg, H., Beier, H., Brody, H., Patton, R., Raychaudhari, K., Takeda, H., Thern, R. and Van Berg, R. (1978). s-dependence of proton fragmentation by hadrons. II. Incident laboratory momenta, 30–250 GeV/c. Physics Review D, 17, 2875–2887.
Weisberg, S. (2014). Applied Linear Regression, 4th edition. Hoboken NJ: Wiley.
One year of suicide data from the United Kingdom crossclassified by sex, age and method.
A data frame with 36 observations on the following 4 variables.
number of people
method
used - a factor with levels drug
(suicide by solid or liquid matter),
gas
, gun
(guns, knives or explosives) hang
(hanging,
strangling, suffocating or drowning, jump
other
a factor with levels m
(middle-aged) o
(old) y
(young)
a factor with levels f
m
Everitt B. & Dunn G. (1991) "Applied Multivariate Data Analysis" Edward Arnold
Generic summaries for lm, glm and mer objects
sumary(object, ...)
sumary(object, ...)
object |
An lm, glm or mer object returned from lm(), glm() or lmer() respectively |
... |
further arguments passed to or from other methods. |
This generic function provides an abbreviated regression output containing
the more useful information. Users wanting to see more are advised to use
summary()
returns the same as summary()
Julian Faraway
This function is adapted from the display()
function in
the arm
package
data(stackloss) object <- lm(stack.loss ~ .,stackloss) sumary(object)
data(stackloss) object <- lm(stack.loss ~ .,stackloss) sumary(object)
The teengamb
data frame has 47 rows and 5 columns. A survey was
conducted to study teenage gambling in Britain.
This data frame contains the following columns:
0=male, 1=female
Socioeconomic status score based on parents' occupation
in pounds per week
verbal score in words out of 12 correctly defined
expenditure on gambling in pounds per year
Ide-Smith & Lea, 1988, Journal of Gambling Behavior, 4, 110-118
The data come from a Multicenter study comparing two oral treatments for toenail infection. Patients were evaluated for the degree of separation of the nail. Patients were randomized into two treatments and were followed over seven visits - four in the first year and yearly thereafter. The patients have not been treated prior to the first visit so this should be regarded as the baseline.
A data frame with 1908 observations on the following 5 variables.
ID of patient
0=none or mild seperation, 1=moderate or severe
the treatment A=0 or B=1
time of the visit (not exactly monthly intervals hence not round numbers)
the number of the visit
De Backer, M., De Vroey, C., Lesaffre, E., Scheys, I., and De Keyser, P. (1998). Twelve weeks of continuous oral therapy for toenail onychomycosis caused by dermatophytes: A double-blind comparative trial of terbinafine 250 mg/day versus itraconazole 200 mg/day. Journal of the American Academy of Dermatology, 38, 57-63.
Lesaffre, E. and Spiessens, B. (2001). On the effect of the number of quadrature points in a logistic random-effects model: An example. Journal of the Royal Statistical Society, Series C, 50, 325-335.
G. Fitzmaurice, N. Laird and J. Ware (2004) Applied Longitudinal Analysis, Wiley
Boxes of trout eggs were buried at five different stream locations and retrieved at 4 different times. The number of surviving eggs was recorded. The box was not returned to the stream.
A data frame with 20 observations on the following 4 variables.
the number of surviving eggs
the number of eggs in the box
the location in the stream with levels 1
2
3
4
5
the number of
weeks after placement that the box was withdrawn levels 4
7
8
11
Manly B. (1978) Regression models for proportions with extraneous variance. Biometrie-Praximetrie, 18, 1-18.
Hinde J. and Demetrio C. (1988) Overdispersion: Models and estimation. Computational Statistics and Data Analysis. 27, 151-170.
Data on an experiment concerning the production of leaf springs for trucks.
A fractional factorial experiment with 3 replicates was
carried out with objective of recommending production settings to achieve a
free height as close as possible to 8 inches.
A data frame with 48 observations on the following 6 variables.
furnace temperature - a factor with levels
+
-
heating time - a factor with levels
+
-
transfer time - a factor with levels
+
-
hold-down time - a factor with levels
+
-
quench oil temperature - a factor with
levels +
-
leaf spring free height in inches
J. J. Pignatiello and J. S. Ramberg (1985) Contribution to discussion of offline quality control, parameter design and the Taguchi method, Journal of Quality Technology, 17 198-206.
P. McCullagh and J. Nelder (1989) "Generalized Linear Models" Chapman and Hall, 2nd ed.
Incubation temperature can affect the sex of turtles. There are 3 independent replicates for each temperature.
A data frame with 15 observations on the following 3 variables.
temperature in degrees centigrade
number of male turtles hatched
number of female turtles hatched
Beyond Traditional Statistical Methods Copyright 2000 D. Cook, P. Dixon, W. M. Duckworth, M. S. Kaiser, K. Koehler, W. Q. Meeker and W. R. Stephenson. Developed as part of NSF/ILI grant DUE9751644.
data(turtle)
data(turtle)
Life expectancy, doctors and televisions collected on 38 countries in 1993
A data frame with 38 observations on the following 3 variables.
Life expectancy in years
Number of people per television set
Number of people per doctor
Unknown, data for illustration purposes only
data(tvdoctor) ## maybe str(tvdoctor) ; plot(tvdoctor) ...
data(tvdoctor) ## maybe str(tvdoctor) ; plot(tvdoctor) ...
Study of IQ in twins reared apart
A dataframe with the following variables:
IQ of the fostered child
IQ of the biological child
social class of natural parents
Burt, C. (1966). The genetic estimation of differences in intelligence: A study of monozygotic twins reared together and apart. Br. J. Psych., 57, 147-153.
Weisberg, S. (2014). Applied Linear Regression, 4th edition. Hoboken NJ: Wiley.
A student newspaper conducted a survey of student opinions about the Vietnam War in May 1967. Responses were classified by sex, year in the program and one of four opinions. The survey was voluntary.
A data frame with 40 observations on the following 4 variables.
the count
a factor with
levels A
(defeat power of North Vietnam by widespread bombing and
land invasion) B
(follow the present policy) C
(withdraw
troops to strong points and open negotiations on elections involving the
Viet Cong) D
(immediate withdrawal of all U.S. troops)
a factor with levels Female
Male
a factor with levels Fresh
Grad
Junior
Senior
Soph
M. Aitkin and D. Anderson and B. Francis and J. Hinde (1989) "Statistical Modelling in GLIM" Oxford University Press.
The uswages
data frame has 2000 rows and 10 columns. Weekly Wages for
US male workers sampled from the Current Population Survey in 1988.
This data frame contains the following columns:
Real weekly wages in dollars (deflated by personal consumption expenditures - 1992 base year)
Years of education
Years of experience
1 if Black, 0 if White (other races not in sample)
1 if living in Standard Metropolitan Statistical Area, 0 if not
1 if living in the North East
1 if living in the Midwest
1 if living in the West
1 if living in the South
1 if working part time, 0 if not
Bierens, H.J., and D. Ginther (2001): "Integrated Conditional Moment Testing of Quantile Regression Models", Empirical Economics 26, 307-324
vif
vif(object) ## Default S3 method: vif(object) ## S3 method for class 'lm' vif(object)
vif(object) ## Default S3 method: vif(object) ## S3 method for class 'lm' vif(object)
object |
a data matrix (design matrix without intercept) or a model object |
Computes the variance inflation factors
variance inflation factors
Julian Faraway
data(stackloss) vif(stackloss[,-4]) # Air.Flow Water.Temp Acid.Conc. # 2.9065 2.5726 1.3336
data(stackloss) vif(stackloss[,-4]) # Air.Flow Water.Temp Acid.Conc. # 2.9065 2.5726 1.3336
The acuity of vision for seven subjects was tested. The response is the lag in milliseconds between a light flash and a response in the cortex of the eye. Each eye is tested at four different powers of lens. An object at the distance of the second number appears to be at distance of the first number.
A data frame with 56 observations on the following 4 variables.
a numeric vector
a
factor with levels 6/6
6/18
6/36
6/60
a factor with levels left
right
a factor with levels 1
2
3
4
5
6
7
Crowder, M. J. and D. J. Hand (1990). Analysis of Repeated Measures. London: Chapman & Hall.
data(vision) ## maybe str(vision) ; plot(vision) ...
data(vision) ## maybe str(vision) ; plot(vision) ...
A full factorial experiment with four two-level predictors.
A data frame with 16 observations on the following 5 variables.
a factor with levels -
+
a
factor with levels -
+
a factor with levels
-
+
a factor with levels -
+
Resistivity of the wafer
Myers, R. and Montgomery D. (1997) A tutorial on generalized linear models, Journal of Quality Technology, 29, 274-291.
Components are attached to an electronic circuit card assembly by a
wave-soldering process. The soldering process involves baking and preheating
the circuit card and then passing it through a solder wave by conveyor.
Defect arise during the process. Design is with 3 replicates.
A data frame with 16 observations on the following 10 variables.
Number of defects in the first replicate
Number of defects in the second replicate
Number of defects in the third replicate
prebake condition - a factor
with levels 1
2
flux density - a factor with
levels 1
2
conveyor speed - a factor with levels
1
2
preheat condition - a factor with levels
1
2
cooling time - a factor with levels
1
2
ultrasonic solder agitator - a factor
with levels 1
2
solder temperature - facctor with
levels 1
2
L. Condra (1993) Reliability improvement with design of experiments. Marcel Dekker, NY.
M. Hamada and J. Nelder (1997) Generalized linear models for quality improvement experiments, Journal of Quality Technology, 29, 292-304
Data come from a study of breast cancer in Wisconsin. There are 681 cases of potentially cancerous tumors of which 238 are actually malignant. Determining whether a tumor is really malignant is traditionally determined by an invasive surgical procedure. The purpose of this study was to determine whether a new procedure called fine needle aspiration which draws only a small sample of tissue could be effective in determining tumor status.
A data frame with 681 observations on the following 10 variables.
0 if malignant, 1 if benign
marginal adhesion
bare nuclei
bland chromatin
epithelial cell size
mitoses
normal nucleoli
clump thickness
cell shape uniformity
cell size uniformity
The predictor values are determined by a doctor observing the cells and rating them on a scale from 1 (normal) to 10 (most abnormal) with respect to the particular characteristic.
Bennett, K.,P., and Mangasarian, O.L., Neural network training via linear programming. In P. M. Pardalos, editor, Advances in Optimization and Parallel Computing, pages 56-57. Elsevier Science, 1992
3154 healthy young men aged 39-59 from the San Francisco area were assessed for their personality type. All were free from coronary heart disease at the start of the research. Eight and a half years later change in this situation was recorded.
A data frame with 3154 observations on the following 13 variables.
age in years
height in inches
weight in pounds
systolic blood pressure in mm Hg
diastolic blood pressure in mm Hg
Fasting serum cholesterol in mm %
behavior type which is a factor with levels A1
A2
B3
B4
number of cigarettes smoked per day
behavior type a factor with levels
A
(Agressive) B
(Passive)
coronary heat
disease developed is a factor with levels no
yes
type of coronary heart disease is a factor with
levels angina
infdeath
none
silent
Time of CHD event or end of follow-up
arcus senilis is a factor with levels absent
present
The WCGS began in 1960 with 3,524 male volunteers who were employed by 11 California companies. Subjects were 39 to 59 years old and free of heart disease as determined by electrocardiogram. After the initial screening, the study population dropped to 3,154 and the number of companies to 10 because of various exclusions. The cohort comprised both blue- and white-collar employees. At baseline the following information was collected: socio-demographic including age, education, marital status, income, occupation; physical and physiological including height, weight, blood pressure, electrocardiogram, and corneal arcus; biochemical including cholesterol and lipoprotein fractions; medical and family history and use of medications; behavioral data including Type A interview, smoking, exercise, and alcohol use. Later surveys added data on anthropometry, triglycerides, Jenkins Activity Survey, and caffeine use. Average follow-up continued for 8.5 years with repeat examinations
Statistics for Epidemiology by N. Jewell (2004)
Coronary Heart Disease in the Western Collaborative Group Study Final Follow-up Experience of 8 1/2 Years Ray H. Rosenman, MD; Richard J. Brand, PhD; C. David Jenkins, PhD; Meyer Friedman, MD; Reuben Straus, MD; Moses Wurm, MD JAMA. 1975;233(8):872-877. doi:10.1001/jama.1975.03260080034016.
data(wcgs) ## maybe str(wcgs) ; plot(wcgs) ...
data(wcgs) ## maybe str(wcgs) ; plot(wcgs) ...
An experiment to investigate factors affecting welding strength.
A data frame with 16 observations on the following 10 variables.
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
a 0-1 predictor
The welding strength
G. Box and R. Meyer (1986) Dispersion effects from fractional designs, Technometrics, 28, 19-27
Age, weight, height, and 10 body circumference measurements are recorded for 184 women. Each woman's percentage of body fat was accurately estimated by an underwater weighing technique.
A data frame with 184 observations on the following 19 variables.
Percent body fat using Siri's equation
Weight (lbs)
Height (inches)
Body Mass Index
Age (yrs)
Neck circumference (cm)
Chest circumference (cm)
Calf circumference (cm)
Extended biceps circumference (cm)
Hip circumference (cm)
Horizontal minimal measurement, at the end of a normal expiration (cm)
Forearm circumference (cm)
(Proximal Thigh) Horizontal measurement immediately distal to the gluteal furrow (cm)
(Middle Thigh) Measurement midway between the midpoint of the inguinal crease and the proximal border of the patella (cm)
(Distal Thigh) Measurement proximal to the femoral epicondyles (cm)
Wrist circumference (cm) distal to the styloid processes
Knee circumference (cm)
A minimal circumference measurement with the elbow extended (cm)
Ankle circumference (cm)
Roger W. Johnson (2021): Fitting Percentage of Body Fat to Simple Body Measurements: College Women, Journal of Statistics and Data Science Education, DOI: 10.1080/26939169.2021.1971585 (Note that I have changed some of the variable names to correspond with the older fat data for men)
Insect damage to wheat by variety
A data frame with 13 observations on the following 2 variables.
a numeric vector
a
factor with levels A
B
C
D
Unknown
data(wheat) ## maybe str(wheat) ; plot(wheat) ...
data(wheat) ## maybe str(wheat) ; plot(wheat) ...
Data on players from the 2010 World Cup
A data frame with 595 observations on the following 7 variables.
Country
a factor
with levels Defender
Forward
Goalkeeper
Midfielder
Time played in minutes
Number of shots attempted
Number of passes made
Number of tackles made
Number of saves made
None
Lost
data(worldcup) ## maybe str(worldcup) ; plot(worldcup) ...
data(worldcup) ## maybe str(worldcup) ; plot(worldcup) ...