tr
{
replace `var' = `var'[_n-1] if missing(`var')
}
***************
【问题】
有时候,整理一份数据,或者拿到一份数据,想看一下变量缺失情况。
【命令】
nmissing,npresent
【例子】
ssc install nmissing
\安装 nmising 命令
use yourdata,clear
\调入你的数据
nmising
\显示所有变量的缺失数目
nmising var1
\显示变量var1的缺失数目
nmising,min(10)
\显示缺失超过10个的变量
npresent
\显示所有变量的已有数据,和nmising正好相反。
【问题】
Stata如何快速查看变量缺失情况?
sum可以看到样本量,但不好看到底缺多少。
【方法】
mdesc可以的
【例子】
ssc install mdesc
use http://www.stata-press.com/data/r11/mheart5 ,clear
(Fictional heart attack data; bmi and age missing)
. mdesc
Variable
|
Missing
Total
Percent Missing
----------------+-----------------------------------------------
attack |
0
154
0.00
smokes |
0
154
0.00
age |
12
154
7.79
bmi |
28
154
18.18
female |
0
154
0.00
hsgrad |
0
154
0.00
----------------+-----------------------------------------------
How can I replace missing values with previous or following
nonmissing values or within sequences?
| Title |
|
Replacing missing values |
| Author |
Nicholas J. Cox, Durham University, UK |
| Date |
August 2000; updated January 2012 |
1. The problems
Users often want to replace missing values by neighboring
nonmissing values, particularly when observations occur in some
definite order, often (but not always) a time order. Typically,
this occurs when values of some variable should be identical within
blocks of observations, but, for some reason, values are explicitly
nonmissing within the dataset only for certain observations, most
often the first. So, there is a wish to copy values within blocks
of observations.
Alternatively, users often want to replace missing values in a
sequence, usually in a time sequence. These problems can be solved
with similar methods.
A different situation, not addressed directly in this FAQ, is when
values of some time-varying variable are known only for certain
observations. There is then a need for imputation or interpolation
between known values. Copying the last value forward is unlikely to
be a good method of interpolation unless, as just stated, it is
known that values remained constant at a stated level until the
next stated level. Either way, users applying the methods described
here for imputation or interpolation take on the responsibility for
what they do.
2. Without tsset: copying nonmissing values
Let us first look at the case where you have not
tsset
your data (see, for example,
[TS]
tsset for an explanation), but we will
assume that the data have been put in the correct sort order, say,
by typing
. sort time
If missing values occurred singly, then they could be replaced by
the previous value
. replace myvar = myvar[_n-1] if missing(myvar)
or by the following value
. replace myvar = myvar[_n+1] if missing(myvar)
Here the subscript notation used is that
_n
always refers to any given observation,
_n−1
to the previous observation and
_n+1 to
the following observation, given the current sort order. There is
not, of course, any observation before the first, or after the
last, so
myvar[0] is always missing, as is
myvar for any observation number that is
negative or greater than the number of observations in the data.
See
[U] 13.7
Explicit subscripting for more about
subscripting.
missing(myvar) catches both numeric missings and
string missings. If
myvar is numeric, you
could write
. replace myvar = myvar[_n+1] if myvar >= .
because
. <
.a <
.b < ... <
.z are
the numeric missing values. Most problems involve missing numeric
values, so, from now on, examples will be for numeric variables
only. However, if
myvar were string,
. replace myvar = myvar[_n+1] if myvar == ''
would be correct syntax, not the previous command, because the
empty string
'' is string missing.
3. Copying previous values downwards: the cascade effect
Missing values may occur in blocks of two or more. Suppose you want
to replace missings by the previous nonmissing value, whenever it
occurred, so that given
_n myvar 1 42 2 . 3 . 4 56 5 67 6 78
you want to
replace not only
myvar[2], but also
myvar[3] with
42.
. replace myvar = 42 in 2/3
is an interactive solution, but, for larger datasets, you need a
more systematic way of proceeding. To get this, it helps to know
that
replace always uses the current sort
order: the value for observation 2 is always replaced before that
for observation 3, so the replacement value for 2 may be used in
calculating the replacement value for 3.
. replace myvar = myvar[_n-1] if myvar >= .
achieves this purpose.
myvar[1] is unchanged,
because
myvar[1] is not missing.
myvar[2] is replaced by the value of
myvar[1], namely, 42, because
myvar[2]
is missing. But
myvar[3] is replaced by
the new value of
myvar[2], 42, not its original
value, missing (.). In this way, nonmissing values are copied in a
cascade down the current sort order. Naturally, one or more missing
values at the start of the data cannot be replaced in this way, as
no nonmissing value precedes any of them.
What if you want to use the previous value only and do not want
this cascade effect? You need to copy the variable and replace from
that:
. gen mycopy = myvar . replace myvar = mycopy[_n-1] if myvar
>= .
No replacement is being made in
mycopy, so there is
no cascade effect.
replace just looks across
at
mycopy and back one observation.
4. Copying following values upwards
The opposite case is replacement by following values, but, because
replace respects the current sort order, this
is not just the mirror image of replacement by previous values. In
practice, it is easiest to reverse the series and work the other
way.
. gsort -time . replace myvar = myvar[_n-1] if myvar >=
.
gsort
allows you to get reverse sort order; see
[D]
gsort. The command
sort time
puts highest values last, whereas
gsort −time
puts highest values first. It is as if you had generated a
variable that was
time multiplied by −1 and
sorted on it, and, in fact, this is exactly what
gsort does behind the scenes, although the
variable is temporary and dropped after it has served its
purpose.
. replace myvar = myvar[_n+1] if myvar >= .
does not produce a cascade effect.
myvar[2]
would be replaced by existing
myvar[3],
myvar[3] would be replaced by existing
myvar[4], and so forth. At most, one of any block of
missing values would be replaced. This might, of course, be exactly
what you want.
Once again, nothing can be done about any missing values at the end
of the series (placed at the beginning after the
gsort). After replacement, you will probably want to
reverse the sorting once again by
. sort time
5. Complications: several variables and panel structure
Two common complications are
- You want to do this with several variables: use
foreach. sort or
gsort once, replace all variables using
foreach, and, if necessary, sort
back again.
- You have panel data, so the appropriate replacement is a
neighboring nonmissing value for each individual in the panel.
Suppose that individuals are identified by
id. There
are just a few extra details to review, such as
. by id (time), sort: replace myvar = myvar[_n-1] if myvar >=
.
or
. gsort id -time . quietly by id: replace myvar = myvar[_n-1] if
myvar >= . . sort id time
The key to many data management problems with panel data lies in
following
sort by some computations under
by:. For more information, see the sections of the
manual indexed under
by:.
6. With tsset
If you have
tsset your data, say, by
typing
. tsset time
then
. replace myvar = L.myvar if myvar >= .
has the effect of copying in cascade, whereas
. replace myvar = F.myvar if myvar >=.
has no such effect. The value of
tsset is that
it takes account of gaps in your data and (if you had declared a
panel variable) of any panel structure to your data.
7. Missing values in sequences
In some datasets, time variables come with gaps, something
like
_n year 1 . 2 . 3 1990 4 . 5 . 6 . 7 . 8 1995 9 . 10 .
We can use a similar method and rely on cascading:
. replace year = 1988 in 1 . replace year = year[_n-1] + 1 if
missing(year)
The difference is simply that each value is one more than the
previous one. If data were once per decade, each value would be 10
more, and so forth. Again missing values at the beginning of a
sequence need special surgery, as shown here. With
tsset panel data use
L.year + 1
rather than
year[_n-1] + 1.