[转载]STATA 缺失值的处理

2018-08-31 03:22阅读：

http://blog.sina.cn/dpool/blog/u/1420576515

参考资料： http://www.stata.com/support/faqs/data-management/replacing-missing-values/
http://bbs.pinggu.org/thread-2284674-1-1.html
http://bbs.pinggu.org/thread-3082225-1-1.html
http://blog.sina.cn/dpool/blog/s/blog_629bb7580101bvyo.html?vt=4
数据处理中，经常遇到缺失值的问题，常见的是使用上一条数据来替换。
相关的代码：
nmissing mdesc （查看） mvencode（赋值） carryforward tsfill （填充）
***************
tsset permco month_id
tsfill
foreach var of varlist date fyearq fqtr rdq datafq

tr {
replace `var' = `var'[_n-1] if missing(`var')
}
***************
【问题】
有时候，整理一份数据，或者拿到一份数据，想看一下变量缺失情况。
【命令】
nmissing，npresent
【例子】
ssc install nmissing \安装 nmising 命令
use yourdata,clear \调入你的数据
nmising \显示所有变量的缺失数目
nmising var1 \显示变量var1的缺失数目
nmising，min(10) \显示缺失超过10个的变量
npresent \显示所有变量的已有数据，和nmising正好相反。
【问题】
Stata如何快速查看变量缺失情况？
sum可以看到样本量，但不好看到底缺多少。
【方法】
mdesc可以的
【例子】
ssc install mdesc
use http://www.stata-press.com/data/r11/mheart5 ,clear
(Fictional heart attack data; bmi and age missing)

. mdesc
Variable | Missing Total Percent Missing
----------------+-----------------------------------------------
attack | 0 154 0.00
smokes | 0 154 0.00
age | 12 154 7.79
bmi | 28 154 18.18
female | 0 154 0.00
hsgrad | 0 154 0.00
----------------+-----------------------------------------------

How can I replace missing values with previous or following nonmissing values or within sequences?

Title		Replacing missing values
Author		Nicholas J. Cox, Durham University, UK
Date		August 2000; updated January 2012

1. The problems

Users often want to replace missing values by neighboring nonmissing values, particularly when observations occur in some definite order, often (but not always) a time order. Typically, this occurs when values of some variable should be identical within blocks of observations, but, for some reason, values are explicitly nonmissing within the dataset only for certain observations, most often the first. So, there is a wish to copy values within blocks of observations.
Alternatively, users often want to replace missing values in a sequence, usually in a time sequence. These problems can be solved with similar methods.
A different situation, not addressed directly in this FAQ, is when values of some time-varying variable are known only for certain observations. There is then a need for imputation or interpolation between known values. Copying the last value forward is unlikely to be a good method of interpolation unless, as just stated, it is known that values remained constant at a stated level until the next stated level. Either way, users applying the methods described here for imputation or interpolation take on the responsibility for what they do.

2. Without tsset: copying nonmissing values

Let us first look at the case where you have not tsset your data (see, for example, [TS] tsset for an explanation), but we will assume that the data have been put in the correct sort order, say, by typing
. sort time
If missing values occurred singly, then they could be replaced by the previous value
. replace myvar = myvar[_n-1] if missing(myvar)
or by the following value
. replace myvar = myvar[_n+1] if missing(myvar)
Here the subscript notation used is that _n always refers to any given observation, _n−1 to the previous observation and _n+1 to the following observation, given the current sort order. There is not, of course, any observation before the first, or after the last, so myvar[0] is always missing, as is myvar for any observation number that is negative or greater than the number of observations in the data. See [U] 13.7 Explicit subscripting for more about subscripting.
missing(myvar) catches both numeric missings and string missings. If myvar is numeric, you could write
. replace myvar = myvar[_n+1] if myvar >= .
because . < .a < .b < ... < .z are the numeric missing values. Most problems involve missing numeric values, so, from now on, examples will be for numeric variables only. However, if myvar were string,
. replace myvar = myvar[_n+1] if myvar == ''
would be correct syntax, not the previous command, because the empty string '' is string missing.

3. Copying previous values downwards: the cascade effect

Missing values may occur in blocks of two or more. Suppose you want to replace missings by the previous nonmissing value, whenever it occurred, so that given
_n myvar 1 42 2 . 3 . 4 56 5 67 6 78
you want to replace not only myvar[2], but also myvar[3] with 42.
. replace myvar = 42 in 2/3
is an interactive solution, but, for larger datasets, you need a more systematic way of proceeding. To get this, it helps to know that replace always uses the current sort order: the value for observation 2 is always replaced before that for observation 3, so the replacement value for 2 may be used in calculating the replacement value for 3.
. replace myvar = myvar[_n-1] if myvar >= .
achieves this purpose. myvar[1] is unchanged, because myvar[1] is not missing. myvar[2] is replaced by the value of myvar[1], namely, 42, because myvar[2] is missing. But myvar[3] is replaced by the new value of myvar[2], 42, not its original value, missing (.). In this way, nonmissing values are copied in a cascade down the current sort order. Naturally, one or more missing values at the start of the data cannot be replaced in this way, as no nonmissing value precedes any of them.
What if you want to use the previous value only and do not want this cascade effect? You need to copy the variable and replace from that:
. gen mycopy = myvar . replace myvar = mycopy[_n-1] if myvar >= .
No replacement is being made in mycopy, so there is no cascade effect. replace just looks across at mycopy and back one observation.

4. Copying following values upwards

The opposite case is replacement by following values, but, because replace respects the current sort order, this is not just the mirror image of replacement by previous values. In practice, it is easiest to reverse the series and work the other way.
. gsort -time . replace myvar = myvar[_n-1] if myvar >= .
gsort allows you to get reverse sort order; see [D] gsort. The command sort time puts highest values last, whereas gsort −time puts highest values first. It is as if you had generated a variable that was time multiplied by −1 and sorted on it, and, in fact, this is exactly what gsort does behind the scenes, although the variable is temporary and dropped after it has served its purpose.
. replace myvar = myvar[_n+1] if myvar >= .
does not produce a cascade effect. myvar[2] would be replaced by existing myvar[3], myvar[3] would be replaced by existing myvar[4], and so forth. At most, one of any block of missing values would be replaced. This might, of course, be exactly what you want.
Once again, nothing can be done about any missing values at the end of the series (placed at the beginning after the gsort). After replacement, you will probably want to reverse the sorting once again by
. sort time

5. Complications: several variables and panel structure

Two common complications are

You want to do this with several variables: use foreach. sort or gsort once, replace all variables using foreach, and, if necessary, sort back again.
You have panel data, so the appropriate replacement is a neighboring nonmissing value for each individual in the panel.

Suppose that individuals are identified by id. There are just a few extra details to review, such as
. by id (time), sort: replace myvar = myvar[_n-1] if myvar >= .
or
. gsort id -time . quietly by id: replace myvar = myvar[_n-1] if myvar >= . . sort id time
The key to many data management problems with panel data lies in following sort by some computations under by:. For more information, see the sections of the manual indexed under by:.

6. With tsset

If you have tsset your data, say, by typing
. tsset time
then
. replace myvar = L.myvar if myvar >= .
has the effect of copying in cascade, whereas
. replace myvar = F.myvar if myvar >=.
has no such effect. The value of tsset is that it takes account of gaps in your data and (if you had declared a panel variable) of any panel structure to your data.

7. Missing values in sequences

In some datasets, time variables come with gaps, something like
_n year 1 . 2 . 3 1990 4 . 5 . 6 . 7 . 8 1995 9 . 10 .
We can use a similar method and rely on cascading:
. replace year = 1988 in 1 . replace year = year[_n-1] + 1 if missing(year)
The difference is simply that each value is one more than the previous one. If data were once per decade, each value would be 10 more, and so forth. Again missing values at the beginning of a sequence need special surgery, as shown here. With tsset panel data use L.year + 1 rather than year[_n-1] + 1.

举报/Report

我的更多文章

下载客户端阅读体验更佳

APP专享

新浪博客