Hello everyone!
I've been cheking the data base, and the following is a little piece of what I've found (total observations are 30 361):
HHID | HVIDX | AGE | ALIVE |
"96921 | 2 | 4 | yes |
#740"2 | 6 | 5 | yes |
057=34 | 5 | 2 | no |
146789 | 1 | 3 | yes |
146789 | 2 | 7 | no |
146789 | 3 | 1 | - |
146790 | 1 | 2 | no |
146790 | 2 | 5 | - |
146790 | 3 | 8 | yes |
146790 | 4 | 6 | yes |
146790 | 5 | 4 | - |
I realized that the first three observations of HHID were badly typed. I checked manual reference, but "drop" command didn't convince me at all. Bearing in mind this, I have some questions:
1.- How can I delete the first three badly typed observations? I want to delete the complete row (beginning with HHID and including observations from HVIDX, AGE, and ALIVE variables). The result that I would like to have would be like this:
HHID | HVIDX | AGE | ALIVE |
146789 | 1 | 3 | yes |
146789 | 2 | 7 | no |
146789 | 3 | 1 | - |
146790 | 1 | 2 | no |
146790 | 2 | 5 | - |
146790 | 3 | 8 | yes |
146790 | 4 | 6 | yes |
146790 | 5 | 4 | - |
2.- Then, how can I delete those observations if the AGE>6 (beginning with HHID and including observations from HVIDX, AGE, and ALIVE variables). The result that I would like to have would be like this:
HHID | HVIDX | AGE | ALIVE |
146789 | 1 | 3 | yes |
146789 | 3 | 1 | - |
146790 | 1 | 2 | no |
146790 | 2 | 5 | - |
146790 | 4 | 6 | yes |
146790 | 5 | 4 | - |
3.- How can I delete those observations if ALIVE==. (missing) Is it the same process in the previous question that applies here as well?
Thanks in advance for your help
I'd encourage you to look at a beginner guide. I'm not sure what you mean by not being convinced by the drop help guide, as it features examples such as
drop if medage>32
Here's everything. I prefer to use a command that drops based on attributes (and not just the first 3) in case the data ever becomes sorted differently.
gen byte notnumeric = real(HHID)==. drop if notnumeric == 1 drop notnumeric drop if AGE > 6 drop if missing(ALIVE) *You could also do drop if ALIVE == ""Does your ALIVE variable contain the literal string "-"? Then it would be drop if ALIVE == "-" Last edited by Andrew Castro; 01 Feb 2017, 17:43 .
After sorted "HHID" variable, I detected 36 observations either were blank or the codes were badly typed. Then, I used the code you've posted, and it worked but it deleted 43 observations, seven more than I've founded. Looking for an explanation, I observed what kind of variable was HHID, and it resulted to be a string type, more exactly str15. Given this, does the code you previously wrote have to change?
The ALIVE variable has missing values "."
Last edited by Brian Yalle; 01 Feb 2017, 20:30 .gen byte notnumeric = real(HHID)==. drop if notnumeric == 1
identifies and then drops observations in which HHID is not a string representation of a number. That was Andrew Castro's hypothesis about what characterizes mistyped entries. But perhaps there are some valid HHIDs that are not the string representation of a number. Or perhaps you simply failed to notice 7 observations that fit that criterion and are, indeed mistyped. Hard to say.
Anyhow, go back to your original data and run
browse if real(HHID) == .
The browser will open displaying all observations that meet Andrew's criterion. You can inspect them and figure out why there are 43 of them instead of the 36 you expected.
I have two more doubts, I hope you can give a hand again, here goes:
a) Previously, I mentioned "HHID" variable which is string (str15), but in fact as one can see HHID only contens six digits. So I wish to change this in order to have a HHID with str6. From what I read in the manual, the format of str15 is "%15s", so if I want six digits, it would be like "%6s", then remains to know what function would be the most suitable to apply. In an attempt, I used strtrim, substr, and strlen, but none of them have reduced the number of digits (string). Which would be code not only to reduce but also to enlarge the length of the string format? More details about HHID (str15): let's suppose that "_" means blanks, an observation can look like as:
_ _ _ _ _ _ _ _ _ 1 4 6 7 8 9b) Other worrisome variable is CASEID which is string too but with eighteen digits (str18). The way the numbers have been placed is a little bit complex. Anew, let's imagine that "_" means blanks, an observation can look like as:
_ _ _ _ _ _ _ _ _ 2 3 6 9 1 7 _ _ 3
The structure of this str18 starts with six blanks followed by six six numeric digits, then two more blanks and it ends with one numeric digit.
Similar to the question before, I want to eliminate the first nine spaces from the left, in other words, reduce the length from 18 to 9 digits (str9)
CASEID | REGION | PLACE OF RESIDENCE | HEALTH INSURANCE |
coast | urban | yes | |
2409 | highland | rural | no |
236917 1 | highland | rural | - |
236917 2 | highland | rural | no |
236917 3 | highland | rural | - |
236920 4 | jungle | urban | yes |
236920 5 | jungle | urban | yes |
`" / 252301 4 | coast | urban | - |
As can be seen, the first, second and seventh observation was bad typed and I decided to delete theses three observations This table is just a piece from the complete data I have. With the purpose of failing to notice other bad typed observations, I used the code that Clyde Schechter sugested:
browse if real(CASEID) == .
After typed this, the browse window showed a lot of observations, more exactly were 25 761 obs., which in fact are the total numbers of observations of CASEID.
For the moment ignoring this, I followed with the procedure, and applied the code Andrew Castro posted:
gen byte notnumeric = real(CASEID)==.
drop if notnumeric == 1
It runned, and showed that 25 750 observations were eliminated, leaving only 11 observations in CASEID that were mal typed also. What solution could give?
So, again, the condition real(var) == . picks out observations where the content of the string variable var is not the human-readable representation of a number. It does not, it seems, correspond to what you are looking for as data-entry errors. It certainly will not work for CASEID, because that variable is supposed to contain embedded blanks, and numbers cannot contain embedded blanks. That's why almost nothing is left when you apply this condition.
Now, I find your discussion of HHID confusing and I'm really unsure what it is you want to do. My best guess is that you want to retain only those observations where the value of HHID consists of exactly 6 digit characters. If that's it you can do this with:
keep if regexm(HHID, "[0-9][0-9][0-9][0-9][0-9][0-9]") format HHID %6s
The fact that HHID is stored as str15 suggests that not all its values are (or were earlier in data management) just 6 characters long. And it may be that while they look like they have only 6 characters, it may be that there are leading or trailing blanks that you aren't seeing. So it may be that what you really need is not the code shown just above but:
keep if regexm(trim(HHID), "[0-9][0-9][0-9][0-9][0-9][0-9]") format HHID %6s
As for CASEID, it sounds like what you want is:
replace CASEID = substr(CASEID, 7, .) // REMOVE FIRST 6 CHARACTERS
And if you then want to keep onlly those observations where CASEID now conforms to the pattern of 6 digits, 2 blanks, and 1 digit, follow the immediately preceding code with:
keep if regexm(CASEID, "[0-9][0-9][0-9][0-9][0-9][0-9] [0-9]")
I applied your advices Clyde Schechter but unfortunately none of them have worked out.
keep if regexm(trim(CASEID), "[0-9][0-9][0-9][0-9][0-9][0-9] [1-20]")
Then I used the code above, it showed that delete 25 680 observations from 25761, remaining only 81 obs. It's worth make an accuracy, with [1-20] I mean that there are observations that have just one numeric digit (for example it could be like "1" with a blank that precedes, not "01""), and sometimes it presents two numeric digits.
The code below didn't alter the format I would like to, I noticed that after run "codebook CASEID". The string format still is 18.
format caseid %9s
Maybe I have to point out that I'm using Stata 13. Summarizing a little about what I wanted to do is the following: I checked a bunch of files in which I noticed that "key variables" like HHID or CASEID presented some mistyped observations, then I decided to eliminate them because they would probably cause problems when merging these files into a single one. Let me give you more details, HHID is a household identification variable using six numeric digits in a string format of 15, while CASEID is a personal identification variable that uses in the first part six digits, then a blank and the last part uses one or two digits -these two spaces can be filed with a number between 0 and 20, all of them in string format of 18. The HHID varible presents less observations than CASEID because this one refers in detail to househols members. In both cases, the first six spaces in left are blank.
Afterwards, when I merge them I will only require the first digits of CASEID to match with HHID, so that's the reason what I thougt I would be good to keep the six digits from each "key variable" but also change the format they were placed.
Last edited by Brian Yalle; 07 Feb 2017, 09:18 .It is a good idea when using code posted here to read the manual section on any commands or functions used if you do not already know them well. This is all the more so if you plan to make adjustments to the code. -regexm()- parses POSIX regular expressions, and [1-20] does not mean what you think it means. The POSIX regular expressions do not refer to numeric regions. They refer to sets of characters. So [1-20] will not identify character sequences that humans would recognize as representing numbers between 1 and 20.
In your original description of your problem, I understood you to be looking for sequences of exactly 6 digits. Now it appears that it is really up to 6 digits, possibly preceded by a large number of blank spaces in HHID. And CASEID is similar, but the (up to) 6 digits are followed by a blank and then a number between 1 and 20. So based on those criteria:
replace HHID = trim(itrim(HHID)) replace CASEID = trim(itrim(CASEID)) destring HHID, gen(HHID_check) force gen byte HHID_ok = inrange(HHID_check, 0, 999999) & HHID == int(HHID) split CASEID, gen(part) destring force gen byte CASEID_ok = inrange(part1, 0, 999999) & part1 == int(part1) /// & inrange(part2, 1, 20) & part2 == int(part2)
The logic is this. The first two commands eliminate any leading or trailing blanks, and reduce any sequences of internal blanks to a single blank. This eliminates the difficulties created by varying numbers of blank spaces.
The second two commands deal with HHID by forcing Stata to interpret it as a number. The HHID is valid if the resulting number is an integer between 0 and 999999. The final two commands similarly deal with CASEID, except that now there are two "parts" to deal with. The first is checked for being n integer between 0 and 999999 (just like HHID), and the second part is checked for being an integer between 1 and 20.
Note that this is not tested, and it may contain some typos.