When doing a simple survey, many users lack the expertise to programmatically run through the *.do file to replace the values with their respective labels.
The most basic case of single select questions provide numeric outputs in the data export.
This makes it extra work for the user to then programmatically run through the *.do or to run a vlookup() on excel after manually copying the values from designer before running it through whatever data analysis program that they are using.
Or having to manually change the resultant analysis output by their label.
The modern data analysis programs have no problem with aggregating strings, so it would be great if there was an option to export the data with labels rather than values.
To the extent that I am aware do files may only be run in the Stata software, which allows the user to resave the data with labels instead of codes (values) with a single command. See help for outsheet. I am not sure how the designer is involved. Please clarify.
If you don’t have access to either Stata or SPSS, you’re likely to use the Data Export of Tabular
It’s possible to programmatically parse the *.do file to replace values with labels in the csv.
It’s also possible to use SpssLib or similar to parse the SPSS files and export them as csv with the labels.
However both of these require a level of technical skill not often available to a surveyor, so they might be forced to do a vlookup in a spreadsheet program, having copied the values and their labels from designer.
It would be much easier for the user to just be able to export the data in tabular format with labels rather than values in the export.
Neither SPSS and Stata are free to use.
When you are in an organisation with a very limited budget, buying these programs is simply not an option.
So it would be nice for the data export to consider users without either SPSS or Stata.
Which it does with the DDI and the tabular export, it’d just be nice to have this extra option.
As an R user, I agree that it would be more useful to have the tabular export the value labels instead of the values exported in the Tabular data.
Right now, I have to label them manually in my script by getting the value labels either from Designer, or the do file, or use a package to read in Stata file. All of which are not ideal. I agree with @macuata that it would be nice for the Tabular data export to be easier to use for users without SPSS or Stata.
@macuata, are there particular programs you have in mind? Would the ideal input data have categorical variables be strings corresponding to their labels? Do you see any second-best options (e.g., a easily usable dictionary of variable and value labels)?
Like @l2nguyen, I’m increasingly an R user. For R, there are a few paths to the desired end result with the existing export data:
Get labels from Stata (or SPSS). This would involve reading the Stata (or SPSS) file into R, and then changing the labelled variables from numeric to factor.
Get labels from Stata .do file. This would entail injesting the .do file as a structured text file. The user could subset to the labels of interest, apply them to variables, and then convert numeric variables to factor.
The first option is much easier, in my opinion, than the second. But–here I agree with you both–the workarounds may be beyond the coding skills of many users. For that reason, tab export of categorical variables as factors is, in my opinion, the first-best solution.
Tableau is the program I am using to create visualisations from the data I’m collecting, along with with generating HTML reports with my own custom program for all the villages/districts/provinces.
But even for a spreadsheet program it would be easier to create charts straight from the tab file if there was an option to be exported in pretty much the reverse fashion, with the labels in the tab and values in a separate file.
It is an option to export a dictionary/lookup table but I agree it’d be a distant second option as the first option allows the data to be immediately used by both someone of a low technical background and R users.
I expect the vast majority of the people I will be sharing the data with to run very basic analyses, such as how many houses in the village what are the houses made of, which villages don’t have fresh water, which village has the least toilets per person etc.
It’d be nice to not have to do quite so much post processing in order for the data to be easily usable.
In the absence of the desired feature, here’s an R work-around. The code below offers a simple template for the task of transforming variables with value labels into string variables whose values are the labels. The code only requires 2 inputs: the path to where the file is located (dataPath), which needs to be specified slashes instead of backslashes; and the file name (dataFile). The code could be extended to loop over all Stata files in a folder.
# load necessary libraries
# if you don't have them, do a one-time install via install.packages()
library(haven) # for reading Stata files into memory and preserving labels in labelled class
library(dplyr) # for data wrangling and identifying columns to be converted to character
library(stringr) # for changing the file extension from ".dta" to ".tab"
# read in data
dataPath = "C:/my/file/path/"
dataFile = "someData.dta"
myLabelledData <- read_stata(file = paste0(dataPath, dataFile), encoding = "UTF-8")
# create function to transform labelled variables to character variables
# first, transform labelled to factors, where levels are labels
# then, transform factor levels to characters
makeCharacter = function(x) x = as.character(as_factor(x, levels = "default"))
# apply function to all variables that have value labels
myCharacterData <- myLabelledData %>%
mutate_if(
.predicate = is.labelled, # check whether var has Stata value labels
.funs = makeCharacter) # if so, transform var to string
# write to tab-delimited format
outputFile = str_replace(dataFile, "dta", "tab")
write.table(myCharacterData,
file = paste0(dataPath, outputFile),
sep = "\t",
row.names = FALSE,
fileEncoding = "UTF-8")
Can I suggest @arthurshaw2002’s snippet be placed in the FAQ?
The main issue with this workaround is the need for coding ability.
There should be a data export that a user without much technical knowledge can use immediately.
The users I will be handing over to don’t have any.
I will leave them my program for the existing survey and can even knock up a generic converter, but in order for survey solutions to cater for this user base, it needs a simple export format.
It’s by far the biggest hurdle of the solution for beginner users who have little to no stats experience.
The rest is fantastic at having a simple user experience as a starting point, ramping up to expert when your needs are more complex.
@arthurshaw2002 Since your workaround gets the labels from the Stata datasets, does this mean it would require the user to download both the Stata and tabular formats?
For this, the user would need to download the Stata data only. With the R code, the Stata file provides the labels, and is written to tab-delimited format where questions with labels have string values rather than labelled integers.
I know that you’re working on a solution for getting values from the .do files that accompany the tab-delimited data. I’m looking forward to that as an alternate approach.
Building off the code that @arthurshaw2002, I made some R code that converts all Stata files in a folder to Excel files to share with any users who are more comfortable with Excel.
All the user has to do is change the folder at the end to point it to the folder that they want to convert all the exported Stata data to Excel format and saves it to the same folder. The excel format will have the value labels instead of values.
Fellow R user here!
Just following up with this thread to say that I too was surprised that there wasn’t a default option to replace select options with their associated labels. Thanks to @arthurshaw2002 and @l2nguyen for their code snippets – I was able to put together an R script to get the job done.
Echoing several folks here when I say that it took me quite a bit of time and effort to connect all the dots between the Stata files and the .tab exports. Just figuring out that the Stata files held the labels took a lot of trial and error (and some free software trials).
A mention of the issue on the support page for the data export topic or FAQs would be useful, and perhaps easier than a new feature on the HQ export side. In general, I’ve found myself rooting around a lot on these forums, random web posts and @l2nguyen’s Github page (and Google) for tips on how to handle the API, understand the data structure, and get the analysis going in R. Could I suggest giving these questions and discussions on this site a home under a new Forum Category, say, “Data Analysis”?
Additionally, if there were time/resources available, a brief quick-start guide for analyzing Survey Solutions output in R could be assembled by linking to pre-existing resources compiled by folks on these forums, and then posted in the Support documentation.
Thanks for all the thought and effort you’ve all put into the SuSo community.