Exporting / manipulating paradata

It’s great that Surveysolutions provides paradata.

For each observation, we are looking to compute the time taken to answer each section of the survey. This is to try to identify modules that have been answered too fast to have been administered properly.

We understand the paradata takes the form of one tab file per observation, and the order of the variables in each tab file may vary depending on the sequence of actions.

Is there a way to tailor the exportation the paradata, for instance to extract the paradata only for a subset of variables (e.g. time at which the first and last question of each section was answered), or in a similar sequence across all observations (and possibly even in a single database that would include all observations?)

If not, are there examples of programs users have developed to perform these manipulations and extract the time stamps?

We are assessing whether we need to write a script to do this, but pointers from anyone having done such manipulations would be welcome.


Dear Patrick,

At this moment there are no customization for the paradata output,you always get the whole history when you export. We do have plans to making some usability improvements in next releases but those may come in a month or so.

Meanwhile, all the filtering or aggregation can be done with any scripting language of your comfort - Stata, R, Python, etc. I’d like to thank Michael for offering his help. We can also try to help advise you. Please contact me or any of our team members and we’ll try to walk you through the process.


Dear Patrick,

I played a bit with the paradata a while back to compute response latencies. I wrote all the code in R, but haven’t uploaded it to github. If you contact me privately, I’m happy to share the code.

Essentially my workflow goes like this (if memory serves):

  1. I create a function which reads in one case file. I concatenate the date and time columns, and convert them to POSIX.
  2. I sort the datasort according to the new date time column, and assign an order number.
  3. I create a column which stores the case id.

I apply this function to each data file (using lapply), and store the output in a list. I then collapse (using rbindlist) this list into a single data frame with all the columns from the paradata files plus the case Id, order, and POSIX date and time column. This is my master data set.

Next, I write a function to create a variable table. This function takes the master dataset as an input, and outputs a table where each row is variable, with the frequency, average, median, max, min, and standard deviation of the response latency.

The method I use for computing the stats for response latencies is as follows:

  1. I group my master dataset by case Id.
  2. I compute response latency for each question for each case as time(t) - time(t-1) where t is the row number. The latency for the first question is not computed.
  3. I then generate a table for the summary stats using the piping features in the dplyr package. This is intuitive enough for an R user.

*I know this isn’t perfect and when you dig in, it’s flaws become apparent.

At this point, you could append a column which identifies the corresponding module. You could group by module, and then perform operations on the response latencies to compute the time each takes.

Good luck, I hope that makes some sense!

For any SuSo users that may be interested, there’s some messy code for tabulating paradata here:


Feel free to fork and make less messy.

1 Like