Paradata - Difficulty working in STATA

Hi team,

I’m working with some paradata for a current survey and noticed since last time I worked with it, it has become a fair bit larger. To the extent now it is difficult to work with in STATA. In particular having troubles unpacking the parameters variable - I’m using the command.

split parameters, p(||)

Any recommendations on how to work more efficiently with this? I’ve tried ‘compress’ before running the above as well as dropping as many entries as possible before but still it will freeze STATA or take a very long time.

Many thanks

I know this may not be helpful, but my suggestion is to work in R or Python instead.

For R, there are a few packages built for this:

For Python, there may be as well, but I’m simply not aware of them.

For Stata, have a look at sursol. While I’ve not used it (in a long time), it looks to have some commands for handling paradata–and perhaps addresses some part of your problem.

The way you describe it, it appears to be a Stata issue, which you can certainly bring up with Stata developers (did you?), and yet “a very long time” may very well be what is needed to process that data.

If you could show that e.g. a Fortran code can do something in 17 seconds what the split() command is doing in Stata for 23minutes, then one could realize that there are possible optimizations. Clearly split() is not the only solution (nor it is necessarily the right solution) for this task, and other codes exist that avoid it.

Hi Sergiy,

Clearly split() is not the only solution (nor it is necessarily the right solution) for this task, and other codes exist that avoid it.

I’m aware of this, and that is why I’m here asking the question - if any colleagues have experience working with the paradata and can point me in the direction of some more efficient commands or techniques.

Thanks,

Lachlan

Thanks Arthur- Really helpful - I will have to brush up on R a bit!

If you decide to use R, want to do your own data manipulation, and know or want to learn some tidyverse tool (e.g., dplyr, tidyr, etc), consider checking out tidytable. This package is very performant for large data but offers the well-known, expressive syntax of tidyvese. (I never quite got into data.table)

For Python, I don’t know much (yet), but polars is one that is on my radar for code that’s apparently more performant than pandas. Plus, it has the “cool” factor of being written in Rust.

For Stata, I’ve come across one package that’s meant for large data: ftools. Not sure if it will address the parts of the data transformation process that are slowing your work.

Last idea: see if you can get a virtual machine that has more computational resources than your machine.