Data Cleaning in SS

During a survey, no matter how hard we try, we will get data errors. Date errors, Wrong gender, Spelling mistakes etc. Most of the times, these error are easily corrected without the need to contact the respondent again. But the process needed to be followed to make this correction is tedious. We need to reject the interview to a supervisor, who rejects it to the interviewer. Then they have to follow-up with the interviewer to ensure they make the required change in the form. This takes up time and effort for a problem that could be easily solved at supervisor or HQ level.

For many such errors, we ignore them and plan to correct it during data cleaning (which can only start after the collection is complete). This has its own problems. Sometimes the error correction will get skipped (the supervisor forgot to note the error). This also makes data cleaning tedious.

I propose a Data Cleaning step where the HQ (or a new Data Cleaning user) is able to correct these errors on the go and the final data out of Survey Solution is cleaner. Cleaning data as a parallel process with data collection is much more efficient.

I understand that this may be considered beyond the scope of a survey management tool, but the transition between the survey tool and the data management tool should be facilitated. This greatly helps with the work flow and saves time.

I believe such a request has been made earlier also for the ability to alter/change responses using API.

1 Like

Please provide as much information as possible using the template for new feature suggestions in this branch of the forum.

Please avoid ‘magic’ in the description. You have mentioned so far that:

  • During a survey,… we will get data errors. Date errors, …
  • I propose a Data Cleaning step where the HQ (or a new Data Cleaning user) is able to correct these errors on the go …

Consider the following example: the HQ user received an interview where the person’s date of birth was recorded as July 04, 1976 and age of 40. The interview date was August 27, 2021.

What should the HQ user change?

  • age 40 to 45?
  • year 1976 to 1981?
  • date July 04, 1976 to March 08, 1981?
  • perhaps, cast doubts on the interview date?

We need to reject the interview to a supervisor, who rejects it to the interviewer.

As an HQ you can reject directly to an interviewer. Including to your own interviewer account where you can make corrections.

I have no objection with the current processes of rejecting the interviews to the Supervisor and then Interviewers. I am proposing a new feature that also permits the HQ to make these corrections.

In this example, I would reject it to the interviewer to verify. But consider if I have also collected the picture of an ID. Now I can see that the date of birth is wrong or the age is wrong. In this case, I can simply make the correction.

I understand that we open a can of worms here with Supervisors, HQs making changes in the data even when they aren’t sure of the response. I am not sure how to strike this balance.

  1. Rejecting the data to a Data Cleaner who has permission to edit only to the variables rejected correction.
  2. HQ has permission to edit
  3. Specific questions were we can anticipate such corrections may be marked editable in designer or while importing (like exposed variables).

I apologise. I am iterating a problem and not necessarily a solution, but suggestions to possible solutions.

Abstracting from a specific software, as I do think that this question is not only about Survey Solutions, I’d argue that even for the systems where you have an ability to ‘directly’ edit data, you should try and build processes that avoid doing so. The ‘magic’ word (I know @sergiy will not like this :slight_smile: ) would be reproducability - you want your actions to be documented and easy to review.

Although Survey Solutions does a great job with paradata to keep track of all the changes, and if HQ-based data editing were implemented, of course those edits would be recorded, but it is still extremely difficult to get a full picture of what edits were made, when and most importantly why.

The alternative option is to have a clear pipeline of the work - data collection => raw data; data-cleaning scripts => cleaned data; data-analysis scripts => report(s).

When working on poverty analysis at the bank (our ‘other’ life outside of Survey Solutions) for many countries we have a survey data going back few years and derived variables (weights, total expenditure, poverty aggregates etc) constructed by a person not working here anymore. If I only have a dataset with ‘cleaned’ variables in it, I have no way of knowing what was cleaned, and how. But if I have an original file and a say, stata script that shows somewhere replace age = ... if age > 100 I have an option to exclude this row and keep the original age variable and deal some other way with the large numbers in age …

Also specifically for Survey Solutions workflow - since there is not much else you can do with an interview once it is collected (other than export) I don’t know what would be the benefit of editing something? You are going to export the data to another system for storage, analysis and reporting anyways so would it not make sense to do all the data-manipulation work at the same time and place?

Why is it going to be difficult to get this information? What and When should be easy from the paradata. Why can also be commented.

Editing/Cleaning data is a broad task. It may involve (a) data processing/manipulation (re-categorisation, change in format etc) (b) correction of errors in data entry.

I only want the ability of error correction within SS without sending the interviews back to the interviewers. Reassigning interviews to someone else (self) for corrections has its own problems. We what information on who collected which interview (for follow-up and payment). Reassignment makes this complicated.

This pipeline is fine for processing data. But I doubt if errors can be always corrected using pre-defined cleaning scripts. It helps if the data coming out of SS is error free (as far as possible). What I am proposing is a modification of the pipeline:

Within SS: data collection => raw data => Error Corrections;
Outside SS: data-cleaning scripts => cleaned data; data-analysis scripts => report(s)

Have all interviewer accounts start with ‘i’, for example: iJohn, iPeter, iMary.
Have a corrector account in each team starting with c, for example: cPamela.

  1. After any interview is completed by an i-account, automatically reject it to a c-account in that same team.
  2. After an interview is completed by a c-account, do not automatically process it, but let the supervisor decide on it.
  3. For establishing who collected the data - that would be i-accounts mentioned in interview__actions file.
  4. For establishing who corrected the data - that would be the c-account mentioned in interview__actions file.

If you have more than one corrector per team, you could also do some load balancing, considering, for example, their performance, availability, and currently assigned work load.

PS: @kv700032 may want to practice on this, since this would be useful for your survey as well.

I understand there are multiple workarounds to achieve this (password sharing :shushing_face:). But that’s what they are, workarounds. They are convoluted and tedious.

I would much rather have being able to correct errors as a feature than a workaround.

Just to clarify: none of the steps in my post above required any sharing of the password. Furthermore, I don’t see any efficiency gain from sharing a password with anyone.

In Survey Solutions:

A) it is perfectly fine (from the security standpoint) to have two or more accounts, and in particular in different roles (e.g. a headquarters and an interviewer roles) for the same person;

B) it is not ok for two persons (e.g. John and Mary) to have access to the same account (e.g. Super1) simultaneously.

It seems to me a very good practice to carry out in the future ENIGH, however, one of the main problems encountered during the few weeks of the survey in the last ENIGH, was that the field personnel (interviewers and supervisors) did not have the appropriate knowledge to perform a data cleaning, then the people with the c-account should be in the office and under the orders of the HQ and the data cleaning would be done under assumptions or incomplete information since they would not be in the field obtaining the information directly from the informant.
But I think it is a great idea to have a C account on each team to take care of cleaning the data for each interview conducted by your team.

I was just hinting that password sharing is another such workaround.

Every team may or may not have a cleaner. Even if he is there, he may only receive interviews sent to him for cleaning and not all interviews. I would want the cleaner to be shared by multiple teams.

In Honduras, during the survey of the National Household Income and Expenditure Survey, we thought the same, we had a “cleaning team” called “critics-coders” who were going to reclassify some products and make the corresponding corrections of all the interviews collected in the field for all teams, in practice the approach was wrong because it caused a big bottleneck, there was a delay during the cleaning of the interviews, the cleaning team got stuck with so many interviews and many times the follow-up is unknown of the corrections made in the interviews. For this reason, I think it would be more optimal to have a C-account on each team.

This would depend on the survey. There are surveys, like yours, that would overwhelm the cleaner. And there would be surveys that would not have the resources to have a dedicated cleaner for every team or the survey is too small to warrant team specific cleaner.

It’s a question of choice. The choice of how to structure the survey should lie with the people conducting the survey, and not be limited by the software.

In this suggestion, the cleaner may be optional. If no cleaner is allotted to the team, the system may continue to work the way it is working now. This cleaner can be assigned to one or multiple teams. We may have:

  1. 1 cleaner for all teams
  2. 1 cleaner per team
  3. 2 or more cleaners in a team
  4. A cleaner for teams A and B and another for teams C and D
  5. 1 cleaner in team A, no cleaner in team B
  6. no cleaners in any team.
2 Likes

I really like this idea–and this discussion.

Personally, I think that data cleaning should be done–in an ideal world–in a way that is both as transparent and as “reproducible” as possible. Transparency means that one should be able to observe who changed what, when, and why. Reproducible means that one should be able replicate changes and–an important corollary–make alternate changes.

If these two principles conflict in practice, I would like to give the user options about how they resolve the conflict. Currently, the user has both transparency and reproducibility–on the strong condition of doing data cleaning outside of Survey Solutions. But the user does not have an easy path for doing data cleaning inside of Survey Solutions–apart from “C user” workarounds.

Another complementary way of looking this is as follows. In my view, data cleaning is almost always neglected. To the extent that data cleaning is hard, it will not be done. In my experience, data cleaning typically happens only done after data collection has ended, rarely draws on observations made during data collection, and more often than not is fairly superficial. To promote data cleaning during data collection–as a supplement to post-survey data cleaning–I think it would be worth considering giving some class of user the ability to change data. While I share @zurab 's reservations about cleaning that is not recorded in a script, I also see that necessary cleaning may not happen at all unless it’s done during data collection in Survey Solutions. Put another way, if one wants a productive pipeline, one needs plumbers, and those plumbers are typically too busy when the water is flowing (i.e., those who would do data cleaning are often those managing survey operations, and neglect important cleaning tasks in favor of urgent survey ones).

By the way, here’s the thread that discussed allowing the API to answer questions. Note that, interestingly, it somewhat evolved into the same discussion we’re having in this thread.

My specific proposal is to allow the Headquarters users to change answers and leave a (required) comment on why. On the front end, HQ users would have one more option in the overflow menu associated with each question (i.e., the three vertical dots). On the back end, the changes would be recorded in the data and the paradata. On the front end, data edits that change the state of the interview–for example, enable/disable questions–would need to be played out. That last bit seems like the trickiest part to my layman’s eyes.