Thursday, August 11, 2016

Reader Question: Data Documentation

"Eliminating the Pop"--Thursday, 8/11/16

How about some black and white + light blue + orange outfit inspiration?  Definitely not a combination that springs immediately to mind for me.

From theperfectstormbffs.com

For this one, I used a black polka dot skirt and orange flats on the bottom half.


Black polka dot skirt (Walmart), $3.22/wear
Light blue short-sleeved T (thrifted, Walmart), $1.50/wear
Elbow length black cardigan (thrifted, AB Studio/Kohls), $1.25/wear
Orange flats (Payless), $0.91/wear
White/blue/yellow scarf (Target), $3.00/wear

Outfit total: $9.88/wear

I kept the top very close to the inspiration photo but could not resist adding my Easter scarf that has both light blue and coral/orange in it.  This definitely takes away from the orange "pop" effect but in the end, my style is more matchy-matchy than it is pop.


Sorry, Magnitude.


At least the public health community will approve if I eliminate pop, right?

In other news...Tam sent the following question:

I feel like there's no way on earth to keep enough notes about what I'm doing to actually put it back together later. How do you keep on top of this? I know it's the kind of thing like documenting code where what seems like a substantial time investment up front actually ends up saving enormous time and expense down the line, but unlike documenting code, it doesn't feel natural to me.

Feel free to incorporate your answer into a blog post :-)
 
I'm really struggling to answer this question because this does feel natural to me...some combination of it being in my nature and my having done it for so long.  At a work meeting (on a day I was out sick so I only got to hear about it later) a while back, my team worked with a matrix of people and work functions/areas, and for the "documentation" area, I was the only person they rated this a strength for (which is so, so sadly true).  Just today, my office mate C. asked me, When you did blah-blah end of year report last year, what did you do about such-and-such situation?  And within 15 seconds I had pulled up an SPSS syntax file in which I had documented a section "SUCH AND SUCH ISSUE--Discussion with C. 8-12-2015" and several lines which explained what we had discussed, what our decisions were, and how I implemented them.  BOOM. 
 
So I feel a bit like saying, Well, you take the kinds of notes you need to take given the data and the project and the sources and how often you will be revisiting this.  And that's kind of like answering the question, How do you know that's a chipping sparrow? with the answer, Because it activates the "chipping sparrow" areas of the FFA in my brain....which is to say, not helpful at all.
 
Tam further described her current project this way:

"I'm basically filling out this spreadsheet of requested summary information from another company ...using a variety of different data sources and various procedures. It is complicated by data that isn't easy to access, data that doesn't easily support the type of information I'm trying to get out of it, and my own ignorance of the underlying domain."
 
Let me try tackling this in pieces.

First, the domain/content area piece.  When I was new to my current job, I started a set of documents in which I started tracking the things I learned.  Importantly, I also started tracking the questions I had, how/if they were answered, why I wanted to know...basically everything I could without too much concern with efficiency, layout, etc. (with dates! always with dates! because shit changes on you!).  It became like a narrative of my understanding of the domain.  I still refer back to these notes sometimes but not very often, though I used them constantly in the first year or so at the job.  I now tend to keep information on content areas with specific projects that I'm working on.  
 
But that narrative flavor is common to a lot of my documentation.  When I sent my syntax to C. for her to look at, she reviewed all of my syntax and notes to see how it compared to what she was planning to do for this same report, and she mentioned to me that it was interesting to see that I often would write things in a format like:  "But what about X?  X is blah blah blah.  Oh but later under 'Eliminating Y and Z' I will remove these records, which should take care of the problem."  This is partly because I'm a write-thinker and partly because the same questions come to mind again and again, so it's helpful to see the train of thought.  As a benefit, when I later am changing my procedure to retain Y and Z, I will have a clear indication that doing so will bring up the X problem again.  But I think this aspect is more like documenting code than what Tam is really asking about.  I just thought I'd mention the narrative style because I think sometimes we are so concerned about making our notes "efficient" that we leave out things that we later wish we hadn't. 
 
Tam's project reminded me the most of an every-two-year federal data collection project that I have become the default operational manager for.  Some of the data are related to (and generally available from, with effort) the information systems to which I have access while others are pieces kept by individual departments/programs (including paper records) or are in information systems to which I do not have access (accounting, HR, etc.)  I'll focus on the "data related to systems to which I have access" because coordinating with other people to supply information to you is a huge thing all to itself.

It is fascinating to look at the documentation I was starting with this year compared to last time (the first time I did it)--which is to say, good documentation versus almost no documentation (just some incomplete scraps of syntax with no reference to what data source was used and a final Excel data file that had many, many hundreds of columns without headers that meant nothing to me at all).  My god, it has been so easy to churn out good data this time around now that I know what I'm doing and have all these notes on decisions and procedures and my syntax already written (which only needed to be modified to reflect changes to the data requirements/our systems).

OK, yay for me, but how did I get there?  One document that has been worth its weight in gold--a list (just a Word doc) of the various data elements required with bullets for assumptions, decisions, data sources, warnings (e.g., Note: XYZ definition of Whatsit does not include A or B, unlike the QRS definition we generally use), and points to be clarified.  I kept editing this document as I went last time so it reflected the most up-to-date information (as opposed to the narrative style I talked about above).  And I updated it at the start of this year's process on the basis of the changes to data requirements/our systems.

I vary in whether I keep information about the steps of a procedure in a separate Word document (some people like google docs instead, but I am generally writing only for myself), as part of SPSS syntax, or both.  It depends a bit on the data sources.  When I'm working with data that I download from our server and import directly into SPSS, or files already in SPSS, I'll write it up as comments there.  When I'm working with data that takes a lot of steps to get it into that format, I'll write it up separately.  When I have a bunch of different syntax files that I need to run through to get my result, I will often have a Word doc that just reminds me which files I'm using and the order of them and what data sources to use.

When I'm working with data directly from our big-ass primary database mirror, I have a Sally version of a data dictionary that I keep to remind me of the table.Field where data are stored, calculated fields I routinely create, etc.  

I will say that my approach when I don't know what the hell I'm doing and whether it will work is usually to write down every single thing I try.  This creates a mess but when I'm done, I can then develop a cleaned up set of instructions to myself for doing this again and (in cases like what I understand Tam's to be) a quick reference key to the data in the spreadsheet.  But I do keep my messy trial version, too, because I sometimes look back at it to fill in gaps.  

I guess I like the keep my syntax (code) commented with a bit more of a narrative flavor, and I also keep instructions that are clean and reflect the steps I need to do, documentation of decisions/definitions/etc. that reflect the current understanding of the data needs and what will be/was supplied (in a general way), data dictionaries that are very operational, and other messier documents that show the things I tried/what I was thinking/etc.  Those messy documents are often actual pieces of paper, hah.  
 
(Right now I'm kind of in a weird place with multiple projects that exist at least in part as dozens of pieces of paper with notes on them, paper clipped together.  I was noticing today that this was driving me crazy...the paper itself primarily but also being in an awkward place with so many projects.  I'm trying to balance working on some Big Thought stuff with some more routine, quick turn-around reports, and I wish the people I'm waiting on for the data for the reports would hurry up already so I can knock those out and more fully concentrate on a Big Project for more than 30 minutes at a time.  I think this paper confusion reflects my mental confusion, which was definitely a contributor to the frustration I discuss below.)

So this boils down to:  My process is to write down everything I think and everything I try and what happens with it.  Then I clean it up later.  This works for me because I a lot of things I can write about as fast as I can think them, and other things, I can use the writing to slow things down and think them through.  As far as I can tell, doing this doesn't make me appreciably less efficient/productive than my co-workers who don't document things the first time around, and it is always helpful later.  
 
Really, in my experience, always helpful.  I have yet to think, I wish I had documented this less thoroughly.  This afternoon I was revisiting this monstrous syntax I wrote a few months ago, bringing together a bunch of different data sources and creating (I kid not) hundreds of new variables to be aggregated in various ways...ugh, just a big mess.  And it was very well documented, but it turns out I would have liked it to be even a little better documented.  I was wanting to add some new variables to it and I got a bit lost figuring out at what point to add them and what "versions" of the variables I needed given the kinds of restructuring that happened later in the syntax.  I was able to figure it out but I could feel that it was harder than it needed to be to reconstruct what the hell was going on.  See, in the normal way, I should only wonder what the hell is going on with other people's syntax.  My own syntax just shows what is going on.

I have no idea whether this is helpful or not, so I will open it up to my readers.  Any suggestions on how to document data-related stuff at work?

6 comments:

Tam said...

That was helpful, thank you. I write down EVERYTHING while doing math, in a style so narrative it doubles as a diary, but I have never done much documenting of workflows at all in a business context.

Sally said...

I'd say, Don't get hung up at first on the "documenting of workflows" piece that makes it sound very formal and like there is a right way and so forth. Let that come after you've figured things out!

Sally said...

It looks like this comment from my dad disappeared. Maybe this will be a re-post but I'm not seeing it when I look at the page so I'm posting it again.

------

I work as the final reviewer of a lot of completely different accounts on my company's books. I mostly use Excel spreadsheets. I just add a "sheet" in Excel (not in Word), and then create an explanation of the theoretical aspects of the account and also of the procedures to follow to get the information to put into the Excel spreadsheets. This explanation sheet is in the same file as the recon, so it is easy for anyone who is looking at my recon to tab over to the explanation sheet to see why and how the reconciliation is done. I always note when my explanation sheet was last revised because we change data collection systems from time to time and I need to update the "how-to" part of the explanation. I purposely use Excel for the explanation because it looks like a Word document, but I can store it directly in with my Excel files w/o having to open a separate Word document.
After I have updated an explanation for a specific account (let's say "tips & gratuities"), I copy the sheet attached to the "Tips & Gratuities Recon" to a master file called "How to Reconcile Accounts." This master file has all the individual explanations all in one place, so if I ever get run over by a Mack truck, anyone with a good knowledge of accounting could pick up where I left off on my recons. We are a large company with hundreds of accountants, analysts, bookkeepers, and clerks. We have a big problem with "brain drain" when an experienced person leaves and all that person's knowledge goes away.

One of my biggest reports is only done once a year and involves extremely large amounts of money, so I make sure to keep good notes about who gave me the data, how to use it, etc. The beauty of Excel is that you can right click your mouse and then insert comments. When looking at the Excel spreadsheets, I know which cells have comments because the upper right corner of the cell is filled in with red. That tells me that I can see comments by clicking in the cell.

As a CPA, I try to explain to others that having the correct numbers on the books is great, but without a good "audit trail" (where did the data come from), then the job is incomplete. I might send a "summary recon" to a high level person in the company by "hiding cells" (so I don't "confuse a supervisor with a lot of detail), or I keep one recon with all the gory details for me and a summary recon on a separate sheet for my monthly reporting requirements. It is frustrating to get reconciliations from other people when there is no note of the purpose of the recon, who prepared it and when, etc. I should not have to redo someone else's work just to review it.

When I worked at a CPA firm in Colorado, the partner always reminded the staff going out to do audits that our work papers (documentation of what did and why) was what would help protect the CPA firm if ever got suited by someone using the information contained in our audit. In fact we joked that CPA stands for "cut," "paste," and "apply." Make a photocopy of a bank statement, then CUT it down to size fit in our work papers, then take PASTE (glue, scotch tape, etc.), and APPLY (attach it firmly) into the work papers. Programmers, accountants, IT personnel, etc. who do not document there work are not doing their job and should be reprimanded. When my company switched from one version of accounting software to a new version of the exact same software, we spent a year documenting and testing to be sure everything would work properly.

Sally said...

Dad,

Thanks for this information--very interesting!

I had forgotten that I too sometimes have an Excel sheet with explanations for certain things. I could do that more often that I do, really.

I didn't realize that you could leave comments on Excel cells. I just tried it out, very cool.

One thing your comments underscored for me is that we shouldn't be overly concerned about whether our managers etc. think that we are spending too much time on documentation, something that Tam mentioned having concern about. I've never had anyone question the time I spent on documentation--I've only had managers be grateful that I do it--but if someone did, I would be confident in explaining that without proper documentation, the job is incomplete. In cases like the one Tam specifically mentioned here, providing data to another company, it seems critical to be able to explain and stand behind your numbers if/when questions come in (and it seems to be always a matter of when, not if).

I will also just say that I wish the accountants I have worked with in my organization kept any notes at all on what data they've given me. For our big data collection, last time they provided with the wrong data at least twice before it was...I won't say right, but not so wrong that it wasn't patently obvious from a person just looking at it. (For example, one of the sets of numbers is funding from sources A and B, and another set of numbers if funding from sources A, B, and C. They gave me numbers from A, B, and C that were LOWER than from A and B alone. So unless source C is taking money away from us, that is clearly wrong.) I am kind of dreading asking them for data this year because we will be going through this whole cycle again. Asking them for the same data again doesn't mean they will use the same procedure again--I've already established that.

Jen M. said...

One issue in programming is having stale documentation. As much as possible I just use well-named, modularized code which is "self-documenting". I'd rather have it readable so anyone could jump in and understand it than write up information to explain it. It is more likely to be high quality code that way and it makes reviews easier. Save the documentation for higher level design, APIs, and data storage that other teams might need to know. And that goes in the wiki!

Sally said...

Jen,that's an interesting point. I could see how people could lean too much on documentation to make sense of something that could have been made sense of without so much documentation if it were better written.

I wonder whether there is a distinction between programming and data reporting that plays into the extent to which documentation is needed. If I understand correctly, all the "steps" of a program are already documented in the code, but with reporting, you are often only seeing the results of the steps.

Even where my syntax is concerned, some parts of it are more like programming where you can see what steps are being carried out and that is sufficient, but rationale matters because you are trying to achieve a specific reporting result and the desired result/underlying data source/etc. is so variable.