Over a decade has passed since big data became common across the business world, and now giant datasets are appearing with growing frequency in the types of investigations, litigations, and regulatory proceedings for which economists offer expert advice and testimony. A government investigation of foreign exchange manipulation might see high-frequency trading orders; discovery in an antitrust litigation against a retailer might produce a database of all consumer purchases during a class period; and a state health regulator might receive an entire hospital system’s patient records. These legal and adversarial settings impose on expert economists constraints that do not always hamper data scientists in industry: limited access to information about the data, limited budgets, and limited time. A wholistic big-data-management plan can alleviate some of the effects of these constraints. Such a plan should incorporate best practices from the data-engineering industry to satisfy auditability and scientific standards and to maintain the agility and accuracy that legal proceedings demand.
The first step of such a plan is assessing the data requirements of the case and the resources available for fulfilling those requirements. While not always possible, the ideal starting point is to develop testable, quantitative hypotheses and identify what data exist to test those hypotheses. For example, a hypothesis may be that a bank discriminates against loan applicants from certain populations by approving their loans less frequently than other applicants’ loans. Data requirements for this analysis include identification of which borrowers are members of the different applicant populations. This starting point allows for the development in some cases of a more focused data request in the discovery process as data not relevant to the hypotheses need not be produced.
The complexity of the data, as well as the monetary and time costs of handling and storing the requested data, also inform the data request. Different orders of magnitude of data require different computer infrastructures and engineering efforts. Specialist “data-warehousing” firms may be able to host and manage the very largest datasets more quickly and cheaply than the economists querying and analyzing the data can. Further, if the data are very large, sampling parts of the data may offer a more practical—or the only feasible—alternative to working on the entire dataset. For example, litigation discovery may constrain the number of retail locations from which consumer data can be retrieved, and (appropriate) sampling is the only statistically reliable approach to draw conclusions about the entire population of interest.
The second step of a big-data-management plan is tracking the “chain of custody” of all raw data upon intake, which involves keeping a written inventory of data sources and ensuring the raw data does not change accidentally. Storage of raw data in a space—which data engineers call a data lake—dedicated for this purpose ensures that audits or reproductions of results will be traceable to the original data sources. Common quality-assurance practices can facilitate agreement among litigants that everyone is looking at identical raw data. Such practices can avoid issues that occur in litigations, such as a third party producing a complete data file to a plaintiff but a truncated copy to the defendant due to technical errors.
The third step is data cleaning, deduplication, and optimization. This is known as extraction, transformation, and loading (ETL) in data-engineering parlance. To the extent possible, it is important to automate in computer code every step of ETL for auditability and replicability. Extraction moves data from the data lake into software where it can be viewed and modified. Transformation entails cleaning the data to eliminate duplicates and errors and normalizing it to eliminate repeated information within the dataset. For example, to clean a list of a retailer’s store addresses, one might check the spelling of street names against USPS’s online lookup service. The most important step in transformation is writing a programmatic description of the data’s constraints of internal consistency (e.g., “prices are non-negative,” or “sales occurred between these specific, known dates”). This facilitates deduplication as well as size and speed optimizations. Finally, loading involves moving the cleaned dataset into a data storage facility and query handler called the data warehouse.
Once data are available in the data warehouse, manual review and validation of the data are critical. The goals of validation are curing errors (e.g., misspelled addresses) not caught during prior steps, further deduplication, and identifying missing data. Validation can be implemented by summarizing the data, e.g., to see that the sum of all of a bank’s loans aligns with what they wrote in their regulatory filings. During this process, it is important not to confuse abnormal observations with unreliable data. For example, abnormally large financial trades may be outliers of investigatory interest rather than erroneous data. For categorical variables (e.g., a loan applicant’s race), tabulations can be used to check if any categories have unexpected frequencies (e.g., a higher proportion of loan applicants of one race than live near the bank under investigation).
The fourth step of a big-data-management plan is analysis. Analyses include querying the data warehouse to compute statistics of interest, such as averages, frequencies, correlations, and, in more complex cases, regressions. By only ever reading from the data warehouse through its querying interface, experts and economists ensure reproducibility and auditability of the analyses while allowing multiple workstreams to proceed simultaneously.
The fifth and final step is reporting the results of the analyses. In some instances, this may require building interactive visualizations such as geographic maps, charts, and network visualizations. An integrated and automated data management plan provides the flexibility needed to create sophisticated reports even as more data become available or the goals of a case evolve.
The comprehensive approach to data management described above is well suited for the adversarial data science required in investigations, litigations, and regulatory proceedings. While handling vast amounts of data from multiple sources is complex, this framework can achieve accuracy and cost effectiveness without compromising speed, flexibility, and auditability.