Data for Data Scientists

Submitted by Vinod Jain on Wed, 2021-07-21 15:09

A data scientist’s dream is to get consistently clean, complete, accurate, and on-demand data sets—and in the form they can use. Data scientists measure and analyze multiple data sets including economic, social, market, environmental, and political information. Portfolio managers expect their data scientist teams to analyze large/disparate data sets within their models to provide the most extensive influence possible on their trading strategies.

At a high level, we could view data required by data scientists as either direct data, derived data, or observed data. All these three types of data are essential in a business context for decision-making and building models. But more crucial is the trust in data that empowers users to make business decisions. Data scientists need constant availability to data in real time, intra-day, and the end of the day. Additionally, there should be a balance around the speed of delivery and accuracy between the data need in real time versus the end of the day. One of the critical aspects of data sourcing on a real-time basis is that its value can erode quickly over time, such as the depth of the book. Another critical aspect is if there is a heavy reliance on the end-of-the-day data, then a business decision could be much delayed, and the usefulness or relevance of the data is completely lost by the end of the day.

So, what are the practical implications in achieving consistently clean, complete, accurate, and on-demand data sets?

Buy-side portfolio managers hire data scientists to generate opportunities to discover alpha for portfolio managesr. Data scientists are expensive resources; thus, the time and cost spend on building the models should be managed prudently. Data scientists are eager to consume data to analyze actual phenomena and their impact on the portfolio manager’s portfolios.

But data scientists are spending almost three times the expected time they should spend on data collection. This shortens the time available for data analysis. Something is amiss. There could be some instances in which the data source needs manual intervention before data analysis, but even after discounting this effort, the time spent in data collection does not reduce. This is crucial, as more time is spent on data sourcing results and less time spent on developing accurate models—ultimately affecting the trading strategies.

Data sourcing involves external and internal data. External data sources include securities pricing and valuation, risk indicators, end-of-day closing prices, various interest rate benchmarks and curves, economic indicators, and other data depending on the portfolio. The external data is standardized, to a larger extent, by the type of data. But, if we have too many vendors providing the data set—and vendors’ compulsion to differentiate their offerings in the market—eventually, the data set leads to non-standardization and cannot be easily aggregated during its use. This leads to the development of multiple patch fixes to reduce the external data complexity and achieve a cheap standardization across all the data sources.

Now let’s look at the internal data sources. Some of the internal data variations include the security master data, client preferences data, portfolio strategy data, transaction and position data, and the various rules and regulatory mandates requiring compliance. But every authorized system of records has its own data format for distribution. This creates another concern for data scientists to address. While developing the system of records data distribution, very little attention is given to the data set required by a data scientist.

The expensive data scientist knowledge is now spent on standardizing the data between external and internal sources so that it can be used for further purposes. Ideally, this should not be the job of a data scientist, as the source data itself should provide the correct data to the data scientist. A mechanism should be in place within the enterprise that can serve up data that is ready to be put to work. Further, data science leaders need to start insisting upon stronger supporting infrastructures that empower them to deliver on their mission efficiently.

Some technology solutions providers have identified this standardization gap and are providing designs and solutions to firms so that they could reduce the time spend in data gathering and redirect the time to developing the model. Other providers continue to hold these capabilities within their software solutions. Another gap observed by the data service providers is the linkage between securities masters and legal entity masters across various jurisdictions. Achieving data standardization and maintaining a sustainable mapping across various data elements are key criteria to ensure scientists are served the consistent data they require.

Kuberre Systems, a Wilmington, Massachusetts-based software solution provider with its X-Ref PRO Security cross-referencing service, aims to resolve the data sourcing gap, strive toward data standardization, and provide various cross-references across data sets to create a single data model. It is striving to make this capability available as a more easily leveraged “service” to eliminate the bottleneck of time spent in data collection by the data scientists across the industry necessary to create their quantitative models.

A data scientist and a portfolio manager’s team would be successful if they generate alpha based on various models. A data scientist would be looking for a solution that can range from fixing the upstream systems to having an intermediary process that can provide good quality data to the data scientists’ team. Thus, out of necessity, data scientists are exploring various options and solutions to standardize data sourcing. A solution like X-Ref PRO from Kuberre Systems offers is changing the way data standardization is accomplished at an industrial scale. This would bring external data and internal data on a similar pattern or pace for the data center’s team to develop the exotic model strategy.

Data for data scientists can be viewed in the same way as fresh/wholesome ingredients are to a chef. Once acquired, they need to be cleaned, chopped, and mixed with other ingredients to be ready to serve. Data scientists’ requirements should be addressed so that portfolio managers can enjoy expedited/timely access to models that identify the impact of market changes, develop trading strategies, and generate alpha. As progress toward machine learning is advancing, the fundamental need for clean, accurate, reliable, and complete data has never been more critical—just imagine the amount of bad data machines can generate if data standardization issues are not resolved now. The hunt for clean data continues by data scientists…

Add new comment

How can we help?

If you have a question specific to your industry, speak with an expert. Call us today to learn about the benefits of becoming a client.

Talk to an Expert

Receive email updates relevant to you. Subscribe to entire practices or to selected topics within
practices.

Get Email Updates