User Stories for Data Science Experiments?

https://www.pexels.com/photo/photo-of-people-doing-handshakes-3183197/

"Hey Data Scientist, please go ahead and create the user stories for the new features that we discussed and provide an estimation of the effort required for each story." - Is this a familiar request?

"And please provide story points for each task so we can plan our sprint effectively." - Sound familiar too?

Estimating effort for data science experiments can be inaccurate due to the inherent unpredictability of exploratory research, data quality issues, and the evaluation of Large Language Models (LLMs), which can be particularly challenging. Unlike software engineering, which often has defined functional requirements, data science relies on unknown variables that can swallow weeks of work.

"Our competitors are shipping agentic solutions, we need to catch up!" - Is this a common pressure you face?

The pressure to deliver results quickly can lead to rushed experiments and incomplete evaluations. Data scientists may feel compelled to provide estimates that align with business expectations rather than realistic assessments of the work involved. Moreover, with the hype around LLMs, AI frameworks and tools, many people are excited to leverage them for various tasks.

It is important to understand how do your Data Scientists work, what they need to work on and what do they expect.

Starting with the hypotheses formulation, data scientists may need to explore multiple avenues before settling on a viable approach. This exploratory phase can lead to significant variability in the time required, making it difficult to provide accurate estimates.

Data is needed to prove or disprove hypotheses, and this process can lead to unexpected challenges. A hypothesis that seems promising at first may require significant adjustments or even abandonment based on initial findings. This iterative nature of hypothesis testing makes it difficult to provide precise time estimates.

"How good is the dataset?" - Is this a question you often hear? This is an important question, especially when working with LLMs. The quality of the dataset can significantly impact the performance of the model. However, evaluating the quality can be subjective and context-dependent. Simple metrics such as dataset size or diversity may not fully capture the nuances of the data. Furthermore, the relevance of the dataset to the specific task at hand can be difficult to assess without extensive experimentation. It is good to profile the dataset using various statistical and visualization techniques to get a better understanding of its characteristics. However, this can be time-consuming and may not always lead to clear conclusions. It is important to involve domain experts in the evaluation process to provide insights into the relevance and quality of the dataset. Helping domain experts visualize the data and understand its characteristics can lead to better assessments of its quality. A python notebook with visualizations and summary statistics can be a good starting point for discussions with domain experts.

Once the data is prepared, evaluating the performance of LLMs can be another challenging task. What metrics should be used to evaluate the model's performance? How can we ensure that the model is generating high-quality responses? These questions can be difficult to answer, especially when dealing with complex language tasks. We can start with baseline metrics and then iterate based on user feedback and real-world usage. This is an iterative process that may require multiple rounds of evaluation and refinement. Therefore, it is hard to provide accurate estimates of the time required for this phase.

On top of these, we have to consider AI Content Safety, Ethical AI, and Bias evaluation. This includes jailbreak testing, compliance to regulations in different regions. In Microsoft, we have a dedicated team for AI Content Safety who help us in this regard. And guiding principles to assist us in building ethical AI solutions.

To address these challenges, here is some food for thought:

1. Understand the business context

Collaborate closely with stakeholders to clarify objectives and constraints. This helps in setting realistic expectations and aligning efforts with business goals. At the same time, educate the stakeholders about the inherent uncertainties in solutions that involve Large Language Models. It is important to set the right expectations from the beginning. It is important to focus on the user experience and the value that the solution brings to the users, rather than just the technical aspects. Hence, it is important to involve an EXPERIENCED designer and an EXPERIENCED technical product manager in the discussions from the start. Let the stakeholders tell you the business goals and the user needs, and then work backwards to design the solution. It is important not to make assumptions about the user needs and the business goals. At the end of the day, the users are the ones who consume the response from the LLM, and they may use it in ways that we do not anticipate. Hence, it is important to focus on the user experience and the value that the solution brings to the users.

Your company is running a business. Hence, it is important to balance the need for exploration and experimentation with the need for delivering value to the business. It is important to focus on delivering small, incremental improvements that can be measured and evaluated, rather than trying to achieve perfection in one go.
Stakeholders will not approve the solution if it is too expensive to run. Hence, it is important to consider the cost of running the solution. For instance, how much token usage will the solution incur? What model will be used? Can we expose tools to the LLM to reduce token usage? The operational metrics is essential here.
Consider the latency requirements of the solution. If the solution is user-facing, ensure that the response time is within acceptable limits. This may require optimizing the prompts, using smaller models, or caching responses.
How is the response from the LLM presented to the user? Is it in a format that is easy to understand and use? Consider the user interface and user experience when designing the solution.
Understand the risks involved in using LLMs and how they can be mitigated. For instance, to avoid doctors missing out on critical information, we can tune the LLM to be more conservative in its responses. That's why we need to prioritize recall over precision in certain scenarios.

2. Running experiments

Adopt an iterative approach to experimentation, starting with small-scale pilots to validate hypotheses before scaling up. This allows for adjustments based on initial findings and reduces the risk of large-scale failures.

Having a well-defined experimental design can help in providing more accurate estimates. This includes defining clear hypotheses, selecting appropriate evaluation metrics, and establishing a timeline for the experiments.
Visualizing the metrics and the results can help in better understanding the performance of the model. This can also help in identifying areas for improvement and refining the experimental design. For example, using a confusion matrix can help in understanding the performance of a classification model.
Involve domain experts in the evaluation process to provide insights into the relevance and quality of the dataset. Helping domain experts visualize the data and understand its characteristics can lead to better assessments of its quality. We have seen cases where domain experts were able to identify issues with the dataset that were not apparent to data scientists.
Use automated tools and frameworks to streamline the experimentation process. This can help in reducing the time required for data preprocessing, model training, and evaluation. We have spent a significant amount of time in setting up automated processes for experimentation, which has helped in reducing the time required for experimentation.
Storing artifacts such as datasets, system prompt templates, and evaluation results can help in reproducing experiments and providing accurate estimates for future experiments. We have seen cases where storing artifacts has helped in reproducing experiments and providing accurate estimates for future experiments. This can also help in repeating the same experiment with different LLMs in the future. This is unavoidable because LLMs are deprecated frequently.
Communicate uncertainties: Clearly articulate the uncertainties and risks associated with data science experiments to stakeholders. This helps in managing expectations and fostering a collaborative approach to problem-solving. Many times, stakeholders can provide valuable insights, suggestions, and assistance in overcoming challenges. Hence, it is important to keep them in the loop and involve them in the decision-making process.
Provide a range of estimates (best-case, worst-case, and most likely) rather than a single point estimate. This helps in conveying the inherent uncertainties involved in data science experiments. We have seen cases where we report good results and delay in sharing the bad results. It is important to share both good and bad results in a timely manner to manage expectations. This helps in building trust and credibility with stakeholders. Moreover, it allows stakeholders to make informed decisions based on the range of possible outcomes and plan for contingencies accordingly.
Document assumptions and dependencies that may impact the estimates. This helps in providing context and clarity around the estimates. We have seen cases where assumptions and dependencies were not documented, leading to misunderstandings and misaligned expectations. It is important to document these factors to ensure that all stakeholders are on the same page.
Repeat the experiments many times to get a better understanding of the variability in the results. This can help in providing more accurate estimates. We have seen cases where repeating the experiments has helped in identifying patterns and trends that were not apparent in a single run. This can also help in identifying potential issues and challenges that may impact the estimates. This should be part of the reporting to stakeholders.
Use visualizations to communicate complex workflows, and at each step, highlight the uncertainties and risks involved. Evaluation metrics at each step can help in conveying the performance and limitations of the model. Additionally, we can discuss simplifying the workflow to reduce uncertainties. When there are many steps where LLMs are involved, the overall uncertainty can be high. The compounding error from each step can lead to significant variability in the final results. Hence, it is important to evaluate the overall workflow and identify areas for improvement.

3. JumpStart the process

Be creative about planning and estimation. Every team is different, and every project is different. Find what works best for your team and your project. For example, once we have a good understanding of the business context, we can generate synthetic data to jumpstart the development of experiments. This can help in reducing the time required for data collection and preprocessing. This can also help in exploring different scenarios and edge cases that may not be present in the real data.

Don't be afraid to fail. Data science experimentation is an iterative process, and failure is a natural part of the learning process. Embrace failure as an opportunity to learn and improve. This mindset can help in reducing the fear of failure and encourage experimentation and innovation.

Summary

Understanding the business context is crucial for data scientists to provide accurate estimates. Knowing how the generated content is consumed by end-users and what the business impacts of the generated content are can help data scientists prioritize their efforts and focus on delivering value to the business.

Building a strong foundation for experimentation is essential for accurate estimation. This includes having a set of well-defined hypotheses, a clear experimental design, and a robust evaluation framework. By establishing these elements upfront, data scientists can reduce uncertainties and provide more reliable estimates.

We have invested a lot of effort in building experiment automation (runner) that makes life easier for us to run experiments consistently and reproducibly. This has helped us in reducing the time required for setting up experiments and storing artifacts.

In conclusion, estimating effort for data science experiments, particularly those involving LLMs, can be challenging due to the inherent uncertainties and risks involved. By adopting strategies such as understanding the business context, running experiments iteratively, and communicating uncertainties clearly, data scientists can provide more accurate estimates and manage stakeholder expectations effectively. This collaborative approach helps in delivering value to the business while navigating the complexities of data science experimentation.

Dennis Seah

Search This Blog