so thank you everyone good morning my name is Bishan Singh thank you all for joining us at the third data science collab demo day we never thought that when we launched this past round of collab we'd have to host a demo day virtually however I'm pleased to share that we have a record audience that's awesome let the next speaker Sanjay Kalani will tell you about so I guess I oka with a small thanks but that said I hope you're all staying healthy from the safety of your homes so I've had the pleasure of being the collab co-director for this pass around having been a graduate of collab myself in the very first cohort few years ago collab is very near and dear to me so I'm really excited for you to see the newest additions to HHS this portfolio of data science tools which are participants individually built during their eight weeks of the collab boot camp a couple quick housekeeping notes before we get started so presenters I ask that you please turn on your video and on beat yourselves when it's your turn to present and then turn off your video and meet yourselves when you're done presenting my colleague Rachel Mello will be keeping track of your time so you'll hear from her before you begin to make sure you're good to go and then also if you run over time audience members this event is being recorded and live streamed if you have any issues with the WebEx please switch over to the live stream which is available at the HHS YouTube sorry HHS livestream channel linked here the link is should also be in your email if you registered for the event I sent it out yesterday through the event right after the meeting the recording will be available on HHS's youtube channel so feel free to share and watch again if you're interested if you have any questions for the presenter please ask them via the WebEx Q&A features and they can respond to you directly and then others can also see your cue and eggs which may be informative finally live tweet with us at HHS CTO office and use the hashtag collab demo day I'll now turn it over to Sanjay koi Yanni who is the executive director of innovation in the office of the CTO believes a reimagine HH data insights initiative under which we did a science call-out currently lives under go right ahead great thank you thank you [ __ ] in here me okay yes great one good morning everyone and thank you for joining us for our data science collab demo day fishing set on San Tekka Yanni I'm CJ's executive director and I want to welcome colleagues from HHS other federal agencies and other interested partners from across the country from what I understand this is our largest attendance yet I can fishing you were telling me that it's just short of two thousand people registered so this is really terrific to have so much interest in the work that we're doing and some of the incredible results that have been happening from the demo day before I begin I want to align collab with HHS is larger reimagine data sharing strategy someone to go through a few slides they're collab has been a collaborative partnership between aces our assistant secretary for administrations office of business management transformation I will Kim's been a a key player in that along with efficiency in the office of the CTO and so both groups have really been driving our alpha and beta versions today you're hearing about alpha I don't really want to recognize her exceptional work and all the stuff they've done behind the scenes to make this stuff happen abyssion next slide please thank you so as any of you may know reimagine HHS is a robust transformation effort to improve how the department operates and improves its efficiencies in effectiveness to serve the American public one of ten initiatives that reimagine represented is called data insights initiative and it's focused on improving how we do data sharing across the department and so this is a just a general framing of the goal that we have which is to improve how we at HHS and beyond share integrate analyze and visualize or federated data to better inform policymaking and our evidence-based decision-making and the reason why we really emphasize federated is that we've got incredible you know programs at HHS from FDA to CMS the CDC and so forth they've all got their own data system to data scientists and and that drive the work that they're doing what we're trying to do here is to create a bridge and the linkages between that data so that when you're looking at opioids and other health and human service issues we can bring that data together in a lightweight way to really help answer questions that you can't really do independently next slide please and this here represents just some of the work we've done to get to this point the when we started this we did it with discovery so in 2018 we interviewed HHS stakeholders across the department and develop the current state report on data sharing and opportunities for improvement and if you go to our website on a charge at CTO you can see our current state report in 2019 we spent time developing that teacher state roadmap that was going to inform how we implemented our data sharing strategy and we're at right now you can also get that report online it is our proof of concept phase where we're building in and we're building out an ad Julie an enterprise-wide data sharing platform and the appropriate governance and other systems in place that that enable us to do that data sharing if you go to the next slide mission I'll highlight the key components of our work and we're the collab fits in so our data sharing strategy consists of four key parts if you look at the far right on the data sharing platform called the unified we're in we're developing a cloud-based open-source collaborative collaborative environment to enable data sharing of non publicly available data at HHS so healthdata.gov which is our external platform is our reference catalog for all the publicly available data we have we're creating something internally now that enables us to look at some of the non publicly available data to answer important questions if you go up to the top this is our this is driven by our use cases on Health and Human Service issues to inform the platform build out and demonstrate the impact of the system in improving our decision-making or our policies and so just with the platform it's really important that the that our users drive it both in what its functionality is but also to make sure that it's doing what it needs to do to help improve our decision-making and policymaking if you go down you look at our data governance which is basically how do we create our data standards and have the right level of oversight and fit into the existing data systems at the department and then finally to the left you see that we have our collab to train and enable more staff interested in data science to get hands-on experience and eventually drive projects through our platform and through their own data driven work and so that's that's how we frame this out you'll see more of that in our future state report go to the next slide mission this basically is just some highlights of the program vision went in this you're going to hear from the from the program members themselves which will be much more informative but basically our goal here was to create an opportunity to upskill our workforce and to do it in an innovative way that was very hands-on and it was much more do not just learn so we have the bootcamp fully sponsored by HHS 30 students for cohort you're hearing from our latest cohort a key thing that we want to just emphasize here with the iceberg is that this is a nice representation of what we currently analyze which is just above the surface there's an incredible amount of data below the surface that is in some cases you might call dark data or data that is unstructured stuff we may not even know about that we really need to bring up to the surface and use it to do much better evidence based decision making at the department and so collab is one of those facets it helps us to get there so that being said and now tying this back to our large reimagine effort I'm going to turn it back over to Bishan who will introduce our next speaker Thank You Bishan hi bition it looks like you're muted Thank You Rachel so our next speaker is will Brady will Brady the chief of staff to the deputy secretary and senior adviser to the Secretary of the Department of HHS as chief of staff at the deputy secretary Gate buzzes on program policy matters and assists the deputy secretary in the management and operations of the department in addition is responsible for strategic policy initiatives focused on innovation deregulation and health care financing little please feel free to take it away yeah thanks fishing uh you know Sanjay and CTO team you know thanks for me here today happy to share some thoughts so you know I think you guys heard a little bit about my background and you know a little bit more specifically a lot of the regulatory and policy work that I do and focus on is focused on innovation and specifically things like deregulation interoperability telemedicine and a couple others but one of the themes of all of those efforts and everything we do is are that they are data-driven even so I think it's so important what the data science collab does overall but it's becoming even more important under the current you know Kovach pandemic with which we're all working through in trying to you know survive and thrive it and so when we think about why gating is important that it's no secret here is every single day whether we do we see it on the news or we deal with in our everyday work it is always important to use data but it's also a struggle to find it figure out how to manipulate it and gain the insights we need and so I can't stress enough how important it is for everybody to learn how to read understand analyze and work with data but as you've heard some of the conversations so far that there's the question of you know well why is enterprise data sharing a strategic asset and so tell me you know some of the basic reasons of why we're doing it there's recent legislation it requires us to be more strategic on how we use data and one of those being the foundation to the evidence-based policymaking Act of 2017 so is some of you might be familiar with the Act actually requires us to have a systemic systematic rethinking of government data management to better facilitate access for evidence building activities and so you know there's one of the reasons why data is so important is just it's not part of our mission in statute and is we've been directed you know understanding you know this new goal to have a systematic rethinking of government data it also focuses on using things like artificial intelligence and direct agencies to use federal data and make the models and the computing resources available not just for us internally but also they are in the experts so I think the first thing is you know it's important is it our Authority and it's not just a good our Authority but we're obligated to use data in this way you're following that bill and even prior to took over 19 HHS has been at the forefront of recognizing the value of data and looking at every possible way to use data to make better decisions come up with better ideas better solution optimize we've done that and every single one of whether it be the reimagine activities or our Posse developing activity and you know we've tried to not only do it internally but also share data externally to allow you know the private sector and outside of governments thrive as it as they drive a lot of innovation that occurs between Oh so I think that's an important thing too distressing eyes insurance it's developed and these are the ability to understand data and then not just have it but also share it and educate people on it and teach it and show the insights I've already shown so many successful or provided so many successes across teach us and elsewhere when you look at whether it be kidney axor or the cancer moonshot NIH or even some of the things that it just more recently happened during The Cove in nineteen and it just underscores the importance of not just getting data internally but also externally and while we've made significant progress collecting leveraging data for policies and operational decisions you know we recognize the we need to continually work to strengthen data shared across the enterprise and not just data sharing but data knowledge did a hygiene and all those things that just are fundamental to using data and toward that and the age of gesture spent the past you used assessing its current state of internal data and working across its agencies to document a future state there's your process developing an open source cloud-based sharing an analysis platform to improve your access to data insights access to the raw data and in the ability to have data drawing and decision making ideation and improvement and and so I think one thing that I share you know just in the past you know two or three months since coders really taken off one of the things where to me data has become you know the lifeblood of what we're doing is what's known as the provider Relief Fund which something a part of the leadership up I look at the secretary's office to where we've been asked by Congress to deliver distribute one hundred seventy-five billion dollars to providers is a lifeline to keep them moving is they have to shut down your primary economic resources you know would it be for doctors office like in-person visits for hospitals elective procedures and so on and so forth but it's it's it's shown that you know everything we do how we ID eight on who we can provide relief to is driven by the data not just from from a decision making process but also just can we actually get this to to the right people to where when we talk about you know will what providers reading belief well how do you define provider what what data do we have on providers can we do we know with this person this entity is an actual health care provider that currently exists what type of care they provided we have their address we have contact information do we have the actual data necessary if we wanted to ship this person financial relief and you know so that's kind of the first thing what can we do and what can we do in in such an important project is is completely relying on the information the data we have before you came and getting to the idea to policy you've got to think what he's actually capable and not in that requires data and so I think that's been the first thing or when you talk about data science you've got to know what resources are available and so that's where you got to start understanding what what's out there getting that means you know do you look at what CMS has and they're in their provider network for Medicare fee-for-service do you look at what Samsa has is a list of grantees for behavioral health you know there's a massive amount of data sources but you got to understand what they have and what their data points entail and what it might allow you operationalize so you know that that's one of the first things that is an incredibly present for me in front of mine as we continue this work is people having knowledge of the data that's out there in what the data can actually allow and how reliable it is helps drive ideation done based off here what you can actually operationalize and then I think another thing not just to no one within is out there to actually be able to use the data and read it and understand it across every level of an organization from top to bottom from you know leadership to do new personnel and understand what the data is is presenting from but also being able to manipulate it and play with it and in tests with it so you can have multiple insight contesta four hypotheses and so even if in the work that we've done I can pass like I was a financial analyst and I never thought I'd have to use the Excel programming or other programming capabilities that I had to use in the past but everybody every level had to work with basic tools like Excel and SAS and understand how they work and how to manipulate them to get to the solution bit of a lot us to distribute you know seventy two billion dollars in under fifty days I mean that only happens because we were fortunate enough to have people who grew understand the data sources and then also have the operational skills to manipulate the data and invalidate it in check it across multiple different programming and systems to where you have a level of confidence to make decisions and so the only way that happens fast is that the smart people also know how to use the tools so I think that you know can't be underscored strongly enough that technical fundamental technical ability is so important at every level if you want to move quick so that you're not going back and forth then you can pull up you know the sheet or the data set and start manipulating in real time as I can share with you I've had to do that personally at every level of government just in the past 60 days which you know is I think shows that we are the past they're people who are data scientists or data analysts or you know people with all the data everybody's got to have that basic skill setting out to be to move quickly and operationalize so I just can't underscore enough how important it is to understand the data set they're out there but also have the ability to put it in Excel or SAS or Python and start to play with it to see what you can find what insights you can do and you know it's just something that it's constantly going to be a growing need and being the importance to have those skills and understanding just can't be stressed enough and then I think the last thing when we were talking about data and I think it's important we wanted to share it based off the you know recent experiences when you're doing it everybody needs to keep the mindset that you know it's an iterative process it's going to require persistence and resiliency and a huge degree of humility in it even in the past 60 days everybody has been developing different hypotheses for where you should distribute money how and why and what really why you really want things to be data-driven is because you want it to actually demonstrate the need and show the evidence for what that requires is a degree of humility of having a hypothesis without pride of ownership and so personally in myself and others we've developed hypotheses and proposals and they put it in gather today to analyze and said look the idea you know was just off and so you can't you not only have to have the technical ability and understanding tab to know where the data is and the manipulated game inside but you have to have the humility to say yeah hypothesis was wrong this is what the data is telling telling us and inform the decision-makers to understand what it actually says and why sup we need the pivot and so I think that's a really important piece to keep in mind is people develop data skills to reinforce you know why we're pursuing the data isn't to prove what we think is right but to really understand what the data is telling us to make better decision and I think it's been and like I said having that end end ability to understand the data sets manipulate the data and and pull it together for in a way that allows for executive in good decision-making it is incredibly critical and so I can't thank all of you enough for for you guys you know in your interest in any demo day the work that you've done all the learnings with the collab and you know the success is only possible when all these diverse perspectives get together like the ones you see today by the presenters and so you know I'm delighted to be here today to share my experiences and you know why this is so important you know not just in the in the long term of how will drive decisions and operate but but just in the immediate short-term and as people are getting pulled in to understand different data sets and how it can be helpful so New York today used to put us on a path that will help us be make better decisions understand the data better and then just you know serve the American people that are achieving better but also recover better organization and teams so you're once again I want to thank you all for the work that you've done and you'll continue to pursue so thank you thank you will appreciate your insight talking about your experiences recently literally distributing the funds and how data has helped you make those decisions our next speaker is Steve Babbage Steve uses design data and technology to build better products and policy currently give the head of artificial intelligence portfolio for the u.s.
Technology transformation service pts AI TTS AI is focused on helping the US federal government invest in and use AI to achieve their respective missions in 2015 he was a White House presidential innovation fellow at the FBI where he orchestrated a user-centered approach to building products and mitigating threats to national security Steve was awarded for exceptional service in the public interest by FBI director Christopher ray Steve feel free to jump right in alright super thanks fishing you can hear me right yes okay fantastic everyone thanks first edition and the entire HHS data science collab for the chance to speak with all of you this morning or at least a two-dimensional version of me maybe the bots for simple to meet you all in real life hopefully sooner than later like I said I'm Steve I head up the air portfolio for the u.s. transformation service tts just a note on TTS we are an organization within GSA that basically exists to help government use technology better text to improve the lives of the public and public servants and so I think TTS we have groups like the presidential innovation Fellows 18f Centers of Excellence even like 10x which is a pic of it as a venture or seed fund to let you govern um ideas forward and so you know within TTS and following the AI executive order last year and then of the White House summit on AI in government in September White House with sponsorship from OSTP in the federal CIO so that Kent and the Neal cherian are the head of TTS we've stood up a focus on artificial intelligence and it's like every talk starts this way so it's pretty cliche but obviously it potential to bring about transformative change and so equally cliche is it it also presents a range of challenges and opportunities including government and so within government we've got to figure out how do we invest in an apply AI to help agencies and ultimately that the public the country at large and whatever we do as far as TGS is concerned whatever we're trying to tackle has to be rooted in the challenges that agencies really have and how do we really support that mission and so you know by and large it unless you're dark blue or do we and NASA our census agencies are relatively early on in there through use of AI so they're laying the foundations of it and there is certainly pockets of activity happening with an HHS has done a great job laying the foundation certainly with the data science initiative the ability to actually share more and more data sets on top of you to take advantage of AI so congrats to HHS and some of their groundbreaking work and so in terms of TGIF and our focus there's so a few areas that we're focused on you know as gaming think about that our mission overall is to help accelerate the use of AI in government to achieve the mission and so I'll just go through a few of those Sundays and go to the next slide the first one within TGS we have a focus on implementation of delivery of AI work and so – that connects to the center of excellence and so we work with agencies including the bo-buddies joint artificial intelligence Center and a DOL are some of some of the some of the work that we're doing there there it's really how do we actually help agency through the new survey there are elementary centers of excellence like data analytics cloud etc that support this as well and so that's one area so I had signed edition the next one so in addition to implementation and delivery we also want to support the acceleration of your accurate product development what I mean by that is what are the things that we can do that agencies at large can take advantage of and one of those things is the guide to actually building that right now is a piece of a set of content that helps agencies and agency leaders primarily think through what does it take to invest in AI how do I start to apply it and where do I get started that's one thing we're building a library I'm hearing particularly agencies across the government want to know what else is happening out there and how do we learn from that and so we want to build this repository or library up a variety of these cases and we're starting to to build that out over time so if there are use cases we certainly welcome those and please do share those with us and then we're investigating other ways of how we become a broader resource of AI learning so we're thinking through ideas right now and we're having conversations with HHS and others on how we can build that out so again that's the product development piece next one and then the third one so one platform that we have is the actual artificial intelligence community practice so this is a place where we can essentially have tox panels workshops where we share use cases lessons learned the pitfalls the challenges everything in between and so we've got about it gets somewhere between 800 and a thousand strong now in its community and so we welcome you to sign up for this we have monthly events right now and that equal the next part mission the within the community we're starting to and even today we're going to have an initial discussion on working groups that are going to essentially hearing that the federal government how do we get together interagency and engage on specific topics relevant to AI this is something that we're starting to build out and that will likely form subgroups as well and then lastly there's this notion of external engagement that's the next piece of the click-through subdivision so we want to engage externally I get beyond the government to meet with academia industry and other consortium think tanks just to make sure that we're fostering that continuous learning and develop and new projects to help federal agencies and the government prepare for invest in and use AI so that's quick sort of download of what we're focusing on and again tonight I could come back with the community practice HHS we had them come in in a couple of formats but one of which was the data science collapse share their data insights initiative to share the great work that they've been doing and then we also had HHS their HR department with Claire Duncan and the HAR neocon they share some of the great work they're doing to modernize the HR practices and so it's just this is the this is sort of the mechanism by which the federal government can get together and focus on the issues related to AI and a lot of that really is foundational in terms of data sharing data preparation data readiness the tech challenges and then just a couple last things I'd like to share like if I had to pick one thing to focus on you know obviously see there's a range of issues that are challenges for the federal government data technology related issues is even Willet alluded to but people I would argue is arguably the most important thing so some of the things we're starting to think about is when you think about the people development and the upscale and the rescaling again HHS has done a great job with this the collab and the cohorts that they're sort of working through and then the great work to sharing out you know in events like today but we're thinking about things like how does data science sort of fit in into an agency what does that capability to look like and what is the structure so and part of that relates to what is the career path of a data scientist looks like we have to sort of set these foundational models and mechanisms up so that when an agency when a data scientist comes in they can see where their professional development will live and how they progress with that and so this is something worth thinking about and want to share a perspective on including that professional development career path and then how do we build into the DNA of agency this notion right we hear it all the time so it's another cliche but how do we fail faster but more importantly how do we learn fast certainly ad is hard stuff right we're going to we're going to fail at some things but we got to make sure that we pay attention to the long game here and not abandoned AI because you failed somewhere you're going to do that we're going to learn so we got to continue to look for ways to improve upon that and then agency leadership has to support that and recognize the importance of investing in AI but for the long-term and I think part of that you think about the talent itself one critical area is actually the human resource and talent teams and agencies they have to have enough AI knowledge to be able to evaluate the talent that is applying to the agencies that they can bring them in to the right business areas and really get the work done so how do we give them the training and the development needed to recognize and understand what good AI and data size house looks like and I would say that very much analogous to that is the issue of acquisition certainly the government is going to be buying a a lot of AI technology you know certainly more than probably building it unless again you're a pure rnd type of agency but if the government buying a fair amount of AI the people doing the acquisition have to really understand and work with their technical stakeholders and visit stakeholders to understand what are the questions to be asking of the vendor and contracting community and then enough knowledge to understand what the answers that are coming back and make sure that we work side-by-side and have multiple stakeholders at the table to ensure that we do buy the good technology you could solution good tools and so again this speaks to the most important issue again of people is like how do we build that data literacy at both technical and business levels and then we got to start somewhere and start to do some piloting and experimentation fail and learn I'd say you know TTS we're trying if we're building those perspectives and we're looking to share that out to the federal government writ large and you know we don't have all the answers that's what we want to engage in the community to share what is working well and what if not we're simply working hard to provide some of those perspectives and bring all of you together so certainly welcome you to be a part of the community practice we're all in this training together and I look forward to thinking about maybe going forward so thanks for the time thanks again scientist edition and all of HHS pleasure to be with you thank you Steve it's great to be a part of a larger community especially as we try to learn from one another to improve the way the government does AI and does data science and build better data strategies appreciate your talk so now we're ready to get started with presenters while all 32 lab participants have worked very hard to create amazing tools that they'll be bringing back to their home offices to develop new insights from the data sets in order to make better business decisions make their workflows more efficient or reduce costs we'd like to share nine projects today to give you a flavor of the possibilities that the applications of data science bring to the department first up we have Christie final who will tell you about her project on the administration for Native Americans good morning everyone my name is Christie style I'm a current program analyst with the administration for Native Americans which is underneath the administration for Children and Families or ACF I'm very thankful to everyone at the collab and for y'all turning in today to share a bit about my presentation which is focused on ACF tribal transcripts as well as Memphis ena murder Native Americans crisis that's occurring which being depicted in the illustration on the right here that was shared by a a during the basis Awareness Day that just passed you can go in the next slide just to provide a little bit of background information on a a we are an agency that provides competitive grant funding and areas of social economic language and environmental topics and these are these are available for Native American Pacific Islander organizations as well as state and the 507th for currently federally recognized tribes and the United States at a name we support several different acts and H and HHS and ECS and if you can go the next slide one of those texts is also to support tribal consultation there's a variety of different types of consultation policies and things that occur across the agencies from advisory committees to individual agencies consultations and as you're seeing here ACF annual consultation consultation leads to the information exchange initial understanding and informed decision-making specif aided by ACF tribal consultation policy on the screen if you know a great so with consultation tribal leaders come and share their voices and this creates transcripts as you're seeing on the left here as well as written testimony that's submitted these transcripts are currently publicly available for ACF recent from 2010 to 2011 analyzed utilizing data science or text mining techniques so next one and why this is extremely important right now is this crisis those thirty the missing a murder Native Americans or MMA or some of you may know it as missing and murdered indigenous women and then iw as you can see the CDC has reported homicide as the number four cause of death for Native women and the number three cause of death for Native men aged one to nineteen years old this crisis has been long-standing and occurs across areas and affects many different communities this is is a complex issue related to the social determinants of health and cuts across all of our ACF programs and HHS programs on November 26 2019 actually an executive order created Operation Lady Justice which was to help to bring together Department of Interior Department of Justice and HHS to work specifically on MMA with this operation Lady Justice there will be future listening sessions and consultations that will occur and part of this was of the collab both tapes the information that we have from consultations to create a systematic way to analyze that data and as well help support the work that is being done on MMA and also think of ways we can analyze this these consultation listening sessions that will occur in the future through this it's also acknowledging the voices of our tribal nations that I have already shared their thoughts their ideas and their feedbacks with us throughout the last nine years can go to the next one so really first of these consultations transcripts I wanted to understand who had participated over the last nine years and through this through the collab I was able to create encode interactive maps that you're seeing on the left here this map is the percentage of federally recognized tribes they have participated in ACF consultation again there's several types of different consultation that occur across HHS but these particular was looking at the annual consultation that happens with ACF as you see here there's many tribes and organizations that have participated but I really wanted to understand those in comparison to the number for the recognized tribes that are in the States for example Alaska many organizations and tribes have participated but this only makes up 3% of the Alaskan villages because there's a large state many villages and potentials challenges to participate in consultation to be traveling to DC to attend or limited resources to even to turn in written testimony during the next slide great so now that we have an idea of who participated I wanted to compare that and look at some of the data that does exist on this in person this data of course is from several different sources because some of the missing EMA miw and mmm-ma data is a challenge to find there is the other data source that urban Indian Health Institute's that put out a report that focused on cases of women and girls from 1943 to 2018 which are seeing in the red map on the top and on the bottom you're seeing the National Missing and unindicted identified person system or nameís which is of all genders and ages from cases in the native populations that I pulled from March 2020 once I take the consultation participation data and lay that on top efficiently you can go to the next slide or the next tab there perfect you can see that the colors that are coming through in the bottom the red and the yellow these are states that have missing cases but have not necessarily participated in consultation not all states have federally recognized tribes but some states that do such as Nebraska which you're seeing on both the top and the bottoms have cases of mmm iw as reported in UI H and in maintenance so this might be a place such as an ambassador that has six federally recognized tribes so we might want to focus our outreach or awareness or discussions with as they have not participated recently in the nine years go ahead some extra great so now that we kind of get an idea of who's participated I wanted to look at the content of these transcripts you're seeing here on the right a word cloud of all nine years of all the words in the word physicians of the different comments or responses that were given throughout these nine years as you see with tribal consultation of course the biggest word that's going to come to the interview its tribe the center of and the reason for a consultation an interesting word that you're seeing on that have come through on this is a quest which so the Indian Child Welfare Act which can be continues to be a common topic that we saw across every year of consultation and it is important to Native communities go ahead so next so looking more into these on present words I was able to create the top 30 most frequent terms as you're seeing here what might be interesting in these 30 words if it's efficient you can hit the next slide there there's a lot of action words not only our tribes asking for action from the federal agencies such as support in funding but they themselves are taking action from mmm IW to other programs in areas they're providing solutions they're creating innovative ideas on their own as well and trying to share these best practices that maybe can help other to use in these transcripts and through consultation so it's a nexus after taking these words frequencies you can see that they retreated a network graph to really understand how these different years interacts together in 2010 mainly focused on creating future consultations so the way to understand how to develop this together and to see how these different consultation years interacted there the next slide there's also a way to do a turn search quickly through some of the skills I gained in this class as you're seeing here as an example the Carruth tribe said in 2015 the trafficking was an issue so we can use term searches to find different ideas and topics as and occurring throughout native communities and then we can create hopefully a public-facing matrix which are seen on the bottom that organizes transcripts so that tries to see publicly what comments have been said in the past and how they can help in the future and how we can relay this information analyze it further next slide so really I hope in the future we can continue to do this analysis we can build tribal related resources and releases class and this collab is so thankful for have given a new understanding to are analyzed before transcripts and consultations and created time savings for organize these transcripts as well and I hope this continues to support the work of M&A and most importantly continues to honor the relationship we have with tribes if there's a last one I want to thank the co.labs for this opportunity as well as the staff at Otis ACF and the AMA staff on the right you're seeing who some have been deeply affected by this who have lost loved ones and against the tribal nations that have participated in this I think Terrace we need the specific secretary of Indian Affairs a DOI has really said it best the consultations are important because they are a voice for those who cannot speak and I really thank you for today as we continue to work on this and to the tribal nations for their resiliency and strength throughout this crisis thank you everyone Thank You Christy our next lightning presenter will be donkey who will tell you about his project on building operations and maintenance hi everyone and my name is a donkey I'm a database manager from NIH office of the researcher facilities unfortunate to express my appreciation to researchers and the collab team skies to provide our 32 tremendously trading opportunities on Washington in technology yeah I'm so grateful for eight weeks the longing for the our language program skills we learn from the classroom but also the opening line as a untreatable data scientist to incorporate and they collaborate various database we are dealing with every day in order to provide more insights and reasonable intelligence to the management so today I like to this demo I have is a brief layout open by toggling projects how to use existing building maintenance data to predict its overall operational advantage cause next slide please a little bit equip bag one and here in storage without the campus on Google Street and apps you can see the blue filters with numbers go up to how many active in vintage Democrat on each field is this is a real-time data or Orion 24.7 simplicity during last 20 years we have accumulated minute of the data and entity records you can see tropicals maintains preventing lentils and the service orders related to each beauties operation and the maintance but how to use those data is always a big challenge for us we all know that we can easily generate a report or dashboard to show how good how bad our when his team are working on their jobs or even we can show you how many possible flooding or night out happen another year is interviewing for the signal or next case please the real data scientists total Starbound also report available you know from all sources they can increase and conducted deep data our analysts their purpose are too unreal and impossible business accumulation among rows data element all they wanted to specify the reasonable classification all furthermore they want to predict the data trend that we never seen before this is our purpose next ice so I'm doing it in weeks College class of Stony inspire by all women so I decide to give you try so we know that each of us critical year o YF our office has to reporting those of operation elemental cost and a building conditional settlement that encompass so traditionally they just allow a team and the set up to connect all the data manually all kind of Excel workbooks including the main material called label cards to the contract cards and managing causes blah blah but I'm proposing here to use machinery low debris taking you this in yearly maintenance tastic data class previously physical year card to predictive FY nineteen numbers so we use three technical steps in the cotton one is the first one is the data preparation and and then we do the data visualization and the finally we have a data model next screen so first step we all sort our data so we adopt with FY 912 to FY nineteen data we class the building groups selectively idling use types as well as building a each size and the census data so we collect all the data we have then basic data Manning main density test data after learning we spend a lot of time to clean up under standardized to create a final our language the data side totally opposite ovulating for motivation for data motoring and the target variable is we call om cards together is a fourteen predictable variables next spring's here is to do building over the our language if they decide the programming so you can use one sentence code to show you correlation matrix among all is the variables so any you can see here we have a different rule colors we should hide the darkness the better and invertible above 0.8 a thing as a high coefficient you can see the growth and the census data are among the highest you can see the silver solder and occipital lobe labels are lower than the PM's and the total call data so you also can see some kind of a lab is a lab building or learner beauty has shown some coefficient so very interesting data next apply please next be 25 loader high-efficiency variables and to fear them into our to data loading why is calling a random forest of decision tree and the other one is called generalize the inner model so we learn these models and accountant so use our some people and over they decided to train the each data model and I use a lot of silica sand all the tests they have to test their model to ensure the result are acceptable actually the result are very impressive next so finally we apply both data models outfits lfy 19 predictable variables and the generator calculate for each buildings we initially saw we could compare also estimated with the real FY 19 cost but also numbers are not available yet so unfortunately but we can see both and activate the histograms side by side they are very similar under the bits which will become invaluable for two predictions is only 5% next place okay and now we have for our Canton a project for this project is on your guru star for our data scientist to you we have many many source down in the road and like we can do a time series model you can take asset creation model to analyze the different type of abilities or we even can go deeper to analyze particularly mechanical system electrical system within each ability to see how good how bad they are the performers for the future we will of actually we already kind of visualize – for more data scientist panel meeting over half we gathering all the peoples different the background different major only the professionals get together to expand our data source it's too wide our data time for them and even to all the data Mobley that's all we'll go that's our initiative to attributable team to dig more about the data scientists to counter using in our day day by day the business operation the final goal is to take it machine learning skills into all decision making process as a pioneer payers thank you after all my humble today thank you dog our next presenter will be Claire Jean and she'll share her insights on how to predict a broken heart I clear go ahead hello okay hi everyone my name is crazy I work at FDA today I'm presenting predicting a broken heart Beauty machine learning to to assess cardiovascular risk of pharmaceuticals next next next I work at the FDA Center for Drug Evaluation and Research my team is called QT interdisciplinary review team next our job is to review submissions from pharmaceutical companies and assess how likely a drug can cost Colonel arrhythmia if you don't have a high risk of arrhythmia it might not be safe to be using patients that we might not approve the job or cook significant warning in the label and the originator where most of interested is the cost for target phone which is a potentially fatal arrhythmia and it can cost by many drugs and it is associate with this ECG signal the ECG measurement is that we attach leads to the surface of a patient's body record the electrical signals and the use that to assess how the patient heart is functioning next so traditionally we use the the QT interval on the ECG signal to access to the better risk of footbath on the bottom left you see our normal ECG signal this signal is come closer with several with with our good with focus with it next next and the orange on sign showing the the QT interval if I talk : gate is the QT interval it is really associated with a high risk of causing facade next however this is not necessary along with drugs that still prolonged QT interval but they have actually have a low risk so therefore it will only use the QT interval on ECG pickable a assessment for the to start with which would throw away a lot of good drugs next but this is a problem we want to solve next so the goal for this project is to find a better way of predicting for that words we try to find involve a reliable way or using machine learning algorithm next next so that's how comprehensive individual Pro original acting with the CPAP to show the paper next people is a new regulatory paradigm use electrical signals from cell level to put it too far group next next are corresponding to this normal ECG signal are on the bottom left is the electrical signal from the cell level and the to the mm no the signal from the cell level is the next it is shown like this Oh text so traditionally we're trying to use a the QT interval in the usage is signal to predict so far next next um now we're trying to use a characteristic of the electric signal on a fair level such as well the situation of the signal and amplitude of signal to predict forces are next next and this combination of those characteristics of the signal of a cell avoid because magic yes Oh so this is where this slide shows who the sleeper workflow first we perform the experiments with profuse ourselves with the drug and recall electric signals from the cell on the signal we're having total 28 drop that is known without risk levels where you separate these from a drug into the two best o'clock training drops in the first 12 drops in the training set and the sixteen dropping the validation set if this is reduced to 2000 model and the prototyping yes yes um then we will put the data from the experiment to build a mathematical model the youth from after from characterizing and after building all the mathematical model we run the simulation and to generate the metrics next next on and the after of things of this matrix from the computer simulation with input that into a machine learning model and an output of the machine learning model this is a prediction our the risk of the to our next in this presentation focusing on the plasti yes so the goal here is to use a machine learning model to predict the drug risk to induce to that all to tackle this problem from the data science perspective next we first clean the dataset and next we will train the classification model are we we use the data to train the model to learn how to predict whether a drug has a low or not low risk or levels based on the training data set which composed of twelve drop the machine learning algorithm I used here are our four different algorithms k-nearest neighbor logistic regression decision trees and the render for it yes and after viewing this a classification model apply i applied the truth model on the validation data set which concludes our for our 1512 and a comparative ridiculous risk with a known risk to assess the performance of the model next are the disorders training data set looks like next are it is composable one target variable which is a risk and the next are 12 of predictors which are metrics next um in the training set I have 4 included SWAT truck 4 of them has a low risk level and 8 of them at work has had 4 not low risk level and the for each drug I have a data from photo switch and each dosage composable 2,000 samples and you can't think of this $2 examples of the mm if itself that further still a comical next are and in the validation that I have included 15 trucks five of them have loaded and the 11 of them have not Lois next this slide shows the result of the result of the classification model on the fixing validation drug next so this figure is composed of 16 panels each panel represent result the prediction results of one track from the validation step next so on each time code accesses is the dosage or ranging from 1 to 4 next on the right axis is the count of number of samples for each panel is from 0 to choose our next um so each panel with represents one o'clock are on the top of the panel and it shows the name of the drug and the color of that tab indicating whether the lessons risk of the drug is if the drug is a low risk drops and the colored screen and it is the iris Club the color is addressed and each of this far indicating the prediction result on our next hour surveying part of the bar anything how many samples were predicted as a low-risk toxin and the the red are part indicating how many other samples is the predictors will not load with thanks under this line separate on the highest drug and a low-risk drug and if the model works about this line there's arm11 a high-risk drop and the pillow is a size notice drop and if the model works works works well you will see most red on the top and a small green on the bottom which is exactly realistic here Thanks this shows you the specific develop performance on the models this values are pretty close to 1 which means this model is doing pretty well yes so this is the summary statistics of all the model type or use you can see that they all doing pretty well next it's not the best one is the Lacanian neighbor with only one metric to net which is a charge carried by a full charge carried by our few selected current fit this model performs the best and which indicating to men is a good predictor for the total risk next on in conclusion I've given several application models with different metrics using for algorithms and then compare the model performance based on the based on the resulting predicting risk levels of validation trucks the potential impact of this is on the seat of paratime we're continuously developing guidelines using interstitial and empirical data to assess your beautiful forests this project extending the scope of the classification methods that can be used to predict your risk levels the indicators are important in certain metrics and the shed light on the mechanism of the bus all these are crucial in developing guidelines on the signal thanks finally I want to thank my team keep watching my division and people from a division of site regularly science to supply the data and the instructors and the organizers of collab we don't own this particular guitar oh oh oh thank you thank you Claire next up we have Ryan Laird she'll tell you about how you use machine learning to find clinical patterns of autoinflammatory disease alright thank you so much fishing hi everybody thank you for virtually being here today as you just said my name is Brian Laird I'm a post bachelor fellow with the National Human Genome Research Institute and I work with the clinical team under dr.
Dan Kastner so today I'll be going over my 10th tee shoes my friend so far he was unsupervised machine learning to find clinical patterns of autoinflammatory disease next please okay so some quick background on our research so we study a group of rare diseases called Auto inflammatory diseases you can think of these essentially as a dysregulation of the innate immune system so it leads to clinical features such as periodic fever arthritis arthritis sterile or non-infectious skin lesions etc it's a natural history study has been going for 25 years now and to date we have over 2,000 patients that are enrolled in the study or have been seen by one of our clinicians at the NIH Clinical Center at some point so it's a very diverse group it's a very diverse cohort with a large percentage of the patients we see renaming undiagnosed or undifferentiated we just know they have some sort of odd inflammatory disease and we're looking for a cause and they're very rare – very rare cases sometimes only dozens of cases if that known worldwide and so my goal was to find a way to better stratify or cluster these patients solely by how they're disney present to see if we can find any new insights may be aids downstream analysis or in the future do any sort of data-driven medical decision-making based on how they present but the challenge in doing this is a large amount of our data is not most of our clinical data is unstructured it's in free text electronic health records that have been written by our clinicians another challenges as I said we don't know exactly what we're looking for there's no outcome variable to train for or target for we're just looking for similar groups of patients but with similar disease and lastly when it comes to stratifying patients and other challenges we have a curated cohort so we're not looking at all rare diseases most of our patients have some degree of similarity to each other especially to the untrained eye so how can i stratify out beyond that next slide please so when I applied to collab my proposal of twofold one I wanted to use existing natural language processing tools to get clinical data out of our EHR into a workable format and then to use that data to cluster patients with unsupervised machine learning techniques next please so it's a quick background about the clinical data I'll be using on there's a biomedical knowledge resource called the human phenotype ontology which is a structured vocabulary for describing an abnormal phenotype or an abnormal presentation of disneys or clinical finding and it's becoming the standard in phenotypic and genomic analysis you can see in the graph just for example if you look at the bottom where just short stature you can say short stature is a abnormality of body height is a growth abnormality etc etc it's a way to represent how we think about disease and that's what all that's what I want to get from our records nicely so how can we get these from our records well we could I could plead with our clinician to manually review our patient charts that in extract these terms that be incredibly tedious we have over 20,000 clinically relevant notes and our clinicians are incredibly busy providing care to patients they don't have time to go through all of these notes and contracting out would also be prohibitively expensive to have somebody else go through all of these for us so I turned my focus to automated tools many of which have been published in the past decades many in the past year and the one I'll be focusing on today is a tool called clean 10x so what the overall process looks like is first I download all of our free text notes from a data warehouse we have at the NIH called Beatrice it's a way to get all notes for all of our patients on our protocol but this data is pretty messy and so you then have to filter it down your basic heuristic filtering to only keep clinically relevant notes for instance getting rid of documentation of consent to remove some clutter and then you can take this corpus of notes and apply NLP tool next please so Clinton it was published late in 2019 out of Stanford and it's a more traditional natural language processing tool where you has a paragraph of text we break it into sentences and then you match free in the sentences to a dictionary or thesaurus looking for these human phenotype ontology terms or hpo terms and I chose this one despite it being somewhat more basic than other terms because it does do a good job at taking sentence context into consideration so you see in the middle or it says renal disease in blue that's a true negative it didn't in the output capture that the patient having renal disease because it said it had no further occurrence of renal disease so I was more concerned about having a lot of false positives in the downstream data than missing some other terms because that just be harder to interpret next please and this wouldn't be natural language processing without the obligatory word cloud so this is what the output looks like for our entire code work and oh no this it is a good representation of all the inflammatory disease as a whole and you can see some quality control issues as falls such as Falls in the bottom left but I'll have to deal with later but I was happy with the initial output next please but now now that I have this data for our whole cohort our whole cohort what how can I actually use it to cluster patients well rather than starting with everybody all close to 2,000 patients I wanted to take a step back and do a proof of concept say to see if I can make sense of it so I started with a cohort of patients we see in our clinics that all have the same disease deficiency of adenosine deaminase to our data – and we have a 54 patient cohort and I started with this for both pragmatic and clinical reasons pragmatically our team discovered the Sydney's back in 2013-14 so we have a very good understanding of it and it's a very nuanced and complex disease so our clinicians already mentally clustered patients into these three categories into an immunologic category a hematologic category and the vascular and Kalama Tori category so I knew I had something to look back on and see can I capture output similar to this using these hpo terms and then clinically I was motivated by the fact that unfortunately a defining feature of this disease is early onset stroke around half of the patients in our cohort have had at least one schema program congratula then the median onset being five years old and unfortunately this disease doesn't come up on the patient's differential often times until there is a potentially traumatic stroke so if we can find a better way to catch signs of this disease earlier on by finding patterns that could be used that could be incredibly useful in better treating these patients next slide please so I take the hpo times I take the output and I transformed it into this a tabular format for each row represents the patient and then there's a column for every single HP oh– term that was found within this corpus just a binary one or zero has this HP oh– term showed up somewhere in the patient's record next please and I took the theta and I applied two different clustering methods or pipelines to it I don't have time to go into too much detail but from a very high-level ontology similarity is a more conventional approach popularized in the 90s where you use the structure that I showed you earlier as a human phenotype ontology to calculate term frequency and from there you can get you can extrapolate similarity using different metrics and then I tried a more modern approach called hpo Tyvek which for those of you in the audience familiar with the Skip Graham model of wards avec forward embedding this is an adaptation of that which works on the graph structure of the HP oh– and allows you to also enrich the HP oh– with other biomedical annotation sources so in this case I drew new links between phenotypes if they shared a disease common to give you a bit more real-world understanding of what that term means what it represents next please so I was actually quite surprised if my initial results so what you're looking at here is a graph of a network where each of these nodes is an individual patient one of the 54 and I drew a link between patients if they're pair if their similarity score from ontology similarity metric was greater than the cohort median just as a way to capture relative difference because as I mentioned earlier everybody has a high degree of similarity already and then I used the line algorithm to do community detection and use that as my patient clusters in order to evaluate the clusters I initially started with a more basic approach just looking at what terms are more frequent in one patient cluster than another and I was quite surprised to see that the qualitative categories over here me reviewing the frequent terms in each cluster does largely represent the way our clinicians to already think about these patients mentally remember that Venn diagram I showed you earlier so it's a promising proof of concept and happy with it so far if you go to the next slide please and this is also this is the results from using HPO Tyvek which is it's quite similar although a little less granular so I think I need to tweak around with it a bit more in the future but again I am happy with these categories the next slide please but what so how does that matter I have these patient clusters now and what what can you do with it well one it was a huge time saver we didn't have to manually go through and look through the records and cost of these patients ourselves actually would you go would you go back a slide I'm sorry so going back to what I was talking about the strokes and data to data – I think I'm most excited with this output because each cluster actually has a few patients that have had a stroke in it and at first I was disappointed with it because I thought to myself stroke is such a defining feature how are these techniques missing that but as I think about it perhaps stroke is a very very defining feature clinically when you're thinking about the patient's differential as I said we want to we don't want to even have to see that red flag we want to capture these patients that have this disease before they do have a stroke so we can provide mediating treatment before that period so I think these groups are promising in that respect that perhaps there is a better way to categorize these patients without relying so heavily on the most obvious features the most clinically obvious features so next slide please and I think that can aid a virtuous cycle then between our clinical team and our lab and the patients and Families we see if we're able to better understand these diseases we can potentially get earlier diagnosis better and that leads to better prognosis for these disease get people on treatment sooner and it will help us provide more targeted care to the patients we see and of course there are caveats this was a retrospective study and there's a lot of work to be done but I'm very happy so far this slide please lastly I'd just like to end by thanking our clinical team that has seen these patients and our lab everybody else at the NIH is and supportive of this work and who takes care of our patients they do a great job and of course I'd like to thank all of our patients and their families none of our work would be possible without their commitment in their support and I'd like to thank the data science collab and all of my colleagues that have met through it a year ago I didn't have any programming or data science experience and now I'm becoming confident doing interesting analogy approach like this but thank you to the data science Cola and thank you all again for virtually being here today Thank You Ryan appreciate your presentation as well as presentations of our other four four presenters produced by the US Department of Health and Human Services at taxpayer expense
