Data Umbrella

Sphinx for Python Documentation Tutorial - video hits 20000 views

Tue, 31 Mar 2026 07:01:35 +0000

About

Congratulations to Melissa Weber Mendonça for her Sphinx video tutorial hitting 20,000 views on YouTube!

Video: Sphinx for Python Documentation Tutorial (~75 minutes)

This tutorial focuses on a high-level explanation of how the Sphinx tool for generating documentation automatically works for Python packages, using NumPy as an example (but won’t be restricted to the NumPy use case). We’ll talk about advantages and disadvantages of choosing different documentation systems and how to integrate other types of documents (for example, Jupyter Notebooks) in the documentation for a given package.

GitHub repo: minimalsphinx
Slides: Intro to Sphinx for Python Documentation

About the Speaker: Melissa Mendonca Weber

Melissa is an applied mathematician and former university professor turned software enginneer. Nowadays she works at Quansight, developing open-source software and leading the Documentation Team for NumPy. She is also a LaTeX, Fortran and free software enthusiast.

GitHub: @melissawm
LinkedIn: @axequalsb

Video Outline: Timestamps

00:00:00 Reshama introduces Data Umbrella
00:05:21 Reshama introduces Melissa
00:06:50 Melissa begins talk
00:07:38 Tutorial Introduction
00:09:01 How Melissa got involved with NumPy / Invitation to collaborate with NumPy documentation
00:10:01 Impostor Syndrome explanation
00:12:08 What is documentation? (Tutorials, How-to Guides, Explanation, Reference)
00:16:54 What is Sphinx?
00:18:00 configuration file: conf.py
00:19:00 Getting started with sphinx (installation) (conf.py, index.rst, _build/html)
00:20:43 What is reStructuredText? (rst files)
00:22:21 demo of MinimalSphinx Project: https://github.com/melissawm/minimals…
00:22:55 pokemon
00:24:05 What is reStructuredText?
00:24:25 Comparing reStructuredText (reST) to Markdown (md)
00:25:51 reStructuredText format II (directives are blocks of explicit markup)
00:27:44 Auto-documenting a python package (autodoc)
00:28:45 Q&A: Do the docstrings have to be in rst?
00:29:30 “:members:” options
00:29:58 Other extensions (sphinx.ext.doctest, sphinx.ext.intersphinx)
00:33:08 Minimal sphinx, example
00:35:00 run “make html” command
00:39:50 Q\&A: Should we consider doctest vs pytest? (They are complementary. Use both.) 00:40:30 Example: NumPy docs (how to contribute) (https://numpy.org/doc/stable/dev/) 00:43:20 Getting involved with NumPy, Communicating: https://numpy.org/contribute/
00:43:27 A pull request to the NumPy documentation
00:44:20 Fixing a Typo in NumPy Docs
00:46:20 Fork numpy repo on GitHub
00:47:15 Clone repo
00:47:55 Add remote: git remote add upstream
00:49:05 Branch: git checkout -b doc_fix
00:49:43 Find file: cd doc/source/user/
00:52:30 Set up virtual environment
00:53:40 pip install cython (numpy dependency)
00:56:00 build the docs: “cd doc”, “make html”
01:00:30 Q&A: Can sphinx be used for other languages? (yes)
01:00:55 Q&A: Why did we build NumPy? (we want to generate doc with latest version of numpy)
01:03:33 Jupyter notebooks
01:04:15 Final thoughts by Melissa
01:06:15 Q&A: Why do we build from source?

Transcript

00:00:00 Reshama introduces Data Umbrella

Hello, everyone, and welcome to Data Umbrella’s webinar. So the way the webinar will work is I’m going to do a brief introduction. Melissa is going to do her talk. And then you can ask questions either in chat or Q&A. And we’ll sort of get the questions answered as– we’ll answer all the questions, but we’ll answer them as we can have a comfortable break in between the talk. And this webinar is being recorded. A little bit about Data Umbrella, we are an inclusive community for underrepresented persons in data science, and we are a volunteer-run organization. Briefly about me, I’m a statistician and data scientist. I have a master’s in statistics and an MBA from NYU in business analytics. And I’m the founder of Data Umbrella. I am on Twitter, LinkedIn, and GitHub as Reshma S. So if you would like to connect with me, feel free to follow me. We have a code of conduct. The goal of our community is to be as welcoming and inclusive and professional as possible. And so the code of conduct is linked on the sticky in the chat. So please adhere to it and contribute to making this a welcoming and friendly community for all. And the code of conduct applies to the chat as well. There are various ways that you can support our organization. The primary one, the first one, is to follow our code of conduct and contribute to the community. We have a Discord server where you can ask and answer general questions. You can share events and job postings there as well. And we have an initiative where we have transcripts for all of our video events like this webinar right now. And you can help us edit the transcripts that come from YouTube, which are in a raw format but need some editing to be more accurate. Another way that you can support us is to donate to our Open Collective. And that is at opencollective.com, that data umbrella and any donation you can make would be helpful and welcome to cover our operating costs. Data umbrella, we are on all platforms. Depending on the platform of your choice, you can search for us on Meetup. We’re on YouTube with a great video library. We have a job board and a newsletter which goes out once a month. And we also have a lot of resources on our website related to diversity, allyship, inclusive language, communities. Check it out. And the link to our Discord is on our website. We have a couple of really great playlists on YouTube. And one of them is contributing to the open source. We’ve had a series of open source webinars. NumPy, Scikit-learn, pandas, and core Python. So if you want to learn more about that, check out those videos. They’re really great. Career advice is always in demand. And so we have a playlist by three terrific speakers who have shared their insight on careers in tech and data science. So if that is something that is of interest to you, please check it out. And this is just a snippet of all the other events that we have done. So depending on your interest and what you’re looking for, check them out. We have a job board. It’s jobs@dataumbrella.org. And you can also subscribe to it for updates. So check that out as well. We have a highlighted job here, which is a cloud infrastructure engineer at Coiled. And it is a remote position. And you can find out more information about how to apply and the job description on our job board. And I will share some of these key links in the chat as well as soon as I finish presenting. And this is just a reiteration of– we have a lot of resources on our website. And our website is dataumbrella.org. And also another iteration of where you can find us. We are on LinkedIn. We’re on Facebook. We’re on Twitter. And the best place to find out about upcoming events is on our Meetup. We post it in different places, but Meetup is the first place to find upcoming events. We have an upcoming Scikit Learn sprint. The website is afme2021.dataumbrella.org. We do have a wait list at this time. But feel free to check it out. And we do have resources there also on contributing. There’s an upcoming event. It’s a community event. So it’s organized by Global Diversity CFP Day, which is on February 20. It was supposed to be last weekend, but was rescheduled due to some issues that came up. But if you would like to get started in speaking at a conference or a meetup or even just learning how to write CFPs, which is a call for proposal, check out these live streams. They have six live streams for each of the regions in the world. So they’re friendly to the time zone that you’re at. So check it out. And the event is free.

I’d like to introduce today’s speaker. I also just want to share briefly why this event was organized. I’ve attended a bunch of– and organized a bunch of Scikit Learn sprints. And I was working on a PR, a pull request for documentation. And it was like a mystery to where these files were being produced. I saw the files produced, but I couldn’t see the code for it. And I was searching on Google. And I realized Sphinx is super powerful, also complex. And I thought, really, I want to learn more about it. And so I went on a search to find someone who could speak about Sphinx. And I went from three or four different people and finally led me to Melissa, who I actually have met. I just didn’t know that Melissa was the person to go speak to about Sphinx. So I’m glad that I was referred to Melissa. And Melissa is joining us from Brazil.

00:05:21 Reshama introduces Melissa

Melissa is an applied mathematician and former university professor turned software engineer. And she works at QuantSight now. Her pronouns are she/her. And she is also a tech, for-trend, and free software enthusiast. And Melissa is on GitHub and Twitter @MelissaWM. So check it out. And with that, I am going to turn off my camera and mic and hand over the screen to Melissa.

00:06:50 Melissa begins talk

Thank you, Reshma. I’m so happy to be here. And I just want to point out, as much as I am talking about Sphinx today, it’s not like I am the biggest and more profound– don’t have the most profound knowledge about Sphinx. I think there are other people who could also speak to it. I just think it’s a great opportunity to share and to explain. And sometimes not understanding things is the best way to explain things to other people, because you find the same issues, and you have the same troubles, and you find the same difficulties that other people might face when they start doing this. So thank you so much for the invitation. I’m really happy to be a part of Data Umbrella.

00:07:38 Tutorial Introduction

So I’m going to do a little intro to Sphinx for Python documentation. And this is kind of going to be in two parts. So the first part is going to be more about Sphinx and how it’s built and how to start to understand how it works. And the second part will be more of an applied thing where we’ll fix something in the NumPy documentation. It will be a trivial fix. And those are usually not encouraged in NumPy just because we have so little maintainers and so much to do. But it’s just going to be an example. And if ever you are interested in doing a PRR for documentation in NumPy, it can be a good example of how to start. So I have a few links there. The first one is for the slides, the actual slides that I’m showing you. You can access them with that link from hackmd.io. There’s a repo, which is called MinimalSphinx, where I have the tiniest example of a module and how you can use Sphinx to generate the documentation for that module. And then I have a link to the NumPy docs where you can see the kind of documentation that we do.

00:09:01 How Melissa got involved with NumPy / Invitation to collaborate with NumPy documentation

So just to explain who I am and why I’m here, I’ve been working with the NumPy documentation since the beginning of 2020. I’ve been leading the documentation team with some people who are also here. I saw that Ross was here before. I don’t know if there’s other people from the docs team or from NumPy here. But the idea of the docs team is to concentrate our efforts into the documentation that we want to do for NumPy and have a larger vision of how we want this documentation to be improved. So everyone can be a part of the documentation team. You just have to show up. And if you are interested in working in the documentation for NumPy, you can join our open meetings that happen biweekly. So if you’re interested, just ask around. And you can check our contribute page for NumPy as well to see how to do that, how to go to our meetings.

00:10:01 Impostor Syndrome explanation

So first things first, I want to show this picture. And actually, the first time that I saw a picture similar to this one was with Reshma because we were at the discount conference in New York that was organized by NumFocus. And this picture was so– for me, it was awesome because I could finally understand what I was feeling. And how I could reframe the way that I was thinking about things. So this is about imposter syndrome, which is when you think that you know nothing and that people will finally realize that you’re not supposed to be here or that you don’t know enough to do what you’re doing. And so in your mind, sometimes you have the picture on the left, which is like a big circle of what I think that other people know. And inside, there’s a smaller circle that says what I know. So you have an idea that other people know a lot of things, and you know nothing. But reality is actually closer to the picture on the right, which is a bunch of intersecting circles, and each contributing to a little bit of the information. So what I know and what other people know are actually complementary things. And they work well together and not as in subsets of each other. So the idea that you can not contribute because you have nothing to say or you have nothing to add to a project or to something that you want to work with is usually false because we each have different perspectives and we’ll each bring something to the table that wasn’t there before. So I strongly encourage you to think about contributing if you have the opportunity, if you have the time, and if you have the resources. Its diverse communities make our communities better. And having those different points of view help.

00:12:08 What is documentation? (Tutorials, How-to Guides, Explanation, Reference)

So let’s talk about documentation. What is documentation? It is funny because sometimes talking to people who are not in an open source context, they have a different understanding of what documentation means. So for example, for many people, documentation for a software project means the auto-generated module documentation or the API documentation. This is actually just a tiny piece of what we call documentation in an open source project and in larger projects. So the picture that I’m showing you here is by Devio, which is a company that works with Django. And they have someone called Daniela Prochida, who organized the idea of a documentation system. So he realized that we could most of the time divide the documentation for a software project into four parts, which are the four quadrants that you see in the picture. So the API documentation or the auto-generated documentation extracted from docstrings or from the interface of the functions or the methods of your software project, we call that reference documentation. And they are mainly information about how to use the software project. But there are other kinds of documentation, for example, which are learning-oriented or problem-oriented or understanding-oriented. So what are those? If you’re thinking about tutorials, you would think about something educational in which you are trying to explain good practices or ideas and concepts that you want people to figure out while using your project. So tutorials are something that we see often in open source project. But sometimes they are not exactly– it’s hard to make a difference between tutorials and how-to guides, which is the second kind of document that we’re going to talk about. So the how-to guides are supposed to be problem-oriented and just a series of steps that you would take to solve a problem. So right now, we have three different kinds of documents– the reference documentation, tutorials, which are supposed to be educational content, teaching you best practices, and how to do things, but with an underlying idea of teaching you processes, techniques, and best practices. How-to guides are mostly, how do I solve this problem? Oh, you follow this step, and then this step, and then this step. And in the end, you have a solution. Maybe you don’t understand every step, but that’s OK. So sometimes people call how-to guides tutorials, but that’s something else. It’s interesting to know that you can have those two kinds of focus for documents. And then the last kind of document that we have is an explanation. So what is an explanation? It could be a longer-form document explaining historical developments for the software project or why certain design decisions were made. And so the explanations are not always necessary. Not every project is going to need this, but often they are, and they just don’t exist, and they need writing. So I’m showing you this not to overwhelm you with information, but just to explain that there are different types of documentation. And especially if you’re looking to contribute to an open-source project, you might find yourself wanting to contribute to just one of those kinds of documents. And that’s OK. And so there are different options for how you can do this. I also have a link in the slides. So if you have access to the slides, there’s a link to NEP44, which is a document that we wrote for NumPy. And it expands a bit on the kinds of documentation, the things that we are looking at doing in NumPy. So if you want to know more, you can check that out. Or you can check the DVO website where they have– if you search Google for DVO documentation, you’ll certainly find the documentation system explanation. It’s actually very, very good for whatever project you’re working on.

00:16:54 What is Sphinx?

So in that context, what is Sphinx? I showed you before four different kinds of documentation. And Sphinx, just to be clear, could be used to generate any of them. But usually, you think of Sphinx as a documentation generator in the sense that it extracts doc strings, which are comments of code that we leave in our Python modules, to an HTML. Sphinx can take any plaintext source files and generate readable output. And for our use case, you can think of it as a program that takes in those plaintext files in a special format, which is called Restructured Text, and outputs HTML. Of course, you can output PDFs. Sphinx also can output EPUB and other kinds of documents. But usually, what we use in software projects is HTML to be able to read them on the web. It needs basically a configuration file, which is a conf.py file. And that is already a tip of how Sphinx works. So it’s basically a Python file with a bunch of dictionaries and lists inside. So it’s supposed to be readable as a Python file. And Sphinx is really extensible. So it has a bunch of different extensions that you can add to it to do extremely powerful things, just like Reshma was talking before. The thing is that it can be overwhelming. Exactly because it is so powerful, it can be overwhelming. So we’ll try to do something more contained and not go too deep and not talk about advanced usages of Sphinx just because this is an introduction. But there will certainly be much, much more if you search the internet for information on how to do Sphinx. So how does it work? Like, basically, if you have a project– I put it an empty project, but I meant like an empty documentation folder because you will have your project. Maybe you already have a module or you just have a Python file with some functions written in them. It doesn’t have to be very complicated. If you have a project with a Python file, you can just install Sphinx. So for example, you use pip install or you can use conda, whatever your package manager is.

00:18:00 configuration file: conf.py

You can initiate the configuration file using Sphinx-quickstart. It will generate a bunch of stuff for you. It will ask for the project’s name, who’s the author, and how do you want to organize things. Usually, you can just choose the defaults and it will create a sensible directory structure and organization for your project. You can edit your conf.py file and any other files that you wish to make them customized to your liking. And then you can build the outputs of your documentation when you are satisfied. So usually, in the default configuration, you will have something like make HTML, because you want to generate HTML. And you can find the generated documents under a folder called build or underscore build or something like this. So for the defaults, it’s basically how it works. What is this file format that I mentioned to you? So I said that the documentation, the plain text files would be written in a specific format called restructure text. So after running Sphinx-quickstart in a project, you get an index.rst file created.

00:19:00 Getting started with sphinx (installation) (conf.py, index.rst, _build/html)

Sorry, I just saw the question on the chat. Isn’t make a Linux command? Yes, if you are using– when you install Sphinx, I believe that you will have that command installed whatever system you’re in. It’s just supposed to compile the results in your HTML. But I definitely– I’m not sure. But I think it will work just the same in whatever– in whatever operating system. Yes, Ross, thanks. Make.bat for Windows. Make is actually a compilation command. So it’s not like a Linux or a Windows thing. It just compiles things for you. So if you have a source and you want to compile the result, you can use make, which will read a file and with the instructions about how to compile these things into the output that you want. So it will generate the HTML for you.

00:20:43 What is reStructuredText? (rst files)

Coming back to the Sphinx Restructure Text format, after running Sphinx-quickstart, an index.rst file is created. So in order to show you this, I am actually going to go to the repo that I mentioned to you. If you want to see that, you can also go to github.com/melissawm/minimalsphinx. I’ll try and paste it in the chat so you can join. If you go there, there’s a big read me with instructions about how to do this for your own project. I’ve done this, and so I can go to the docs folder that’s already here. And I can see the index.rst file that was generated by Sphinx. If I open this file, I can see that it’s a very simple– let me check the raw format because it’s easier to see what’s happening. So this is what I have.

00:22:55 pokemon

My module is actually a Pokedex, where I’m listing three Pokemon, which are these little monsters, creatures that were popular. I think they’re still popular. My son likes them. And then you can adapt this file to your liking like it says here. The first part is actually a comment, so it won’t show up in your rendered documentation. You have a title here, which is Welcome to Pokedex documentation. And you can see there’s a bunch of equal signs under this text. This means that this is a title. And then Sphinx, when it reads this file to generate the HTML, it will figure out how to show the results accordingly. And so you have other things that I’m going to mention later, but this is the basic format of your index.rst. Going back to the presentation, so Restructured Text is a markup syntax. So you could see that there’s little commands and things that you can tell it to get certain generated content that you want. So in that sense, it’s similar to Markdown, but you have to be careful. Because it is similar to Markdown, it’s easy to get things wrong. And so this is a source of confusion for many people. For example, the standard Rest inline markup is one asterisk for emphasis, which is italics, two asterisks for a text that is a strong emphasis or boldface, and back quotes for code samples. But you actually need two back quotes instead of one like you would do on Markdown. So you have to be careful about the little differences. And it is possible to use Markdown with Sphinx, but I’m not going to go there at this point.

00:24:25 Comparing reStructuredText (reST) to Markdown (md)

So the basic usage is using Restructured Text. If you have access to the slides, you can actually click this link where it says Restructured Text for the basics of the syntax and how you do other more advanced things with REST or Restructured Text. It is a very powerful syntax because it will help you to auto-generate the documentation including the inline markup that I mentioned, but also custom content, powerful linking, and cross-referencing features. So it is pretty powerful. It can do a lot of things. And if you take the time to learn, I promise it will help you generating those documentations. Rest also implements directives, which are blocks of explicit markup, which can have arguments, options, and content. So like I was showing you with the index.rst, I’m going to be back to that file now. So you can see, for example, this is what we call a directive. It’s a talk tree directive, which is meant to generate a table of contents for this file. And then it has options, for example, max depth, which is the depth of nesting that you want to list in your table of contents. And the caption that you’re going to use for the table of contents. So you can see there’s the characteristics of this. For example, indentation matters. And you have to use consistent indentation to make the compiled documents generate properly. There are other things that it’s possible to do. For example, here you can see another kind of syntax, which is a list. So the asterisks there are meant to be items of a list. And then you have a reference to another document that’s going to be called gen index. I don’t want to go into the details of all of these things right now, because it can be very overwhelming. But you can check and experiment with this in the MinimalSphinx repo. That’s why it’s there. So you can check that out on your own time and experiment and try and see what happens if you change one thing for the other. We’ll see some concrete examples of this in the NumPy documentation. But you can check a nice summary of REST syntax and a longer REST primer in the links that I’m putting in the presentation. So I think the main thing about sinks– I actually didn’t give a historical explanation for this. But Sphinx kind of was built by Python programmers and developers to Python projects. So it’s meant to be specialized for Python. But it’s not anymore. Some people outside Python are also using it. I’ve heard the Linux kernel are using Sphinx to document things as well. And so it is pretty powerful. And it supports the inclusion of your docstrings with an extension called autodoc. So the main idea is that you’ll have your module with your doc strings. And you can tell Sphinx, OK, now you need to extract those docstrings and put them in a generated HTML documentation. Let me just answer a question from the chat. Do the docstrings have to be in RST? Yes, they should use a syntax that is compatible with what you’re doing in Sphinx. There are extensions that allow you to write the docstrings in other formats, for example, in Markdown. So you can check that. If you prefer to use Markdown, for example, many people do. You can use Mist, I think, will do that, which is read the docstrings in other formats and output HTML anyway. So many of those things, for example, the extensions and how you want to extract things from your module are going to be selected in the configuration file, which is the conf.py file. And you can then document whole classes or even modules automatically using member options for the auto directives. So this is an example. You would have a module called I/O. And if you want to just document every class, every function inside this module, you just say members, which means document everything. OK. And just to mention a couple other extensions that might be useful, you can do doc tests. So I don’t know if this is something that you’re familiar with, but in the same way that you would test your code– for example, you do unit tests, and then you run your tests to see if your package is consistent, if everything’s working well– you can create doc tests, which are little snippets of code that you will put in your docstring. And you can then run those doc tests as you generate the documentation or in a separate comment like this one, which is make doc tests. And it will check if your documentation is generating the same outputs that you want to. So this is great to guarantee that your documentation stays up to date, even if you change your API or if you change the way you’re doing things in your package. If you have doc tests, you can guarantee that the documentation is still relevant to the current version of your package. There’s also another super cool extension called intersphinx, which is an extension that allows you to refer to other projects’ documentation labels easily by using their intersphinx mapping. So for example, I’m going to show it to you later quickly. But in the NumPy documentation, we refer to the SciPy documentation several times. Sometimes we mention something in MapLodlib. Sometimes we’ll mention something from Pandas. And because they share their intersphinx mapping openly, we can actually check which kinds of functions and modules and classes they have available. And you can refer to those elements in your own documentation such that the generated link will not be a link that you write yourself. But it’s going to be something provided by the other package. So what’s the advantage of this is that you don’t have to create those links manually yourself. So suppose that you are citing another project’s documentation, and there’s a new version, and the URL for that module or function changes. And then your reference is not valid anymore. If you’re using the intersphinx mappings, that doesn’t happen. Because the actual URL that you have to click to get to the documentation will be generated automatically by the intersphinx extension. So this is really useful if you’re citing other people and other projects. And I’ll try and show you an example later if we have the time. Then I’m going to show you an example of the NumPy docs, but I just want to go back a little because I want to just show you more about the minimal syncs repo. So I’m going to open a console here. I’m on Linux. So it’s not going to be the same for you if you’re in another operating system. But I just wanted to show you quickly some things that we can do with syncs. So what I have– this is the pokidex.py file that is in the minimal syncs repo. And it’s a very, very simple package, really, which only has some definitions of Pokemon and how you can call them which Pokemon they evolved to and stuff like that. So I just want to check out this repo. And I’m going to show you– I think I’m going to do it differently like this. So you can see better. There we go. OK. So what we have here is a directory structure like this. There’s my readme, the license file, and then there’s a source folder which has only one file in it, which is the one that’s open there, which is pokidex.py. And all the other things that are there are actually auto-generated and doesn’t matter for our purposes. So the main file is pokidex.py. And if we go to the docs folder, there is a bunch of .rst files. There’s a make.bat file, a make file, and some other stuff. So I just wanted to show you how you would generate the documentation. My machine’s name is Asoka, as in Asoka Tunnel, because I’m a big Star Wars fan. I’m just going to run the make HTML command here. So like I mentioned, we already have a conf.py, which is a configuration file. So we already started Sphinx in this repo. So we can run make HTML to generate our documents. So the results will be in build HTML. So you can actually access this with your browser. If I go to file– HTML. HTML. Oops. Oh, docs. Sorry. There we go. So I just went to the folder that it pointed me to. And then index.html should be my main file, which is the one that I showed you before. So I showed you index.rst, which is this one. And Sphinx generated this HTML. So it’s much nicer. It does have a talk tree with the contents that I told him to include. There’s an indices and tables. Some of those are irrelevant for our purposes here. But the idea is that you can use Sphinx to then generate this nice documentation that you can actually read. And then for example, I have my class starter Pokemon, which is like the parent class. I will generate another class for Bulbasaur, another one for Charmander. I don’t know how to pronounce those names in English. And I have another one for Squirtle. So I can check the docstrings are here, which are these big comments that I put into in the beginning of each class. And I’ll see what they generated with Sphinx. So I want to see what’s with class Charmander. So if I go here and I generate a documentation and I go to API documentation, there’s all the docstrings that I listed before. And they are here properly formatted. You can note that for Charmander, for example, I put a note. This was done with this directive, a note. And then I said, this is something you have to be careful with. And in my generated documentation, this becomes like a highlighted block. So there’s a bunch of different options. This is just one theme. Also, this is the default theme, but you can choose different themes if you want. This is actually a doc test. So if I go to the pokidex.py file, I’m going to show you how that’s done. Here it is. So there’s a function called Who is that Pokemon? And it will show the Pokemon’s name and its evolution. And I will say, hey, Sphinx, this is a doc test. So you have to make sure that those commands that are preceded by the big three greater than signs are actual commands that are going to be executed. And the output must be this one. So what I’m going to show you is that I’m going to make this wrong. So I’m going to delete all the evolutions of Bulbasaur and leave only Ivysaur. I’ll save my file. And I’ll come here and show you– if I do make doc test, it will say, whoops, test failed. I expected this because this is what’s listed in your doc test. But I actually got this output. So your doc test is wrong or your API is wrong. So you can check that. It’s a pretty nice feature if you have a big API and you want to make sure things are consistent between versions. This is all that I wanted to show you in the minimal Sphinx repo. But if you want to check that out, please do. There’s other little things that I did, like linking to other places. And you can check that out. [END PLAYBACK] OK. Should we be considering doc tests versus something like pytest? I think those are complementary things. That pytest means for testing your code and doc tests are for testing your documentation and how that’s done and your doc strings and things like that. I actually think you should be using both. I don’t know if pytest has ways of testing doc strings as well. I’m not sure. OK. This concludes the first part, which was mainly the explanation about Sphinx and the idea of how to use that to document your code. And then I just want to show you an example from the NumPy docs because I actually am pretty excited to get new contributors. I want to show you how you can do this. And I know that it can be a bit overwhelming to start because NumPy is such a big project. And you don’t know which steps you should take. We actually have a how to contribute to NumPy documentation page in our docs. You can check that out clicking this link. And I’m just going to show you here. We explain about the documentation team meetings and what you should know before you join. Or how you can contribute fixes and all that you need to know to contribute. So please check that out. And if you have feedback about that page, please let us know. It would be also super useful to be able to make that friendlier for people who want to join. So like I mentioned, we do have a documentation team. So if you’re unsure about how to contribute, you can always come and ask us for help. There will most certainly be people around to help you. And we’re happy to mentor people if they want to get started. So please just join. Don’t worry if English is not your first language. It’s not mine either. And we’ll make sure there are native speakers to make corrections or make sure things make sense. So you can contribute what you have. Start from where you are. And we can help you get there. There’s a couple ideas about how to contribute. You can open an issue requesting content that you think is missing. Maybe you lack an explanation or something. Or you would like to see a specific tutorial. You can just open an issue and suggest that that’s written. The more specific, the better. So if you can list, for example, the things that you would like to see in a document, that would be great. You can suggest video content as well. So we are looking into experimenting with that. So if you have suggestions for good video content that we could generate to help you or other people contribute or understand how NumPy works, let us know. And you can join our communications channel. So we mainly use Slack for our internal communication, meaning if you want to contribute and you want to know more and you want to get in touch with other people who are contributing to NumPy, you can just ask for an invitation for our Slack channel. We don’t vet people. You can just ask and we’ll add you. But it’s a nice place to ask questions privately if you prefer. If not, there’s GitHub and there’s our mailing list. So you can check our communications channel at numpy.org/contribute. OK, so I just want to finish with a pull request to the NumPy documentation. So I’m not actually going to do the pull request, but I’m going to show you how to get there. So I want to thank Fatma Trelasi because she’s the one who suggested this to me. There is a typo in our Quick Start guide. So I’m going to open this for you here. We have a Quick Start tutorial, which is like a high level explanation of things that you can do with NumPy, basic functions and methods that you can access, and things like that. And so if you go to Quick Start stacking arrays, which is one section of this Quick Start, there is a typo. So if you go here, I think I missed a link. The text is this one in general for arrays with more than two dimensions. H tags, tags along their second axis, V tags, tags along their first axis, and concatenate allows for n optional arguments, giving the number of the x’s along which the concatenation should happen. So I think this word should not be here. So I’m going to just search for concatenate allows so that we can find this quickly. Here we go. This is the offending word. So we would like to remove that. We’ll just look into the files and do that. However, because of things and how things are done, and actually because of– let me just clear my screen. Because of how NumPy is built and the documentations are built, so the documentation for NumPy includes those docs strings that I was mentioning before. This means that you have to have the whole source code for NumPy to be able to touch the docs. So we will actually do that now. So I’m just going to do the following. I’m going to split my screen so we can see the instructions on the left and the commands on the right. I hope that’s not too small. If it is, let me know, and I can maybe make that larger. So first, we have to look at the NumPy repository on GitHub. [TYPING] So if you go there, you can fork this repository to your own. So for example, you can go here and fork this repository to your own account. So I already have a fork, so I’m not doing this step. But this should be your first step in case you’ve never contributed to NumPy before. It’s the plural arguments that should be changed to singular. That’s very possible. We can take a look at that. For the purpose of this presentation, it really doesn’t matter. It’s just more of an example of how to find the file and compile things. But yeah, I think maybe we should check that. OK, so once you have NumPy forked into your own GitHub profile, you can just clone git@github.com your username numpy.git. This will create a folder called NumPy into your own directory structure. So if you’re on your home folder, it will create it inside your home folder. If you’re in a subfolder, it will create it there. You’ll find it later. So let’s see it cloning. Cool. Now, because our final intent is to submit a PR, it’s nice to add a remote called upstream to the main GitHub repo. And I’ll explain what this means in a bit. OK, I have never joined my own NumPy folder. I am on NumPy. Now I can add the remote. And I’ll show you what this means. I think this is too big. I’m going to try and make it like this. That’s better. So if you ask it for what’s happening now, you have two separate sources for your code. You have origin, which is your own fork. And then you have upstream, which is the main GitHub repo. This is useful if you want to sync your repo to the original NumPy repo later, or if you want to submit your pull request directly. It’s interesting to have those two remotes. So origin is yours, and upstream is NumPy. Now you can maybe check out a branch. So I’ll check out a branch called docfix. And this will be the branch that I’ll be working on to do the changes. If you’re not familiar with Git, maybe you can check out a Git tutorial. And this should be just a standard workflow for any PR. So there’s nothing different about NumPy here. So now is the time when we have to do our fix. Now that we have our developer environment set up, we can do our fix. So we can actually check for the file that we said we were going to change. So this is a Quick Start tutorial. And I don’t know if you can see here, because it’s very, very small. But it says numpy.org/doc/stable/user/quickstart. So that gives you an idea of where this file is. If you look inside the NumPy folder here that you have, that you have downloaded and cloned from GitHub, it will have a doc folder. Inside the doc folder, you can see a source, which is all the RST files that we mentioned before. And then because our document is under user, this should also be under user here. Now that I am inside the user folder, there’s a bunch of things, including my Quick Start document, which is here, quickstart.rsd. So I’m going to check this with Nano just because it’s easier. You can do this with the editor that you prefer. So under this file, Quick Start, there’s a bunch of things, including the phrase that we were mentioning we thought was not right. So it was under– is it shape manipulation? I think I can search for concatenate allows, just like we did before. Oh, no. Maybe it’s an inline break. Here you go. Concatenate allows for n optional arguments. Maybe we need to delete the s, like you mentioned, someone mentioned in the chat. So let’s do that for n optional argument, giving the number of the x’s along which the concatenation should happen. OK, so we did our fix. Let’s save this file. Let’s exit. And now I’m just going to go back to the roof folder because I want to show you some other stuff. Now that we are here, we would like to build the documentation and see if our fix actually happened and if everything’s going fine, right?

So we would have to set up our developer environment because we have never built NumPy before here. So what I’m going to do is create a virtual environment. You can use Conda if you want. It’s up to you what’s your preferred way of doing this. But it’s a good idea to use some kind of environment setup so you don’t have trouble with your operational system Python. So I’m going to activate my virtual environment. And I’m going to pip install with the file called doc requirements. So because NumPy is so big, we have two files, one called doc requirements and another called test requirements. And those two files are separate with separate packages that you should install with their version specified. So if you’re just going to build the documentation, you don’t necessarily have to build the test requirements only if you want to run the complete tests before submitting your PR, for example. We also have to install Sitefon because NumPy depends on it. So I’m going to install it now. We’re good to go. So now we should compile NumPy. So I am going to copy this. And you can actually use this command as it’s more convenient for you. So for example, for me, I have 12 cores in this machine. So to make it faster, I will use -j12. So while this is compiling, if you want to ask questions in the chat, I’m available to answering them. I think it’s going to be a couple of minutes when it compiles. Thanks, Ross, for being here. [AUDIO OUT] So yeah, I just want to mention if there’s no questions, and even if they are while you’re thinking about them, I want to mention that if you have questions or if you don’t know how to approach this and you want to help NumPy or any other project, there will be most certainly people available to help you, either doing reviews for the VR or answering questions in forums or communications channel for the project. So for NumPy, I can assure you that there are many people available to answer questions. And nobody expects you to know everything from the go. So if you’re interested in contributing, you can always come and ask questions and figure out how you can best contribute. Is Python core Python? I don’t know. I don’t think so. Obrigada, Stephanie. No, CPython is the Python implementation that most of us use, with an underlying C implementation. OK, so I just want to mention that my compilation stopped. So I have NumPy on my machine now, and I just have to build the docs. So to build the docs, I will go into the docs folder. There’s a bunch of things there. You can actually see a bunch of, for example, tests.rst.txt. You can open those files and look and see how they look like, if you want. But to compile, you just do the same command that I mentioned before, make HTML. So now we’re going to get something– OK, so because this is the first time that I was building, it asked me to make dist first before make HTML, just because I just rebuilt NumPy. So it was going to give me– yeah, this is something that we identified yesterday, and it was not supposed to happen here. But I’ll just fix it. Just give me a second. [TYPING] So this is not supposed to happen. I think this should work. [TYPING] OK, I’m going to try and fix this while you ask questions, if that’s OK. [TYPING] So you see, this is done by Ralph just this morning. Maybe I can actually fix this in a better way. [TYPING] Which is git fetch upstream. [TYPING] OK, so this should work, but I’ll have to rebuild. I’m sorry, folks. So we identified this yesterday, actually, which is an error in setup.py. So you can see the actual commit to fix this issue right here by Ralph. So we identified this yesterday, and I thought it was working already. So OK, we had already built this, so it doesn’t have much to do. So I’ll try and build the docs again. Yay, that works. So while it’s building, let’s go back to the questions, maybe. I’m usually– Charlie, you’re using Conda. That’s also a great idea. Me, personally, I prefer Conda, so I use that all the time. When presenting, I chose to use Virtual AMP because I realize that many people are mostly living in Virtual AMP and using pip. But if you want to use Conda, it should also work. Yes, that is the nature of live coding. And it’s also nice to see, for example, this error, I identified it the other day, and then I mentioned in our Slack, Ralph was patiently debugging with me for some time until we found the error and he fixed it. So that’s kind of how it goes when you find those errors. Can Sphinx be used for other languages? I think it can. It probably depends on extensions of some sort. But yes, I think it’s pretty flexible right now. I know that there are some extensions for JavaScript or Matlab. I know that there’s an extension for Matlab. Why did we build NumPy? OK, so like I’m doing now, make this. This means that you want to generate documentation for the newest NumPy version. And so what we did is we created a virtual end that didn’t have NumPy installed. We just cloned it. But then to build the documentation, you want to run the doc tests, right? So you want to actually be able to import NumPy. So you have to build NumPy and install it before you can build the docs so that the doc strings pass and NumPy can actually be imported. And I don’t think that’s a newbie question. I think that’s– I don’t agree with the newbie, like beginners, advanced, intermediate difference. I mention this all the time because we are usually– each of us has our own experience. And maybe we are very experienced in genome sequencing. But we don’t know how to do that in Python. Does that mean we’re like a beginner? I don’t think so. I think we’re maybe a newcomer to this package. But that doesn’t mean we don’t have other things that we know about, we know a lot about. So this takes a while. I wish we could accelerate that, but I don’t think we can. So it wouldn’t have worked if we would have installed it. It will usually complain about a different version like it did here. So I don’t think it would work. You have to have the source code to be able to build the docs. Yeah, I’m so sorry, folks. I don’t think that’s going to work. And it’s something that has been working before. Yeah, it will give me an error. And I think it’s just because we found this error recently. And it’s just bad luck that it’s not fixed right now. But it’s going to be soon, I promise. I don’t think I can extend this for too long. Otherwise, you can not follow, actually. So yeah, the final step is this one, which is building the HTML docs and then submitting your pull request once you are satisfied with your changes. I just wanted to leave a small note about Jupyter Notebooks. It’s going to be very, very short. It is possible to use Jupyter Notebooks inside your Sphinx generated documentation. We are actually doing something like this in our NumPy tutorials repo. So in this repository, you can find some IPython Jupyter Notebooks that have been converted using Sphinx to an HTML site. This is still in progress. So you can check that out. And it’s supposed to contain only tutorials. So thinking about the four quadrants of documentation, this will be only tutorials. So I had some final thoughts just mentioning about how challenging Sphinx can be. It’s not supposed to be as challenging as I’m getting it now, just because there’s a recent error in the building for the documentation. But it is sometimes challenging. So it’s nice to have someone next to you who can help you understand what’s going on. If you prefer using only Markdown, you can check out Myths, like I mentioned before. There are many other interesting extensions for Sphinx. And for your own project, if you’re using Sphinx to build the documentation, you can use Read the Docs, which is an excellent way to serve those documents in the web. There’s also a bunch of interesting resources about documentation from the Read the Docs folks. So those links are all clickable. And you can check them out in the slides later. That’s it. I’m so sorry why I didn’t manage to build the docs in the end. And this is something that I worked yesterday. So yeah, I’m sorry. But that’s the nature of live coding and the nature of open source projects as well, I have to say. Oh, Reshma, I think you muted. There I am. OK, yeah. I was going to say, yeah, you never know, because things are– it’s like this ecosystem is constantly changing. And something that works yesterday, things need to be– so yes, that is the nature of it. It was a great presentation.

01:06:15 Q&A: Why do we build from source?

I did have a question for you about– I think the question about why do we build from source? And let me know if I’m interpreting this right, because I was wondering the same thing with scikit-learn, where I have a little bit more experience, which is there’s the stable version of a docs, which is by release. And then there’s the dev version, which is the development version. And so when we’re building the virtual environment, we don’t want to use our latest released version of the library. We want to use the one in development. Right? Yes, yes. It’s like a lie. It’s like things have been fixed, and they’re constantly being fixed. And I think that’s why we’re doing it right.

Melissa:
Yes, exactly. And not only that, but also the idea of having the most recent changes to make sure you are not overriding someone else’s work, because that will happen if you don’t have the latest one. So this is why we keep our fork in sync, and we try to do that. Because of course, when you submit your changes, there will be continuous integration. There will be tests. There will be other ways of checking if you are not doing something that’s not following what the API says anymore or something like that. But yeah, you’re supposed to be working with the latest development version. Yeah, because one time I had actually opened up an issue, submitted a PR for a documentation problem. And I was looking at the stable version, and it had already been fixed in the development version. Once I understood that, I was like, oh, now I understand. Like I wasn’t looking at the final docs in the development version. So it was a good learning point for me as well. So I’m always amazed. Every time I work on– even if it’s like a small documentation fix, which is adding a line how much I learn about the whole process from one fairly simple PR. Yeah, absolutely. And I think it’s pretty amazing to work in a project like this one. NumPy is a huge project and an important one as well, which is sometimes– can be intimidating, right? Because you’re doing big changes, and you don’t know if that’s going to work. So it is interesting that you can find ways to contribute in a low emotional cost, low impact things first, if that’s something that concerns you. So it is nice to be able to touch the documentation and see how that goes, see the workflow, see how you interact with people, how the community behaves, and all that. So that’s interesting. That’s great. It would be cool to do something with NumPy that’s hands-on for some of our members. Yeah, I think that this is something that we can talk about. I’d be open to it. I think other people would as well. Yeah, it’s cool. I love seeing some of the parallels with Scikit-learn and some of the things that are done a bit differently. So yeah, it was a great learning experience. So I don’t see any more questions. But if anybody else does have any more questions, now is the time to post it in the chat. And just for people to know, I’m going to have the video up. I try to have it up within 24 hours on YouTube. I will take a lot of the links from the chat and put them in a transcript so people can easily access the links that have been posted in the chats as well. And yeah, I think that’s all that I have to say. Melissa, if you have anything else to say, thank you so much for doing this presentation. Yeah, I just want to apologize again for the build problem. But it should be fixed soon. Other than that, it was a great pleasure. And if you ever want me to come back, I’d be happy to. Absolutely. Thank you.

OK, now that we have done our fix, we can build the docs and see how that worked. So we can do make HTML inside the NumPy/doc folder and see what happens. So this will use Sphinx to build all the files. And this might take a while exactly because it’s reading all of the sources, including the docstrings, including the auto-generated documentation, and the extra documents that we have written in RST format. So now that it’s finished reading sources, it’s getting to the end, actually. So it’s executing all the comments that we have inside our documentation. For example, for this tutorial SVD that you were looking at there, there is an image manipulation aspect. And so this clipping input data message that you’re seeing is actually coming from the code inside our documentation. So Sphinx is building everything, executing all the commands that are listed there, and now writing the output to HTML. So you can see that this really takes a while. And once it’s all built, all the HTML is going to be generated inside the build folder, in the doc folder, and the root numpy folder. So we can go to our browser, in my case, home, Melissa, numpy, doc, and then you can see all the directory structure under there. So we’ll go to build HTML. Remember that Quick Start, the file that we changed, was under user. So we can go to userquickstart.html, and we’ll see our fix, fortunately. So if we go to concatenate allows, it will save for an optional argument, giving the number of the x’s along which the concatenation should happen. So now our fix is done, and we can submit our PR if we want to.

Contributing to the NumPy Documentation

Tue, 15 Jul 2025 07:01:35 +0000

Video: Contributing to the NumPy Documentation (~50 minutes)

Slides

About NumPy

NumPy is a fundamental, open-source Python library for N-dimensional array programming used extensively for data analysis and scientific programming. As a community-driven project, NumPy is mainly sustained by open-source contributions. This talk focuses on avenues of contribution to the project documentation, an integral part of the software.

NumPy Video Playlist

NumPy: Its History, Governance & How to Contribute (~60 min)

Live demo of contributing to NumPy (~25 min)

00:22:41 Live demo of contributing (setting up virtual environment, working on an issue, submit a pull request)

Sphinx for Python Documentation Tutorial (~75 min)

Live demo of contributing to NumPy (~15 min)

00:43:27 An example pull request to the NumPy documentation

Intro to NumPy Array Operations (~45 min)

Resources (+ all the links from the Slides)

Slides: Slides
NumPy: https://numpy.org
video: Intro to NumPy Array Operations
video: NumPy: Its History, Governance and How to Contribute
video: Sphinx for Python Documentation Tutorial
NumPy tutorials
NumPy tutorials on GitHub
Style Guide
NumPy development
Documentation as a way to build community
NumPy community (link to slack is on this page)
NumPy playlist: these videos have examples of contributing to the NumPy documentation
NEP 44 — Restructuring the NumPy documentation
Diátaxis - A systematic approach to technical documentation authoring
NumPy Contributor’s Comic
NumPy YouTube
NumPy Community
Setting up your development environment – Building the NumPy API and reference docs

Connect with the Speaker: Mukulika Pahari

Mukulika is a maintainer for the NumPy documentation and has been involved in the community since 2021. She is also studying to be an oceanographer and likes working on scientific software. She was a Technical Writer via Google Season of Docs in 2021.

LinkedIn: @mukulikapahari
GitHub: @Mukulikaa

Questions

If you have any questions on contributing to NumPy, there are various ways to contact the team here: Contributing

Video Outline

00:00 Data Umbrella introduction
03:15 Mukulika begins her presentation

[Note: The remaining timestamps will be completed at a later date.]

Intro to Zenodo - Advancing Open Science

Tue, 15 Jul 2025 05:01:35 +0000

Resources

Videos

Video Part 1 [110a]: The full tutorial Intro to Zenodo: Advancing Open Science (~75 minutes)

Video Part 2 [110b]: Step-by-step tutorial: Upload research files (documents, datasets, code, etc) to Zenodo (~20 minutes)

Video Part 3 [110c]: What are FAIR Principles? (~2 minutes)

About Zenodo

Zenodo is a general data repository where any research output (data, presentations, research articles, software, and much more!) can be shared and preserved for the long term, increasing their visibility and impact.

Zenodo is derived from Zenodotus, the first librarian of the Ancient Library of Alexandria and father of the first recorded use of metadata, a landmark in library history. It was launched in 2013 by CERN, which is The European Organization for Nuclear Research, and it is an intergovernmental organization that operates the largest particle physics laboratory in the world. It was built by researchers to ensure that anyone can join Open Science. The repository welcomes research from all over the world and all disciplines. Zenodo does not impose any requirements on format, size, access restrictions or licence. “Quite literally they wish there to be no reason for researchers not to share!”

The benefits of Zenodo include

Being accessible: It is free (up to 50GB per upload)
Helping researchers receive credit by making the research results citable and through OpenAIRE integrates them into existing reporting lines to funding
Providing a Digital Object Identifier (DOI) which is a globally unique persistent identifier for your record and is an important for discovery system to attribute citations correctly
Preserving knowledge: deleted website content (Research shows 25% of web pages posted between 2013 and 2023 have vanished.)
Sharing content accelerates research, this supports Open Science and reproducibility principles

Connect with the Speaker: Esther Plomp

LinkedIn: @estherplomp
GitHub: @estherplomp

Video Outline

00:00 Introduce Esther
01:33 Why use Zenodo?
02:37 Preserve research outputs
03:07 What is a DOI, Digital Object Identifier
04:24 Make research outputs citable
04:51 Disclaimer of “Data available upon request” does not mean data is obtainable
06:52 All the reasons to use Zenodo
08:30 A background on Zenodo (history, location, storage types)
10:34 Zenodo is open source, a general data repository
11:10 FAIR Principles: Findable, Interoperable, Accessible, Reusable
12:34 Licenses
15:32 Using Zenodo over supplementary materials
18:50 How do I use Zenodo?
20:20 Start step-by-step walk through of Zenodo: how to add (or upload) a file
29:15 Q: If you were to upload a paper, dataset and code, would they have the same or separate Digital Object Identifier (DOI)?
43:30 Viewing, editing and making changes to your uploads
45:50 Publishing and versioning uploaded files
50:40 Zenodo Sandbox: place for testing uploads (https://help.zenodo.org/docs/get-star…, https://sandbox.zenodo.org/)
52:02 How to use Zenodo and GitHub
1:01:42 Using Zenodo in research articles
1:04:29 Using Zenodo to share presentations
1:11:50 Resources
1:12:48 Q: Where can we find the images/illustrations used in your presentation?
1:16:16 Q: How does Zenodo prevent people from uploading spam?
1:17:25 Thank you!

Full transcript of Esther Plomp’s Zenodo Tutorial

00:00 Introduce Esther

Reshama:
Hello, welcome to today’s talk. Today’s presentation is Intro to Zenodo by Esther Plomp. Esther is an open science enthusiast and contributes to a more equitable way of knowledge generation, facilitating others and working more transparently. She currently works as a postdoc/research developer at the University of Aruba and is working on an East Science Fellowship project on tracking research objects other than peer-reviewed articles, as well as software, as well as being a Software Sustainability Institute fellowship, on facilitating contributions to open source science communities with a focus on The Turing Way.

Esther particularly cares about open data in the field of isotope archaeology, and she also does some advisory board stuff for open science communities on the side. You can find Esther on LinkedIn. Welcome Esther.

Esther:
Thank you so much for the introduction and thank you so much for letting me talk about Zenodo on the Data Umbrella Series, which I’ve been following for quite some time. So it’s been very exciting that I’m now finally a part of this whole series, so thank you very much. As mentioned, I’ll be talking about Zenodo, why, what and how. So we’ll dive into those details right now. And I like to start any talk by discussing why Zenodo and also perhaps why me. I don’t necessarily work for Zenodo, I don’t represent Zenodo, I’m just a frequent user of Zenodo and a big fan, so that’s why I’m talking about this. I’m sure other people can also talk a lot about Zenodo, but I’ll just show you why I’m using Zenodo and how I’m using that.

01:33 Why use Zenodo?

But we’re first going to talk about why Zenodo in the first place. And that’s because we don’t like broken links, or at least I don’t like broken links. And I’m sure everyone has encountered them when you’re looking for something on the internet and perhaps when you’re looking for something that you’ve shared in the past. And a four-link is really not what you hope to find when you’re looking for something and particularly not when you’re looking for some of the research data that you’ve been working on for six years. So we don’t like broken links. And I think as scientists it’s really important that we consider where and how we’re sharing our research in a manner that is more persistent than in our heads, behind paywalls or on a computer that can actually crash at some point and then we lose access to all of this data and all of this knowledge. And researchers should really consider this more carefully sometimes.

03:07 What is a DOI, Digital Object Identifier

And a solution to no more crashing computers and knowledge locked up in our head is a digital object identifier. If we share this knowledge online and assign it a digital object identifier, the knowledge will be persistently available because a digital object identifier is a way to persistently make things available on the internet as well as uniquely because every output will have a unique identifier. And we see this already a lot in use for journal articles, but you can actually assign it to any data sets or any research outputs. So data sets, software, preprints, presentations like the one that I’m showing you now. That one actually has a persistent identifier at the bottom of the screen where it also says “zenodo” because my presentation is on Zenodo. And these types of DOIs or digital object identifiers, I will probably say DOI a lot throughout the presentation, but this is what I’m referring to. These DOIs are persistent and unique and they avoid these 404 pages on the web so that you can actually find whatever it is that you’re looking for.

04:24 Make research outputs citable

Another benefit of these DOIs is that it makes research outputs citable. You can see that for publications again we can keep track of who is using what research thanks to these DOIs because you can do it in an automated way because all of this is machine readable. So now you can also, for example in the image, keep track of how software is used by software citations. And another reason why these DOIs are amazing is because it actually makes research outputs available.

04:51 Disclaimer of “Data available upon request” does not mean data is obtainable

We have probably encountered at some point the quote “data will be available upon request”, whether in research articles of others or in complaints on various social media channels from scientists or researchers where they complain that data will be available upon request but not really. There’s actually research being done into this where people are indicating that they did request the data and then concluded that they actually didn’t receive any responses to these data requests for 41% of the data requests. So there’s research in 2021, which is not great. And earlier research in 2014 already indicated that data availability decreases 17% per year and they actually make a very bold statement that says research data cannot be reliably preserved by individual researchers. That sounds a bit harsh but these are not the only two studies that are being done on the topic. So data is actually not available upon request. These types of requests are only successful for 38% of the time and there’s actually lots of research being done about that. And as a fellow researcher I just felt the need to list all of these resources. So this is not something I am making up. This has been researched. Feel free to look up some of these references. We’re not going to go into detail right now because we actually want to get to the points where what do we do?

06:52 All the reasons to use Zenodo

If research data cannot be reliably preserved by individual researchers, what are we to do? And that actually is where Zenodo comes in. Zenodo actually does all of this for you. It’s making your life a lot easier because it ensures that outputs are stored persistently, that they become citable, and that they’re available on this platform. So it’s an amazing platform and I’ll explain a little bit about why you should use Zenodo in their own words. So this section is actually copy/pasted from their website or their platform.

Zenodo is safe. It’s safely stored for the future in CERN’s data center. I’ll talk a little bit about what CERN is in a bit. It’s trusted. Again, citable. Every upload is assigned this digital object identifier that I talked about in one of the first slides. So everything is citable, trackable. There is no waiting time. So you can make an upload as soon as you hit publish and then your UI is available within seconds. Sometimes it takes a little bit longer. But generally, this is almost instant. You can make it available open or closed. So it’s also for more sensitive data if you put it on the restrictive access. You can also version data sets or fonts. So you can upload new versions and indicate what modifications you’ve made. There is a GitHub integration and we’ll talk about that later as well. And you can see a bit about users statistics if you use this platform. So that’s why Zenodo.

08:30 A background on Zenodo (history, location, storage types)

Now I want to go into a little bit more about what is Zenodo. So Zenodo is built and developed by researchers with the main aim to ensure that everyone can participate in open science. It was launched in 2013 already. So it recently had its 10th anniversary, which is why you see this nice logo with the candles on the slides. And it’s managed by CERN. So I mentioned that before, which is the European Organization for Nuclear Research. And that institute is based in Switzerland, so Europe. And the name is actually amazing. It’s derived from Zenodo, which is the first librarian of the ancient library of Alexandria and the father of the first recorded use of metadata. And that might not tell you a lot, but as like an almost librarian, this is amazing. It’s a really great name for the platform that they offer. And Zenodo is very inclusive in the sense that it’s welcoming research from all over the world and from every discipline. So it’s a general data repository. And the only requirement for that is that it needs to be necessary to understand the scholarly process. So perhaps it’s not suited to put your cat videos on there unless you’re studying your cats and you got ethical approval to do that. But anything related to the scholarly process is very welcome.

It’s also free for uploads up to 50 gigabytes. And they are also open for debate or discussion about bigger uploads. So you can also reach out to them if that is needed. And Zenodo also doesn’t impose any requirements on the formats, the size, the access restriction, so you can share it openly or close, or the license. And we’ll talk a bit about licenses later. So it’s very up to you what you’re doing on Zenodo.

10:34 Zenodo is open source, a general data repository

And I would just also like to highlight that Zenodo is actually open source. So the code is open source. It’s built on Invenio, which is also open source. And everything is shared openly on GitHub. And so they also invite contributions to the platform. And so that’s another amazing thing about Zenodo. So what is Zenodo? Zenodo is a general data repository. And a data repository is a place where digital objects, such as research objects, can be stored and shared with others. And it’s in compliance with the FAIR principles.

11:10 FAIR Principles: Findable, Interoperable, Accessible, Reusable

And if FAIR is the first time you’ve heard about that today, I’ll briefly explain FAIR. It’s an acronym for Findable, Accessible, Interoperable, and Reusable. So it has not a lot to do with FAIR in the sense of ethics, or FAIR as an equal, and so forth. But it’s an acronym for these terms. And I’ll explain very briefly what each of them mean. So data can be findable when it has descriptive metadata, so information about the data, as well as this persistent identifier, such as the DOI that we discussed earlier. It is accessible when it is openly available, or there is an authentication process or procedure in place so that people don’t necessarily can access it immediately, but there is a process in place so that they eventually can access it if they fulfill the requirements. So that’s also the restricted access option on Zenodo. And data is interoperable when you can integrate it with other data or applications and workflows. And it’s reusable when it’s shared with sufficient documentation explaining what the data is about, as well as a license.

12:34 Licenses

And I mentioned license before, and I think I owe you a bit of an explanation, because it’s not always clear what a license is, but a license is a formalized agreement of what re-users can do with data and software. And so if something is openly available on the internet, it doesn’t actually mean that you can just use that for any purpose that you like. That’s actually quite the opposite. If the output doesn’t have any license, it means that all of the copyright is still with the original owner, and you would actually need to ask them for permission to do whatever it is that you’d like with the research objects. And so licenses are a great way around having to email everyone and explaining in detail what it is, what they can’t, and what they can do with data and software. And instead, you tell them that from the start, so people immediately know what they can do with the research output. And I won’t go into too much detail, but software and data have different licenses, which you can choose from. And I’ve listed a software license chooser and a data license chooser, or the open data commons, which you can explore to see what license best fit your needs. And for that, you always need to be considering whether there’s any requirements placed upon you. And if you are really free to choose a license, so it’s a model that lets you choose any of the pre-existing licenses. But perhaps you have a funder that wants you to choose a specific license, or a collaborator with a very strong preference for a particular license. And in that case, you should probably listen to that as well, instead of just making your own choice. So that’s licenses.

And data repositories support with choosing these licenses and support with making data fair, because they generally assign these persistent identifiers. They have metadata fields, some of them required. So you actually need to provide some information about the data before you share it. They provide a record or a landing page that people can actually access in order to get access to the research outputs. So they really support making all of these research outputs fair. So data repositories are great. And also just a disclaimer that we’re talking about, Zenodo today. But Zenodo is not the only data repository available. So sometimes it’s more helpful to use a discipline specific repository, because then that’s more specific for your type of data that you’re working with. So you can find out about other data repositories by, for example, checking out re-tree data or fair sharing. But today we’re focusing on Zenodo.

15:32 Using Zenodo over supplementary materials

And before we go more into depth about why Sonoda, how to use it, I just want to go into a question that I frequently get asked by researchers. Why is it not sufficient enough to just put all of the data and software in the supplementary materials of my research article? Because then it’s available, right? And there’s even a license associated with it because it’s the same license as the research article. And the first rebuttal to that is that not all research outputs are always associated with a research article. So data repositories can also be used when you don’t necessarily have a research article, but you just want to share a small subset of the data or a script that really helps you, but you can’t really write a software article about it.

Sometimes the publisher actually requires you to use a data repository instead of the supplementary materials. And another thing about supplementary materials is that you actually give up a lot of the control that you have over data. Because just like publications, once you have an update, you can’t really update the publication anymore. So then you would need to write a new article, or you need to make a new data article, etc. So you can’t really update that yourself.

And I will show later how you can actually update an existing research output on Zenodo as well. Then a more of a data stewards answer to this question is that, yeah, research is not just about research articles. Data and code are also primary research outputs. So they really shouldn’t be hidden away in the supplementary materials. And this is quite literally hidden away sometimes. Particularly when articles are published behind the paywall, it’s very difficult to then also get access to the supplementary materials. And that’s not great for availability, and particularly not because supplementary materials themselves don’t have this persistent identifier assigned to them in the majority of the cases. So that means that it’s actually liable to being lost or to have broken links in order to access supplementary materials. And sometimes it is also not great to use the supplementary materials if there are restrictions in place about which file formats or which sizes you’re allowed to use. So sometimes data repositories like Zenodo are more inclusive of all of these types of file formats. And so that would be better to use in some of these cases.

And if you’re still not convinced, I would say that it’s not in accordance with FAIR principles primarily because of this, there’s no persistent identifier associated with the supplementary materials. So it already fails at the findable part of FAIR. Yeah, so plenty of reasons to not use the supplementary materials for this. Although I suppose that using the supplementary materials is still better than saying data is available upon request. All right.

18:50 How do I use Zenodo?

Now that we’ve discussed a bit about why and what, I want to go into how do I actually use Zenodo. And for that, I’ll show this on Zenodo itself as well in a bit. But you do need to make an account which they made very easily. You can sign up with an existing GitHub account with Orkits, or you can also sign up with an email so you have some options. I would personally really recommend that if you don’t haven’t yet and you’re researcher set up an ORCID ID, because this is a persistent identifier for you as a researcher. And it makes logging into a lot of scholarly system platforms, etc, a lot easier for you. So it’s not just making your research outputs more findable for yourself, but also very much improving your life in terms of accessing all of these systems. So I’ll sign in using ORCID in a bit. But do sign up yourself for an ORCID if you haven’t done so yet. All right. Then I’ll show you how to create a new item. So I’m going to stop sharing my presentation and hopefully still sharing my screen. Let’s see. Can I see that? I think I’ll just re-share, just to be sure that you actually see my browser.

So this is the homepage of Zenodo. So here you see, again, this block about why should you use Zenodo. So that’s the exact same information. I just copy-pasted that in my presentation. And here you can see recent uploads. So if I click on this, I’ve never clicked on this before, but apparently this is something about Brazilian flora. So this is a dataset I assume about flora, which is very exciting. We won’t look into that too much, but this is actually already a very nice example about how you can actually version these different versions of this research data, actually. You see this is version number seven already. So they’ve been working on this for quite some time, I assume. And you can actually browse in between all of these versions, which is very nice. And then Zenodo will also let you know, like, hey, there’s a newer version of this record available, but you can still access all of the previous versions as well. So here’s the older dataset. I can go back in time. They named it exactly the same. You can see all of the versions that will take you to a different page. But if I scroll down, let’s see, when did they start the project? 2014 August. Very nice. And so you can download each of these versions. If you press the download button, sometimes there’s also previews, then you can see that a little bit more. But this is how a record looks like once you’ve published it on Zenodo. And you can see that you can also see how many times people viewed it or how many times people downloaded it. So it’s quite a nice way to keep track of what it is that is happening to your research outputs.

Reshama:
Esther, in the views and downloads, what’s under show more details?

Esther:
Let’s go there. A little bit more detail about this version, apparently, which is very cool that you can distinguish if the views are for this particular version or whether it’s accumulated. Very cool. We can also maybe press this one. So here they explain a little bit more about what they actually track. I’ll skip that for now. I think it’s just indeed a unique view, visiting person or a robot. But yeah, it doesn’t work to reload the page hundreds of times of your own research objects and then see what happens. So for example, if I refresh now, then that doesn’t mean that the views go up. So it’s a little bit more robust than just me refreshing the page a hundred times. Yeah. All right. So this is how it looks on Zenodo. And right. Let me log in first. So log in is here. And if this is your first time sign up, this is the sign up page as mentioned. So you can see that I use both an email of the data stewards at TU Delft where I no longer work, unfortunately. So I shouldn’t access that. Instead, I will use the ORCID’s. So this is how the sign up page looks like. Because I already signed up, I’m going to go back to Zenodo and use the login button instead. And so here you see, again, very similarly, I can sign in with ORCIDs, GitHub, or an email address. But I’ll use the ORCIDs’s since I’m no longer a data steward at Delft. So here you see my ORCIDs. So that’s a number string. And I’ll use that to sign into ORCIDs. And then I’m signed in to Zenodo. And you can see my username here or my email address.

20:20 Start step-by-step walk through of Zenodo: how to add (or upload) a file

And in order to make a new upload, I’m pressing this plus sign. I’m selecting new uploads. And then if it’s both, you see this interface. And what you see here is that you can drag and drop files or use the computer or the, how do you say that? Not sure. You browse in your computer to get to the file. So it’s quite easy to get a file onto Zenodo. And they start with some basic information. And what you’ll notice is that some of these basic information questions have red asterisks behind them. And that means that these fields are mandatory. So if you try to publish something on Zenodo while not filling out these fields, Zenodo will let you know, like, hey, this is a mandatory field, you should fill this in. And here you can see the question, digital object identifier, do you already have a DOI for this upload? So sometimes when you share a version of a research article, for example, it could be possible that you already have a DOI. And in that case, it’s better to use the existing DOI because otherwise we have multiple DUIs for the same types of research outputs. And that’s not very helpful because then you have multiple things in multiple spaces. And it will all be going very messy. It’s also more difficult to keep track of how things are reused. So then you can copy paste existing DOI here. Or in the majority of the cases when I’m using Zenodo, I say, no, I need a DOI because I don’t have one. And then I can press this button here for getting a DOI now.

And I am not actually going to upload this file. But you can just request a DOI. And that will also be visible in the browser where you see the same number as is listed here. So now it’s already saving what it is that I’m doing. And I can use this DOI and it will be stable until I delete this record. And then, yeah, someone else can probably reuse my DOI if I delete it. But this is nothing. Nothing is published yet. So nothing is official yet. But we’ve reserved this DOI. I’ll get back to that at the end of the presentation as well, how you can use that. Then it’s asking about research type. So you can upload a data set, event type, related information, images, lessons, presentations, physical objects, lots of options here. Lots of different types of publications that you can also share here. And so lots of options. And if you’re not sure, there’s always the option other. So I’m going to go for that right now. And for my title, I’m just going to say test one, which is a very uninformative title. And as a data steward, I should warn you that this is not the best title to use.

But just for you to see what’s an order looks like, I’ll use test one. The publication dates, it automatically sets it to today. You can change that. So it uses the year, month, and day formats. I will keep this the same for testing purposes. And then it is asking about creators. So here you can I can add my own name. So that’s Plomp and Esther. If I use my ORCIDs, I would need to copy paste that. And you can add your affiliation, as well as your role in the construction of the research output. So for example, I’m the data manager of this research output. So I’ll skip my ORCID for now because I unfortunately still don’t notice from the top of my head. But I would recommend you to add your ORCID here for actual purposes of creating a Sonoda output. And so that’s what it looks like now. So creators, it’s Liz myself, and you can just add creators. There’s also a save and add another creator in this add creator form. So you can add lots of creators, which is very helpful if you’re working together with a lot of people.

29:15 Q: If you were to upload a paper, dataset and code, would they have the same or separate Digital Object Identifier (DOI)?

Reshama:
Now, I had a quick question. If you were to upload, say, a paper, data sets, and code, would they each have a separate DOI or would it be under one?

Esther:
It depends a little bit. I would recommend you to have separate DOIs for data sets and code unless it’s… Yes, it’s really dependent on your own preferences. So your publication gets assigned a DOI anyway, so it’s important to keep that the same for the publication. But I would assign a different DOI for, for example, the data and software together if the software is very closely related to the data and it’s underlying a research article. And you don’t really plan for the software to be reused in other purposes, if that makes sense. So then it can make more sense to put data and software together. But in some of the cases, it makes more sense to also split those apart also because I mentioned that data and software use different licenses. So then you would need to attach two licenses to the same upload if you’re uploading them together. So it can actually be easier to upload them separately so that you can share the data, provide context about the data, choose a data license, and then for the same for software that you…

We will go about how, especially if you also share it via GitHub, it might be easier to have the software as a separate output, if that makes sense. And then it is important to link them all together and make sure that all of these DOIs are referred to in all of these separate uploads and that you can actually do using Zenodo as well. If you scroll down a little bit to the related works, you can also add, for example, that this one, this research outputs is supplemented by another research output. And so you can enter the DOI here. I’m not really sure which DOI that is, but it’s a DOI. It’s a DOI. So we’ll add that here for the scheme. And then here, I can also indicate what is it? Is it a presentation? Is it software? Is it data, et cetera? So then you can all put that together and everything is linking to each other. And that’s also a very important thing, your research article. And since you’ve mentioned that, maybe I should go to that slide for the presentation already. Yeah, no, let’s do the upload form first and I’ll get back to that.

How do you link them all together in the research article as well? But you can link them all together on Zenodo. So that’s in the related works section. I hope that makes sense. Please do interrupt me if I’m not making any sense. Scrolling. Yeah. No, I was going to say, that’s a great explanation. And I see now that the datasets can be connected to the paper. So thank you. Yeah, yeah, exactly. So that’s actually where my eScience fellowships project comes in. It’s a lot of frustration about how people do not connect all of these different research outputs. And then it becomes really difficult to find all of your research outputs, especially from an institutional perspective, then we have no idea what it is that you’re doing. And we cannot count it. But also for people trying to find the data underlying a publication. If you don’t actually share in the publication where you’ve stored the data, where you’re sharing the data, then it’s almost as bad as data available upon request because then people need to search and look for these datasets everywhere. So it’s very important indeed to bring that all together to increase findability of these datasets yourself. Yeah. So back to the upload form. So the basic information is the section which is most important. So we’ve already set up the digital identifier research type, the title, publication date, and the creator, which is myself. So those are the mandatory fields. So those you need to fill out. But then it’s also very helpful to add a short description. So for example, this is a desk deposit. So there’s a very short description, not super helpful. But hopefully I’ll remember in six months time that this is my test deposit.

And then what you see here is the license section. And so what you see here is a creative commons attribution 4.0 international. And that means that anyone who will come across this work can then redistribute and reuse this work as long as the original creator is appropriately credited. And what that means is just that you cite the original work, you refer to it in the redistribution. So it is quite similar to a research paper. You don’t just take someone else’s conclusions and claim to measure on, you refer to the original publication where they mentioned this first, or at least you should. So that’s this particular license. So they already assign you a license. That doesn’t mean that you need to keep this license. It’s one of my favorite licenses. So I’ll keep it for now. But you can press the edit button. And here you can choose between several licenses. And these, all of them, starting with creative commons are actually data licenses. And all of these other ones, Apache and MIT, for example, those are for software. And so in these descriptions, you also see a bit about what it is that they require you to do and whether it is for software or not. So for example, this one from the creative commons actually says not recommended for software. So yeah, different licenses for data and for software. So I’ll keep it as it is for now, because we can do a whole different presentation about licenses. But I’ll skip that for now.

For now, I’ll continue in the upload form a little bit, because they let you add a lot more extra information. And so particularly keywords and subjects is very important. If people are searching in the search button on Zenodo, etc., that you actually put some information here. So for example, fare, I press enter in order to get that shown up as a keyword. I do a little bit of data management things. Again, enter. And then you can also select three existing keywords. Language can be helpful. My majority of my outputs are in English. So I’ll select English for this, even though it’s a test output. So you can basically select any language. And I can select a date. I’ll just use the date, which I already used. So that will be 2025, 14 of February. And it is available on this date. I can add a version. So for most of the outputs, that might not be relevant. But again, if like the Brazilian flora data set, you need to assign a particular version to the data set. You can also do it here. And publisher Zenodo, or if you are sharing an existing research article, then here you might have to change the publisher. You can add information about funding. So if you have a funding agency, recommend you to add that information here.

It can also be done using persistent identifiers nowadays. There’s alternative identifiers. If your research output also has different DYs associated with it, we can link it with related works, which we just did. So you can say that your data set is a part of a research article, or it’s part of software, or the data set is another version of a data set. So there’s lots of relations between these DYs that you can have. You can also add any references that your research output has separately here. That’s recommendable because then it becomes machine readable.

Whereas if you put it in a PDF file, Zenodo does not take that up and then makes that automatically available in a machine readable way. So publishers do this for you. Zenodo does not do this for you. So it’s helpful to do this yourself. Here they particularly ask about software. So you can link to a repository, which language you’re using, what the status is of the software project. But also the publishing information. If this is for a journal, you can fill out all of these details regarding where it’s published, which volume, what is the book chapter, etc. And if it’s a thesis, which university it’s been awarded for. And it also means that if this is not relevant, so you can’t have both information about software and about publishing, etc., you don’t need to fill it out because there’s no red asterisks behind it. So if any of this is not relevant for you, you can just skip it. So for the test, I’m also going to skip it. Conference, if it’s part of a conference. And you can also add more information about which fields you are working in.

So I’ll skip that for now, because there’s so many field options. But there’s really an opportunity for you to really fill this out as detailed as possible, which will enhance the findability of your research. All right. A quick question. Can you go back in, for example, if we forget to put in the conference that a content was presented at, can you go back and add it later? Yeah. Yeah. So, yeah, no, it’s great. So if you want to fill that out later, you can actually do that. So the only thing that you can’t edit once it’s been published is actually the file that you’ve uploaded. So if you want to make any changes there, you can. But then it will be a new version of the record. But all of the metadata information, that you can actually change later. So it’s not too bad in that sense. I can show you for a different output in a bit. For now, I will just save this draft. And I will show – I already didn’t do some things. Yeah. So I screwed something up with the date. Please provide a valid date or an interval. So I probably should have put an ending in that date file. And I didn’t upload any files, which I won’t be doing. So it’s saved, but it has some errors. So it’s a note that will let you know whenever you need to improve some of your upload information. But for now, because I want to move on to making changes and making sure that you see how this looks, I need to fix that before I can do the preview. Let me see if I can remove the dates. I think the issue with the date is the formatting, I think. But I haven’t done this before. So I don’t know.

I’ll remove it for now and I’ll see if I can get to the preview. No. I need to upload a file. Okay. I will – well, we already looked at an existing upload. And then I can show you what it looks like if you try to update it. And as mentioned, visibility. Also, red asterisk. It’s on public automatically. But you can also put it on restricted. So – or you can apply an embargo and then you need to fill out the date in the correct format until the embargo lasts. And then it can – I think it will be automatically lifted. But similarly to other changes, you can also extend the embargo if you change that date. Yeah. Normally, you can make your research outputs publicly available. So I’ll select this for now. I’ll save it. What that looks like in your dashboards. So moving to my dashboard now. Something like this. So I already have some uploads. So it looks a bit more populated than if you’ve never used it. So this is where you can go back to your uploads. So this is my test one, which is a terrible name again.

But if I click on it, I can then go back and make any changes. And again, this is a draft. So no one else will see this. Here you can see there is a draft. Going back to my dashboard. Then here you will, for example, see one of my more recent uploads. A data management plan and section templates for the faculty that I worked at. And so here, if I want to edit any information. So for example, this link is not clickable. That’s not super helpful. So let me see if I can edit that using the edit button. So only I can do that because I created this record. So this is not something that you can do for other people’s outputs. Because that would be a mess. So we’ll select edit. And then we’ll go into the same form. But here you can see I’ve uploaded some files. And all of this information is already filled out. Now let’s see if we can actually make this link linkable. So I’m pasting the URL here. And I want a new window for that. I press save. And then now that should be linkable. So I think that’s the only edit that I want to make. I’ll just press save draft. And then I’ll do publish. Let’s see what happens. And here you see this warning, which we’ll see again at some point as well. Which is asking you whether you’re sure you want to publish this record. Because once the record is published, you will no longer be able to change the files in the upload. So again, the files. Yes. Can we see the preview first before you publish? Yeah. Let’s see if it works for this one. Yeah. So no longer be able to change the files. So that’s what it’s warning me about. Let’s go to preview, indeed.

So here it says I’m in preview mode. You are previewing changes that have not yet been published. So it looks almost the same as the page that it would have normally. And that we previously accessed. But here you can already see this is the change. I can now if I click this link, it’s actually a link. And people don’t have to copy paste that. So it’s displaying the change that I’ve just made. And everything else just, yeah, it looks the same as it would look like whenever you’ve published it. Let’s go back to the edits. All right. Okay. So try to publish this now. Again, the warning. No longer be able to change the files. However, you will still be able to update the records metadata later. And that’s exactly what we did. We updated the records metadata. So if we now publish, then here we can see this change in the link. And again, it looks almost the same as the preview version. Other than that here, you can now see the added a new version buttons. And this is still the same version. So it’s still version one. Because I didn’t touch any of the files. So you can just very easily adjust any of that information without it having an effect on the version of your file. So it will have an effect if you update the files. Let’s see if I can get to one of my – so I’m moving to my dashboard again. So this looks different whenever you upload a new version of the files in your uploads. And so here you can see the date and the version. So I have a lot of version ones in all of my files.

But here we have a tree. So I’m going to assume that this has three versions. I know. I mentioned that this is a version three. Because it’s actually based upon a previously set up guidance document. Which I should hopefully also indicate in the related works. Yes. So I indicate it here. This is actually an alternative version of a different output. Which is also on the node-own. So this is how you link them together. And if I link click to this. Then this is version one. I actually called my previous version version three. Because, yeah, we made multiple iterations of this. So this is how you conversion things. I did that because it’s quite different. And the contributor authors were quite different. So that’s why I made a separate upload. But going back to my dashboards, there should also be an upload where I use the same uploads. Let’s see.

Yeah. So as you see, the majority of the times you just have one version of your output. Particularly if this is slides. Aha. But this one has four versions. So we’ll go to this one for now. So this is a data carpentry lesson on spreadsheets. I’ve done that a couple of times. So I have multiple versions. So version one looks like this. And I use completely different slides, I think, in this version if it loads. Yeah. So completely different slides. Quite old. Old picture, et cetera. And each of these versions, as you’ll see, has a different DOI. So you can link directly to this different DOI for each of the versions that you have. And if you use that DOI, you get directed to that particular version. But it’s also possible to use this overarching DOI. Like an umbrella DOI of some sorts. And that’s listed here. So if you want to cite all of the versions, you can use this DOI. So for example, if we use that one, you get automatically redirected to the latest version of the record. Also, if I now upload a new version, version five, then it will automatically go to that one. But if I want to link to a particular version, because I want to mention this version to my course participants for this particular course in 2021, then I should be using one of the older DYs. And you can just copy that link from these different versions. And you’ll also see that I used these links in the presentations themselves. So here I didn’t use the overarching DOI link. But for example, for a data set, that might make more sense. Or again, up to your personal preferences. So that’s what it looks like when you edit things and you actually edit your file and you get a new version. All right.

Yeah, let me check my presentation in terms of did I cover everything that I wanted to cover. I think you can see my slides as well. We created a new item. We did the basic information. Right. So now I showed you a test upload. And I – let me see if I can stop sharing and also reshare this screen. So I showed you a test upload that I didn’t upload. And I made an edit and an existing work that was actually already uploaded. But I didn’t share a new one yet.

The Sandbox Environment

And I can imagine that it’s also very daunting to upload something for the very first time if you’ve never used Zenodo. So there is a way around this. Zenodo has a sandbox environment, actually, which you can also use. So I won’t be using that for today’s webinar. But if you’re very scared about putting something online permanently for the very first time, you can use sandbox.Zenodo.org. And just have a go and see how it feels. But when you go to the sandbox, it’s also saying that it’s only for testing purposes. So anything you do on there will actually not be assigned an actual DOI. And no one will be able to see that, which is the whole purpose of the test service. But so if you then want to really upload something, you would have to do it again on the real Zenodo version of the platform, the actual version. Yeah, this is a nice way for you to get started if you’re really scared about putting something out there online. Okay, let’s see. Yes.

GitHub and Zenodo

So the next thing that I wanted to discuss briefly is how do you actually use Zenodo and GitHub together if you want to share some software. And that is important to do because again, putting software on GitHub alone is not following the fair principles in the sense that GitHub does not assign a DOI to your research output. And it’s also not a great place for long term preservation, because you can decide to just delete your GitHub repository at any moment, or perhaps Microsoft will decide that your repository has some type of word that they don’t like in there, and then they might decide that they will delete it for you. So it’s important to use GitHub alongside a data repository to ensure long term preservation of your software. And so it’s quite easy to use Zenodo with GitHub. I’ve linked to a couple of guides and a video that also explain it in more detail if you want to see that. But we’ll go over it in a little bit in this webinar as well.

So for you to use GitHub and Zenodo integration, you would first need to link your accounts. So in the menu, and let’s see. Yeah, I’ll share my screen again. Sorry about that. I pressed this game. So I will show you how that works on Zenodo as well. Let’s see. So if I now go to the top right menu, so where your profile or your email is listed, you can see linked accounts here as one of the options, I go there. And then for me, you will already see that I linked my accounts to GitHub and ORCIDs. So I’m connected with both of these services. Not with open air, I’m not actually sure what that actually does. So I’ll skip that for now. But I’m connected with GitHub and ORCIDs. But if you do this for the very first time, you will need to select the service. So I, for example, if I do that for open air, I’ll probably get directed to their platform, and then I’ll need to indicate or log in in order to integrate my accounts there. Yeah, so I should sign in with GitHub or ORCIDs, for example. I won’t do that for now. But this is how it would look like for GitHub as well. So you will be redirected to GitHub. And you need to make sure that everything is connected. Before you can now go to the menu, select GitHub. So again, menu GitHub. And this is how it looks like once you’ve linked your accounts. So you see this big header where it says get started, flip the switch, create a release and get the badge. So this is their way of the three step process that you can you can use using Zenodo. So what they mean with flip the switch.

So here you see a couple of repositories of mine that have already been published on both GitHub and Zenodo. So if I switch this flip on or off, that means that if I then publish a release on GitHub, so you don’t actually use Zenodo for that, but you use GitHub to publish a release of a new version, then it automatically will get pushed to Zenodo. So for example, if I now publish a new version of this figures, and the data on GitHub, then it will also automatically publish a new version on Zenodo because I have this switch flipped on, I can also turn that off at any point in time. And this could be because you don’t want all of the versions on GitHub archived on Zenodo. And again, that’s up to your personal preferences. If you don’t want 50 versions of your software on Zenodo and only major changes, then you can slip it off. And then at any point, you can flip that back on. So that’s you’re fully in control of that. Here you for example, see some of my repositories below which are not switched on. And I can then switch those on at any point. And then whenever I’m reloading the page, it says, please reload the page. Then they will be listed here. And so here you can see that this one is not yet published on Zenodo because it doesn’t have this batch yet. Whereas the other repositories do have this batch.

I’ll turn that off for now because that’s not something I would like to put on Zenodo. But yeah, you get a batch. And how that looks like on GitHub. So going to this repository on GitHub. So this is the homepage of the repository. So here you see again, this this DOI batch that I put it there. So you do need to do that manually. But you will see that. Yeah. It has one release. So I literally only wrote a release for this repository. So it would go on GitHub. So if you go there on releases, let’s see, should be able to make a new one. We go back to releases. If I draft now a new release, I can put a release title, write some descriptions about changes that I didn’t make. So we’re not actually going to do this. And then here I can publish the release. And then any changes that I’ve made in this GitHub repository will also be pushed to Zenodo. Right now I don’t have a really good example for code. So I won’t be doing that.

But that’s how that would work on GitHub. So you need to go to the releases and draft a new release. But for now, I won’t be doing that. Because I don’t have a new release of this code because this is my PhD research. And I hope to never touch that again. No, just kidding. But yeah, in order to get this batch on your readme page, you can use markdown text to just add the batch to your readme. And Zenodo is very kind with that. You can actually, if I go to the Zenodo page of this output, so here’s what it looks like on Zenodo. So it doesn’t provide a lot of information. So this you need to edit the Zenodo page yourself. But this is what it looks like from GitHub. So all of the files from GitHub are there. And yeah, and it looks quite similar to the deposits that we’ve already seen. You see me as a creator, etc. But in order to get this nice DOI button, we can go a little bit below. Here you see this DOI button. If you press on it, then here you see the markdown text. So you don’t need to remember anything. You can just go to this DOI button, copy paste the markdown text or just the URL or the HTML output if that is needed. Copy paste that and then put it into your readme file. And then you have this nice button.

On your GitHub repository. And then because I apparently did very well, I also made the citation file. So here you can see that adequate citation. I actually did this very well. I’m impressed by 2021 Esther. So I’m pasting the URL, for example, but also the DOI from Zenodo. So here for the citation, it’s very important to add in the DOI because otherwise it becomes less trackable. And all of the author information is there as well. And if you want to learn more about citations, CFF, there’s more information on The Turing Way as well. But if you create a new file in your repository, let’s see. Yes, I’m logged in. Great. If you do a citation.cff, yeah, it should provide you with a template and then you can insert that example and then you can fill out all of that information. And then here they also use a template Zenodo DOI, but that’s how you then use that by copy-pacing your own DOI in there. And they also pre-fill some of the information so you don’t even have to copy-paste everything. Which is probably why that URL is there for me. Right. So I think that’s all I wanted to show for GitHub and Zenodo.

Yes. So the slides also contain a bit of examples of how that then looks on Zenodo, et cetera, because we actually just went through the whole process. I won’t be going over these slides, but just for your reference, I also describe again how you can copy-paste this DOI button. But for now, I wanted to go to the point made earlier that it’s very important to link all of these research objects together. And so hopefully you still see my screen and the presentation. But yeah, how do you then actually link all of these data and codes and research article together, because this is all over the place and very confusing. I referred earlier to The Turing Way about the citation CFF file, but The Turing Way also has some information about how do you now link all of these research objects together. So you can go there for a bit more detail. But the best way to do this is to do things.

1:01:42 Using Zenodo in research articles

Publications nowadays have a data availability statement or software availability statement, where you can say, well, the data is underlying the research article is available on Zenodo, and here’s the DOI. And then it becomes very important that for two, you also put that DOI to your data set in the references, because otherwise we cannot automatically pick up that you shared the data and we can’t automatically link from the publication data to the other research outputs. So please, please cite your data and your software outputs.

Also, if you reuse other people’s data and software outputs. So it’s very important to do that. And what that looks like. So this is a very nice example, where they say, actually, we don’t have additional data available for the article. So all of the data underlying the results are available as part of the article, which is perfectly fine. Not every research article needs data, but they do have a great software availability statement. So here they say that software is available from by a conductor platform. It’s a package. The source code is available from a GitHub link. And they archive the source code at the time of the publication. So they really indicate this particular version of the code on GitHub is available on Zenodo. So what we just did, we released a particular version on Zenodo. And then they cite it because the 26th at the end is actually the reference to their own software. So it’s both available in this software availability statement, as well as in the references. So this is perfect. This is more than perfect because they also list the license. So now you know exactly what it is that you can do with the software, which is a lot because the MIT license doesn’t place a lot of restrictions on software use. So yeah, you can have a look at this example on the link from the DOI in the slide. And yeah, please, please, please do link all of your research outputs together.

And then one of the last things that I wanted to share with you is how do I use Zenodo to share the presentations? And this is also something that I’m currently doing and what I would like to show again. So we reserved the DOI when we were in the upload form in the basic information. This is one of the first things you do actually, where I said, no, I do not have a DOI yet. I need one. And I press the button, get a DOI now. You saw on the upload form how it magically appeared. And then I can use that DOI. So what that means, and what you also saw, is in that upload form, I didn’t need it to put any files in there. So Zenodo complained a little bit about there’s no file here and we can preview it. But you don’t need to put a file in the upload form in order to already get that DOI. So that means that you can copy paste that DOI and put it into your slides or put that DOI into your research article before you already publish the data and the codes. So that’s why reserving this DOI works really great to combine all of these research outputs, but particularly for presentations. And just a note that I wanted to make, because in the presentation, you also see just the DOI and not the link in front of it. So if you just copy paste the DOI and not the URL, it’s not a link. But I don’t want to put the HTTPS/DUI.org in front of all of these numbers, because then it’s longer, it takes up more space in my presentation. So I embed that link into the DOI.

But just to keep in mind that if you add the DOI.org in front of your any DOI, it will automatically go to the landing page of the DOI. And so it is also important to make sure that if you’re presenting your slides, to make sure to have that difference there available before people just start typing over the number of the DOI, etc. So it’s not a direct link unless you put that DOI.org in front of it. All right, so I reserved this DOI for the presentation. So if I now go to this link, because I reserved the DOI, it comes to this page where it says, “DOI is not found.” That’s because my presentation is actually not shared yet. So we’ll go to that now. I’ll go to some nodal. Yes, I’m going to leave this page because I don’t want to save that. I already had a citation file. I’ll go to my dashboard, because my dashboard is where all of my draft uploads are. You see the tests that we used before. But you also see the Zenodo, the why, what, and how presentation. And you see that this is still in draft stage. So this is a draft. It’s marked as a draft. It has a red upload file instead of this green tick, which means that it’s probably available. So I’ll go there and do edits. And so here you see I already uploaded the slides so that you don’t have to wait for me waiting for things to be uploaded. But again, you don’t have to upload your slides in order to get this DOI. So we saw that in the test form. So I can just copy paste that. And I can go to the slides. And then I can copy paste this DOI into the slides. And I tend to put that onto any of my slides, just in case someone wants to reuse just a single slide. I tend to just put that DOI into all of them. Yeah. So what we’ll do now is we’ll upload this version of the presentation that I made and publish it so that we actually have a working DOI at the time where you’ll be watching this recording. So I checked this deposit carefully. So I very carefully filled out the basic information. It’s a presentation. The title is Zenodo, Why, What and How. Publication date is today. This is me. I’ve put in the description. Thanks, Reshama, for helping with the description here because that’s all your work and not mine. So thanks for making my life a lot easier for the description. And I’ve added some keywords to make it more visible. I’ve added the language. But I did not add any of the other identifiers. So I kept everything else here. I kept that empty. This is not part of conference. Well, I could actually put in a data umbrella. I might do that later because it is part of a webinar series. So that would be good to link back to that to make it more clear what this output is actually about.

All right. So that’s what the upload form looks like. I am now just going to publish this. And then again, the warning, you can’t change the files. You can update the records metadata later. So I might do that. I add Data Umbrella there. I press publish. And now the slides are available. And no one has viewed them yet. Yeah. So this is what that then looks like. And now, and it might take a couple of seconds, but now if you actually use this UI in my presentation slides, so if we go back to the presentation and I will go open the link. Ah, because of my cookie history is probably still saying, “DOI not found.” So this might take a couple of seconds. It’s also good to remember if you’re uploading a presentation on the note of five seconds before you do the presentation, you might want to do it a little bit longer. But in general, it should work almost immediately. Maybe I can open it in Chrome actually. Maybe then it, I did not install Chrome on this laptop. So we’re going to go to Microsoft. Let’s see if that works. And let me copy paste that. copy this link. And then I’m hoping. No. Okay. So it will take longer than a couple of seconds, but it is really there. So hopefully, when you watch this recording, you will not see this DOI not found window anymore. But you will see the actual upload here. And then you can download the slides here. And because there’s a CC by 4.0 license attached to it, which you can see here, feel free to reuse any of the slides and to use that for your own purposes.

1:11:50 Resources

And with that, I think I would like to conclude with a couple of resources from other people that you can also watch. And perhaps they explain more clearly how you can use the notice to upload your research, as well as some information in Spanish, which I don’t master. So this is why the presentation is in English. But if you prefer to have that information available in Spanish, there’s different resources available to you. And I think with that, that’s what I wanted to show. But perhaps you have some questions or clarifications or other things you wanted to know from Zenodo that we know.

Reshama:
Esther, thank you so much for your presentation. We have a very comprehensive presentation. And when the video is posted, we will link to your Zenodo slides. And in the past, we haven’t done that. Sometimes we just link to Google Slides, which you never know that might be deleted or access might be changed. So we will link to your Zenodo slides in the video description.

1:12:48 Q: Where can we find the images/illustrations used in your presentation?

Reshama:
I do have one question, which is, I love the illustrations that you showed in your presentation from The Turing Way. And can you show us where people can find them on Zenodo? I notice you link to a DOI on Zenodo, so I assume that’s where they are.

Esther:
Yes, indeed. And that’s why I skipped my thank you slide because I actually wanted to thank The Turing Way community for all of these images, indeed, in my slides, which you can also reuse for any purposes because CC by license, as long as you cite them, which is why I indeed refer to them in my slides, saying something along the lines of, let’s see, where’s an image? Should not take me long to find one. Yes, the data repository tree. So as long as I refer to this image, where it’s available on Zenodo, let’s see if that link clicks. I think they don’t allow that for the preview versions of Zenodo outputs. But here you can see illustrations from The Turing Way shared on the CC by 4.0 for reuse. And here you can find all of these images, which are very strange backgrounds for me, and it should probably be white instead of black. But yeah, here you can now see all of these images on Zenodo. And you can now preview them separately here. This looks more like what I would be expecting with the white background. I think that’s probably because my computer is on dark mode or something.

But yeah, this is how you find them on Zenodo. You can also, again, go to previous versions, because every time there’s a new version released, I think what that looks like is also here that you can go to previous versions. But that’s how you can find them on Zenodo. For some of them, it will be easier to go to The Turing Way. So that’s book.thetouringway.org with what’s that little stripe called again? Sorry, I keep confusing that. But these little separations in the link. But if you Google The Turing Way, it’s also the first hit. So that’s where you can also find, for example, the data repository image should be somewhere in research data management. Data repository, where do we put that? Here, the one above. And there you can then also find them. So for some of it, it will be easier to directly copy paste it from the actual book instead of browsing to Zenodo, which if I go back, did I use that here? Yeah. So it’s not that great to browse through it on Zenodo. One of our community members is actually working on it with our shiny app tool to make this a little bit more browsable and also search only for keywords, et cetera. But it’s not yet there. Something very, very much needed. Anyway, rambling on too much about The Turing Way. I’m very excited about The Turing Way.

1:16:16 Q: How does Zenodo prevent people from uploading spam?

Reshama:
Thanks, Esther. I have another question, which is how does Zenodo, and I don’t know if you’re the right person to ask this too, but how does Zenodo prevent people from uploading spam?

Esther:
It doesn’t necessarily prevent people from doing that, but it will remove things that are not related to scientific output. So I’ve never encountered spam, but indeed, basically anyone, as long as you make an account and you fill in the mandatory fields, you can put up that spam, but there are some monitoring going on which will have that removed, because otherwise, if we go to the main page, otherwise this page would probably be flooded with people trying to spam it. But there’s a mechanism in place, but I’m not 100% sure about the details, unfortunately. Okay. And I guess also if people connect their GitHub or their fork ID, then that is a verification that it is legitimate content that people want to share. So that probably helps as well. Yeah. Exactly. Imagine. Okay.

Reshama:
And that is, thank you for being patient with all my questions along the way. And that is the end of my questions. So thank you so much. This recording is going to be up soon. If you have any questions, please ask on the video description. There’s a comment section, and we will be in touch with Esther to get those questions answered. Thank you so much, Esther. Thank you.

The Evolution of Open Source: From Hacker Culture to the Age of AI

Sun, 06 Jul 2025 07:00:35 +0000

This pivotal and comprehensive presentation explores how open source software has revolutionized technology, collaboration, and innovation from the 1960s to present day.

This blog post is based on a thought-provoking Data Umbrella webinar featuring Juan Luis Cano Rodríguez, a prolific open source contributor. In this talk, Juan Luis takes us on a journey through the cultural, philosophical, and technical milestones of open source software—from its hacker roots to its current role in the age of AI. But its story is much more than a timeline of programming languages and licenses— it’s a story about people, ideals, communities, and the evolving meaning of “freedom” in the digital world..

📺 Watch the Full Webinar!

This 75-minute video is a must-watch! It provides the evolution of open source, with all its values and challenges.

Slides: on Zenodo

Timestamps

00:00 Data Umbrella introduction
03:38 Juan Luis introduction
05:14 Juan Luis begins presentation
06:11 About Juan Luis
07:25 Disclaimer
09:20 1979: the rise and fall of hacker culture
11:44 1969: Bell Labs, Unix, C
14:15 1976: Copyright Act (in USA)
17:09 1979: Scribe markup language
18:05 1980-1989: GNU Project and Four Freedoms
18:45 Richard Stallman printer incident at MIT
22:00 Four Freedoms
25:24 Switzerland, CERN
25:55 1990-1998: The Linux bazaar and the Open Source definition
26:20 1991: World Wide Web
27:00 1993: Mosaic: first web browser
27:45 Linux (Ref: video, Intro to the Linux Operating System)
31:11 Python and more (R, Vim, Lua, Java, etc)
33:04 1997: The Cathedral and the Bazaar (an essay)
34:43 Netscape, JavaScript
35:45 1998: Open Source initiative
37:55 The Divide: why open source misses the point of free software
38:55 Digital age and the Big Data Explosion
40:10 dot com bubble
41:15 2001: animosity towards open source
42:00 orgs, Foundations: Apache, Linux, Python Software Foundation, Eclipse
44:04 2003: Google File System
44:45 2005: a new era of collaborative software development (Git, GitHub)
47:08 2011: Why software is eating the world
47:41 2012-2018: what do open source maintainers eat?
48:44 Growth of Python
49:30 open source is not sustainable (OpenSSL, Heartbleed, left-pad)
51:43 Nadia Eghbal, Roads and Bridges
52:17 open source system begins to fragment (licenses)
56:15 2019-2023: Post-open source and the Gen AI volcano
01:00:04 a new kind of open source license
01:00:55 2024: What’s next?
01:04:00 Q: Do you think the open source divide was avoidable?
01:05:36 Q: Are you optimistic about the future of open source in terms of funding?
01:07:45 Q: Web 3 and Blockchain
01:09:25 Q: Can you discuss OSPO’s? (Open Source Program Office)
01:12:30 Q: What do you think of the potential for the UFDA (user friendly developers association) model?

In this blog, we trace the history of open source software, starting from the early days of room-sized computers and academic code sharing, through the founding of the GNU project and the free software movement, the explosive growth of Linux and the World Wide Web in the 1990s, the rise of GitHub and big data in the 2000s, to the complex challenges of sustainability, ethics, and generative AI that define today’s landscape.

This post is based on a webinar hosted by Data Umbrella, a community-funded nonprofit supporting underrepresented voices in data science. Featuring a deep and thoughtful talk by Juan Luis—product manager at QuantumBlack, AI by McKinsey and a longtime advocate for the PyData community—we’ll walk through the key milestones, cultural shifts, and philosophical debates that have shaped open source into what it is today.

Whether you’re new to open source or a seasoned contributor, this journey offers insights not just into the software we use every day, but into the values, struggles, and future possibilities of the open technology movement.

The Genesis of Computing and Hacker Ethos in the 1950s and 60s

Let’s begin at the dawn of computing, exploring what we might call “the rise and fall of hacker culture” – a subjective framing, as emphasized by Juan Luis. So as I said I’m going to start with the very beginning of the history of computing. I call this session the rise and fall of hacker culture and just to make it very clear that this is a subjective perspective… The 1950s and 60s were a world away from today’s computing landscape. Programming involved punch cards and other “arcane methods” to input programs into room-sized computers like the PDP-10, a prominent machine in research and university settings. One such machine was the PDP-10, a staple in university and research departments, as shown in this photograph.

Crucially, early programming was largely an academic pursuit. At the very beginning of this story programming was mostly an academic activity. Languages were low-level, with some code even written directly in assembly language. Debugging cycles were extremely long, and computing wasn’t yet ready for widespread business applications.

We’re talking about very low level languages, some programs were still written in Assembly, and the limitations of the time meant that development was slow and labor-intensive. But the interesting thing is that this academic notion of sharing knowledge, publishing articles, and exchanging ideas permeated the world of computing, having a profound effect on everything that followed.

Sharing code was common practice in the 60s and early 70s, even through physical means. While “open source” or “free software” weren’t yet defined terms, the ethos of sharing was the prevailing norm. So interestingly it was very normal in the 60s and beginning of the 70s to just share the codes, you know, with analog methods back then—but even though there was no notion yet of open source or free software, it was still pretty much the norm.

The GNU Project and the Foundational Four Freedoms of Free Software (1980s)

The 1970s saw a shift as programmers began imposing restrictions on software, moving away from the open academic culture. An anecdote from Richard Stallman, then at MIT, illustrates this change. When using a markup language called Scribe, Stallman encountered “time bombs” placed by the author to prevent unlicensed use. This sparked outrage, not against charging for software, but against restricting user freedom. So with all these things in mind and you know towards the end of the 70s programmers were increasingly imposing restrictions on the software and the academic culture that was in place in the 50s, 60s, 70s of just sharing the code worry-free and so on was starting to fall and there’s one anecdote that I found quite interesting from Richard Stallman we’re going to talk about him in a moment at the time he was a young student at MIT and he was using a markup language called Scribe and the author of such system Brian Wright placed some time bombs in the source code so that users could not access an unlicensed copy of the software basically and apparently the reaction was that this was a crime against humanity not necessarily charging for the software but restricting the user freedom…

This sentiment fueled the GNU Project, initiated by Richard Stallman in the 1980s. Now we arrive to the 80s and we are going to talk about the I want to talk to you about the GNU project and the four freedoms which is still a precursor of the open source movement which is the center of this talk but it’s crucial to understand everything that came after that. So the GNU project was an endeavor that Richard Stallman started…

Another key event further motivated Stallman: losing access to the source code for a new office printer, rendering his custom printer enhancements useless. and then apparently there was one anecdote with an office printer at MIT that kick-started everything so Richard Stallman had added some custom code to the previous printer they had so that every time someone was printing a document and would send a message to the administrator and also if there were too many jobs in the queue of the printer it would send a message to the users that were waiting for documents to be printed basically and this was an additional development that was not provided by the printer vendor and that Richard Stallman did and then at some point the department decided to buy a new printer and suddenly did not have access to the source code so his initial development to add all these messaging systems and so on was rendered useless and apparently this was the moment Mr. Stallman realized that retaining user freedom and protecting it was the most important thing he wanted to devote his life to.

In 1983, Stallman announced the GNU project, aiming to create a “complete Unix compatible software system” and give it away “free to everyone.” So Mr. Stallman sent an email to different mailing lists in 1983 that said starting this Thanksgiving I’m going to write a complete Unix compatible software system called GNU for GNU not Unix and give it away free to everyone who can use it contributions of time money programs and equipment are greatly needed and this is considered the beginning of the GNU projects… The goal was a full operating system, requiring both a kernel and user-facing utilities. Stallman’s emphasis on “free” was crucial – “free as in freedom, not as in free beer,” a distinction often lost in translation, especially in English where “free” conflates both concepts. and there is one word in this announcement that’s critical because Mr. Stallman said that he wanted to give it away free to everyone who can use it and Richard Stallman has spent most of his life after this email clarifying that he meant free as in freedom and not as in free beer and I find that completely fascinating because I’m a Spanish native speaker and in Spanish the word free and gratis are two different words but of course this distinction doesn’t exist in English so it’s so interesting how language can condition these misconceptions

The GNU Manifesto (1985) and the Free Software Foundation (FSF) (1985) followed, solidifying the principles of the GNU project and user freedom. So after starting with the GNU project developments Stallman wrote the GNU manifesto in 1985 and he established the Free Sober Foundation and non-profits that would protect the interests of the GNU projects and use of Freedom. This organization still is active to this day and in fact Richard Stalman is still the head of the organization almost 40 years on. In 1986, the FSF articulated the Four Freedoms of Free Software: And then in 1986 the Free Software Foundation wrote the four Freedoms and these were the underpinnings of everything that came after that in the Free Software movement.

Freedom 0: The freedom to run the program for any purpose. So Freedom 0, because of course they were programmers and they wanted to start with zero, is the freedom to run the program for any purpose. This means that nobody can restrict for what do you want to use the software for and we’re going to see that this has some implications that still reverberates to our day.
Freedom 1: The freedom to study and change the program. Freedom 1 is the freedom to study and change the program which was of course primarily useful for hackers themselves because at the moment any software would not work for whatever reason and they would want to see the source code, possibly change that and so on.
Freedom 2: The freedom to redistribute copies. And then Freedoms 2 and 3 refer to the freedom to redistribute copies of the software so essentially the moment you have a software you should be able to give it away to your friend or neighbor.
Freedom 3: The freedom to distribute modified versions. And finally the freedom to distribute modified versions of the software therefore if you are studying a software and you have the freedom to change it as well you should also have the freedom to distribute those modified versions to other people.

These freedoms are meant to be retained by users. This principle led to copyleft licenses like the GNU Public License (GPL), ensuring these freedoms are passed on to everyone who receives the software. The consequence of that is that the licenses that the GNU project created which are called the GNU public licenses or DPL and they’re said to be copyleft licenses in the sense that these four freedoms are transmitted in a transitive way so everybody that receives a copy of the program must be able to retain these freedoms otherwise they would be in breach of the license. The challenge of pricing software while upholding these freedoms has spurred diverse business models, a puzzle many still grapple with. And in fact the difficulty of putting a price tag to software while retaining all these four freedoms has given birth to lots of different business models and this is still something that lots of entities and companies and freelance developers are trying to figure out.

Around the same time, in Switzerland, Tim Berners-Lee at CERN began envisioning a new information system to organize documents, which would later become the World Wide Web. Meanwhile in Switzerland by the end of the 80s there was someone called Tim Berners-Lee that was starting to imagine how a new information system would look like to organize the documents of CERN the particle accelerator that’s across Switzerland and France and this would give birth later on to what we today call the World Wide Web.

The 1990s: The World Wide Web, Linux, and an Explosion of Innovation

The 1990s marked a period of rapid acceleration. So now we enter the 90s and this is where the things start getting interesting and accelerating a lot. The pace of innovation becomes so intense that a linear timeline becomes difficult to maintain.

In 1991, the World Wide Web was released to the public by CERN. So in 1991 the World Wide Web was open to the public it had been in development for some years already as I mentioned… Crucially, CERN made the web protocol and code freely available, extending the principles of software freedom to the very foundation of the web. and the truly interesting thing is that CERN made the www protocol and the code available reality free. So in a sense they took these ideas of software freedom and so on and they applied them to something as foundational as the web. This decision laid the groundwork for the public digital infrastructure we rely on today. The first web browser, Mosaic, appeared shortly after, paving the way for countless browsers to come, all built upon the foundational elements of HTML and HTTP. And what you see here on screen is a screenshot of the first web browser ever which is called Mosaic and this is Mosaic owns websites back how it looked in 1997 and this of course was only the first of many web browsers that came after that but the elements that are common to all of them so the HTML markup language HTTP as a protocol and so on were already there back at the beginning of the 90s.

1991: Linux

Also in 1991, Linus Torvalds, a Finnish computer science student, announced Linux. At the same time a young computer science student in Finland Linus Torvalds also in 1991 sent an email to a few mailing lists saying I’m doing a free operating system… Initially conceived as a hobby project, Torvalds aimed to create a free operating system for 386-80 clones, distinct from the GNU project. While the GNU project had made significant progress on utilities, their kernel, GNU Hurd, was still under development. and if you remember from the GNU project the original idea for the GNU operating system needed two things the kernel and the utilities. The utilities had been seeing lots of progress during the late 80s and so on and they had basically a re-implementation of the Unix system that they already knew from their interaction with these old computers I mentioned at the very beginning but they were not quite ready to have a kernel they had a project called the GNU herd that to this day is still in development and hasn’t gone very far.

The combination of the Linux kernel and GNU utilities resulted in the GNU/Linux family of operating systems, often simply called Linux. The “coolest thing about Linux” was the rapid emergence of numerous Linux distributions. but the truly revolutionary thing here is that by combining the Linux kernel as it came to be named and the GNU utilities we had the GNU slash Linux family of operating systems that I’m going to be calling Linux for the rest of the presentation for brevity and the coolest thing about Linux is that at that point dozens and dozens of Linux distributions started to appear. These distributions bundled desktop environments, GNU utilities, and other software. Slackware, the first Linux distribution, appeared in 1993, followed by Debian in the same year, which became one of the most influential distributions, parent to Ubuntu, Linux Mint, and many others. remember that Linux was announced in 1991 in 1993 there was the first distribution of Linux called Slackware and when I say distribution here I mean a collection of desktop environments the GNU utilities and some extra stuff then in the same year in 1993 the Debian distribution was created and this came to be among the most successful Linux distributions ever and parent to Ubuntu Linux Mint and many others

The mid-90s saw the rise of commercial Linux distributions like SUSE Linux and Red Hat Enterprise Linux. and in 1994 and 95 there were two commercial Linux distributions and this is really interesting because they in a way they were fulfilling the dream of rich installment of having software that could be free as in freedom but at the same time they could be build a viable business on top of it while still distributing the source code so these two distributions are SUSE Linux and Red Hat Enterprise Linux… These distributions demonstrated the viability of building businesses on free software while still adhering to the principles of source code distribution, fulfilling Stallman’s vision.

Beyond the web and Linux, the 90s witnessed an explosion of new programming languages:

1991: Python (Guido van Rossum) and R (Ross Ihaka). so in 1991 Guido van Rossum, a programmer and mathematician from the Netherlands created Python which is a software that I’ve been working with for half my life so I owe it my whole professional career basically and also in that same year the R programming language was created by Ross Ihaka, a researcher in New Zealand…
1991: Vim editor (Bram Moolenaar). also Bram Muehlenar, a programmer from the Netherlands created the Veeam editor and it’s still popular to this day
1992: X Window System ported to Linux, enabling graphical interfaces. in 1992 the X window system was ported to Linux which made it possible to have visual interfaces on Linux very early on
1993: Lua scripting language (Roberto Ierusalimschy et al.) and LaTeX (Leslie Lamport). in 1993 some researchers from Brazil Roberto Yerusalimch and many others created the scripting language Lua that is very popular for the video game community and Leslie Laporte from the USA created LaTeX
1990s: Numeric and Numarray, precursors to NumPy. and Python was already gaining traction for the scientific numerical community and both numeric and numeric was created and this was the precursor of NumPy which appeared some years after that
1996: Java (James Gosling). and in 1996 Java was created by James Gosling from Canada](https://www.youtube.com/watch?v=LLciYo3rqTQ&t=1969s) and this came to be one of the most important enterprise programming languages of the following two decades

This flurry of activity culminated in Eric Raymond’s influential essay, “The Cathedral and the Bazaar” (1997).

So you see all these things were happening at the beginning of the 90s and this all culminated in 1997 with the writing of the essay The Cathedral and the Bazaar by Eric Raymond. Raymond contrasted the “cathedral” model of traditional free software development, conducted behind closed doors, with the “bazaar” model exemplified by Linux. In the bazaar model, development was open, with public mailing lists and transparent patch management. and so the idea from Eric Raymond here was that the previous generation of free software had been developed behind closed doors so everybody could get the codes and everybody could get the four freedoms but in but in between releases nobody could see the intermediate stages of the software and the mailing lists were private and so on and this is what Raymond called the cathedral model and then he dubbed the Linux development model which was completely different because everything was happening in the open and Linux was sharing how he was or was not merging patches and so on and he called these the Bazaar models… Raymond also formulated Linus’s Law: “Given enough eyeballs, all bugs are shallow,” advocating for transparency in development and highlighting the weakness of “security by obscurity.” and he coined what he called the Linux law which says given enough eyeballs all bugs are shallow which is a way of saying that security by obscurity is not the way to go and instead we should strive for transparency and the more transparent we are the sooner we’re going to find the problems in our codes and again this is one idea that still reverberates to our time and I find it so fascinating that all of this was a consequence of these exciting periods at the beginning of the 90s.

From Dot-Com Bubble to Open Source Definition: A Turning Point

The 1990s also saw significant business developments that would reshape the landscape. In 1994, Netscape Corporation was founded, producing the Netscape Navigator web browser. However there were also some interesting happening interesting things happening on the business side of things that will have tremendous consequences as well. So in 1994 the Netscape Corporation was founded to produce a different web browser that was called Navigator, so Netscape Navigator. Netscape introduced JavaScript in 1995 and went public that same year, marking the beginning of the dot-com bubble. They introduced the JavaScript programming language in 1995 and later that year the company went public so this was considered the beginning of what would in the end be the dot-com bubble and we’re going to see what the consequences were of that. Concurrently, young Stanford students Sergey Brin and Larry Page started developing the foundations of Google. At the same time some very young students in the Stanford University, Sergey Brin and Larry Page had just started creating the very basics of what then would become Google, one of the most powerful tech corporations these days.

This period culminated in 1998 with the formalization of the term “open source”. This all activity culminated in 1998 when the open source term finally became official. Inspired by Raymond’s essay, Netscape released the source code for its browser, a move that resonated throughout the tech industry. This open Netscape Navigator ultimately became Firefox. So influenced by Eddie Raymond’s essay, Netscape released the source code of the browser and this was so shocking and unusual that it sent waves through the tech industry and motivated lots of enthusiasts and technologists to try to channel that energy. This open Netscape Navigator was the precursor of Firefox which is again a browser that we still use to this day.

A strategy meeting, attended by Eric Raymond, Bruce Perens, and others, addressed the confusion surrounding “free software.” Concerns arose that the requirement for transitive freedoms in free software licenses was deterring corporate adoption. So there was a strategy meeting and several people were present there and so Eddie Raymond himself, Bruce Perens who is there in the picture and many others and they were discussing about the fact that this terminology, this free software thing was too confusing and also there were hints that the fact that these freedoms had to be transitive so they were forced to transmit them to downstream users that was not very interesting for corporations. Christine Peterson suggested “open source” as an alternative term, one already in limited use. Bruce Perens and Eric Raymond then founded the Open Source Initiative (OSI), and Tim O’Reilly organized an “Open Source Summit” to promote the new terminology. So Christine Peterson from the United States in that through the immediate suggested using the term open source instead of free software which was a term that was already in use at the end of the 1980s and beginning of the 90s but she is credited with coming up with the idea of using it more broadly. So Bruce Perens then with the help of Eddie Raymond founded the open source initiative and that same year Tim O’Reilly organized what he called an open source summit and they started spreading the word about the new terminology and all these new ideas.

A crucial difference emerged: the open source definition did not mandate transitive freedoms. Interestingly the open source definition did not mandate that these freedoms are transmitted to users and so this created a divide in the community… This philosophical divergence created a lasting divide between “free software” and “open source” proponents. While Stallman initially considered embracing “open source,” he ultimately recognized this key difference and dedicated himself to advocating for free software, leading to a rift that persists within the community. because even though Richard Stallman at some point had considered embracing the open source terminology and at some point he realized that there was a subtle but very important philosophical difference between the two in that open source was not guaranteeing the downstream freedom of the users and so sadly he devoted a very large part of his life to fight open source proponents and hence this divide was created and that still is mentioned by many members of the community to this day.

The 2000s: Open Source Foundations and the Dawn of Big Data

As the 90s ended and the new millennium began, the digital age truly dawned. Historians often mark 2002 as the turning point when digital data storage surpassed analog. This explosion of digital data paved the way for the “big data era.” So we’re about to finish the 90s and we’re about to enter the new millennium and I want you to fixate on this image on the right because it might not seem too long ago but the truth is that what we today take for granted, so having everything on digital storage, is something that’s relatively new and historians usually consider 2002, so 22 years ago, the beginning of the digital age. So up to that point there was more information stored in analog means than in digital means but then after that point there was an exponential growth in hard drives, CDs, DVDs, even digital tape and so on. So that was the moment that marked the true beginning of the digital era and with this explosion of data everywhere came of course the big data era but we’re going to get to that in a moment.

The dot-com bubble burst around 2000, a dramatic moment that, despite initial disappointment, ultimately reflected market dynamics. Before that it’s my turn to talk about the dot-com bubble. So with the advent of the new web and all these open source technologies and the miniaturization of computers and so on, there was this fever and everybody wanted to get into the internet and do digital business. So this created a bubble that reached its peak in 2020 and then most of these companies in the lapse of two years or less, they completely disappeared or saw their value wiped off by more than half. So this was a very dramatic moment that was seen by many as a big disappointment in the potential of the internet but in the end it was the dynamics of the market. In 2001, corporate hostility towards open source peaked, exemplified by Steve Ballmer’s (Microsoft CEO) infamous statement that “Linux is a cancer.” In 2001, the corporate animosity towards open source reached its peak and it’s impossible not to remember this quote by Steve Ballmer who was the CEO of Microsoft back then who took over from Bill Gates and he said at some point that Linux is a cancer that attaches itself to everything it touches.

However, the open source community was organizing. The 2000s saw the establishment of key open source foundations:

1999: Apache Software Foundation, underpinning foundational data technologies like Apache Parquet and Apache Arrow. So in 1999 the Apache software foundation was established and this is still thriving to this day and underpins lots of the data fundamentals such as Apache Parket and Apache Arrow and so on.
2000: Linux Foundation, supporting Linux kernel development. In the year 2000 the Linux foundation was established and they employ Linux tutorials to this day and they’re a very successful coalition of companies pushing for development of the Linux kernel.
2001: Python Software Foundation, managing the Python language and organizing PyCon. In 2001 the Python software foundation was established also in the USA and they started creating the trademark for the language and soon after organizing the PyCon and so on.
2004: Eclipse Foundation, influential in the 2000s. And back in Europe in 2004 the Eclipse foundation was established which was also very influential especially in the 2000s.

Major tech companies, challenging Microsoft’s dominance, began embracing open source principles. In 2003-2004, Google published papers on the Google File System and MapReduce, outlining scalable, distributed computing concepts. But also the contenders for Microsoft’s dominance were embracing these knowledge sharing ideas which was accelerating innovation to a pace that had never been seen. So in 2003 Google released this Google file system paper and the next year they released the MapReduce ideas. These papers, though academic, demonstrated the cost-effectiveness of scaling out computing tasks across multiple machines. In 2005, Yahoo! implemented these ideas in the open, leading to the 2006 release of Hadoop. However, in 2005 one of the big tech giants at the time, Yahoo!, started implementing those ideas and they did that in the open. So this committed in the release in 2006 of Hadoop, which then became the basis of the whole big data craze for the next 10 years. Hadoop became the cornerstone of the big data movement for the next decade, enabling petabyte-scale data processing with open source technology. Hadoop is not a project that is very popular these days, but it was nevertheless the first wave of technologies that enabled companies to process petabytes of data using fully open source technology.

Another pivotal open source project emerged from Linus Torvalds in 2005: Git. Frustrated with a proprietary version control system, Torvalds created Git in just one month. And finally, something that was extremely influential as well was happening on the side. So in 2005 Linus Torvalds from the Linux kernel fame was pissed because they were using a proprietary version control system and they were enjoying a free license, but at some point the company took the license from them and they revoked it. And so Linus Torvalds took some time off and he wrote the Git distributed version control in one month. I find this extremely impressive. While Git itself is a command-line tool without a built-in collaboration layer (Linux kernel development relied on mailing lists for patch exchange), the need for collaborative open source development platforms was growing. And you know, Git is a tool, it’s a command line tool, but it had no collaboration layer. In fact, the way the Linux kernel is developed is over mailing lists and people exchange software patches on there with the lines they want to modify and so on. Which back then was a very advanced system and to the best of my knowledge they still keep using that. However, for projects that were smaller in scale than the Linux kernel, having all this collaboration layer was very annoying. This need was met in 2008 with the launch of GitHub. And in 2008 GitHub launched and this has become the biggest and most important repository of open source and free software ever. GitHub became the largest open source repository, embodying Raymond’s “Bazaar” model by making development history, issue tracking, and pull requests transparent and accessible. And the interesting thing is that this puts the idea of Eric Raymond’s Bazaar into practice, because now on GitHub you can see all the history of commits, you can open issues, you can send a pull request, essentially you can see all the development process in the open. So this fulfilled the dream of spreading open source far and wide, but also it introduced these social aspects into the development.

In 2011, Marc Andreessen, Netscape co-founder, penned the influential article “Why Software is Eating the World,” further fueling the open source movement and attracting a new wave of innovators. In 2011, which is the cutoff point that I chose for this section, Marc Andreessen, who was also one of the founders of Netscape some years prior, wrote this article, Why Software is Eating the World, which again reverberated with the tech industry and unleashed a new wave of innovators that were drawn to the fresh world of open source software.

Challenges to Sustainability and the Fragmentation of Open Source in the 2010s

The 2010s brought a new set of challenges to the open source ecosystem. And so we reached the past decades, 2012 to 2018. I’m going to accelerate now a little bit because by this time and judging by the demographics of the Data Umbrella, most of us were already alive and some of us were already even contributing to open source. And this is where some cracks started to appear in the whole system. While software was “eating the world,” and open source was “eating software,” a critical question arose: “what do maintainers get?” The xkcd comic depicting the fragility of the digital infrastructure, reliant on the unpaid efforts of a lone maintainer, captured this growing concern. So there was this notion that software was eating the world and later on that open source was eating software, but what do maintainers get was the question that everybody was asking. And there is this famous xkcd cartoon that depicts the whole modern digital infrastructure depending on a project that some random person in Nebraska has been thanklessly maintaining since 2003. And to be honest, the reality is not far off from that.

The decade saw the explosive growth of Python, fueled by the data science boom and libraries like pandas (released around 2010) and scikit-learn, the dominant classical machine learning framework. So during this decade we had an immense growth of the Python programming language driven mostly by the data science craze. So bear in mind that the pandas, the library created by Wes McKinney, was unveiled in 2010, 2009 more or less. And during this time Python became the number one option to do all sorts of data manipulation. And at the same time scikit-learn became the most important classical machine learning framework in existence. However, the unsustainability of open source became increasingly apparent. Built largely on volunteer time, projects initially conceived as hobbies were now critical infrastructure. Vulnerabilities like Heartbleed in OpenSSL, a project powering secure web browsing and often maintained by a single person, highlighted this fragility. But while all these things were happening, the world was discovering that open source was not sustainable at all. So it was built on top of maintainers’ free time. And you know in the 80s and 90s, all of these projects started as fun projects that in the words of Linus Torvalds you know would never become big and professional. And people were doing them for fun to scratch their own itch, to learn with their peers and so on. But suddenly we had all this massive infrastructure that was built on top of this very fragile projects and some vulnerabilities to appear. So for example OpenSSL which powers most of the secure web browsing that happens today had a vulnerability that was very easy to exploit. And that day the world realized that OpenSSL was basically maintained by one person.

The left-pad incident in 2016 further underscored the precariousness of the ecosystem. Another example of something that happened was this left pad story. A programmer, Azer Koçulu, removed his tiny JavaScript library, left-pad, from npm, breaking countless projects dependent on it. So basically Asel Kösülü, a programmer that was living in the USA, pulled left pad which was an extremely short JavaScript library. A library is so short that the whole source code could fit in like 11 lines of code as you can see here in the screenshots. And he was pissed because of some of the decisions of the node package manager and one day he decided to remove his library. And it turns out that thousands and thousands of projects were dependent on that and the continuous integration systems of virtually all the web ecosystem, web front-end ecosystem, completely broke. This event exposed the deep dependency chains and single points of failure in the modern web ecosystem.

Nadia Eghbal’s work, including her book “Working in Public,” provided crucial insights into the organization and motivations of open source maintainers. In 2016 Nadia Eghbal, a former lawyer from the USA who became super interested in the whole open source, wrote this seminal book, wrote some widgets with the fourth foundation in which she explained all the ways in which open source projects were organizing, how maintainers were motivated to do what they were doing and so on. And it was a essential piece of work that helped us understand how this whole ecosystem was evolving.

Towards the decade’s end, the open source ecosystem began to fragment. Businesses, while benefiting from open source, sought to protect themselves from competitors leveraging their code. This led to the emergence of “non-open source” licenses like the Business Source License, Commons Clause, and Server Side Public License. Finally, towards the end of the past decade the open source ecosystem began to fragment in two opposite directions but for many different for very similar reasons. So on one hand businesses that were using open source or even copyleft licenses started to see that their competitors were taking their code and competing with them basically. And some companies saw this as normal but some others were not so happy with the outcome. So lots of non-open source licenses started to appear. So we had the first source in 2015, the business source license from Aliadb in 2016, the commons clause in 2018, the server-side public license in 2018. These licenses restricted commercial competition, often by limiting Freedom 0 (freedom of purpose), preventing users from reselling the software or offering managed services based on it. While intended to protect innovation, they were rejected by the OSI as not truly open source. All of these licenses one way or another were restricting the competition. So for example they’re telling you that you can take the code, you can do whatever you want with it, but they were restricting freedom zero, freedom of purpose and telling you that you could not take that code and sell it for example or you could not take that code and do a managed service on top of it. It was a tool for companies to protect their innovations. But you know there was a period of confusion during these years because they tried to pass these licenses as open source because maybe they were open in spirit but the open source initiative had very clear guidelines of what the open source definition was and they rejected all the requests to consider any of these licenses through the open source.

Conversely, concerns arose about open source being used for unethical purposes. This led to the development of “ethical licenses” like the Cooperative Non-violent License and the Hippocratic License. At the same time on the other side of the spectrum people were concerned that open source software was being used for nefarious purposes. So for example there’s a family of projects that created the copyleft licenses and these essentially are a family of copyleft licenses but they mandate that only cooperatives and other non-capitalistic actors could use the code. And then Alakorawin in 2018 she created the Hippocratic license or do no harm license after she saw that GitHub was working closely with ICE, with the Department of Immigration of the US government and she wanted to create a license that developers could use to put their code out there and make sure that this would never be used for purposes of war, discrimination and so on. These licenses, while driven by ethical considerations, also restricted Freedom 0 by limiting the permissible uses of the software, and were similarly rejected as open source by the OSI. And they also tried to submit this as open source licenses in fact Coraline ran for the open source initiative election at some point but again they were restricting Freedom Zero so they were putting constraints on the purpose that was allowed for the software so again these are not considered open source licenses and the OSI rejected them all but we started to see this tension emerging right here. These developments highlighted the growing tensions and fragmentation within the open source world.

The Post-Open Source Era and the Impact of Generative AI (2019-2023)

In a surprising turn, 2018 saw a reconciliation between historical adversaries as Microsoft, once critical of open source, embraced it by acquiring GitHub. And then in 2018 old enemies became friends and Microsoft who had said 20 years prior that Linux was a cancer was embracing open source for once. So Satya Nadella who was the next CEO after Steel Bomber completely changed the culture of Microsoft and in 2018 they acquired GitHub effectively becoming the biggest code repository in the world. Under CEO Satya Nadella, Microsoft shifted its culture and became the owner of the world’s largest code repository.

Entering the period of 2019-2023, we find ourselves in a “post-open source era,” significantly impacted by the rise of generative AI. And now 2019-2023 I only included one album cover here because we’re in the middle of the decades so I’m hoping to discover some new music in the coming years and of course we are now in the post open source era and being affected by the generative AI volcano. Generative AI models like DALL-E, Midjourney, Stable Diffusion, and ChatGPT have become ubiquitous. And I’m just going to say a couple of words about this because there’s truly no need to say anything else about generative AI you’ve seen all of it all over the place you’ve probably played with some of these systems Dali, Dali 2, Mid-Journey, Stable Diffusion but also ChatGPT and all the derivatives and so on.

A critical challenge arises: generative AI companies face lawsuits regarding copyright infringement. OpenAI, the creator of ChatGPT, acknowledged that training useful AI models is impossible without copyrighted material. But one really important thing is that these companies are now facing some lawsuits from journalists, artists and so on and OpenAI the company that’s behind ChatGPT recognized that it’s impossible to create useful AI models without copyrighted material. This brings us back to the 1976 Copyright Act and fundamental questions about copyright in the digital age. Current generative AI models were trained on datasets scraped from the internet, including copyrighted images and code, often without explicit consent. And I find this so fascinating that we’re going back to 1976 and the Copyright Act in the USA and essentially questioning what does it mean for copyrights in the modern era you know. I’m sure some of you have had maybe depending on where you live some problems when you were downloading torrents or piracy movies or tv series and things like that. But it turns out that the current generation of generative AI models was created from data sets made of scraping the whole internet and using images without consent because you cannot retrieve consent at scale like that and using source code that was on GitHub and that had very strong licenses for example strong copyleft licenses.

This raises complex questions. For example, is code generated by GitHub Copilot a derivative work under GPL, requiring copyleft licensing? So there’s a very interesting question now about what happens when Copilot for example gives you a chunk of code. Is that a derivative work by the definition of the GPL and as such should you release that under a copyleft license? Generative AI models, like Copilot and ChatGPT, operate as “black boxes,” obscuring the contributions of digital pioneers from previous decades. So suddenly Copilot, ChatGPT and so on have become black boxes that somehow shield us from all the effort that the digital pioneers put in the 80s 90s and the zeros and so on and we’re still in the process of figuring out how should we proceed next.

The computational resources needed for training and running these AI systems are immense. Reportedly, Sam Altman seeks seven trillion US dollars to compete with Nvidia in GPU production, a staggering figure that dwarfs national GDPs and global needs like ending world hunger. And also the amount of computational resources that training and even using the system the systems requires is just out of this world. So some admin reportedly is looking for seven trillion US dollars of funding, these are American trillion and he wants to create a competition against Nvidia so that OpenAI and other companies do not depend on vidias GPUs to train and do inference with these generative AI models. But if you put this seven trillion into perspective with anything the GDP of your country or the budget that we will need to end world hunger or anything like that it’s just astonishing. So whether or not we will be able to gather these resources in a moment in which we’re starting to face water scarcity climate change and so on is up to debate.

The future remains uncertain. A current initiative, involving Chad Whitaker and David Kramer from Sentry, is exploring new license terms that are “open source in spirit” but allow companies to protect innovation while contributing to the software commons. We’re right in the future as we speak there’s this thread that is active, I am slightly participating on it, that is trying to figure out what term should we use for licenses that are not really open source but are open source in spirit and companies want to use them to protect their innovations while at the same time give back to the software commons. So this is a coalition led by Chad Witakre and David Kramer from Sentry, there’s lots of other companies involved and there are many names that have been proposed and you can make your own contribution and write history just by commenting on GitHub. This initiative invites community contribution on GitHub.

The path forward is unclear – potential outcomes include the proliferation of non-open licenses, copyright lawsuits against AI models, or even a redefinition of copyright itself. I’m going to leave it here, I think I spoke a lot, I truly hope that you liked it. What’s next? Nobody knows, only a big question mark. This could be a proliferation of non-open licenses, this could be lots of lawsuits against DNA models or maybe a redefinition of how copyright even works. We are at the forefront of this evolution, with the opportunity to shape the future, just like the pioneers who built the digital infrastructure we rely on today. The truth is that nobody knows but we are here, we’re at the forefront of it and we can make a dent on this and become one more with these pioneers that have been building the digital infrastructure for the past 50 years.

Contributing to Core Python

Wed, 02 Jul 2025 05:01:35 +0000

Resources

Learn about contributing to Python:

Python Developer’s Guide
Discourse: discuss.python.org
Python GitHub Issue Tracker
- for bug list: Python List of Issues
Python core mentorship

Video

About the Speaker

Carol Willing is a member of Python’s Steering Council and a core developer of CPython. She’s a Python Software Foundation Fellow and former Director. In 2019, she was awarded the Frank Willison Award for technical and community contributions to Python. Carol is a long-time contributor and ACM Software Systems award winner for Project Jupyter. She sits on its Steering Council and works as a Core Developer on JupyterHub and mybinder.org. She serves as a co-editor of The Journal of Open Source Education (JOSE) and co-authored an open source book, Teaching and Learning with Jupyter. Carol has an MS in Management from MIT and a BSE in Electrical Engineering from Duke University.

GitHub: @willingc
LinkedIn: carolwilling

Video Outline

Note: the timestamps are included in the video description.

00:00 Reshama introduces Data Umbrella
04:20 Reshama introduces Carol Willing
05:25 Carol begins talk
08:28 Python Steering Council
10:00 Slides recap
11:41 Kind of contributions needed in Core Python
14:44 Contributing to Data/Scientific projects vs to CPython
20:01 How to start contributing to CPython
23:05 Reasons for contributing to CPython
24:05 CPython Dev Guide https://devguide.python.org
29:23 How to build and test CPython
32:27 Resources to learn about CPython Internals
35:35 Resources for contributing to python
37:10 CPython community
38:21 Getting started with Q&A
38:38 Q: Is it needed to reproduce old bugs against new versions?
40:40 Q: Multiprocessing vs Threading to increase performance
41:19 Q: Does the project have any process to find a mentor?
44:00 Q: From your viewpoint what is the future of Python?
48:47 Q: What are the efforts of the Python Steering Council to make the community more inclusive?
52:29 Q: What will be the impact of the PEG parser?
54:12 Q: Development time vs. Code performance optimal performance

Full transcript of Carol Willing’s presentation

00:00 Reshama introduces Data Umbrella

Hello, everyone. Thank you for joining our webinar for today. Thanks for joining Data Umbrella. I’m going to do a quick introduction. Carol Willing is going to do her talk and we’ll have a Q&A session at the end. And this webinar is being recorded. A little bit about me. I’m a statistician data scientist. I’m the founder of Data Umbrella and I am on Twitter, LinkedIn, GitHub as “reshamas”. Feel free to follow me. We have a code of conduct. We’re dedicated to providing harassment-free professional and respectful experience for everyone. This applies to the chat as well. Thank you for helping make this a welcoming and friendly community for all of us. About Data Umbrella, we’re an inclusive community for underrepresented persons in data science. We welcome allies to join us. And we are a volunteer-run organization.

How to support Data Umbrella

So how can you support Data Umbrella? The first and foremost is to follow our code of conduct. The second is we have a Discord community chat. So feel free to join it. The link is on our website. And there you can ask questions and answer questions as well for the community. We have an open collective. Feel free to also, if you want to donate there, to cover our meetup dues and other operational costs. And we have a new initiative, we’re doing transcripts for all of our talks. The transcripts are on GitHub and it requires knowing some markdown and how to submit a PR, either via terminal or GitHub. So check out this link for more information. We also have a job board, which is under jobs.dataumbrella.org. If you are looking for a position or just curious, feel free to check it out. We have two highlighted jobs today. The first is a software engineer position at Oscar Health, which is based in New York City. It’s a health insurance company. And they are developing seamless technology and provide personalized support to members to navigate their health care. And they have some other roles open as well. So check it out. Our next job that we’re highlighting is a machine learning engineer by Development Seed. They are based in Washington, D.C. or Lisbon, Portugal, or they can be a remote position for now. And what they do is they’re mapping elections from Afghanistan to the U.S., analyzing public data and economic data and leading strategy and development behind Data.World Bank and other social enterprise initiatives.

Website resources

These are just a sprinkling of what we have available on our website. There’s a lot of resources on responsibility, on accessibility, on open source. So check that out on your own. We have a monthly newsletter. It is dataumbrella.substack.com. So feel free to sign up for it. We only send it out once a month. So we promise not to spam you with too much email. We are on Data Umbrella is on a bunch of different platforms under Data Umbrella. So depending on what your preference is, the best place to actually be a member is the Meetup Group because that’s where all the events are posted. Our website has resources. We are on Twitter. We are on LinkedIn. We are on YouTube. If you want to subscribe to our channel, I will post some of these links once I finish this interview presentation on the chat so you can follow it. Before we begin, I just want to let you know that next week our upcoming event is using Streamlit for data science with Thomas Fan. Thomas is also a core contributor to scikit-learn. Streamlit is a way to build and share data apps in Python. I think it’s going to be a great presentation. I’m so glad that Thomas said yes when I invited him and asked him.

04:20 Introducing Carol Willing

A little bit about today’s speaker, Carol Willing. Carol has an MS in Management from MIT and a BSc in electrical engineering from Duke University. She is a core developer of Core Python. She is a member of Python Steering Council. For Project Jupyter, Jupyter Notebook, as many of us have used, she’s on the Steering Council and works as a core developer on Jupyter Hub and MyBinder. She is also co-editor of the Journal of Open Source Education and she co-authored an open source book, Teaching and Learning with Jupyter. There’s a lot more that Carol has accomplished, but I was running out of space. So I will let Carol take it over from here. Oh, and just one more thing. We have a Q&A tab on this platform. So if you have any questions, just post them there. If you do post them on the chat, it’s not a problem. I can easily move them over to Q&A. And if you want to upvote them, you know, I think we’ll have time to answer everyone’s questions, but feel free to upvote to see what’s really exciting and important for you. And I will hand it over to Carol.

05:25 Carol begins

Thanks, Reshama. Thank you all for being here and thank you for the sponsors and everybody who organizes groups. It’s really important and it helps us as an ecosystem get stronger. I am going to start sharing my screen, hopefully, and once Reshama lets me know that you guys can see everything, then I will start. There was a little bit of a lag before, but okay. Well, I’m going to assume you can see my slide deck. Otherwise, somebody please shout. So today, I’m going to talk about contributing to core Python. But I’m going to do it from the lens of a scientist or a data scientist. So it’s going to be my opinionated view of working in all of these communities. And for those that are advanced users, you will find something useful in this. For those of you that have never contributed to open source, you should also find a lot of really useful stuff. As Reshama went through my bio, the ending part talked about education and using tools. And that is really where my heart and passion is. Less about the technology and more about building tools that empower other people to do good things in the world. And I think our science and data science ecosystem helps us do that really well. So let’s get started. So, yes, core Python and CPython are the same thing. You will hear them used interchangeably. And for all intents and purposes, you can just assume that they are actually the same. So for today’s talk, we’re going to look at this from a data and science perspective. So the primary audience is data scientists, scientists, data engineers. But other folks that might get value out of this are computer science, compiler engineers, operating system experts, language geeks. But that won’t be the focus of the talk. It will be the data folks and science folks. Okay.

08:28 Python Steering Council, How Python is organized

So contributing to core Python today, I’m going to walk a little bit through how core Python is organized today. Then I’m going to compare it to other open source projects in our ecosystem of data and science. scikit-learn is a great example of that. And then we’ll talk about getting started and how you can go further with your contributions to either CPython, open source, your local community, and more.

Governance

So we had an interesting thing happen a couple of years ago. Guido van Rossum, who had been the benevolent dictator for life of Python, stepped down. And we were faced with having to create a new governance for core Python. And what we came up with was a small steering council. These are the members currently who sit on the steering council. And really, we are to sort of do and set the direction of things that Guido did by himself in terms of organization in the project and direction of the project.

PEP (Python Enhancement Proposals)

And in specific, we are tasked with ensuring the quality and stability of the language, moving towards contributions that are accessible, inclusive, and sustainable, fostering a stronger relationship with the Python software foundation, continuing to facilitate the decision-making process for PEPs, which are Python Enhancement Proposals. When there are large changes made to the language or proposed to the language or workflow, that’s something that is open for everybody to read and comment on. And our goal is to seek consensus both with each other, but also the community at large, because we don’t want to be a dictator of the direction of Python. So, as we look at core Python, what kinds of contributions are needed? Lots. But I want to first step back and look at the Python software foundation, which is a sister organization to the core Python developers.

Sorry, just in general, we see you, but we don’t see your slides. So, maybe if you can just share again. Let me stop sharing and resharing. Let’s see if escape will do it. Worked in practice. Let’s do it again. Okay. Now you should see my whole desktop. Is that correct? Okay. It takes just a couple of seconds for the lag. It is loading, which is good. And now we can see your whole desktop. Yep. And now you should be able to see the talk? Yep. Now we can see the talk and we can see you. Okay.

10:00 Slides recap

11:41 Kind of contributions needed in Core Python

So, speed version. The title of the talk, core Python opinion guide for scientists and data scientists. Yes, core Python is the same thing as CPython. I’m targeting this to the scientists and data scientists and data folks in the audience. We’re going to go through a few different ways of contributing to core Python and beyond. I started with a quick discussion of governance, our introduction to steering council, what the steering council is responsible for. And now back to you. Which contributions does CPython need? And looking at the mission of the Python software foundation is a good place to start. The Python software foundation is a sister organization to the core Python group. And its goal is to promote, protect, and advance Python the language as well as to grow a diverse and international community of Python programmers. So, as you think of making contributions, keep that in the back of your head because those are really going to be the most valuable contributions. Some ways you can contribute. And oftentimes, people look at writing new code as the only way to contribute. And it is a great way to contribute. But in many ways, a lot of these other ways to contribute are equally, if not more, important. When you’re adding new code, it’s got to be with backward compatibility. We’re a 30-year-old language. People are using this in production. They don’t want things breaking. Maintaining security, anything you can do to maintain the security of the language. Or improve core development workflows. So, Mariatta has put together a lot of bots that make our core development workflow much easier for folks to get started. And also, for GitHub and the power of technology to help with some of the more tedious tasks.

Writing and running tests.
Writing and editing documentation. Both of those are absolutely critical to the success of any open source project. Python as well.
One huge need we have is folks to triage bugs for reproducibility. And in addition to that,
people to review open PRs.

Right now, we’re split our code bases on GitHub. But our bug issue tracker is on bugs.Python.org. So, there’s sort of an extra step between the two. So, it’s helpful with decreasing the backlog of PRs and bugs for people to help triage and review PRs. And in the last year, we’ve actually started a triaging team which is in many ways a stepping stone to core development. And then, really, anything that you do to share your knowledge with the community like:

Maintain projects,
Giving talks,
Writing blog posts,
Organizing or attending meetups, that is a contribution to core Python. So, congratulations. You’ve all made your first contribution to core Python. Or at least the Python ecosystem.

14:44 Contributing to Data/Scientific projects vs to CPython

So, I’m going to spend a minute comparing the different projects. And comparing and contrasting Python versus other projects that are younger like Jupyter, Matplotlib, scikit-learn, and more.

So, what’s similar?

Well, most of these projects use a GitHub workflow.
And with that, I mean we host our source code on GitHub or GitLab.
And we operate with a pull request mindset. And what that means is a pull request is, hey, I would like you to take my code and add it to your code base. And then a maintainer will either say, yes, that sounds like a great idea. I accept it. Or, gee, could you make these changes and then resubmit it? Or in some cases when it’s just not an appropriate contribution, we might say, you know what? This would be better as a third party project or another project.
All projects have code review. And that’s when maintainers and core developers look through code that’s being submitted or documentation that’s being submitted. And make suggestions, hopefully very kindly, about what could be changed or improved.
And many strong projects have automated testing and continuous integration. And it’s really valuable to have that as part of your project because it provides sort of an independent view of what’s going on. So, as a contributor, it sort of provides you guidance. Okay, I’ve done, like, the correct syntax or formatting if I’m submitting code or documentation. And one thing that I forgot to mention but
it is really important is healthy open source projects have not only a code of conduct, but also
an onboarding guide or a developer guide that helps new contributors get started and helps existing contributors build their skills.

17:00 Differences between CPython and other Data Science Projects

So, what differs?

Well, there’s a fair amount of stuff that differs between CPython and other data science projects like Jupyter and Jupyter hub and binder that I’ve been involved with.

And one of the most important or biggest ones is the velocity at which new features are added to the project. Pythons, a 30-year language, has a long history and a lot of code out in the wild that are being used in production, in scientific research. And as such, we can’t just go and change things as quickly as we could in a newer project because we have more embedded users.
The other thing that’s different about core Python is it’s a language that’s used far beyond just data and science. So, not only do we have to satisfy the data and science community, but we also have communities around web development, sysadmin, DevOps, embedded systems, teaching, and so forth. So, when we look at Python, we’re looking at, okay, how do we keep stability and backward compatibility and security while adding new features? And in many of these other scientific projects, it’s a bit flipped.
We’re looking for adding new features while also creating, you know, a stable, though often changing environment. We might deprecate things, which is stop using things or offering things in a two-year window where Python, it would be much longer, it could be five years, it could be ten years. And then the context in which these projects are used is also different. CPython is a tool that we use to build things. And it’s a tool that is used as a foundation for many of these projects like Jupyter or Matplotlib to create their own project.
So, the stability of CPython is critical to the stability of our entire data and science ecosystem because what we don’t want to do when we make a new Python release is break a whole bunch of other projects.

And so, yeah, so that should give you a sense of the fact that Python is going to move a little slower, really emphasize code and quality and security, whereas some of these scientific and data science projects are going to move much quicker. And you’ll see a lot of change in even a year or two years.

20:01 How to start contributing to CPython

So, how do you get started contributing? Or if you’re already contributing, how do you continue and perhaps grow your skills? Well, if you’re a first-time contributor to open source, I want to strongly encourage you to consider making your first contribution be a contribution:

make a contribution to a project in the scientific data science community. And I’m going to go so far since it’s my opinionated talk to say I would encourage you to do that over making CPython your first contribution. And the reason why is you will find going back to the differences between the projects and Python language, the velocity of change is much higher in these projects like scikit-learn. So, it actually winds up being a way to learn while doing and make an impact while you’re learning, which is much harder to do in CPython. And I highly encourage you to go watch both of these or read the transcripts. They’re excellent.
So, as you get ready to contribute to CPython, and this would apply to most open source projects as well, you want to kind of get into a mindset that will set you up for success.
And one of the first things you can do is sort of check your intent or really identify why you want to contribute. And the reason I say that is when things get bumpy along the way, it’s easier to persist in the process when you have clear goals of what you’re trying to accomplish and why.
And I would encourage you to think about your initial impact and initial scope. - Keep it small. It’s much easier to start small and then work up to big.
And I want to also say that patience is probably one of the best things that you can have along with communication skills in open source. Most projects are run by volunteers. The vast majority of core developers for CPython are volunteers. All the stuff I do is as a volunteer on my time, Notable gives me a lot more latitude than most companies to work on open source, but it’s not my primary job. And there’s less than a handful of folks within the core development team where it is their primary job. All right. So, you’ve got the right mindset. You’re setting yourself up for success.

23:05 Reasons for contributing to CPython

What are some of the common reasons that people contribute to open source? And these are just a few of many. You’re using a project, whether it’s a scientific Python project, and you hit a bug, and that bug possibly is related to something in CPython. So, you might want to fix something in CPython to make your other project run better. You might have come across something that perplexed you or was complicated to learn. So, you might want to improve the documentation so the next person doesn’t have to go through the same process. A lot of people just think it would be cool to contribute to Python, which is a totally valid reason. Many people want to just understand more about how things work and to strengthen their development skills. So, this is just a small subset, but some things that come up time and again.

24:05 CPython Dev Guide

CPython Dev Guide

So, your most important resource when contributing to core Python is what we call the dev guide. And it’s located at devguide.python.org. It is a comprehensive guide to contributing to Python. And it’s maintained by the core developers that also maintain the language. It is pretty much everything you ever wanted to know about contributing to Python and then some. As such, there is a quick reference guide, which is really a great place to start. And we’ll talk about it in a little bit. But there’s also things about how to submit a pull request, how to get help, how to run tests, and many other resources. This is sort of your one stop, if you will, towards getting started. And many other projects have something similar, perhaps not as long. But, yeah. So, some other helpful prerequisites that will improve your contribution experience to see Python is to take a little time to understand see Python’s culture. And that can be – there’s different aspects of the culture. Because we have people that have been with the Python language for 30 years versus are relatively new in the last five years to core Python, you’re going to have people with different perspectives and different work styles within the language.

Discourse: discuss.python.org

Personally, I spend a lot of time looking at Discourse, which is discuss.python.org. And less time looking at mailing lists. Partially because I find more value in Discourse than in the mailing list. But there’s a lot of core developers that do the reverse. And then I also spend a lot of time looking at the pull requests and the code itself. So, understanding the culture and where to find information and how the pace at which things happen is really important. Also, understanding the difference between the core language and the standard library. The core language is a smaller subset of what you would think of as core Python. And then the standard library is actually many other smaller libraries that provide additional functions that gave Python the batteries included name. And core Python – the core language is really things like data types and really the fundamentals that you would have in any software development language.

Be kind when contributing to open source

Again, I can’t reiterate enough that core developers are volunteers and be kind. Many of us are wearing many different hats. On the flip side, the core developers should be kind to you as well. So, we do have a code of conduct. And, you know, I encourage you, if you’re seeing behavior that is not professional, please let folks know. Understanding Git and GitHub workflow is, I think, very important for contributing to core Python. At the point where people are contributing to core Python, the general assumption in the community is that you have basic Git and GitHub workflow experience and understanding. And if you jump back to where I said, oh, I’m encouraging you to also consider contributing to other data science and science libraries, partially it’s because those projects tend to be a little bit gentler with new contributors that are still learning Git and GitHub. And there’s lots of information out there about doing it. Software Carpentries have a great guide. And there’s a talk which maybe we can link to in the future. Unfortunately, it wasn’t recorded, but a slide deck that I put together, gosh, probably 2016 for complete beginners at write speak code who are learning Git and GitHub. And it really is a very gentle introduction to both but has been very popular. And then the other prerequisite is some familiarity of Python. It’s not necessary to know C. It’s not necessarily to know C++. So, most of the language is written in Python so you can be very effective without really understanding much, if at all, from C or C++.

29:23 How to build and test CPython from source

So, that’s a lot to digest. And I want to – for those of you that want a great exercise later or when you’re ready, I want to just give you the brief directions on how to build core Python from source. And people often think it’s a really super complicated process. And the quick reference guide in the dev guide actually runs through these steps. There may be some subtleties based on the operating system that you’re running. But essentially, what you’re going to do is fork and clone the source code from GitHub, which is that Python C Python. Then you’re going to use a C compiler and most of the time C compiler is kind of available with your operating system to configure and build Python. It’s one command to configure and build it with Unix, Linux, and Mac. Make is what will actually do the actual building. And then Windows, there’s a .bat file that combines those same steps. So, to configure and build Python is one step. So, what would you want to do as you’re starting? Well, a good place is just to run the tests and again, like building and configuring Python, it’s one line of code. And there are – this should look pretty familiar with folks that have done Python. It’s just basically executing the test library. So, there you go. You have now learned how to build, configure, and run tests for Python. The dev guide will definitely give you additional questions and answers on how to contribute. Again, remember, changes to the language are submitted as GitHub pull requests. Our continuous integration will run the automated test just like you’re running them locally. It will run them in an automated way across all operating systems and a number of different versions of Python. The next step in the process, if you submit a PR, would be to wait for some review from a core developer or a contributor. Address the feedback as appropriate. And hopefully then you will have a final core developer review. And if luck will have it, it will be merged into the code base and folks can use it from then on.

35:35 Resources for contributing to python

For those of you that want to go deeper in your understanding of CPython, there is a wonderful blog post which is now a book by Anthony Shaw. About a guide to CPython source code. It is the most accessible yet highly technical explanation of how CPython works, you know, down to the lowest levels. In fact, it was so good that when it first came out, I took the entire blog post, copied it all, had it bound in a spiral, you know, thing so that I could refer to it on a day-to-day basis. And I encouraged Anthony to write a book based on it and he did. So we’re very lucky that there’s many ways of accessing his materials. And he’s also a very prolific speaker. So there’s lots of stuff on YouTube as well. Another great resource for learning about Python. I don’t know if you can see me or if you’re just seeing my slides. But there is a book called High Performance Python by Misha Gorlek and Ian Osvold. And it is, I think, an outstanding book for both learning Python and how it is built and comes together.

Performance in Python

But also the things you can do to improve the performance of Python. One of the things you might hear in the media is Python is slow, Python, you know, isn’t performant because there is this global interpreter lock or Gil. And the global interpreter lock or Gil, what it does is it limits at certain points the processing to one thread, if you will, at a time. And that tends to be a bottleneck. Because right now we have many multi-core processors and things like that. You’d want to use all that and not have to bottleneck into things. But this high performance Python, it runs through things like how to profile your code, how to use multiprocessing, how to use Sython, which is a great project in our ecosystem in terms of – that allows us to do a lot of more CPU intensive stuff. I believe it also covers Numba, which is another great project that uses time compilation to get around the Gil. So, yeah, there’s lots of ways you can improve your performance. And these two resources are just some of the many that are out there. In addition to those resources, there are many core developers, keep websites that have a lot of technical content, Victor Stinner, Brett Cannon, and Guido, a lot of historical information from Guido as well.

Sprints in Python

Much like the sprints that have been held for Scikit Learn, CPython typically runs sprints when we have in person conferences, which sadly 2020. But the hope is that there will be some being done virtually. And if anybody wants to run a CPython sprint, let me know. I’d be happy to kind of help guide you with that. Many of the Python – PyCon talks, both from PyCon US and beyond explain how to contribute to core Python. I think my PyCon talk from 2015 was how to contribute to core Python when you’re not a core developer. Marietta has given great talks about contributing both to the language and to the workflow, Victor Stinner. And for those of you that have any interest in asynchronous programming, Lukasz has a great series on YouTube about async.io. It’s about seven parts. And it’s a really great introduction. So we can always use help from folks that are interested in asynchronous.

37:10 CPython community

And this is just a small subset, maybe a third of the core developers. As volunteers, our time is limited. But the community is key. And it’s what lets Python, both as a language and ecosystem, thrive. So I want to encourage you to go forward and contribute. Join the discussion on Discourse, which is discuss.python.org. And whether you’re contributing to Python or any of the projects in the ecosystem, you are creating real change and helping others solve important problems in the world. So I want to thank you for listening to me. And thank you to Rishama and the organizers. And I’m going to shut off my screen. This is already available on speaker deck. And it will be available through Data Umbrella as well. And I am happy to take any questions that might have come up. Okay.

38:21 Getting started with Q&A

So there’s a question about the slides. So I will find speaker deck for that. It’s on speaker deck. Okay.

38:38 Q: Is it needed to reproduce old bugs against new versions?

So the next question is, is the need to reproduce old bugs against the newest version of 3.9 a need of the project? Depends on how old the bug is. If the bug is like within two years old, I’d say, yeah, it’s probably useful to reproduce those. If in general my view, and it’s strictly my view, it’s not an official view in any way or shape or form, personally, I would close the vast majority of issues that are over, let’s say, three years old. And because they’re still accessible to people to find if needed, but the likelihood of them being worked on, I think, is fairly low. We are going to move the issue tracker to GitHub. There is work in process to do so. And that should make the whole process more streamlined. And it will let us do some things with notebooks and some of the tools that we have in our ecosystem to help surface issues that haven’t been looked at to help recognize contributors. There’s many different things we can do with the data that is available within the repo. Hopefully that answers the question. Okay.

40:40 Q: Multiprocessing vs Threading to increase performance

The next question is, GIL is a problem, but the advice I’ve seen is avoid be limited to one core is to use multiprocessing over threads. » Okay. So there’s a lot of different perspectives. And the GIL or GIL, which is the global interpreter lock, there are different ways of getting performance that gets around the limitations of the global interpreter lock. Multiprocessing is a great way. I would say threads would not be my first choice in how to do that. But things like Sython, Numba, depending on the use case, asyncio. But, yeah, it’s a great question. And the resources that I mention, particularly high performance Python, will give you excellent advice on how to get the most out of your deployments in a safe and efficient way. » Okay.

41:19 Q: Does the project have any process to find a mentor?

And the next question is, what is the best way to attract a mentor from the core team? Does the project have a formal process for being mentored? » So there is a core mentorship mailing list which used to be more active. Right now, mentors, you know, because we’re all volunteers, people mentor as they’re able to. I typically, my mentoring is basically being welcoming and answering questions that people directly ask me. I just don’t have the time and bandwidth to mentor individuals. But one of the things that we do have is the triage team. So if you’ve been involved for a while, somebody may ask you, hey, do you want to join the triage team? And the triage team, you know, has some additional capabilities that can do things like make comments or attach labels to different issues. And that’s really a great way that recently folks have found mentors as a result of the work that they were doing within triage.

And then Victor Stinner has been really great at mentoring folks as well as Barry, Raymond Henniger. There’s a lot of people that do Eric Snow, Pablo Salgado, who’s our release manager for 3.10 and 3.11 have also mentored folks. And Victor’s got a lot of writing on it. So hopefully that helps. And feel free to reach out to me directly. And I can probably provide additional resource. » What’s the best way to reach you, Carol? » Reshama knows this. I stink at email. GitHub is really the best way to @mention me on something. But, you know, my DMs are open on Twitter. And, you know, I will get to it when I can get to it. It’s not that I don’t want to respond. But the volume of stuff that comes in is I could spend my whole day doing email and nothing else. So, you know, I tell folks that I meet, you know, be persistent if you don’t hear from me once or twice. Do not hesitate to email me a third time. And I will do my best to answer.

44:00 Q: From your viewpoint what is the future of Python?

Definition: GIL
The Python Global Interpreter Lock (GIL) is a mutex, or a mutual exclusion lock, within the CPython interpreter (the default and most widely used implementation of Python). Its primary purpose is to ensure thread safety by allowing only one thread to execute Python bytecode at a time, even on multi-core processors.

Reshama: The next question is from your viewpoint, what does the future hold for Python for instance for the next 30 years?

Carol: Wow. That’s 30 years. I hope that Python is as vibrant today as 30 years from now as it is today. I think in many ways, you know, the whole data science, scientific Python ecosystem has been one way that has really revitalized the language over the last five years. Another place that we’re seeing a lot of growth is in embedded systems, things like micro Python and circuit Python. And, you know, that’s really exciting to me from an open hardware and education standpoint and getting young folks or folks that maybe aren’t computer scientists to start with involved. I would love to see us improve the performance of the language. And what that will look like, I’m not entirely certain. There’s definitely going to be efforts over the next five years to do that. Some of it is time, some of it is funding. But I – the one thing I hope stays the same 30 years from now is the *readability of the language. And I think because Python is so readable, that actually makes it much more accessible for folks as well as tools like the notebook kind of break down and provide a good education tool for learning more about Python. So the community is going to be the one that drives where Python goes. And truth be told, there’s another language that I also find really interesting that probably is relevant to many members in this community, and that is Julia. It’s still a very young language, but I think it has a lot of potential and promise. And it’s very similar to Python in a lot of ways, but gets around issues like the GIL and relies more on C, C++. And should you ever want to compare Julia code with Python code, Tom Sargent’s quant econ website is excellent for doing that because even if you’re not an economics person, it has economics code written in Python and similar economics code written in Julia and it’s a good compare and contrast when you’re learning. So I don’t know. That’s about all I can say for the future.

Speed in Python

There’s been a lot of discussion I see going on because O’Reilly published a report too about Python. And I guess speed is the thing that’s most under discussion, right? It’s interesting because oftentimes, and this has been historically, I’m 54, I’ve been in this industry a long time. Speed has always been at the key. Is this faster than the other? Is VIM faster than Emacs? Speed is relative. And it depends on what you’re measuring. I would say one of the things that is not often discussed and really should be is the speed to create a project. And Python is a really efficient language for going from no code to prototype to production. And there’s a value in that beyond just pure processing power. And now, that said, there are certain things that if you’re CPU bound, like high CPU operations versus I/O operations, and web is going to be different than pure number crunching, your performance is going to vary. And one of the things I think that would be really useful is to try out, like profile your code, look at the other tools for increasing the speed. There’s tradeoffs regardless of what language you use. And I encourage you to try other languages as well.

48:47 Q: What are the efforts of the Python Steering Council to make the community more inclusive?

Reshama: There’s another question about, you mentioned the global reach of the tech community. Aside from events like this, what is the role of the Python steering committee in making the community more inclusive?

Carol: So, the steering council has been a part of some of the code of conduct decisions and discussions related to core developers. And we have this year taken some actions to redirect or remove folks that were probably not contributing constructively to the community.

So, you know, adopting the code of conduct throughout the project was one way.
The steering council also looks at efforts, whether it is the core language summit or the core developer sprints. Those used to be exclusive to only core developers. And in more recent years, we’ve invited more people to it that to both increase the diversity on a number of dimensions as well as just even use cases.
My hope has long been that diversity will benefit from the move to GitHub. I was an early proponent of moving the code base to GitHub and have been a strong advocate for moving the bug tracker to GitHub or it could have been GitLab. But something where the tooling is modern and is more accessible to more people. Because one of the things I personally feel when you’re an underrepresented group, your time is more limited than if you are in the majority. So, you have to pick and choose what you work on much more carefully. And by having a roadblock, like having to learn an issue tracker versus using tooling that you’re used to from your day-to-day work, I think that adds a barrier to inclusion.
So, yeah. So, it’s really trying to shift the culture towards more inclusive culture and more inclusive. And I think that’s a really good question.
Because, you know, Python software foundation is also connected to improving inclusivity in the Python space. Yes. That’s a great suggestion or comment. Because we also have within PyLadies, there’s the ability to ask questions. Just like with our ladies, there’s an ability to ask questions on core development within those channels.

52:29 Q: What will be the impact of the PEG parser?

Another question here is, what will be the impact of the new PEG parser in future Python tools and libraries? And maybe you could explain what PEG means to someone like me who doesn’t know. So, we affectionately call it the PEG parser. And when you’re creating a language, you have a grammar. You know, your syntax. And the language has to have a way to sort of know, okay, this is a key word. This is a variable. This is an operation. Things like that. And that is the job of the parser. I believe that the PEG parser will give us more flexibility over time. It is probably equally as performant as the prior parser. And there was a lot of extensive testing done during an entire release cycle to make sure that, you know, it wasn’t introducing weird regressions and things like that. But one of the things, it is much easier to write grammar rules now with the PEG parser than it was with the old parser that we have. So, hopefully that answered a little bit what it is and why it is. And someone who can provide far more answers than I would be Pablo Salgado and do reach out to him. He’s probably got some talks on the PEG parser as well.

54:12 Q: Development time vs. Code performance optimal performance

And one last question, which is, I heard a talk that is a twist on the speed demand. Instead of focusing developer time expertise on optimizing to a capital O of 1, go with N-squared performance because developer time is more expensive than RAM is. What are your thoughts on that? » I think it really comes back to use cases. And I think the difficulty with Python is because we serve so many different communities, what is optimal for one community may not be optimal for other communities. So, you know, it’s an approach. Is it the approach? I don’t know. To be really honest. It wouldn’t be my first approach, I guess is what I’m saying.

55:10 Thank you!!

Reshama: So, with that, we are reaching the top of the hour. And so, thank you so much, Carol, for taking the time to join us and sharing about core Python. I have a link to the slides. I put them in the chat. And I’ll include them in the video when it’s uploaded to YouTube. And there’s a bunch of links that you’ve mentioned, which I’ve been putting in chat. And I will link to it in the YouTube description or somewhere very convenient. Thank you so much.

Carol: Great. Thank you. And thank you, everybody, for attending. I hope you learned something about contributing to open source. And I hope you do so as you’re ready to do it.

Reshama: Thank you.

Interview with Sam Miyamoto: Giving My First Talk

Mon, 31 Mar 2025 07:00:35 +0000

We speak with Sam Miyamoto about her experience giving her first talk.

Tell us about yourself

Hi all! My name is Sam Miyamoto, and I’ve been a contributing member of the Data Umbrella community since 2023. I am based in the Los Angeles, California area. I was introduced to Data Umbrella by way of SPEC and Hackbright Academy.
Have you given a talk or tutorial before or participated in panel discussions?

“Intro to Git” was my first solo talk on a technical topic in a virtual format. However, I have a fairly broad definition of what a talk means. I’m thankful to every educator I’ve ever had in any setting who included student-given presentations in their lesson plans, from elementary school until now. Not to mention the friends and colleagues who encouraged shared curiosity. They are giants!
When was your first talk?

My first talk was with the Data Umbrella community. I reached out to Data Umbrella, as I often contribute to timestamps for their webinars. The talk was Intro to Git in December 2024.
What inspired you to do that presentation:

That’s a good question… it was a combination of a few factors. Speaking candidly, I wanted to present on Git from my beginner’s perspective because I wanted to learn and understand more by teaching.

Git is complex for me, and I was fortunate to have a clear starting point provided from work and volunteer experience from which to “branch” off and explore while communicating in team-based work in pursuit of a shared objective.
Do you have any suggestions for first-time speakers?

I would suggest / encourage that first-time speakers ask thoughtful and personal questions of themselves while thinking of a topic on which to present. Some examples can include: Why that topic? Why now? Or later? What about it resonates with you? How will you sustain the audience’s attention? What journey will you navigate the audience on?

Virtual event speaking can sometimes be a balancing act, and the Data Umbrella community walked me through those steps.

And if you couldn’t already tell… sometimes the rambles go on for a bit.

Also, virtual talks helped me with landing a job, as they demonstrate a variety of communication skills.
What has been your experience since you have presented? What are the benefits of public speaking?

Public speaking (especially one that is recorded for anybody to see on the web at any hour of the day) can be intimidating sometimes. Doing that presentation helped me practice some information synthesis, overcome some fear, and test / refine some technical communication skills in a live setting as well. Communication refinement is (hopefully) a never-ending cycle. Data Umbrella’s audience and community is respectful.
Anything else you would like to share?

Thank you and the Data Umbrella community for the time. Presenting was a special and challenging experience that I valued, and I look forward to tuning in to additional events.

Sam Miyamoto presenting to the Data Umbrella community

Connecting with Sam Miyamoto

LinkedIn: Sam Miyamoto
Website: www.smiyamoto.dev
GitHub: @samvmdev

Getting Started with Bash Scripting

Sat, 29 Mar 2025 09:30:00 +0000

This post explores the fundamentals of Bash scripting, focusing on how it’s used in data science workflows and how to write more effective scripts. It builds upon the core concepts presented in the Data Umbrella webinar, Intro to Bash Scripting.

Intro to Bash Scripting: by Rebecca BurWei

Resources

Repo: https://github.com/rebecca-burwei/intro-to-bash-scripting/
Slides: https://docs.google.com/presentation/d/1X9pOOEFOIK2oI26VvuNKRBC8psIM8HnGqOZ6fOUF8jM/
Bash file examples: https://github.com/rebecca-burwei/intro-to-bash-scripting/tree/main/bin

Section Timestamps of Video

00:00 Data Umbrella Introduction
04:05 Rebecca begins presentation
05:24 Agenda
05:39 What is bash? A brief history of shells
06:56 What is bash used for?
07:58 Scripting basics + resources
09:36 What does this code do? Analyzing an ETL script
11:20 (three min. of quiet time begins)
14:24 Begin code walkthrough
15:33 Deeper dive on curl and ssconvert
18:33 Continuing code walkthrough
20:16 Quick summary
20:45 Making the code into a script
22:45 Making the script executable
24:17 Running the script
25:52 Turning code into a script - best practices
26:01 Ways to customize your bash environment
26:54 Customize bash environment - .bashrc example
28:36 Song break + interactive chat time
33:06 Returning from break, reviewing chat
34:27 Improving the script - error handling + demo
41:04 Error handling - more resources
41:24 Improving the script - logging + demo
46:15 Improving the script - options + demo
50:00 (two min of quiet time begins)
52:05 Continuing options demo
54:16 Summary + mentorship/collaboration
55:07 Q&A - ssconvert/gnumeric, GNU, esac

What is Bash and Why Use It?

Bash is a command-line interpreter that allows users to interact with Unix-like operating systems (e.g., macOS, Linux). A fun fact, Bash stands for “Born Again Shell”, which is just a play on the name of that first shell. It’s a powerful tool for tasks like file management, process control, and job automation. While not typically used for core data analysis tasks like model building or statistical analysis, Bash is invaluable for setting up environments, managing files, and automating repetitive tasks.

A Basic ETL Script in Bash

Let’s examine a simple script that performs an Extract, Transform, Load (ETL) process. This script downloads an Excel file, converts it to CSVs, extracts a specific column, and then cleans up the downloaded files.

#!/bin/bash
# Title: Process Data
# Date: 2024-03-14
# Author: Rebecca BurWei
# Version: 1.0
# Description: Download data, convert to CSV, extract column, and clean up.
# Options: None

URL="http://www.econ.yale.edu/~shiller/data/ie_data.xls"
echo "Downloading data from $URL"  # Print a message indicating the download source.
curl -O stock_data.xls "$URL" # Download the file using curl. The -O option saves it with its original name.
ssconvert -S stock_data.xls stock_data.csv # Convert the Excel file to multiple CSV files (one per sheet).
cut -d, -f10 stock_data.csv.4 | head  # Extract the 10th column (using comma as delimiter) and display the first few lines.
rm stock_data* # Remove all downloaded and created files (using wildcard *).

The script begins with a shebang (#!/bin/bash), which specifies the interpreter. The first comment is a special one. It’s a special type of comment called a shebang or a hashbang. It then defines a variable URL containing the download link. The echo command prints a message to the console. The curl command downloads the file, and ssconvert converts it into multiple CSV files. So I think this will not work if you don’t have SS convert installed. The cut command extracts the 10th column, using a comma (,) as the delimiter, and head displays the first few lines of the output. Finally, rm removes the downloaded and created files.

Figure 1: Resources

To make this code executable as a script, save it in a file (e.g., process.sh), and use the chmod command to grant execute permissions:

chmod u+x process.sh

so I’m going to give the owner execute permissions so that the script can be run by using chmod.

You can run the script using either source process.sh (runs in the current shell) or ./process.sh (runs in a subshell). If you type just the name on then the current shell will create a subshell and run your script in just that subshell. The latter is generally preferred for cleaner execution.

source process.sh

Bash provides several files to [customize your environment](https://youtu.be/1pQ527fGhVQ?si=lGkFFhyLODqmZLEP&t=1559).  Key files include:

1.  **`.bash_profile`**:  Executed upon login (opening a new terminal). There's a bash profile and this will run on login.
2.  **`.bashrc`**: Executed when a new subshell is started.
3.  **`.bash_logout`**: Executed upon logout, useful for cleanup tasks.

These files can be used to set aliases, define environment variables, export variables, and configure shell behavior.

## Enhancing the Script: Error Handling

By default, Bash continues execution even if a command encounters an error. This behavior can be modified. One approach is to use `set -e` (or `set -o errexit`), which causes the script to exit immediately upon encountering an error. [ You can set air exit and this will make the bash stop when it runs into an error.](https://www.youtube.com/watch?v=1pQ527fGhVQ&t=796s)

A more robust approach is to use `trap`.  `trap` allows you to define custom actions to be taken when specific signals are received.  A common use case is to handle the `ERR` signal, which is triggered when a command exits with a non-zero status (indicating an error).

Here's how to implement error handling with `trap` in the script:

```bash
#!/bin/bash
# ... (previous comments) ...

handle_error() {
  echo "Error on line $1 with exit status $?"
  exit 1
}

trap 'handle_error $LINENO' ERR

URL="invalid_url" # Introduce an error for demonstration.
echo "Attempting to download data from $URL"
curl -O "$URL"
ssconvert -S stock_data.xlsx stock_data.csv
cut -d, -f10 stock_data.csv.0 | head
rm stock_data*

Figure 2: Code explanation

In this modified script, handle_error is a function that prints an error message including the line number ($1, passed from $LINENO) and the exit status ($?). $1 refers to the first argument or the first positional parameter and $? refers to the status code or the exit status of the last command that was run. The trap command associates this function with the ERR signal. Setting URL to an invalid value demonstrates the error handling in action.

Enhancing the Script: Logging

Instead of printing output directly to the terminal, it’s often beneficial to redirect output and errors to files. This is called logging.

You can achieve this using redirection operators:

> redirects standard output (stdout) to a file.
>> redirects standard error (stderr) to a file.

Here’s how to modify the script execution to implement logging:

./process.sh > output.txt 2> errors.txt

So anything that would have gone to the terminal on standard out, send that to this file instead.

This command runs process.sh and redirects its standard output to output.txt and its standard error to errors.txt.

To encapsulate the logging within a separate script, you can use a wrapper script like this. Here is the logger.sh script:

#!/bin/bash
# ... (comments) ...
# Options:
#   $1: Path to the script to be logged.

"$1" > "${1}_output.txt" 2> "${1}_errors.txt"

Figure 3: Making the code into script

This logger.sh script takes the path to another script (passed as the first positional parameter, $1) and executes it, redirecting output and errors to files with appropriate names.

Enhancing the Script: Adding Options

Using positional parameters ($1, $2, etc.) can be fragile because the order matters. Options (e.g., -f, -d) provide a more robust and user-friendly way to configure script behavior. The getopts built-in command helps parse options.

Here’s a modified version of the script that accepts an -f option to specify the columns to extract:

#!/bin/bash
# ... (previous comments) ...
# Options:
#   -f <fields>: Comma-separated list of fields to extract.

while getopts "f:" opt; do
  case $opt in
    f)
      fields="$OPTARG"
      ;;
    \?)
      echo "Usage: $0 [-f <fields>]"
      exit 1
      ;;
  esac
done

URL="http://www.econ.yale.edu/~shiller/data/ie_data.xls"
echo "Downloading data from $URL"
curl -O "$URL"
ssconvert -S stock_data.xlsx stock_data.csv
cut -d, -f"$fields" stock_data.csv.0 | head
rm stock_data*

This version uses getopts "f:" opt to parse the -f option. In line 16, that string f colon tells you about the valid options. The colon after f indicates that it requires an argument. The case statement handles the option: if -f is provided, its value (accessed via $OPTARG) is stored in the fields variable. So fields would be 10 or 10,1. If an invalid option is provided, a usage message is printed, and the script exits. The cut command now uses -f"$fields" to extract the specified columns. The esac closes out the case statement. When you create a lot of these control structures, like you type case to start it, and then ESAC, which is case backwards, to end the case statement, to tell Bash, you know, case part is done.

These examples show how to build upon a simple Bash script to add functionality, improve robustness, and make it more user-friendly.

About the Speaker: Rebecca BurWei

Rebecca BurWei is a Staff Data Scientist at Mozilla. She has a patent in computer vision and a PhD in mathematics. She learned to code in open-source communities, and is passionate about developing the technical leadership of others.

Connect with the Speaker

GitHub: rebecca-burwei
LinkedIn: Rebecca BurWei

Introduction to Git

Wed, 22 Jan 2025 12:01:35 +0000

Event Introduction

Join us for an introductory session on Git, the quintessential version control software used by developers to collaborate and manage projects. Discover some fundamental concepts, explore its practical applications, and learn how to efficiently use both the command line and the Visual Studio Code graphical user interface (GUI). Whether you’re a seasoned developer or just starting your coding journey, this webinar will equip you with essential skills.

Video

Resources

Slides
Repo: Sam’s live working example
Download: VS Code
Download: Git

Section Timestamps of Video

00:00 Data Umbrella Introduction
04:19 Sam begins presentation
05:19 Agenda / Outline
05:41 What is Git? 30,000 ft overview
08:15 The 3 stages of a file in Git
09:27 Git terminology
11:00 Branching
12:38 Git Branching - parallel development
14:32 Git Demo and pre-requisites
17:25 VS Code
17:40 Configure git identity
21:36 Create a repo with: git init, VS Code, GitHub
25:40 Make a commit
33:50 Create a branch via VS Code
36:24 Collaborating in the cloud: GitHub
44:05 Q: Is there a way to see the global configurations file?
44:44 Git Graph (add this extension): visualize and act on branches
45:52 Q: What is the idea of a branch?
46:48 Q: Are main and master branch the same?
47:43 Q: What is difference between “git remote add origin” and “git clone”
49:18 Q: Show the code for pushing to a branch
51:03 Q: How to visualize Git Graph?
51:43 Q: What are differences between “git pull” and “git fetch”?
52:38 Q: Explain “git stash” and difference between regular staging (“git add”)
53:22 Q: What is the difference between using the HTTPS and SSH keys for repo URLs?
55:27 Troubleshooting some common git issues
56:00 Merge conflicts
01:02:33 Git Graph
01:11:30 git rebase
01:21:32 Submit a pull request (PR) on GitHub
01:22:38 git ignore file
01:24:35 Continuing learning
01:26:30 Q: git stash (temporarily store changes), Can you go through an example?
01:32:18 git stash show, to unstash: git stash pop

About the Speaker

Bio

Sam Miyamoto, MPH is a software engineer and civic tech enthusiast based in the Los Angeles, California area. With experience in varying roles in sectors like clean transportation, public health, data storage, renewable energy, and more, she is dedicated to solutions that have impact. Fun fact: She once went composting with a group of friends at Griffith Park.

Connect with the Speaker

GitHub: @samvmdev
LinkedIn: Sam Miyamoto

Key Links

Data Umbrella Interview: Joe Lucas

Mon, 08 Jan 2024 12:01:35 +0000

Authors: Reshama Shaikh, Beryl Kanali & Joe Lucas

Tell us about yourself.

I’ve moved all over the United States, but I’m originally from Pennsylvania and now live in Texas with my wife and two daughters. I just bought a pair of cowboy boots, so I guess you can say I’m getting used to the change. I’m currently on the AI Red Team at NVIDIA, which perfectly blends my two professional interests: math and cybersecurity. “Red Teaming” comes from the concept of “red versus blue” where the Blue team consists of developers and security folks, while the Red team thinks and acts like the adversary. By red teaming, we test defensive tools and processes to help provide recommendations. In the end, we’re all contributing towards a more secure product and developer ecosystem. So an AI Red Team performs security reviews, penetration tests, and adversary emulation operations to probe and test the security of MLOps and deployed AI systems. I originally got interested in the machine learning security field when I learned about adversarial images (images that look correct to humans but are intentionally misclassified by classifiers), but the ride has gotten wilder since the most recent AI wave.
How did you learn of the scikit-learn June 2020 (first online) global sprint, organized by Data Umbrella?

I think I saw it announced on Twitter. At the time, I was just leaving the Army and it had always been one of my goals to contribute to Open Source. I had been using scikit-learn in my job as a data scientist and this seemed like the perfect opportunity to learn how to give back.
How was your experience?

I’m incredibly grateful for the experience. At the time, I understood the concept of open source software, but didn’t really understand how to actively deliver and contribute (especially to a project as established as scikit-learn). The structure of the Data Umbrella sprint resolved many of these unknowns and enabled me to feel like I’d both made a meaningful contribution and learned how to interact in the open source community.

I wrote an article sharing my experience, “My Open Source Adventure” to document and share my experience.
What was the value of the open source sprint? How has that shaped your path in open source?

The sprint gave me the experience and confidence to continue engaging with open source projects. I learned more about maintainers’ perspectives and challenges and now understand that there is a wide spectrum of valuable contributions.
To which OSS projects and communities do you contribute?

The Data Umbrella sprint showed me that not all valuable contributions are new features; there may be tests, documentation, or continuous integration modifications, for example. Lately, I’ve been identifying ways to leverage my professional application security experience to contribute security to projects. I started out with joining the Jupyter Security Subcouncil and have recently been contributing as the chair for the NumFOCUS Security Committee. Just like I learned during the Data Umbrella sprint, we can contribute to the security of the projects we use and appreciate through education, establishing standards, enhancing processes, technical reviews, and security services.
Any advice or tips you have for people starting out in open source?

Just do it! Be aware that you will likely need help from the maintainers, and that help is part of the larger open source “gift economy”. Be gracious, communicative, and willing to learn.
What are your favorite resources, books, courses, conferences, etc?

More than recommending a specific conference, I’ll recommend that you submit to speak at a conference. It’s a great mechanism to really test your understanding of a material and meet people with similar interests. Even if you’re new to these communities, you likely will have some valuable experience to share and many conferences offer mentors to help new presenters prepare.
What are your hobbies, outside of work and open source?

I have too many hobbies… lately I’ve been spending a lot of time training my puppy and trying to learn how to skateboard (inspired by the Olympics).

Resources for Contributing to scikit-learn

References

Joe Lucas: scikit-learn Open Source Adventure

Data Umbrella Year in Review

Sat, 30 Dec 2023 15:44:03 +0000

Intro

Data Umbrella was founded in December of 2019, and we are happy to have celebrated our 4th birthday! Here we are sharing our accomplishments and acknowledging all the people who made it happen.

Highlight Accomplishments

Talks

We are grateful to our esteemed speakers, the pioneers and experts who generously share their insights and expertise with our community. We also appreciate our community members for your active participation,contributions, and enthusiasm have created an environment for learning and collaboration. We had a total of 23 talks in 2023. All these talks are available on our YouTube channel.

Timestamps

We have made significant progress in the timestamps project. There are only 9 timestamps left!

PyMC Sprint & Study Groups

We had a total of 8 PyMC study group sessions in 2023 and completed the 12 PyMC study group sessions. Take a look at the Study Sessions Report on our blog.

We also had a successful PyMC sprint in March 2023. We would like to appreciate Sandra Meneses & Oriol Abril Pla for setting up the Spanish translation of the sprint website.

Community Partnerships

We are proud to have partnered with PyGotham, PyData Global, PyLadiesCon, NODES 2023, All ThingsOpen, WIA conference, and Diversify Tech. Looking forward to awesome community partnerships in 2024.

Website

We are proud of our new and improved website! We hope you enjoy it.

Milestones

YouTube

We reached 3,000 subscribers on YouTube!

Meetup

Our Meetup membership grew to:

DataUmbrella: 2808 members
Data Umbrella Africa: 1066 members

Community Contributors

Neo4j, has been a significant contributor to our community.

We also received great support from our contributors on Open Collective.

Team

Special thanks to the Data Umbrella team for keeping our community moving and growing.

Thank You

To each and every individual who has contributed to the Data Umbrella’s journey, we extend our gratitude. We are truly grateful for your support and commitment to the Data Umbrella community. Thank you for being a part of our journey!