Newbie's Guide to Python Module Management

I've always been confused by Python modules and how they work, but was able to kinda muddle thru on my own, as I expect most of us do. I recently sat down and actually taught myself how they work, and I think I have a handle on them now. To test it, I refactored a chunk of code in one of my projects that was bothering me, and as far as I can tell, I did so successfully!

This post documents that process, to hopefully help others in figuring stuff out.

The Problem

My main open-source project, Bikeshed, has to maintain a set of data files. These get updated frequently, so users can call bikeshed update to get new data for them, straight from the data sources. Each data source gets its own independent processing; there's not really any shared code between the different data files.

Originally there were only two types of data, and I wasn't doing anything too complicated with either of them, so I just went ahead and crammed both of the update functions into the same file, update.py. Fast-forward two years, and I've now got six independent update functions in this file, and several of them have gotten substantially more complicated. Refactoring code into helper functions is becoming hard, because it makes it more difficult to find the "main" functions buried in the sea of code.

What I'd really like is for each independent updater function, and its associated suite of helper functions, to live in a separate file. But I've already got a lot of files in my project - it would be great to have them all grouped into a subfolder.

Intro to Python Packages/Modules

So each foo.py file in your project automatically defines a module, named foo. You can import these files and get access to their variables with from . import foo, or from .foo import someVariable. (This is using absolute package-relative imports, which you should be using, not the "implicit relative imports" that Python2 originally shipped with; the . indicates "look in this module's parent".)

Each foo folder in your project defines a package named foo, if the folder has an __init__.py file in it. Packages are imported exactly like modules, with from . import foo/etc; the only difference is that packages can contain submodules (and subpackages) in addition to variables. This is how you get imports like import foo.bar.baz - foo and bar are packages (with bar a subpackage of foo), baz is either a package or a module.

Whenever you import a package, Python will run the __init__.py file and expose its variables for importing. (This is all the global variable names the code in the module can see, including modules that that the code imports!) It also automatically exposes any submodules in the package, regardless of whether __init__.py imports them or not: you can write import foo.bar if the foo/ folder contains a bar.py file, without foo/__init__.py having to do anything special. (Same for nested packages.)

Finally, whenever you do a * import (like from foo import *), Python will go ahead and pull in all the variables that foo/__init__.py defines and dump them into your namespace, but it does not dump submodules in unless __init__.py explicitly imported them already. (This is because the submodules might not be supposed to be part of the public API, and importing may have side-effects, since it just runs an __init__.py, and you might not want those side-effects to automatically happen.) Instead, it looks to see if __init__.py defined a magical __all__ variable; if it did, it assumes it's a list of strings naming all the submodules that should be imported by a * import, and does so.

(AKA, if your __init__.py already imports all the submodules you use or intend to expose, you're fine. If there are more that __init__.py doesn't use, but you want to expose to *-imports, set __all__ = ["sub1", "sub2"] in __init__.py.)

The Solution

So now we have all the information we need.

Step 1 is creating an update/ folder, and adding a blank __init__.py file. We now have an update package ready to import, even tho it's empty right now.

Step 2 is copying out all the code into submodules; I created an update/updateCrossRefs.py file and copied the cross-ref updater code into it, and so on. Now that the code is in separate files, I can rename the updater functions to all be just def update() for simplicity; no need to mention what they're updating when that's already in the module name.

Now that the code has moved from a top-level module in my project to a submodule, their import statements are wrong - anything that mentions from . import foo will look in the update package, not the overall project. Easy to fix, I just have to change these to from .. import foo; you can add as many dots as you want to move further up the package tree if you need.

At this point I'm already mostly done; I can run import update, then later call update.updateCrossRefs.update(), and it magically works! The last step is in handling "global" code, and putting together a good __init__.py.

For Step 3, I have one leftover piece of code, the general update() function that updates everything (or whatever subset of stuff I want). This is the only function the outside world ever actually calls; it's the only thing that calls the more specific updaters.

There's a few ways to do this - you can just put it directly in __init__.py and call it a day. But that exposes the imports it uses, and I want to keep the update module’s API surface nice and clean. Instead, I create another submodule, main.py, and put the function over there. Then, in __init__.py, I just call from .main import update. Now the outside world can say from . import update, and then just call update.update(), without having to know that the function is actually defined in a submodule.

Now that this is all done, I can finally delete the original update.py file in my main project directory. It's empty at this point, after all. ^_^

The End Result

I end up with the following directory structure:

bikeshed/
  ...other stuff...
  update/
    __init__.py
    main.py
    updateCrossRefs.py
    updateBiblio.py
    ...

And my __init__.py just says:

from .main import update, fixupDataFiles

__all__ = ["updateCrossRefs", "updateBiblio", 
           "updateCanIUse", "updateLinkDefaults", 
           "updateTestSuites", "updateLanguages"]

Then my project code, which was already doing from . import update, and calling update.update() (or update.fixupDataFiles()), continues to work and never realizes anything has changed at all!