I've always been confused by Python modules and how they work, but was able to kinda muddle thru on my own, as I expect most of us do. I recently sat down and actually taught myself how they work, and I think I have a handle on them now. To test it, I refactored a chunk of code in one of my projects that was bothering me, and as far as I can tell, I did so successfully!
This post documents that process, to hopefully help others in figuring stuff out.
My main open-source project, Bikeshed, has to maintain a set of data files. These get updated frequently, so users can call
bikeshed update to get new data for them, straight from the data sources. Each data source gets its own independent processing; there's not really any shared code between the different data files.
Originally there were only two types of data, and I wasn't doing anything too complicated with either of them, so I just went ahead and crammed both of the update functions into the same file,
update.py. Fast-forward two years, and I've now got six independent update functions in this file, and several of them have gotten substantially more complicated. Refactoring code into helper functions is becoming hard, because it makes it more difficult to find the "main" functions buried in the sea of code.
What I'd really like is for each independent updater function, and its associated suite of helper functions, to live in a separate file. But I've already got a lot of files in my project - it would be great to have them all grouped into a subfolder.
Intro to Python Packages/Modules
foo.py file in your project automatically defines a module, named
foo. You can import these files and get access to their variables with
from . import foo, or
from .foo import someVariable. (This is using absolute package-relative imports, which you should be using, not the "implicit relative imports" that Python2 originally shipped with; the
. indicates "look in this module's parent".)
foo folder in your project defines a package named
foo, if the folder has an
__init__.py file in it. Packages are imported exactly like modules, with
from . import foo/etc; the only difference is that packages can contain submodules (and subpackages) in addition to variables. This is how you get imports like
import foo.bar.baz -
bar are packages (with
bar a subpackage of
baz is either a package or a module.
Whenever you import a package, Python will run the
__init__.py file and expose its variables for importing. (This is all the global variable names the code in the module can see, including modules that that the code imports!) It also automatically exposes any submodules in the package, regardless of whether
__init__.py imports them or not: you can write
import foo.bar if the
foo/ folder contains a
bar.py file, without
foo/__init__.py having to do anything special. (Same for nested packages.)
Finally, whenever you do a
* import (like
from foo import *), Python will go ahead and pull in all the variables that
foo/__init__.py defines and dump them into your namespace, but it does not dump submodules in unless
__init__.py explicitly imported them already. (This is because the submodules might not be supposed to be part of the public API, and importing may have side-effects, since it just runs an
__init__.py, and you might not want those side-effects to automatically happen.) Instead, it looks to see if
__init__.py defined a magical
__all__ variable; if it did, it assumes it's a list of strings naming all the submodules that should be imported by a
* import, and does so.
(AKA, if your
__init__.py already imports all the submodules you use or intend to expose, you're fine. If there are more that
__init__.py doesn't use, but you want to expose to
__all__ = ["sub1", "sub2"] in
So now we have all the information we need.
Step 1 is creating an
update/ folder, and adding a blank
__init__.py file. We now have an
update package ready to import, even tho it's empty right now.
Step 2 is copying out all the code into submodules; I created an
update/updateCrossRefs.py file and copied the cross-ref updater code into it, and so on. Now that the code is in separate files, I can rename the updater functions to all be just
def update() for simplicity; no need to mention what they're updating when that's already in the module name.
Now that the code has moved from a top-level module in my project to a submodule, their import statements are wrong - anything that mentions
from . import foo will look in the
update package, not the overall project. Easy to fix, I just have to change these to
from .. import foo; you can add as many dots as you want to move further up the package tree if you need.
At this point I'm already mostly done; I can run
import update, then later call
update.updateCrossRefs.update(), and it magically works! The last step is in handling "global" code, and putting together a good
For Step 3, I have one leftover piece of code, the general
update() function that updates everything (or whatever subset of stuff I want). This is the only function the outside world ever actually calls; it's the only thing that calls the more specific updaters.
There's a few ways to do this - you can just put it directly in
__init__.py and call it a day. But that exposes the imports it uses, and I want to keep the
update module’s API surface nice and clean. Instead, I create another submodule,
main.py, and put the function over there. Then, in
__init__.py, I just call
from .main import update. Now the outside world can say
from . import update, and then just call
update.update(), without having to know that the function is actually defined in a submodule.
Now that this is all done, I can finally delete the original
update.py file in my main project directory. It's empty at this point, after all. ^_^
The End Result
I end up with the following directory structure:
bikeshed/ ...other stuff... update/ __init__.py main.py updateCrossRefs.py updateBiblio.py ...
__init__.py just says:
from .main import update, fixupDataFiles __all__ = ["updateCrossRefs", "updateBiblio", "updateCanIUse", "updateLinkDefaults", "updateTestSuites", "updateLanguages"]
Then my project code, which was already doing
from . import update, and calling
update.fixupDataFiles()), continues to work and never realizes anything has changed at all!