SoK Proposal for "LabPlot: Download/Import of data sets from kaggle.com" Project
Project Abstract
LabPlot is a data visualization and analysis software. For testing and study purposes, LabPlot provides easy access to multiple collections of publicly available datasets. In addition to the already available collections, the support for https://www.kaggle.com is desired which seems to be the central place nowadays for datasets used in the data science community. The purpose of this project is to extend LabPlot to download and import datasets from https://www.kaggle.com.
Proposal
Interacting with Kaggle
LabPlot can interact with kaggle.com via the Kaggle CLI tool. The Kaggle CLI tool is a program that provides a command line interface to the Kaggle API. To access the Kaggle CLI tool, it must be in $PATH
else the user can specify the path to the tool under Settings > Configure LabPlot > Datasets > Kaggle CLI Path.
Dialog for Searching, Viewing and Importing Kaggle Datasets
LabPlot already has a dialog for importing datasets from collections over the internet under File > Import Data > From Dataset Collection. We can add a new “Kaggle” option to the existing QComboBox
and use the existing widgets in the dialog for searching, viewing and importing datasets from kaggle.com.
Performing Checks
When the user selects the “Kaggle” collection in the QComboBox
, we will need to test that we can run the Kaggle CLI tool via the kaggle
command. If the test fails, we can use a KMessageWidget
to inform the user of the failure and disable the widgets for searching, viewing and importing datasets for Kaggle.
Searching for Kaggle Datasets
The Kaggle CLI tool provides the kaggle datasets list -s SEARCH
command that returns the first 20 datasets from kaggle.com that match the SEARCH
query. We can request for the next 20 datasets and so on by passing a -p PAGE
option with the command. We can add "Prev" and "Next" QToolButtons
for getting the previous and next 20 datasets that match the SEARCH query.
Viewing Metadata for Kaggle Datasets
The Kaggle CLI tool provides the kaggle datasets metadata -p PATH DATASET
command that downloads the metadata for the specified DATASET
to the specified PATH
. This command can be used to fetch then display the metadata for the currently selected dataset.
Downloading Kaggle Datasets
The Kaggle CLI tool provides the kaggle datasets download -p PATH DATASET
command that downloads the files contained in the specified DATASET
to the specified PATH
. This command can be run when the user agrees to import the currently selected dataset.
Storing Kaggle Datasets
We can cache our Kaggle datasets in QStandardPaths::AppDataLocation + "/datasets_local/"
like other datasets obtained over the internet. We can also use the existing functionality under Settings > Configure LabPlot > Datasets > Clear for managing the cache size.
Timeline
- Week 1 - 2: Add widgets for collecting Kaggle CLI path
- Week 3 - 4: Modify existing widgets to allow searching for Kaggle datasets
- Week 5 - 6: Modify existing widgets to allow viewing of Kaggle dataset metadata
- Week 7 - 8: Add functionality for downloading Kaggle datasets
- Week 9 - 10: Add functionality for saving and deleting Kaggle datasets
Foreseen Challenges
Handling Different File Types from Kaggle Datasets
Since Kaggle datasets can potentially contain any type of file, it is possible that a user can try to import a file into LabPlot that is currently unsupported. We can have a list of supported file types which the user is only allowed to import to LabPlot. We can also extend LabPlot to add support for currently unsupported file types.
References/Relevant Background Info
Pre-Proposal Work
I have successfully built LabPlot's code locally on a Linux machine. I am currently working on implementing some of the design ideas above.
- Progress on implementing the previous button, next button and searching on kaggle.com
2024-01-03_18-11-32 - Draft Merge Request for adding new
TimedLineEdit
widget to Labplot
Prior Experience with Open Source
I was an Outreachy intern with Debian from May to August 2022. I then attended DebConf23 in India to give a talk on my contributions and experience as an Outreachy intern in the Debian JavaScript team. I continued contributing to the Debian JavaScript team after my internship by updating packages, fixing bugs and contributing to tooling. Now, I mentor new Outreachy interns in the Debian JavaScript team. Of note, I packaged corepack and yarn (important package management tools in the Nodejs ecosystem) for Debian. I am currently working on updating emscripten in Debian to improve packaging relating to WebAssembly.
How To Reach Me
- Timezone: GMT+1 (Lagos, Nigeria)
- Working Hours: 08:00 - 21:00 (Preferred availability)
- Email: izzygaladima@gmail.com
- Matrix: @izzygala:matrix.org
About Me
Hi, my name is Israel. I recently graduated from uni where I studied Electrical and Electronics Engineering. I’ve worked with C++ in the past, but this will be my first experience with the Qt Framework. I’ve done some GUI programming, albeit with DearImGUI. I'm also an active contributor to the Debian Project. I look forward to contributing to Kde. Thanks.
- LinkedIn profile: https://www.linkedin.com/in/israel-galadima-446a54198/
- Github: https://github.com/israelsgalaxy/
- Debian New Member page: https://nm.debian.org/person/izzygala/
/cc @teams/season-of-kde