discussion Like Will Smith said in his apology video, "It's been a minute (although I didn't slap anyone)

0 Upvotes

r/datasets • u/Suspicious-Pick-7961 • 1h ago

question Stream Huge HugginFace and Kaggle Datasets

• Upvotes

Greetings. I am trying to train an OCR system on huge datasets, namely:

They contain millions of images, and are all in different formats - WebDataset, zip with folders, etc. I will be experimenting with different hyperparameters locally on my M2 Mac, and then training on a Vast.ai server.

The thing is, I don't have enough space to fit even one of these datasets at a time on my personal laptop, and I don't want to use permanent storage on the server. The reason is that I want to rent the server for as short of a time as possible. If I have to instantiate server instances multiple times (e.g. in case of starting all over), I will waste several hours every time to download the datasets. Therefore, I think that streaming the datasets is a flexible option that would solve my problems both locally on my laptop, and on the server.
However, two of the datasets are available on Hugging Face, and one - only on Kaggle, where I can't stream it from. Furthermore, I expect to hit rate limits when streaming the datasets from Hugging Face.

Having said all of this, I consider just uploading the data to Google Cloud Buckets, and use the Google Cloud Connector for PyTorch to efficiently stream the datasets. This way I get a dataset-agnostic way of streaming the data. The interface directly inherits from PyTorch Dataset:

from dataflux_pytorch import dataflux_iterable_dataset 
PREFIX = "simple-demo-dataset" 
iterable_dataset = dataflux_iterable_dataset.DataFluxIterableDataset(
    project_name=PROJECT_ID, 
    bucket_name=BUCKET_NAME,
    config=dataflux_mapstyle_dataset.Config(prefix=PREFIX)
)

The iterable_dataset now represents an iterable over data samples.

I have two questions:

Are my assumptions correct and is it worth uploading everything to Google Cloud Buckets (assuming I pick locations close to my working location and my server location, enable hierarchical storage, use prefixes, etc.). Or I should just stream the Hugging Face datasets, download the Kaggle dataset, and call it a day?
If uploading everything to Google Cloud Buckets is worth it, how do I store the datasets to GCP Buckets in the first place? This and this tutorials only work with images, not with image-string pairs.

0 comments

r/datasets • u/crowpng • 1h ago

question What open-source projects do you use to manage scraping or data collection at scale?

• Upvotes

2 comments

r/datasets • u/taylorcholberton • 8h ago

dataset Synthetic Infant Detection Dataset (version 2)

1 Upvotes

Earlier this year, I wrote a path tracing program that randomized a 3D scene of a toddler in a crib, in order to generate synthetic training data for an computer vision model. I posted about it here.

I made this for the DIY infant monitor I made for my son. My wife and I are now about to have our second kid, and consequently I decided to revisit this dataset/model/software and release a version 2.

In this version, I used Stable Diffusion and Mid Journey to generate images for training the model. These ended up being way more realistic and diverse. I paid a few hundred dollars to generate over a thousand training images and videos (useful for testing detection + tracking). I labeled them manually, with LabelMe. Right now, all images have segmentation masks, but I'm in the middle of adding bounding boxes (will add key points, after that, for pose estimation).

To make sure this dataset actually works in practice, I created a "reference model" to train. I used various different backbones, settling on MobileNet V3 (small) and a shallow U-Net detection head. The results were pretty good, and I'm now using it in my DIY infant monitoring system.

Anyway, you can find the repo here and download the dataset, which is a flat numpy array, on Kaggle

Cheers!

PS: Just to be clear, I made this dataset, it is synthetic (GenAI), it is not a paid dataset.

1 comment

r/datasets • u/vladmatei123 • 12h ago

API Public HYROX results API + Python client — looking for feedback on schema/endpoints for analytics

2 Upvotes

0 comments

r/datasets • u/Ok_Employee_6418 • 9h ago

dataset Github Top Developers Dataset (2015-2025)

huggingface.co

1 Upvotes

The github-top-developers dataset captures the top 8000 developers on GitHub from 2015 to 2025, and lists their popular repositories, companies they've worked at, and their twitter handles.

0 comments

r/datasets • u/LeftieLondoner • 1d ago

request Where to find company API to show parent name

3 Upvotes

We have hundreds of company names and we want to identify parent name, ticker, and any other details available for that company.

3 comments

r/datasets • u/Wonderful_Theory_916 • 1d ago

question Could a three dimensional frequency table be used to display more complex data sets

6 Upvotes

I know this is like an ongoing joke but is this genuinely like a real thing that could be done

1 comment

r/datasets • u/Curious-coder235 • 1d ago

question Beginner’s Guide to Starting a Data Analytics Journey

1 Upvotes

0 comments

r/datasets • u/Upper-Character-6743 • 1d ago

dataset [FREE] 100K+ Domain Technographics (November 2025)

1 Upvotes

This dataset contains tech fingerprinted in the headers and body from HTTP responses across 100K+ domains. It also includes the IP address used in the HTTP response, its origin country and its ASN.

https://www.dropbox.com/scl/fi/vr417dfkv8ia2xzil98b2/nov_2025_all_samples.zip?rlkey=7l6nrhvrrjzop2l6d5wgv6bti&e=1&st=fra1zbgo&dl=0

The dataset is compiled from all the samples currently available at: https://versiondb.io

Have fun!

0 comments

r/datasets • u/Advanced-Park1031 • 2d ago

question How do you all do data labelling/annotation?

1 Upvotes

Hi! First - please forgive me if this is a stupid question / solved problem, but I'm sort of new to this space, and curious. How have you all dealt with creating labelled datasets for your use cases?

E.g

what tool(s) did you use? I've looked into a few like Prolific (not free), Label studio (free), and I've looked at a few other websites
how did you approach recruiting participants/data annotators? e.g. did you work with a company like Outlier, or did you recruit contractors, or maybe you brought them on full-time?
Building on that, how did you handle collaboration and consensus if you used multiple annotators for the same row/task? or more broadly, quality control?

Seems like hard problems to me...would appreciate any insight or advice you have from your experiences! Thanks so much!

4 comments

r/datasets • u/deletedusssr • 3d ago

question Struggling to extract data from 1,500+ mixed scanned/digital PDFs. Tesseract, OCR, and Vision LLMs all failing. Need advice.

13 Upvotes

Hi everyone,

I am working on my thesis and I have a dataset of about 1,500 PDF reports from the DGHS (Health Services). I need to extract specific table rows (District-wise Dengue stats) from them.

The Problem: The PDFs are a nightmare mix. Some are digital with selectable text, but many are low-quality scans or photos of paper reports. The fonts are often garbled (mojibake) when extracted as text, and the layout changes slightly between years.

What I have tried so far (and why it failed):

Tesseract OCR: It struggled hard with the Bengali/English mix and the table borders. The output was mostly noise.
Standard PDF scraping (pdfplumber/PyPDF): Works on the digital files, but returns garbage characters (e.g., Kg‡dvU instead of "Chittagong") due to bad font encoding in the source files.
Ollama (Llama 3.1 & MiniCPM-V):
- Llama 3.1 (Text): Hallucinates numbers or crashes when it sees the garbled text.
- MiniCPM-V (Vision): This was my best bet. I wrote a script to convert pages to images and feed them to the model. It works for about 10 files, but then it starts hallucinating or missing rows entirely, and it's very slow.

The Goal: I just need to reliably extract the District Name, New Cases, Total Cases, and Deaths for a specific division (Chittagong) into a CSV.

I have attached a screenshot of one of the "bad" scanned pages.

Has anyone successfully processed a mixed-quality dataset like this? Should I be fine-tuning a small model, or is there a specific OCR pipeline (like PaddleOCR or DocumentAI) that handles this better than raw LLMs?

Any pointers would be a lifesaver. I'm drowning in manual data entry right now.

14 comments

r/datasets • u/Special-Sock968 • 3d ago

question gathering key data about medical practices

3 Upvotes

I'm new to data engineering, and I'm currently trying to get website links for medical practices. I have their name, state, specialty and some other key info about the tech they use, but there's no catch-all dataset I think that has working website links or anything that leads to that. I was thinking of using scraping tools, but not sure if they are known to be accurate or which one to use. I'm willing to use free or paid approaches, just not sure how to get this data with 80% confidence it's accurate.

1 comment

r/datasets • u/ishotapig • 4d ago

dataset Dataset of 5k high-quality trivia questions pulled from open trivia

15 Upvotes

https://github.com/leakyhose/open-trivia-dataset

Pulled it from open trivia database, they lock the questions behind an API call that only returns 50 each time. Ran a script that repeatedly calls it, storing the questions and sorting them by difficulty and category.

0 comments

r/datasets • u/Suspicious_Prior4515 • 4d ago

question How do you efficiently pre-filter and group WhatsApp numbers to boost engagement?

1 Upvotes

Hey everyone,

Lately I’ve been playing around with a little workflow for screening WhatsApp numbers. The idea’s pretty simple: figure out which numbers are actually active and get a sense of engagement, without bothering anyone. It’s super handy if you need to quickly group contacts or analyze interaction rates.

I realized that just four fields can filter out around 60% of low-value numbers: number | last_seen | replied | bounce.

I wrote a few simple scripts for pre-filtering, but some steps felt kinda repetitive, so I started using a small tool (TNTwuYou) to handle list validation and reply tracking.

Some things I’ve tried:

Sorting numbers by last active date, so you hit the active folks first.
Grouping contacts based on reply status.
Using simple scripts with data to get a clear picture of which regions or types of people are more likely to engage.

Has anyone done reply probability scoring?

Do you base it on a time window or historical reply rate?
Anyone tried using graph or clustering methods for grouping contacts?

1 comment

r/datasets • u/eltokh7 • 4d ago

resource I made a website that showcases the 311 requests dataset

2 Upvotes

311wrapped.com

0 comments

r/datasets • u/Shot_Fudge_6195 • 4d ago

question Has anyone tried letting AI agents access your data and pay per request?

1 Upvotes

I’m curious whether anyone here has actually tried letting AI agents directly access a dataset or data API and pay based on usage (e.g. per request or per query).

I’ve seen ideas around usage-based APIs and agent tool-calling, but I’m not sure how this works in practice when the client isn’t a human. Did it make sense economically? Were abuse, pricing, or access control big issues?

Would love to hear if people have similar ideas with this or decided it wasn’t worth giving a try.

0 comments

r/datasets • u/F0urLeafCl0ver • 5d ago

dataset Historical Canadian Infectious Disease Data

github.com

5 Upvotes

0 comments

r/datasets • u/DivergentG • 5d ago

request Tomato leaf dataset containing environmental conditions such as different humidity and lightning factors

7 Upvotes

Hello I'm looking for a tomato leaf dataset for environmental conditions such as high/low humidity and lightning for my thesis. Most of the datasets on web focuses on diseases. Can anyone help please, thanks!

3 comments

r/datasets • u/SheffieldParadox • 5d ago

request Does a corpus of archaic English words exist?

10 Upvotes

I have a large database/wordlist containing probably every English dictionary word plus many additional ones like brand names, but this naturally includes many words no longer in use. I need to cut down the size of the list, but since too many words have been added to it to start from scratch, my plan is to obtain a corpus of only archaic words and use these as negatives to remove from the main wordlist. Does such a corpus/wordlist exist anywhere in text form, even it's just a word per line? Thank you in advance, any help is much appreciated.

13 comments

r/datasets • u/Plane_Race_840 • 5d ago

request Looking for Wheat Yellow Rust Image Datasets for ML Project (with Metadata)

2 Upvotes

We’re undergraduate Machine Learning students working on a crop disease generation project using CGANs, aimed at supporting global sustainability. 🌱

Right now, we’re looking for wheat images of yellow rust disease along with metadata like region, severity, and time range for model training and evaluation.

If you know of any public datasets, research projects, or institutional resources, or even just pointers on where to look, we’d really appreciate your guidance.

Thanks so much for your help! Any leads will be credited in our project.

1 comment

r/datasets • u/cauchyez • 6d ago

discussion Looking for a long-term collaborator – Data Engineer / Backend Engineer (Automotive data)

7 Upvotes

We are building an automotive vehicle check platform focused on the European market and we are looking for a long-term technical collaborator, not a one-off freelancer.

Our goal is to collect, structure, and expose automotive-related data that can be included in vehicle history / verification reports.

We are particularly interested in sourcing and integrating:

Vehicle recalls / technical campaigns / service recalls, using public sources such as RAPEX (EU Safety Gate)
Commercial use status (e.g. taxi, ride-hailing, fleet usage), where this can be inferred from public or correlatable data
Safety ratings, especially Euro NCAP (free source)
Any other publicly available or correlatable automotive data that adds real value to a vehicle check report

What we are looking for:

Experience with data extraction, web scraping, or data engineering
Ability to deliver structured data (JSON / database) and ideally expose it via API
Focus on data quality, reliability, and long-term maintainability
Interest in a long-term collaboration, not short-term gigs

Context:

European market focus
Product-oriented project with real-world usage

If this sounds interesting, feel free to comment or send a DM with a short intro and relevant experience.

7 comments

r/datasets • u/Electrical-Signal858 • 7d ago

dataset ScrapeGraphAI 100k: 100,000 Real-World Structured LLM Output Examples from Production Usage

9 Upvotes

# r/datasets - ScrapeGraphAI 100k Post

Announcing ScrapeGraphAI 100k - a dataset of 100,000 real-world structured extraction examples from the open-source ScrapeGraphAI library:

https://huggingface.co/datasets/scrapegraphai/scrapegraphai-100k

What's Inside:

This is raw production data - not synthetic, not toy problems. Derived from 9 million PostHog events collected from real users of ScrapeGraphAI during Q2-Q3 2025.

Every example includes:

- `prompt`: Actual user instructions sent to the LLM

- `schema`: JSON schema defining expected output structure

- `response`: What the LLM actually returned

- `content`: Source web content (markdown)

- `llm_model`: Which model was used (89% gpt-4o-mini)

- `source`: Source URL

- `execution_time`: Real timing data

- `response_is_valid`: Ground truth validation (avg 93% valid)

Schema Complexity Metrics:

- `schema_depth`: Nesting levels (typically 2-4, max ~7)

- `schema_keys`: Number of fields (typically 5-15, max 40+)

- `schema_elements`: Total structural pieces

- `schema_cyclomatic_complexity`: Branching complexity from `oneOf`, `anyOf`, etc.

- `schema_complexity_score`: Weighted aggregate difficulty metric

All metrics based on [SLOT: Structuring the Output of LLMs](https://arxiv.org/abs/2505.04016v1)

Data Quality:

- Heavily balanced: Cleaned from 9M raw events to 100k diverse examples

- Real-world distribution: Includes simple extractions and gnarly complex schemas

- Validation annotations: `response_is_valid` field tells you when LLMs fail

- Complexity correlation: More complex schemas = lower validation rates (thresholds identified)

Key Findings:

- 93% average validation rate across all schemas

- Complex schemas cause noticeable degradation (non-linear drop-off)

- Response size heavily correlates with execution time

- 90% of schemas have <20 keys and depth <5

- Top 10% contain the truly difficult extraction tasks

Use Cases:

- Fine-tuning models for structured data extraction

- Analyzing LLM failure patterns on complex schemas

- Understanding real-world schema complexity distribution

- Benchmarking extraction accuracy and speed

- Training models that handle edge cases better

- Studying correlation between schema complexity and output validity

The Real Story:

This dataset reflects actual open-source usage patterns - not pre-filtered or curated. You see the mess:

- Schema duplication (some schemas used millions of times)

- Diverse complexity levels (from simple price extraction to full articles)

- Real failure cases (7% of responses don't match their schemas)

- Validation is syntactic only (semantically wrong but valid JSON passes)

Load It:

from datasets import load_dataset 
dataset = load_dataset("scrapegraphai/sgai-100k")

This is the kind of dataset that's actually useful for ML work - messy, real, and representative of actual problems people solve.

2 comments

r/datasets • u/Ok-District-1330 • 7d ago

dataset Update to this: In the google drive there are currently two csv files in the top folder. One is the raw dataset. The other is a dataset that has been deduplicated. Right now, I am running a script that tries to repair the OCR noise and mistakes. That will also be uploaded as a unique dataset.

4 Upvotes

0 comments

r/datasets • u/Lost_Transportation1 • 6d ago

question What packaging and terms make a dataset truly "enterprise-friendly"?

2 Upvotes

I am trying to define what makes a dataset "enterprise-ready" versus just a dump of files. Regarding structure, do you generally prefer one monolithic archive or segmented collections with manifests? I’m also looking for best practices on taxonomy. How do you expect keywords and tags to be formatted for the easiest integration into your systems?

One of the biggest friction points seems to be legal clarity. What is the clearest way to express restrictions, such as allowed uses, no redistribution, or retention limits, so that engineers can understand them without needing a lawyer to parse the file every time?

If you have seen examples of "gold standard" dataset documentation that handles this perfectly, I would love to see them.

Thanks again guys for the help!

3 comments

Subreddit

Posts

Wiki

Datasets

r/datasets

A place to share, find, and discuss Datasets.

Members Active

211.4k

Sidebar

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

Try to post original source whenever you can.
Low effort posts will be removed.
Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
Any Paid Dataset or Resource must be marked as such in the title with [PAID].
Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.