Skip to content
View datafromlopes's full-sized avatar
👨‍💻
👨‍💻

Highlights

  • Pro

Block or report datafromlopes

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
datafromlopes/README.md

Hi, I'm Diego Lopes 🤘🏻

Lead Data Engineer · MSc. @ IME-USP · NLP Researcher · Natural Language Interfaces for Databases

I build large-scale distributed data systems and research natural language interfaces for structured data. My engineering work focuses on high-throughput, mission-critical data platforms, distributed storage, real-time pipelines, and federated query systems operating at scale.

My research at the University of São Paulo (USP) sits at the intersection of NLP and databases, specifically, how large language models can be fine-tuned to understand complex data schemas and translate natural language into executable SQL. I work on the problems that live at the boundary of computational linguistics, neural language modeling, and production data infrastructure.

Research Interests

  • Natural Language Processing & Computational Linguistics — formal and statistical representations of language for machine reasoning
  • Natural Language Interfaces for Databases — making structured data accessible through language
  • LLM Fine-Tuning — adapting language models to specialized structured data domains (PEFT, LoRA, AdaLoRA)
  • Semantic Parsing — mapping linguistic structure to executable queries
  • Low-Resource NLP — Portuguese language support in text-to-SQL and related tasks

Current Project

🔭 Natural Language Interfaces for Databases Fine-tuning and evaluation of LLMs for Text-to-SQL tasks, with a focus on complex schemas, cross-domain generalization, and Portuguese language support.

👉 github.com/datafromlopes/geo-nlq-to-sql

Engineering Background

7+ years designing and operating production-grade distributed data systems:

  • High-throughput transactional platforms (Apache Cassandra, billions of writes, microsecond latency)
  • Lakehouse architecture (Apache Iceberg on AWS, federated queries with Trino)
  • Real-time and batch ELT pipelines at scale
  • Data platform reliability and performance engineering

Technologies

Languages Python · C++ · SQL · Scala

Databases & Storage PostgreSQL/PostGIS · Apache Cassandra · MongoDB · MySQL

Distributed Systems & Processing Apache Spark · Kafka · Trino · Hive · Hadoop

Data Platform Apache Airflow · Apache Iceberg · DBT · AWS · Docker

ML & NLP PyTorch · HuggingFace Transformers · PEFT (LoRA, AdaLoRA, IA³)

Open to Collaborating On

  • Natural language interfaces for relational and non-relational databases
  • LLM fine-tuning and evaluation for specialized or low-resource domains
  • NLP datasets and benchmarks in Portuguese
  • Large-scale data platform architecture

Contact

📫 datafromlopes.com


⚡ Private pilot, aviation and aircraft systems enthusiast ✈️

Pinned Loading

  1. matrix_multiply_optimizer matrix_multiply_optimizer Public

    This repository provides optimized implementations of matrix multiplication algorithms in C. It explores techniques like blocking, vectorization, and loop reordering to maximize performance.

    C 1