Senior Software Engineer - Web Crawling

bigdata softwareengineer webcrawling systemsdeveloper dataarchitecture data

SENIOR SOFTWARE ENGINEER, Web Crawling Architecture, $160K-$180K, Austin/Remote

About the company

Datafiniti was founded with the vision of empowering people and organizations with data. If someone was building technology that required data to power it, we wanted to remove the hurdle of data acquisition so that that technology could come to life. Ten years later, we have brought that vision to life and are looking to expand it as far as possible. We currently have a wide variety of customers in finance, retail, proptech, and marketing, who use our technology to develop their products.

We collect a wide variety of information and content from thousands of sources and transform it all into highly-structured, instantly accessible data that can be integrated right away into any application or analysis someone is building. Startups and Fortune 500s alike use our various data sets and APIs to power thousands of solutions, including fraud prevention, investment algorithms, pricing analysis, mobile apps, lead generation, and much, much more.

In order to efficiently support such a wide variety of customers, we focus on building a highly flexible, robust, and scalable data infrastructure. Our technology is capable of ingesting, processing, and serving out billions of data points every day. Our small, close-knit team works together to develop technology and operational capabilities that allow us to meet the needs of an ever-expanding set of use cases.

Over the last two years, we have doubled our customer base and revenue each year and are on course to do so again this year! As we enter a new phase of growth, we are seeking to build a new engineering team from scratch and are specifically looking for people who are eager to "own" the technology and scale it for growth. This will be a unique opportunity that provides the experience similar to joining a start-up on Day 1, while already having an established customer base and strong technology foundation. We’re hoping to bring on new team members that are excited to work on the challenges of our unique business and push the boundaries of what “scale” truly means.

About the role

We wish to hire an experience software engineer who has a strong passion for developing highly-scalable distributed systems, particularly around web crawling or web scraping. A significant part of our technology stack is dedicated to:

Efficiently fetching content from a large volume of URLs (15 million per day)
Running self-contained “micro applications” on each URL’s page content to convert it into semi-structured data
Passing that data to a separate stack responsible for data warehousing

At a high level, this engineer will be responsible for the maintenance, improvement, and expansion of this stack.

Responsibilities

Specific responsibilities can be broken down into the following categories:

Maintenance of existing systems

Maintaining optimal throughput within the crawling engine’s pipeline
Maintaining and building upon each individual microservice to tackle problems such as:
Preventing duplicate URLs from being crawled
Distributing the correct amount of work to different microservices
Safely running arbitrary code against HTML without locking up an entire process
Maintaining queryable queues that hold hundreds of millions of URLs while keeping retrieval within milliseconds
Leveraging one or many different proxy providers to increase the likelihood of a data-yielding response
Collecting and reporting valuable data points and metrics throughout all parts of the system (90 million data points per minute)
Streaming large volumes of normalized web data to multiple points of interest
Maintaining an API that sits in front of the crawling engine

New systems to support upcoming business needs

Architecting an implementing a robust solution to increase the likelihood of getting a successful response from any number of domains. This solution will leverage IPs provided from various proxy providers along with an elegant approach to rotate these IPs and pair them with cookies generated for a given domain
Implementing a solution that provides more insight/vision into the health and status of data acquisition from any domain
Architecting and implementing an elegant, dynamic, and performant distribution mechanism that provides us with full control over what we’re crawling at any given time

Additional responsibilities include:

Diagnosing and fixing highly complex technical issues independently
Supporting the build and deployment pipeline and, when necessary, diagnosing and solving production support issues
Communicating individual and project-level development statuses, issues, risks, and concerns to technical leadership and management
Identifying and communicating cross-team dependencies to respective peers
Writing specification documents that include the feature-set being developed, explaining how these features will be implemented, and gaining stakeholder approval for the feature-set
Conducting thorough QA as a part of the development lifecycle prior to a production release

Qualifications

Specific technologies required for this role include:

Node.js (5+years experience): All of the microservices that make up our crawling engine are written in Node.js and each microservice leverages Node’s native clustering capabilities for parallelism
Express.js or a similar API framework such as Hapi, Koa, or Restify: The API that sits in front of our crawling engine is written in Express.js
MongoDB: We use MongoDB as our primary user database
MySQL: We use MySQL in a novel way to maintain our URL queues
Redis: We use Redis to maintain global state within our distributed system
AWS and Docker: All of our microservices are containerized via Docker and deployed to AWS

This role will also require a deep understanding of the following concepts or skills:

Thorough understanding of distributed systems and how to make them reliable, scalable and maintainable
Experience with web crawling/scraping at a very large scale including the use of proxy rotation
Deep understanding of the design, implementation, and consumption of REST APIs
Excellent verbal and written communication skills
Strong analytical, problem solving, debugging and troubleshooting skills

Additional skills that will be considered a plus but are not required:

StatsD / Graphite / Graphana

Compensation & Benefits

Compensation and benefits for this role include:

$160K - $180K annual base salary
Equity in a company that is doubling its revenue every year
Comprehensive health insurance (medical, dental, vision, life)
Unlimited PTO, but 15 days MINIMUM required, preferably more
7 federal holidays + 4 quarterly company-wide holidays

Additional benefits include:

Highly flexible work-life balance: If working in Austin, we ask you work from the office at least twice a week (for team bonding!), and you are free to work from home at your own schedule otherwise.
High degree of job autonomy: Team members are encouraged to experiment with their own implementations, propose ideas for company needs, and explore new solutions.
Career development: Executive leaders work with team members to align personal goals with company goals.
A supportive team environment: Team culture is focused on providing a supportive and positive environment for everyone.

NOTE: We prefer to hire someone who can work from our Austin office, but are open to hiring remote for someone who is a great fit for the role.

Job Type: Full-time

Pay: $160,000.00 - $180,000.00 per year

COVID-19 considerations:

Our office is open again. We are allowing fully vaccinated employees to work from the office without wearing masks. Anyone who is not yet fully vaccinated is asked to continue working from home.