Case study
8M+ Records, 7 Legacy Systems, Zero Data Loss
Enterprise pharma platform migration with zero data loss across 8M+ records
The problem
Sanofi's commercial operations ran on seven legacy CRM and data systems across Europe and APAC. 8M+ records with duplicate identities, structural inconsistency, and historical gaps. Three of the seven systems had no API. A standard bulk-migration approach would have produced thousands of validation failures and unacceptable regulatory exposure for a pharmaceutical operator.
What I did
A canonical data model was designed and signed off by compliance, legal, and IT before any extraction. Custom connectors pulled data from the API-less systems. A deduplication, enrichment, and value-mapping pass resolved 340,000 duplicates and 89,000 inconsistencies before load. Migration ran in load-validate-reconcile waves with a zero-tolerance halt threshold.
At a glance
- Client
- Sanofi
- Sector
- Life sciences
- Engagement
- 6 months
- My role
- Lead Solution Architect, Migration and Data Integrity
- Salesforce clouds
- Sales Cloud · Service Cloud
- Outcome
- 0 data loss incidents
Before / After
- Seven separate CRM and data systems across Europe and APAC.
- 8M+ records with duplicate identities, no canonical schema.
- 340,000 potential duplicates, 89,000 inconsistencies, 12,000 records missing mandatory fields.
- Three source systems with no API access.
- No reconciliation framework, no audit trail for regulator review.
- One consolidated Salesforce platform, seven legacy systems decommissioned.
- 8M+ records under a canonical model, owners named per entity.
- 847 records reviewed by humans, the rest resolved automatically with logged criteria.
- Custom connectors and validation suite packaged as reusable assets.
- 100 percent source-to-target reconciliation across all 8M+ records.
Situation
Sanofi’s commercial operations across Europe and APAC had accumulated seven legacy CRM and data systems over a decade of acquisitions, regional buildouts, and platform migrations that never fully completed. Customer and HCP (Healthcare Professional) data existed in fragments. Some sat in aging on-premises systems, some in regional Salesforce sandboxes, some in spreadsheet-based processes that had never been formalised.
The mandate was to consolidate all commercial data into a single Salesforce platform, retire the seven legacy systems, and do so with zero tolerance for data loss. In pharmaceuticals, data integrity is not a performance metric. It is a regulatory requirement. Records of HCP interactions, prescribing data, promotional compliance documentation, and adverse event history must be complete and auditable. Any migration that loses, corrupts, or cannot account for a record creates regulatory exposure.
The scale was 8 million records across the seven source systems, with overlapping data models, inconsistent field mappings, duplicate entities, and no agreed canonical schema for what a healthcare-professional contact record should contain. Three of the seven systems had no API access, requiring custom extraction tooling.
Challenge
The primary risk was not technical complexity. It was the combination of regulatory non-negotiability and data-model divergence. Standard migration approaches (bulk export, transform, load) carry acceptable risk in commercial contexts but are unacceptable under pharmaceutical compliance.
The source systems had three distinct data-quality problems. Duplicate records: the same HCP existed in multiple systems under slightly different names, addresses, and specialisation codes. Structural inconsistency: what one system called “account type” another called “customer segment,” with different value sets and no cross-reference. Historical gaps: several systems had been used inconsistently, producing records with missing mandatory fields that would fail validation against the target schema.
A migration that loaded these records as-is would have generated thousands of validation failures on day one. That is precisely the scenario regulators and business stakeholders could not accept. The work had to resolve data quality before migration, not after. The three systems with no API access added extraction risk on top: custom connectors needed to be built and tested, with every extraction validated against source-system record counts and checksums.
What I told the steering committee before extraction beganThe load is the easy part. The engagement is data quality. If we get the model and the cleansing right, the migration itself will be uneventful.
Action
The architectural call was to design the canonical data model before a single line of migration code was written. The target schema established the authoritative definition of each entity (Healthcare Professional, Account, Product, Interaction, Consent, Compliance documentation). Every source-system field mapped to a target field with explicit conflict-resolution rules. Compliance, legal, and IT signed off before extraction began.
Extraction and profiling
Custom connectors built for the three API-less systems via direct database queries and screen-scraping automation. Every extraction validated against record counts and checksums. Profiling quantified the cleansing scope: 340,000 potential duplicates, 12,000 records missing mandatory fields, 89,000 structural inconsistencies.
Transformation and quality resolution
Deterministic deduplication using national identifier, name plus address proximity, and specialty code. Automated enrichment from public HCP registries reduced the missing-field set from 12,000 to 847 for human review. Value-mapping tables resolved structural inconsistencies; no record migrated with an unresolved mapping.
Load, validate, reconcile
Parallel waves: non-production environments first, full validation before production load. Each batch ran a load-validate-reconcile cycle. A zero-tolerance threshold halted any batch missing a single record. Three halts during the programme; all three traced to extraction issues, corrected, re-loaded.
After cutover, a 30-day parallel-operation period ran the legacy systems and Salesforce simultaneously. Automated comparison queries confirmed data consistency between systems. Decommissioning sign-off required 100 percent reconciliation across all 8M+ records.
Result
Eight million records migrated across seven legacy systems with zero data-loss incidents. The parallel-operation period confirmed complete reconciliation. Every record in every legacy system was accounted for in the Salesforce target, with a full audit trail from source extraction through transformation decisions to final load.
Seven legacy systems were decommissioned on schedule, eliminating the maintenance costs and compliance risk associated with aging infrastructure. The consolidated Salesforce platform reduced the annual IT overhead of maintaining multiple CRM systems by €1.2M.
The migration framework developed for Sanofi proved reusable. The extraction connectors, transformation logic, and validation suite were packaged as assets for future migration programmes. The data-quality methodology (profile, resolve deterministically, flag residuals for human review, document every decision) has since been applied to two further migration programmes at other clients.
The migration also established a clean, well-governed data foundation that positions Sanofi’s commercial operations for future AI and Data Cloud initiatives. Clean data with complete audit trails is the prerequisite for intelligent automation. The commercial org now sits on a platform that can support Agentforce agents on top of it.
Reflection
This pattern works when regulatory rigour is the binding constraint and the target architecture has to meet a higher quality bar than the source. It works when leadership accepts that the engagement is data quality, not the load step. It works when compliance and IT can be brought into the canonical-model design before extraction begins.
It works less well when the org chases a faster cutover by deferring data-quality work to post-migration cleanup. That is the scenario regulators will not accept and that produces years of remediation on the back end.
Worth doing earlier: the canonical model. Naming entities, fields, and owners before extraction starts removes most of the late-stage decision-making that turns regulated migrations into delivery risks.
Technologies used: Salesforce Sales Cloud, Service Cloud, Data Loader, custom extraction connectors, Apex validation frameworks, DataWeave transformations, external data-profiling tooling
Glossary
- Canonical data model
- A single agreed shape for each core entity (Healthcare Professional, Account, Product, Interaction, Consent). Every source system maps to it; the model is the source of truth, not the systems.
- Deterministic deduplication
- Matching records to a single real-world entity using rule-based criteria (national identifier, name plus address proximity, specialty code). Every match decision is logged with the criteria used, creating an audit trail for regulatory review.
- Load-validate-reconcile cycle
- Each migration wave loads a batch, runs automated queries comparing source and target record counts and field values, and reconciles any discrepancy before the next batch starts. A single unmatched record halts the wave.
- Parallel operation period
- After cutover, legacy systems and the new platform run side by side for a fixed window with automated comparison queries. Decommissioning sign-off requires 100 percent reconciliation across that window.
Frequently asked
- Source systems were not reliable definitions of what a record should look like. Three of them had been used inconsistently. Mapping target-to-source after extraction would have anchored the new platform to legacy quality. Building the canonical model first and signing it off with compliance, legal, and IT meant every extraction and transformation had a fixed reference point. Records that did not fit the canonical schema surfaced as remediation work, not silent data debt.
- Every wave ran a load-validate-reconcile cycle. Automated queries compared source and target record counts and field values. Any batch that failed reconciliation by even one record halted the migration and triggered investigation. Three halts occurred during the programme. All three were extraction issues in legacy systems, not transformation errors, and all three were corrected and re-loaded before the next wave.
- Data profiling identified 12,000 records with missing mandatory fields. Automated enrichment from public HCP registry data resolved the bulk of them. The 847 residuals went through a structured human review process with named data stewards before migration. No record was loaded with an unresolved mapping. Every review decision was documented and available for audit.
- Yes. The methodology (profile, resolve deterministically, flag residuals for human review, document every decision) has since been applied to two further migration programmes at other clients. The custom extraction connectors, transformation logic, and validation suite were packaged as reusable assets. The regulatory rigour adds discipline that benefits any migration; in commercial contexts it is simply faster because the audit-trail bar is lower.
- Clean data with complete audit trails is the prerequisite for intelligent automation. Agentforce cannot reason on fragmented identities or stale records. Sanofi's commercial org now operates on a consolidated platform with named entity owners, canonical definitions, and quality monitoring in place. The data layer is ready for Data Cloud ingestion and for Agentforce pilots that can reference the customer's actual state.
Read next
Book the call
We'll know in 30 minutes
whether I can help.
No slides. No pitch deck. Bring the architecture diagram or describe the problem in your own words. I'll tell you whether I'm the right fit and what the next step costs — before you've finished your coffee.
- Replies within 24 hours, always
- If I'm not the right fit, I'll point you at someone who is
- No follow-up emails unless you ask